AnyCrawl

High-performance web, site, and SERP crawler with AI extraction

AnyCrawl offers fast crawling, scraping, and SERP collection with JSON extraction, handling static and JavaScript pages via Cheerio, Playwright or Puppeteer.

Overview

AnyCrawl is a TypeScript toolkit designed for developers, data engineers, and SEO analysts who need to harvest web content at scale. It supports three core modes: SERP crawling (Google), single‑page scraping, and full‑site traversal. Each mode can run with Cheerio for static HTML, Playwright for modern JavaScript rendering, or Puppeteer for Chrome‑based rendering, giving you flexibility across site complexities.

Capabilities & Deployment

The engine leverages native multi‑threading and multi‑process execution to handle bulk jobs efficiently. Built‑in LLM‑friendly JSON schema extraction lets you turn raw pages into structured data ready for AI pipelines. Self‑hosting is straightforward with Docker Compose; generate API keys via the provided pnpm commands and optionally enable token‑based authentication. Proxy support is available per request or via environment variables, and a default high‑quality proxy is bundled. Documentation and a Playground simplify integration in any language.

Who Benefits

AnyCrawl shines for teams building training datasets, monitoring competitor SEO, archiving websites, or constructing knowledge bases where reliable, high‑throughput crawling and AI‑ready output are essential.

Highlights

Multi‑engine SERP crawling with Google support

Threaded and process‑based crawling for bulk workloads

LLM‑friendly JSON schema extraction

Flexible rendering engines: Cheerio, Playwright, Puppeteer

Pros

High performance through native multi‑threading
Handles JavaScript‑heavy pages via Playwright/Puppeteer
Built‑in LLM‑ready structured data extraction
Easy self‑hosting with Docker and API key management

Considerations

SERP support limited to Google at present
Requires a Node.js/TypeScript runtime
Authentication must be configured for secure deployments
Proxy configuration may need manual setup

Managed products teams compare with

When teams consider AnyCrawl, these hosted platforms usually appear on the same shortlist.

Apify

Web automation & scraping platform powered by serverless Actors

Browserbase

Cloud platform for running and scaling headless web browsers, enabling reliable browser automation and scraping at scale

Browserless

Headless browser platform & APIs for Puppeteer/Playwright with autoscaling

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Developers building AI training data pipelines
SEO researchers needing bulk search‑result data
Data engineers automating site archives
Teams requiring customizable rendering options

Not ideal when

Simple one‑off scrapes where a lightweight library suffices
Projects not using a Node.js environment
Use cases requiring Bing or Baidu SERP out‑of‑the‑box
Real‑time low‑latency scraping where latency is critical

How teams use it

Generate LLM training data

Extract structured JSON from product pages to feed language models

Monitor competitor SEO

Crawl sites and collect SERP rankings for keyword analysis

Archive website content

Perform depth‑limited full‑site traversal and store pages for preservation

Build a knowledge base

Scrape articles, extract key fields, and ingest into a vector store

Tech snapshot

TypeScript91%

MDX7%

JavaScript1%

Dockerfile1%

Shell1%

CSS1%

Frequently asked questions

Can I use proxies?

Yes. AnyCrawl includes a default proxy and lets you set a custom proxy per request or via the ANYCRAWL_PROXY_URL environment variable.

How do I handle JavaScript‑rendered pages?

Select the Playwright or Puppeteer engine in the request to render JavaScript before extraction.

How can I generate an API key?

Run `pnpm --filter api key:generate` inside the Docker container or host environment; the command returns a UUID, key, and credit balance.

Is authentication required?

Authentication is optional. Enable it by setting `ANYCRAWL_API_AUTH_ENABLED=true` and provide the generated Bearer token with each request.

Which search engines are supported for SERP crawling?

Currently AnyCrawl supports Google; additional engines may be added in future releases.

Project at a glance

Active

Visit site View repo

Stars: 2,754
Watchers: 2,754
Forks: 290

LicenseMIT

Repo age11 months old

Last commit6 days ago

Primary languageTypeScript

Last synced 2 days ago

Overview

Overview

Capabilities & Deployment

Who Benefits

Highlights

Pros

Considerations

Managed products teams compare with

Apify

Browserbase

Browserless

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions