AnyCrawl logo

AnyCrawl

High-performance web, site, and SERP crawler with AI extraction

AnyCrawl offers fast crawling, scraping, and SERP collection with JSON extraction, handling static and JavaScript pages via Cheerio, Playwright or Puppeteer.

AnyCrawl banner

Overview

Overview

AnyCrawl is a TypeScript toolkit designed for developers, data engineers, and SEO analysts who need to harvest web content at scale. It supports three core modes: SERP crawling (Google), single‑page scraping, and full‑site traversal. Each mode can run with Cheerio for static HTML, Playwright for modern JavaScript rendering, or Puppeteer for Chrome‑based rendering, giving you flexibility across site complexities.

Capabilities & Deployment

The engine leverages native multi‑threading and multi‑process execution to handle bulk jobs efficiently. Built‑in LLM‑friendly JSON schema extraction lets you turn raw pages into structured data ready for AI pipelines. Self‑hosting is straightforward with Docker Compose; generate API keys via the provided pnpm commands and optionally enable token‑based authentication. Proxy support is available per request or via environment variables, and a default high‑quality proxy is bundled. Documentation and a Playground simplify integration in any language.

Who Benefits

AnyCrawl shines for teams building training datasets, monitoring competitor SEO, archiving websites, or constructing knowledge bases where reliable, high‑throughput crawling and AI‑ready output are essential.

Highlights

Multi‑engine SERP crawling with Google support
Threaded and process‑based crawling for bulk workloads
LLM‑friendly JSON schema extraction
Flexible rendering engines: Cheerio, Playwright, Puppeteer

Pros

  • High performance through native multi‑threading
  • Handles JavaScript‑heavy pages via Playwright/Puppeteer
  • Built‑in LLM‑ready structured data extraction
  • Easy self‑hosting with Docker and API key management

Considerations

  • SERP support limited to Google at present
  • Requires a Node.js/TypeScript runtime
  • Authentication must be configured for secure deployments
  • Proxy configuration may need manual setup

Managed products teams compare with

When teams consider AnyCrawl, these hosted platforms usually appear on the same shortlist.

Apify logo

Apify

Web automation & scraping platform powered by serverless Actors

Browserbase logo

Browserbase

Cloud platform for running and scaling headless web browsers, enabling reliable browser automation and scraping at scale

Browserless logo

Browserless

Headless browser platform & APIs for Puppeteer/Playwright with autoscaling

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Developers building AI training data pipelines
  • SEO researchers needing bulk search‑result data
  • Data engineers automating site archives
  • Teams requiring customizable rendering options

Not ideal when

  • Simple one‑off scrapes where a lightweight library suffices
  • Projects not using a Node.js environment
  • Use cases requiring Bing or Baidu SERP out‑of‑the‑box
  • Real‑time low‑latency scraping where latency is critical

How teams use it

Generate LLM training data

Extract structured JSON from product pages to feed language models

Monitor competitor SEO

Crawl sites and collect SERP rankings for keyword analysis

Archive website content

Perform depth‑limited full‑site traversal and store pages for preservation

Build a knowledge base

Scrape articles, extract key fields, and ingest into a vector store

Tech snapshot

TypeScript91%
MDX7%
JavaScript1%
Dockerfile1%
Shell1%
CSS1%

Tags

webscraperserpscraperaghtml-to-markdownscrapingai-scrapingdatacrawlaitools

Frequently asked questions

Can I use proxies?

Yes. AnyCrawl includes a default proxy and lets you set a custom proxy per request or via the ANYCRAWL_PROXY_URL environment variable.

How do I handle JavaScript‑rendered pages?

Select the Playwright or Puppeteer engine in the request to render JavaScript before extraction.

How can I generate an API key?

Run `pnpm --filter api key:generate` inside the Docker container or host environment; the command returns a UUID, key, and credit balance.

Is authentication required?

Authentication is optional. Enable it by setting `ANYCRAWL_API_AUTH_ENABLED=true` and provide the generated Bearer token with each request.

Which search engines are supported for SERP crawling?

Currently AnyCrawl supports Google; additional engines may be added in future releases.

Project at a glance

Active
Stars
2,529
Watchers
2,529
Forks
260
LicenseMIT
Repo age10 months old
Last commit3 weeks ago
Primary languageTypeScript

Last synced 4 hours ago