
Apify
Web automation & scraping platform powered by serverless Actors
Discover top open-source software, updated regularly with real-world adoption signals.

High-performance web, site, and SERP crawler with AI extraction
AnyCrawl offers fast crawling, scraping, and SERP collection with JSON extraction, handling static and JavaScript pages via Cheerio, Playwright or Puppeteer.

AnyCrawl is a TypeScript toolkit designed for developers, data engineers, and SEO analysts who need to harvest web content at scale. It supports three core modes: SERP crawling (Google), single‑page scraping, and full‑site traversal. Each mode can run with Cheerio for static HTML, Playwright for modern JavaScript rendering, or Puppeteer for Chrome‑based rendering, giving you flexibility across site complexities.
The engine leverages native multi‑threading and multi‑process execution to handle bulk jobs efficiently. Built‑in LLM‑friendly JSON schema extraction lets you turn raw pages into structured data ready for AI pipelines. Self‑hosting is straightforward with Docker Compose; generate API keys via the provided pnpm commands and optionally enable token‑based authentication. Proxy support is available per request or via environment variables, and a default high‑quality proxy is bundled. Documentation and a Playground simplify integration in any language.
AnyCrawl shines for teams building training datasets, monitoring competitor SEO, archiving websites, or constructing knowledge bases where reliable, high‑throughput crawling and AI‑ready output are essential.
When teams consider AnyCrawl, these hosted platforms usually appear on the same shortlist.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
Generate LLM training data
Extract structured JSON from product pages to feed language models
Monitor competitor SEO
Crawl sites and collect SERP rankings for keyword analysis
Archive website content
Perform depth‑limited full‑site traversal and store pages for preservation
Build a knowledge base
Scrape articles, extract key fields, and ingest into a vector store
Yes. AnyCrawl includes a default proxy and lets you set a custom proxy per request or via the ANYCRAWL_PROXY_URL environment variable.
Select the Playwright or Puppeteer engine in the request to render JavaScript before extraction.
Run `pnpm --filter api key:generate` inside the Docker container or host environment; the command returns a UUID, key, and credit balance.
Authentication is optional. Enable it by setting `ANYCRAWL_API_AUTH_ENABLED=true` and provide the generated Bearer token with each request.
Currently AnyCrawl supports Google; additional engines may be added in future releases.
Project at a glance
ActiveLast synced 4 days ago