Crawlee

Build fast, human-like web scrapers with a single library

Crawlee provides a unified API for HTTP and headless-browser crawling, automatic proxy rotation, persistent queues, and flexible storage, enabling reliable, scalable scrapers in Node.js.

Overview

Crawlee is a TypeScript‑first library that lets developers build reliable web scrapers and browser‑automation pipelines in Node.js. It targets data engineers, product teams, and researchers who need to collect structured data from the open web, APIs, or rendered pages.

Core capabilities

The library provides a single interface for both raw HTTP requests and headless‑browser crawling (Playwright or Puppeteer) with human‑like fingerprinting, automatic TLS and header generation, and built‑in proxy rotation. Persistent queues support breadth‑first or depth‑first strategies, while pluggable storage adapters let you save tabular results or files locally or to cloud buckets. Hooks and configurable retries give fine‑grained control over request lifecycles, and the CLI can bootstrap projects with ready‑to‑run examples and Dockerfiles for container deployment.

Deployment

Crawlee runs anywhere Node.js 16+ is available – locally, in CI pipelines, or on the Apify platform. Docker images are supplied for easy scaling, and the library integrates smoothly with existing TypeScript or JavaScript codebases.

Highlights

Single interface for HTTP and headless‑browser crawling

Persistent URL queue with breadth‑first and depth‑first options

Pluggable storage for datasets and file assets

Integrated proxy rotation and session management

Pros

Human‑like fingerprinting works out‑of‑the‑box
Supports Playwright and Puppeteer via a unified API
Scales automatically with available system resources
TypeScript‑first with strong typings

Considerations

Requires Node.js 16+; adding Playwright increases install size
Hook‑based lifecycle can add learning curve
Primarily JavaScript/TypeScript ecosystem (Python separate repo)
High‑volume proxy rotation may need external services

Managed products teams compare with

When teams consider Crawlee, these hosted platforms usually appear on the same shortlist.

Apify

Web automation & scraping platform powered by serverless Actors

Browserbase

Cloud platform for running and scaling headless web browsers, enabling reliable browser automation and scraping at scale

Browserless

Headless browser platform & APIs for Puppeteer/Playwright with autoscaling

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Developers building reliable web scrapers in Node.js
Teams collecting data for AI or LLM training pipelines
Projects needing both HTTP and rendered‑page crawling
Users who want CLI bootstrap and Docker deployment

Not ideal when

Environments limited to Python without Node.js
One‑off scripts where a heavyweight browser is unnecessary
Legacy systems that cannot upgrade to Node.js 16+
Use cases demanding ultra‑low latency without browser overhead

How teams use it

E‑commerce price monitoring

Continuously extract product listings, prices, and availability, storing results in a dataset for price‑trend analysis.

Content archiving for research

Download HTML, PDFs, and images from scholarly sites, preserving original files in cloud storage.

LLM training data collection

Scrape large corpora of web text, JSON APIs, and screenshots to feed retrieval‑augmented generation pipelines.

Automated UI testing

Use PlaywrightCrawler to render pages, capture screenshots, and verify element presence across browsers.

Tech snapshot

TypeScript61%

MDX29%

JavaScript7%

CSS1%

Dockerfile1%

Python1%

Frequently asked questions

What Node.js version is required?

Crawlee requires Node.js 16 or higher.

Does Crawlee include a browser engine?

It uses Playwright or Puppeteer, which must be installed separately.

Can I run Crawlee in Docker?

Yes, Dockerfiles are provided for containerized deployment.

How does proxy rotation work?

Crawlee integrates proxy rotation and session management, configurable via its API.

Is there a Python version?

A separate Python implementation is available under the same project name.

Project at a glance

Active

Visit site View repo

Stars: 22,060
Watchers: 22,060
Forks: 1,235

LicenseApache-2.0

Repo age9 years old

Last commit2 days ago

Primary languageTypeScript

Last synced 2 days ago

Overview

Overview

Core capabilities

Deployment

Highlights

Pros

Considerations

Managed products teams compare with

Apify

Browserbase

Browserless

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions