WebMagic logo

WebMagic

Scalable Java crawler framework with flexible API and annotations

WebMagic provides a high‑performance Java crawling engine, supporting multi‑threaded and distributed scraping, simple HTML extraction APIs, and POJO‑based annotations for rapid spider development.

WebMagic banner

Overview

Overview

WebMagic is a Java‑based crawling framework designed for scalability and ease of use. It handles the full spider lifecycle—from URL scheduling and downloading to content extraction and persistence—allowing developers to focus on business logic rather than infrastructure.

Capabilities

The core offers a lightweight, highly flexible API for HTML parsing, while POJO annotations let you define crawlers without XML or code configuration. Built‑in multi‑threading and optional distributed execution (via Redis, Kafka, etc.) enable high‑throughput scraping across multiple machines. Integration is straightforward through Maven dependencies and SLF4J logging, and custom pipelines can be added to store results in databases, files, or message queues.

Deployment

Deploying a WebMagic spider can be as simple as running a Java main class locally, or scaling out by packaging the crawler into a Docker container and orchestrating multiple instances with Kubernetes. The framework’s Apache‑2.0 license and active community provide extensive documentation, sample projects, and extensions for storage back‑ends such as MySQL, MongoDB, and Elasticsearch.

Highlights

Simple core with high flexibility
POJO annotation for configuration‑free crawlers
Built‑in multi‑thread and distributed support
Easy Maven integration and SLF4J logging

Pros

  • High‑performance multi‑threaded crawling
  • Flexible API for custom extraction logic
  • Annotation model reduces boilerplate code
  • Strong community, documentation, and examples

Considerations

  • Java‑only limits language choice
  • Distributed mode requires external coordination (e.g., Redis)
  • Learning curve for annotation syntax
  • Limited built‑in storage adapters; custom pipelines often needed

Managed products teams compare with

When teams consider WebMagic, these hosted platforms usually appear on the same shortlist.

Apify logo

Apify

Web automation & scraping platform powered by serverless Actors

Browserbase logo

Browserbase

Cloud platform for running and scaling headless web browsers, enabling reliable browser automation and scraping at scale

Browserless logo

Browserless

Headless browser platform & APIs for Puppeteer/Playwright with autoscaling

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Java developers building large‑scale web scrapers
  • Teams that need rapid prototype spiders with minimal configuration
  • Projects requiring distributed crawling across multiple machines
  • Applications preferring annotation‑driven crawler definitions

Not ideal when

  • Environments that use languages other than Java
  • Simple one‑off scripts where a heavyweight framework is overkill
  • Real‑time pipelines demanding ultra‑low latency processing
  • Users seeking an out‑of‑the‑box cloud‑hosted crawling service

How teams use it

GitHub repository metadata extraction

Collect author, repository name, and README content for analytics dashboards

E‑commerce price monitoring

Scrape product pages across multiple sites and store pricing data for competitive analysis

News aggregation

Continuously fetch headlines and article bodies to feed a news portal

Academic paper harvesting

Extract titles, authors, and abstracts from conference websites for research databases

Tech snapshot

Java77%
HTML23%
JavaScript1%
Kotlin1%
Ruby1%
Groovy1%

Tags

frameworkjavascrapingcrawler

Frequently asked questions

How can I add custom pipelines for result storage?

Implement the Pipeline interface and register the instance with Spider using addPipeline()

Does WebMagic support proxy rotation?

Yes, configure the Site object with setHttpProxyPool or provide a custom Downloader that handles proxies

Can the framework run in a distributed cluster?

Yes, extensions allow using Redis, Kafka, or other queues to coordinate multiple crawler instances

What logging framework does WebMagic use?

It uses SLF4J; you can plug any backend such as Log4j2, Logback, or java.util.logging

Is there a graphical interface for managing spiders?

The core library does not include a GUI, but the related Gather Platform provides a web console for configuration and management

Project at a glance

Active
Stars
11,689
Watchers
11,689
Forks
4,160
LicenseApache-2.0
Repo age12 years old
Last commitlast month
Primary languageJava

Last synced 4 hours ago