
Apify
Web automation & scraping platform powered by serverless Actors
Discover top open-source software, updated regularly with real-world adoption signals.

Scalable Java crawler framework with flexible API and annotations
WebMagic provides a high‑performance Java crawling engine, supporting multi‑threaded and distributed scraping, simple HTML extraction APIs, and POJO‑based annotations for rapid spider development.

WebMagic is a Java‑based crawling framework designed for scalability and ease of use. It handles the full spider lifecycle—from URL scheduling and downloading to content extraction and persistence—allowing developers to focus on business logic rather than infrastructure.
The core offers a lightweight, highly flexible API for HTML parsing, while POJO annotations let you define crawlers without XML or code configuration. Built‑in multi‑threading and optional distributed execution (via Redis, Kafka, etc.) enable high‑throughput scraping across multiple machines. Integration is straightforward through Maven dependencies and SLF4J logging, and custom pipelines can be added to store results in databases, files, or message queues.
Deploying a WebMagic spider can be as simple as running a Java main class locally, or scaling out by packaging the crawler into a Docker container and orchestrating multiple instances with Kubernetes. The framework’s Apache‑2.0 license and active community provide extensive documentation, sample projects, and extensions for storage back‑ends such as MySQL, MongoDB, and Elasticsearch.
When teams consider WebMagic, these hosted platforms usually appear on the same shortlist.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
GitHub repository metadata extraction
Collect author, repository name, and README content for analytics dashboards
E‑commerce price monitoring
Scrape product pages across multiple sites and store pricing data for competitive analysis
News aggregation
Continuously fetch headlines and article bodies to feed a news portal
Academic paper harvesting
Extract titles, authors, and abstracts from conference websites for research databases
Implement the Pipeline interface and register the instance with Spider using addPipeline()
Yes, configure the Site object with setHttpProxyPool or provide a custom Downloader that handles proxies
Yes, extensions allow using Redis, Kafka, or other queues to coordinate multiple crawler instances
It uses SLF4J; you can plug any backend such as Log4j2, Logback, or java.util.logging
The core library does not include a GUI, but the related Gather Platform provides a web console for configuration and management
Project at a glance
ActiveLast synced 4 days ago