
Apify
Web automation & scraping platform powered by serverless Actors
Discover top open-source software, updated regularly with real-world adoption signals.

Scalable, extensible Java web crawler for large‑scale data collection
Apache Nutch is a Java‑based, highly extensible web crawler that scales from single machines to Hadoop clusters, offering plugin support, flexible configuration, and integration with popular IDEs.

Apache Nutch is a Java‑based web crawler designed for both small‑scale projects and massive distributed crawls. Its modular plugin system lets developers add custom parsers, fetchers, and indexers, while the core engine handles URL scheduling, duplicate detection, and politeness.
Nutch runs on a single JVM or integrates with Hadoop for parallel crawling across clusters. Configuration is driven by nutch-site.xml, where you define the user‑agent, plugin folders, and protocol settings. Build and execution are managed through Ant scripts, and the project can be imported into IDEs such as Eclipse or IntelliJ IDEA for debugging and extension. The crawler outputs crawl databases, segments, and index files that can be fed into search platforms like Solr or Elasticsearch.
Designed for developers, researchers, and enterprises that need fine‑grained control over crawl behavior, Nutch provides a robust foundation for building custom search solutions, data‑mining pipelines, and web archives.
When teams consider Apache Nutch, these hosted platforms usually appear on the same shortlist.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
Academic web‑graph research
Generate a comprehensive link graph for citation analysis
E‑commerce price monitoring
Continuously crawl competitor product pages and feed pricing engine
News aggregation
Harvest articles from thousands of news sites for a custom portal
Digital archiving
Capture and store historical web snapshots for preservation
Nutch is written in Java and runs on any JVM.
Yes, it can be executed locally without Hadoop for small crawls.
It integrates with Hadoop MapReduce, allowing crawl tasks to be parallelized across a cluster.
Configuration is managed through the `nutch-site.xml` file and plugin directories.
Nutch provides command‑line tools; visual monitoring typically requires integration with external systems like Solr or custom dashboards.
Project at a glance
ActiveLast synced 4 days ago