AutoScraper

Automatic, fast, lightweight web scraper that learns from examples

AutoScraper lets you build a scraper by providing a URL and a few example values. It automatically learns extraction rules, enabling fast, repeatable data collection without writing XPath or CSS selectors.

Overview

AutoScraper is a Python library that creates a scraper by feeding it a URL (or raw HTML) and a short list of values you expect to find on the page. The library infers the underlying HTML patterns and builds a model that can later retrieve the same type of data from other pages with the same structure.

The API provides two retrieval modes: get_result_similar returns elements that match the pattern (useful for lists such as article titles), while get_result_exact returns values in the exact order you supplied (ideal for single‑field data like a stock price). Trained models can be saved to disk and re‑loaded, allowing you to reuse extraction logic across projects. Custom request arguments let you add proxies, headers, or other requests options without modifying the core code.

AutoScraper targets developers, data scientists, and small teams that need quick, reliable scraping without the overhead of writing XPath or CSS selectors, and it works with any Python‑3 environment.

Highlights

Learn extraction rules from a few sample values

Support both similar and exact result retrieval

Save and load trained models for reuse

Customizable request parameters (proxies, headers, etc.)

Pros

No need to write XPath or CSS selectors
Works with static HTML using standard requests
Lightweight dependency footprint
Simple API for rapid prototyping

Considerations

Cannot scrape content rendered by JavaScript
Limited to pattern‑based extraction, not full crawling
Model may need retraining when page layout changes
No built‑in concurrency or async support

Managed products teams compare with

When teams consider AutoScraper, these hosted platforms usually appear on the same shortlist.

Apify

Web automation & scraping platform powered by serverless Actors

Browserbase

Cloud platform for running and scaling headless web browsers, enabling reliable browser automation and scraping at scale

Browserless

Headless browser platform & APIs for Puppeteer/Playwright with autoscaling

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Developers who need quick data extraction without learning selector syntax
Data scientists prototyping web‑data pipelines
Small to medium projects that prefer a lightweight dependency footprint
Teams building internal APIs over static websites

Not ideal when

Scraping heavily JavaScript‑driven sites that require a headless browser
Large‑scale crawling across thousands of domains
Real‑time, high‑concurrency scraping workloads
Projects that need advanced anti‑bot evasion or CAPTCHA handling

How teams use it

Gather related StackOverflow question titles

Generate a list of similar question titles from any StackOverflow page with a single function call.

Fetch live stock price and market cap

Retrieve current price, market capitalization, or other ticker data from Yahoo Finance by providing a sample value.

Extract GitHub repository metadata

Collect repository description, star count, and issues link for any GitHub repo without writing custom parsers.

Create a lightweight web API

Wrap AutoScraper in Flask to expose an endpoint that returns scraped data on demand, enabling rapid API development.

Tech snapshot

Python100%

Frequently asked questions

How does AutoScraper infer extraction rules from a few examples?

It parses the HTML of the supplied page, locates the elements containing the example values, and extracts surrounding tag patterns and attributes to build a reusable model.

Can AutoScraper handle pages that load content with JavaScript?

AutoScraper works on the static HTML returned by the request. For JavaScript‑rendered content you need to fetch the rendered HTML yourself (e.g., with Selenium) before passing it to the library.

How do I persist a trained scraper for later use?

Use the `save(filepath)` method to write the model to disk and `load(filepath)` to restore it in a new session.

Which Python versions are supported?

The library is compatible with Python 3.x (tested on 3.7 and newer).

Is it possible to customize request headers or use proxies?

Yes, you can pass a `request_args` dictionary to `build`, containing any arguments accepted by the `requests` library, such as `headers`, `proxies`, or `auth`.

Project at a glance

Stable

View repo

Stars: 7,110
Watchers: 7,110
Forks: 719

LicenseMIT

Repo age5 years old

Last commit9 months ago

Primary languagePython

Last synced 1 hour ago

Overview

Overview

Highlights

Pros

Considerations

Managed products teams compare with

Apify

Browserbase

Browserless

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions