AutoScraper logo

AutoScraper

Automatic, fast, lightweight web scraper that learns from examples

AutoScraper lets you build a scraper by providing a URL and a few example values. It automatically learns extraction rules, enabling fast, repeatable data collection without writing XPath or CSS selectors.

Overview

Overview

AutoScraper is a Python library that creates a scraper by feeding it a URL (or raw HTML) and a short list of values you expect to find on the page. The library infers the underlying HTML patterns and builds a model that can later retrieve the same type of data from other pages with the same structure.

The API provides two retrieval modes: get_result_similar returns elements that match the pattern (useful for lists such as article titles), while get_result_exact returns values in the exact order you supplied (ideal for single‑field data like a stock price). Trained models can be saved to disk and re‑loaded, allowing you to reuse extraction logic across projects. Custom request arguments let you add proxies, headers, or other requests options without modifying the core code.

AutoScraper targets developers, data scientists, and small teams that need quick, reliable scraping without the overhead of writing XPath or CSS selectors, and it works with any Python‑3 environment.

Highlights

Learn extraction rules from a few sample values
Support both similar and exact result retrieval
Save and load trained models for reuse
Customizable request parameters (proxies, headers, etc.)

Pros

  • No need to write XPath or CSS selectors
  • Works with static HTML using standard requests
  • Lightweight dependency footprint
  • Simple API for rapid prototyping

Considerations

  • Cannot scrape content rendered by JavaScript
  • Limited to pattern‑based extraction, not full crawling
  • Model may need retraining when page layout changes
  • No built‑in concurrency or async support

Managed products teams compare with

When teams consider AutoScraper, these hosted platforms usually appear on the same shortlist.

Apify logo

Apify

Web automation & scraping platform powered by serverless Actors

Browserbase logo

Browserbase

Cloud platform for running and scaling headless web browsers, enabling reliable browser automation and scraping at scale

Browserless logo

Browserless

Headless browser platform & APIs for Puppeteer/Playwright with autoscaling

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Developers who need quick data extraction without learning selector syntax
  • Data scientists prototyping web‑data pipelines
  • Small to medium projects that prefer a lightweight dependency footprint
  • Teams building internal APIs over static websites

Not ideal when

  • Scraping heavily JavaScript‑driven sites that require a headless browser
  • Large‑scale crawling across thousands of domains
  • Real‑time, high‑concurrency scraping workloads
  • Projects that need advanced anti‑bot evasion or CAPTCHA handling

How teams use it

Gather related StackOverflow question titles

Generate a list of similar question titles from any StackOverflow page with a single function call.

Fetch live stock price and market cap

Retrieve current price, market capitalization, or other ticker data from Yahoo Finance by providing a sample value.

Extract GitHub repository metadata

Collect repository description, star count, and issues link for any GitHub repo without writing custom parsers.

Create a lightweight web API

Wrap AutoScraper in Flask to expose an endpoint that returns scraped data on demand, enabling rapid API development.

Tech snapshot

Python100%

Tags

aiautomationscrapewebscrapingmachine-learningartificial-intelligencewebautomationpythonscraperscrapingweb-scrapingcrawler

Frequently asked questions

How does AutoScraper infer extraction rules from a few examples?

It parses the HTML of the supplied page, locates the elements containing the example values, and extracts surrounding tag patterns and attributes to build a reusable model.

Can AutoScraper handle pages that load content with JavaScript?

AutoScraper works on the static HTML returned by the request. For JavaScript‑rendered content you need to fetch the rendered HTML yourself (e.g., with Selenium) before passing it to the library.

How do I persist a trained scraper for later use?

Use the `save(filepath)` method to write the model to disk and `load(filepath)` to restore it in a new session.

Which Python versions are supported?

The library is compatible with Python 3.x (tested on 3.7 and newer).

Is it possible to customize request headers or use proxies?

Yes, you can pass a `request_args` dictionary to `build`, containing any arguments accepted by the `requests` library, such as `headers`, `proxies`, or `auth`.

Project at a glance

Stable
Stars
7,074
Watchers
7,074
Forks
713
LicenseMIT
Repo age5 years old
Last commit7 months ago
Primary languagePython

Last synced 3 hours ago