LLM Scraper

Extract structured data from any webpage using LLMs

LLM Scraper is a TypeScript library that turns any webpage into structured data using LLM function calling, offering full type safety, Playwright integration, and support for major AI model providers.

Overview

LLM Scraper targets developers who need reliable, typed extraction of information from dynamic web pages. By leveraging LLM function calling, it converts arbitrary page content into JSON that matches a developer‑defined schema.

Features

The library works with Playwright to load pages in several modes—HTML, raw HTML, Markdown, plain text, or even screenshots for multimodal models. Schemas can be expressed with Zod or JSON Schema, giving you compile‑time type safety. It supports a wide range of model families (OpenAI, Anthropic, Google, Groq, Ollama, etc.) and offers streaming responses and code‑generation to produce reusable Playwright scripts.

Getting Started

Install the package and your chosen AI SDK, launch a Playwright browser, create an LLM instance, define a Zod schema, and call scraper.run(page, schema). Optional stream and generate methods let you receive partial results or auto‑create scraper code. The MIT‑licensed library works in any Node.js/TypeScript project.

Highlights

Multi‑model support: GPT, Sonnet, Gemini, Llama, Qwen, etc.

Schema definition with Zod or JSON Schema and full TypeScript safety

Playwright‑based page handling with HTML, raw HTML, markdown, text, and image modes

Streaming results and code‑generation for reusable scrapers

Pros

Strong compile‑time type safety
Works with many popular LLM providers
Flexible content formats for diverse sites
Streaming and code‑generation boost productivity

Considerations

Requires Playwright and a headless browser setup
Node.js/TypeScript environment only
Streaming limited to Vercel AI SDK
No built‑in CLI or graphical interface

Managed products teams compare with

When teams consider LLM Scraper, these hosted platforms usually appear on the same shortlist.

Apify

Web automation & scraping platform powered by serverless Actors

Browserbase

Cloud platform for running and scaling headless web browsers, enabling reliable browser automation and scraping at scale

Browserless

Headless browser platform & APIs for Puppeteer/Playwright with autoscaling

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Developers building data pipelines that need structured output
Teams already using Playwright for browser automation
Projects that require LLM‑driven parsing with type safety
Applications needing dynamic extraction from JavaScript‑heavy pages

Not ideal when

Non‑technical users without coding experience
Environments that cannot run Node.js or a browser
Use cases demanding ultra‑low latency without LLM overhead
Simple static HTML scraping where regex or DOM selectors suffice

How teams use it

News aggregation

Extract top stories, scores, authors, and comment links from news sites into a structured JSON feed.

E‑commerce price monitoring

Collect product name, price, availability, and SKU from retailer pages for price‑tracking dashboards.

Research data collection

Gather article titles, authors, abstracts, and publication dates from academic journal websites.

CMS content generation

Convert marketing page sections into structured components (headings, copy, images) for automated CMS population.

Tech snapshot

TypeScript100%

Frequently asked questions

Which LLM providers are supported?

LLM Scraper works with OpenAI, Anthropic, Google, Groq, Ollama, and any provider compatible with the Vercel AI SDK.

Do I need a browser to use the library?

Yes, it relies on Playwright to load and interact with pages, so a headless browser instance is required.

How are extraction schemas defined?

Schemas can be written using Zod objects or JSON Schema files, and the library parses results against them.

Can I receive partial results while scraping?

Yes, the `stream` method returns a partial object stream (available with the Vercel AI SDK).

Is the library free for commercial use?

It is released under the MIT license, which permits commercial use, modification, and distribution.

Project at a glance

Active

View repo

Stars: 6,225
Watchers: 6,225
Forks: 370

LicenseMIT

Repo age1 year old

Last commit5 days ago

Primary languageTypeScript

Last synced 4 hours ago

Overview

Overview

Features

Getting Started

Highlights

Pros

Considerations

Managed products teams compare with

Apify

Browserbase

Browserless

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions