LLM Scraper logo

LLM Scraper

Extract structured data from any webpage using LLMs

LLM Scraper is a TypeScript library that turns any webpage into structured data using LLM function calling, offering full type safety, Playwright integration, and support for major AI model providers.

Overview

Overview

LLM Scraper targets developers who need reliable, typed extraction of information from dynamic web pages. By leveraging LLM function calling, it converts arbitrary page content into JSON that matches a developer‑defined schema.

Features

The library works with Playwright to load pages in several modes—HTML, raw HTML, Markdown, plain text, or even screenshots for multimodal models. Schemas can be expressed with Zod or JSON Schema, giving you compile‑time type safety. It supports a wide range of model families (OpenAI, Anthropic, Google, Groq, Ollama, etc.) and offers streaming responses and code‑generation to produce reusable Playwright scripts.

Getting Started

Install the package and your chosen AI SDK, launch a Playwright browser, create an LLM instance, define a Zod schema, and call scraper.run(page, schema). Optional stream and generate methods let you receive partial results or auto‑create scraper code. The MIT‑licensed library works in any Node.js/TypeScript project.

Highlights

Multi‑model support: GPT, Sonnet, Gemini, Llama, Qwen, etc.
Schema definition with Zod or JSON Schema and full TypeScript safety
Playwright‑based page handling with HTML, raw HTML, markdown, text, and image modes
Streaming results and code‑generation for reusable scrapers

Pros

  • Strong compile‑time type safety
  • Works with many popular LLM providers
  • Flexible content formats for diverse sites
  • Streaming and code‑generation boost productivity

Considerations

  • Requires Playwright and a headless browser setup
  • Node.js/TypeScript environment only
  • Streaming limited to Vercel AI SDK
  • No built‑in CLI or graphical interface

Managed products teams compare with

When teams consider LLM Scraper, these hosted platforms usually appear on the same shortlist.

Apify logo

Apify

Web automation & scraping platform powered by serverless Actors

Browserbase logo

Browserbase

Cloud platform for running and scaling headless web browsers, enabling reliable browser automation and scraping at scale

Browserless logo

Browserless

Headless browser platform & APIs for Puppeteer/Playwright with autoscaling

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Developers building data pipelines that need structured output
  • Teams already using Playwright for browser automation
  • Projects that require LLM‑driven parsing with type safety
  • Applications needing dynamic extraction from JavaScript‑heavy pages

Not ideal when

  • Non‑technical users without coding experience
  • Environments that cannot run Node.js or a browser
  • Use cases demanding ultra‑low latency without LLM overhead
  • Simple static HTML scraping where regex or DOM selectors suffice

How teams use it

News aggregation

Extract top stories, scores, authors, and comment links from news sites into a structured JSON feed.

E‑commerce price monitoring

Collect product name, price, availability, and SKU from retailer pages for price‑tracking dashboards.

Research data collection

Gather article titles, authors, abstracts, and publication dates from academic journal websites.

CMS content generation

Convert marketing page sections into structured components (headings, copy, images) for automated CMS population.

Tech snapshot

TypeScript100%

Tags

llamagptaipuppeteerplaywrightllmartificial-intelligencebrowser-automationlangchainbrowserscrapergpt-4openai

Frequently asked questions

Which LLM providers are supported?

LLM Scraper works with OpenAI, Anthropic, Google, Groq, Ollama, and any provider compatible with the Vercel AI SDK.

Do I need a browser to use the library?

Yes, it relies on Playwright to load and interact with pages, so a headless browser instance is required.

How are extraction schemas defined?

Schemas can be written using Zod objects or JSON Schema files, and the library parses results against them.

Can I receive partial results while scraping?

Yes, the `stream` method returns a partial object stream (available with the Vercel AI SDK).

Is the library free for commercial use?

It is released under the MIT license, which permits commercial use, modification, and distribution.

Project at a glance

Active
Stars
6,164
Watchers
6,164
Forks
369
LicenseMIT
Repo age1 year old
Last commit2 months ago
Primary languageTypeScript

Last synced 3 hours ago