@regressor-data/llm-scraper

v1.0.0

Published

3 months ago

Turn any webpage into structured data using LLMs (OpenAI compatible APIs)

0High
0Medium
0Low

han_yong_hui

llm scraper browser playwright deepseek web-scraping data-extraction ai

LLM Scraper

LLM Scraper 是一个 TypeScript 库，可以使用大语言模型（LLM）从任何网页提取结构化数据。

[!IMPORTANT] > LLM Scraper 已更新到 1.6 版本。
新版本支持 Vercel AI SDK 4、JSON Schema、更好的类型安全、改进的代码生成功能以及更新的示例。

[!TIP] 在底层实现中，它使用 function calling（函数调用）将网页转换为结构化数据。你可以在这里了解更多关于这种方法的信息。

核心特性

支持 GPT、Sonnet、Gemini、Llama、Qwen 等模型系列
使用 Zod 或 JSON Schema 定义数据结构
完整的 TypeScript 类型安全
基于 Playwright 框架
支持流式对象输出
代码生成功能
支持 5 种格式化模式：
- html - 加载预处理的 HTML（移除脚本、样式等）
- raw_html - 加载原始 HTML（不进行处理）
- markdown - 加载 Markdown 格式
- text - 加载提取的文本（使用 Readability.js）
- image - 加载截图（仅支持多模态模型）

记得给项目点个星！

快速开始

从 npm 安装所需的依赖：
```
npm i zod playwright llm-scraper
```

初始化你的 LLM：

OpenAI

npm i @ai-sdk/openai

import { openai } from '@ai-sdk/openai'

const llm = openai.chat('gpt-4o')

Anthropic

npm i @ai-sdk/anthropic

import { anthropic } from '@ai-sdk/anthropic'

const llm = anthropic('claude-3-5-sonnet-20240620')

Google

npm i @ai-sdk/google

import { google } from '@ai-sdk/google'

const llm = google('gemini-1.5-flash')

Groq

npm i @ai-sdk/openai

import { createOpenAI } from '@ai-sdk/openai'
const groq = createOpenAI({
  baseURL: 'https://api.groq.com/openai/v1',
  apiKey: process.env.GROQ_API_KEY,
})

const llm = groq('llama3-8b-8192')

Ollama

npm i ollama-ai-provider

import { ollama } from 'ollama-ai-provider'

const llm = ollama('llama3')

创建一个新的 scraper 实例并传入 LLM：

import LLMScraper from 'llm-scraper'

const scraper = new LLMScraper(llm)

使用示例

在这个示例中，我们从 Hacker News 提取热门故事：

import { chromium } from 'playwright'
import { z } from 'zod'
import { openai } from '@ai-sdk/openai'
import LLMScraper from 'llm-scraper'

// 启动浏览器实例
const browser = await chromium.launch()

// 初始化 LLM 提供商
const llm = openai.chat('gpt-4o')

// 创建一个新的 LLMScraper
const scraper = new LLMScraper(llm)

// 打开新页面
const page = await browser.newPage()
await page.goto('https://news.ycombinator.com')

// 定义要提取的数据结构 schema
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Hacker News 上的前 5 个故事'),
})

// 运行 scraper
const { data } = await scraper.run(page, schema, {
  format: 'html',
})

// 显示 LLM 返回的结果
console.log(data.top)

await page.close()
await browser.close()

输出结果：

;[
  {
    title: 'Palette lighting tricks on the Nintendo 64',
    points: 105,
    by: 'ibobev',
    commentsURL: 'https://news.ycombinator.com/item?id=44014587',
  },
  {
    title: 'Push Ifs Up and Fors Down',
    points: 187,
    by: 'goranmoomin',
    commentsURL: 'https://news.ycombinator.com/item?id=44013157',
  },
  {
    title: "JavaScript's New Superpower: Explicit Resource Management",
    points: 225,
    by: 'olalonde',
    commentsURL: 'https://news.ycombinator.com/item?id=44012227',
  },
  {
    title:
      '"We would be less confidential than Google" Proton threatens to quit Switzerland',
    points: 65,
    by: 'taubek',
    commentsURL: 'https://news.ycombinator.com/item?id=44014808',
  },
  {
    title: 'OBNC – Oberon-07 Compiler',
    points: 37,
    by: 'AlexeyBrin',
    commentsURL: 'https://news.ycombinator.com/item?id=44013671',
  },
]

更多示例可以在 examples 文件夹中找到。

流式输出

将 run 方法替换为 stream 即可获得部分对象流（仅支持 Vercel AI SDK）。

// 以流式模式运行 scraper
const { stream } = await scraper.stream(page, schema)

// 流式输出 LLM 的结果
for await (const data of stream) {
  console.log(data.top)
}

代码生成

使用 generate 函数可以生成可重复使用的 Playwright 脚本，用于根据 schema 抓取内容。

// 生成代码并在页面上运行
const { code } = await scraper.generate(page, schema)
const result = await page.evaluate(code)
const data = schema.parse(result)

// 显示解析后的结果
console.log(data.news)

API 文档

`LLMScraper` 类

构造函数

new LLMScraper(client: LanguageModelV1)

创建一个新的 LLMScraper 实例。

参数：

client: 语言模型实例（来自 Vercel AI SDK）

方法

`run<T>(page, schema, options?)`

预处理页面并生成结构化数据。

参数：

page: Playwright Page 对象
schema: Zod schema 或 JSON schema
options?: 可选配置
- format?: 'html' | 'raw_html' | 'markdown' | 'text' | 'image' - 页面格式化模式
- prompt?: 自定义提示词
- temperature?: 控制输出随机性（0-1）
- maxTokens?: 最大 token 数
- topP?: 核采样参数
- mode?: 'auto' | 'json' | 'tool' - 生成模式
- output?: 'array' - 启用数组输出

返回值：

Promise<{ data: T; url: string }>

`stream<T>(page, schema, options?)`

预处理页面并流式返回结构化数据。

参数： 与 run 相同

返回值：

Promise<{ stream: AsyncIterable<Partial<T>>; url: string }>

`generate<T>(page, schema, options?)`

预处理页面并生成可执行的抓取代码。

参数：

page: Playwright Page 对象
schema: Zod schema 或 JSON schema
options?: 可选配置
- format?: 'html' | 'raw_html' - 页面格式化模式
- prompt?: 自定义代码生成提示词
- temperature?: 控制输出随机性
- maxTokens?: 最大 token 数
- topP?: 核采样参数

返回值：

Promise<{ code: string; url: string }>

格式化模式详解

`html` (默认)

预处理 HTML，移除不必要的元素（script、style、nav 等）和属性，减少 token 消耗。适合大多数场景。

`raw_html`

保留完整的原始 HTML，不进行任何处理。适用于需要完整页面信息的场景。

`markdown`

将页面转换为 Markdown 格式，结构清晰，token 消耗较少。

`text`

使用 Mozilla Readability 提取主要文本内容，最小化 token 消耗。适合提取文章正文。

`image`

将页面截图转换为 base64 图片。需要使用支持视觉的多模态模型（如 GPT-4V、Claude 3）。

const { data } = await scraper.run(page, schema, {
  format: 'image',
  fullPage: true, // 截取完整页面
})

高级用法

自定义格式化函数

const { data } = await scraper.run(page, schema, {
  format: 'custom',
  formatFunction: async (page) => {
    // 自定义页面处理逻辑
    return await page.evaluate(() => {
      return document.body.innerText
    })
  },
})

自定义提示词

const { data } = await scraper.run(page, schema, {
  prompt:
    '你是一个专业的网页数据提取专家。请仔细分析页面内容，准确提取所需信息。',
})

数组输出模式

const schema = z.object({
  title: z.string(),
  price: z.number(),
})

const { stream } = await scraper.stream(page, schema, {
  output: 'array', // 提取页面上的所有匹配项
})

for await (const items of stream) {
  console.log(items) // 数组，包含页面上所有匹配的项目
}

工作原理

页面预处理: 根据选择的格式模式，将网页转换为适合 LLM 处理的格式
Schema 转换: 将 Zod schema 转换为 JSON schema
LLM 调用: 使用 function calling 功能，让 LLM 按照 schema 提取数据
类型验证: 使用 Zod 验证返回的数据结构
返回结果: 返回类型安全的结构化数据

性能优化建议

选择合适的格式模式：
- 简单页面使用 text 或 markdown 节省 token
- 复杂布局使用 html 保留结构信息
- 需要视觉信息时使用 image
优化 Schema：
- 使用 .describe() 为字段添加描述，提高提取准确性
- 避免过于复杂的嵌套结构
控制 token 消耗：
- 使用 maxTokens 限制输出长度
- 对于大型页面，考虑先导航到特定区域
选择合适的模型：
- 简单任务使用小模型（gpt-4o-mini）降低成本
- 复杂提取使用大模型（gpt-4o）提高准确性

常见问题

Q: 为什么不直接使用 CSS 选择器？

A: LLM Scraper 使用语义理解而非选择器，可以：

适应不同的页面结构
理解内容含义而非仅依赖 DOM 结构
无需为每个网站编写特定代码

Q: 支持哪些语言模型？

A: 任何支持 Vercel AI SDK 的模型都可以使用，包括 OpenAI、Anthropic、Google、以及通过 Ollama 的本地模型。

Q: 如何处理需要登录的页面？

A: 使用 Playwright 的标准登录流程，在调用 scraper 之前完成登录即可。

Q: 可以抓取动态加载的内容吗？

A: 可以。由于使用 Playwright，所有 JavaScript 都会执行。使用 page.waitForSelector() 等待内容加载完成即可。

贡献

作为一个开源项目，我们欢迎社区贡献。如果你遇到任何 bug 或想要添加改进，请随时提交 issue 或 pull request。

许可证

MIT License - 详见 LICENSE.md

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

LLM Scraper

核心特性

快速开始

使用示例

流式输出

代码生成

API 文档

LLMScraper 类

构造函数

方法

run<T>(page, schema, options?)

stream<T>(page, schema, options?)

generate<T>(page, schema, options?)

格式化模式详解

html (默认)

raw_html

markdown

text

image

高级用法

自定义格式化函数

自定义提示词

数组输出模式

工作原理

性能优化建议

常见问题

Q: 为什么不直接使用 CSS 选择器？

Q: 支持哪些语言模型？

Q: 如何处理需要登录的页面？

Q: 可以抓取动态加载的内容吗？

贡献

许可证

相关链接

`LLMScraper` 类

`run<T>(page, schema, options?)`

`stream<T>(page, schema, options?)`

`generate<T>(page, schema, options?)`

`html` (默认)

`raw_html`

`markdown`

`text`

`image`