@leibniz/extractor

v0.0.3

Published

6 months ago

Web content extraction library for Playwright - extracts content from web pages and converts to Markdown

0High
0Medium
0Low

xshhs23

web-clipper markdown playwright obsidian

@leibniz/extractor

Web 内容提取库，将网页内容智能转换为 Markdown。

安装

npm install @leibniz/extractor

快速开始

import { ClipperService } from '@leibniz/extractor';

const clipper = new ClipperService({ cwd: './output' });

// 从 HTML 提取并保存
const result = await clipper.clip(html, 'https://example.com/article');

console.log(result.markdownPath);  // ./output/example.com/文章标题.md
console.log(result.imagesSaved);   // 5

API

ClipperService

主要服务类，提供完整的网页内容提取和保存功能。

import { ClipperService } from '@leibniz/extractor';

const clipper = new ClipperService({ cwd: '/path/to/output' });

`clip(html, url)` - 提取并保存

const result = await clipper.clip(html, url);

// result: {
//   success: boolean,
//   markdownPath: string,  // Markdown 文件路径
//   assetsDir: string,     // 图片资源目录
//   imagesSaved: number    // 保存的图片数量
// }

`clipToMarkdown(html, url)` - 仅提取不保存

const result = clipper.clipToMarkdown(html, url);

// result: {
//   success: boolean,
//   markdown: string,      // Markdown 内容
//   title: string,         // 页面标题
//   images: ExtractedImage[]  // 图片列表（未下载）
// }

`clipWithPlaywright(url, options?)` - 使用 Playwright 抓取

const result = await clipper.clipWithPlaywright(url, {
  waitFor: 'networkidle',           // 等待策略
  waitForSelector: '[data-ready]',  // 等待特定元素
  cookies: [...],                   // 登录态 cookies
  headers: { ... },                 // 自定义请求头
  headless: true                    // 无头模式
});

直接使用提取函数

import { clipFromHtml, extractContent } from '@leibniz/extractor';

// 从 HTML 字符串提取
const result = clipFromHtml(html, url);

// 从 Document 对象提取（适用于 Playwright）
const result = extractContent(document, url);

Slate 编辑器检测

import { hasSlateEditor, SlateWikiExtractor } from '@leibniz/extractor';

// 检测页面是否包含 Slate 编辑器
if (hasSlateEditor(document)) {
  const result = SlateWikiExtractor.extract(document);
}

与 Playwright 集成

import { chromium } from 'playwright';
import { ClipperService } from '@leibniz/extractor';

const clipper = new ClipperService({ cwd: './content' });

const browser = await chromium.launch();
const page = await browser.newPage();

await page.goto('https://example.com/article');
const html = await page.content();

const result = await clipper.clip(html, page.url());
console.log(`✓ ${result.markdownPath}`);

await browser.close();

内置提取器

| 提取器 | 检测方式 | 适用场景 | |--------|----------|----------| | SlateWiki | CSS [data-slate-editor] | Slate 编辑器页面 | | GeminiDeepResearch | CSS .deep-research-panel | Gemini 深度研究报告 | | Default | 自动 | 通用网页 (Readability) |

输出结构

${cwd}/
├── example.com/
│   ├── 文章标题.md
│   └── assets/
│       └── 2026_01_23_10_50_23/
│           └── image_0.png

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@leibniz/extractor

安装

快速开始

API

ClipperService

clip(html, url) - 提取并保存

clipToMarkdown(html, url) - 仅提取不保存

clipWithPlaywright(url, options?) - 使用 Playwright 抓取

直接使用提取函数

Slate 编辑器检测

与 Playwright 集成

内置提取器

输出结构

License

`clip(html, url)` - 提取并保存

`clipToMarkdown(html, url)` - 仅提取不保存

`clipWithPlaywright(url, options?)` - 使用 Playwright 抓取