news-extractor-node

v0.1.0

Published

4 months ago

A Node.js library for extracting news content from HTML pages using text density algorithm

0High
0Medium
0Low

siping

news extractor content article web-scraping text-density

news-extractor-node

基于文本密度算法的 Node.js 新闻内容提取库

受 GeneralNewsExtractor (GNE) 启发，使用 TypeScript 实现的 Node.js 新闻内容提取工具。

✨ 特性

🎯 智能内容提取：基于文本密度算法自动识别并提取新闻正文
📰 元数据自动检测：智能提取标题、发布时间、作者信息
🖼️ 图片提取：自动收集文章中的所有图片 URL
🧹 噪音过滤：自动过滤广告、评论等无关内容
⚙️ 灵活配置：支持自定义 XPath 和 CSS 选择器
🚀 零配置使用：开箱即用，无需复杂配置
📦 TypeScript 支持：完整的类型定义
🔗 完美配合：与 @siping/html-to-markdown-node 无缝集成

📦 安装

npm install news-extractor-node

🚀 快速开始

基础使用

import { NewsExtractor } from 'news-extractor-node';

const html = `
<!DOCTYPE html>
<html>
<head>
  <title>新闻标题</title>
</head>
<body>
  <article>
    <h1>重大新闻：AI 技术取得突破</h1>
    <div class="meta">
      <span class="author">作者：张三</span>
      <time>2026-03-22</time>
    </div>
    <div class="content">
      <p>今天，研究人员宣布了一项重大突破...</p>
      <p>这项技术将改变整个行业...</p>
    </div>
  </article>
</body>
</html>
`;

const extractor = new NewsExtractor();
const result = extractor.extract(html, {
  url: 'https://example.com/news/article'
});

console.log(result);
// {
//   title: "重大新闻：AI 技术取得突破",
//   author: "张三",
//   publishTime: "2026-03-22",
//   content: "今天，研究人员宣布了一项重大突破...",
//   contentHtml: "<p>今天，研究人员宣布了一项重大突破...</p>...",
//   images: []
// }

便捷函数

import { extractNews } from 'news-extractor-node';

const result = extractNews(html, {
  url: 'https://example.com/news/article'
});

📖 详细使用

1. 过滤噪音内容

移除广告、评论等干扰元素：

const result = extractor.extract(html, {
  url: 'https://example.com/news',
  noiseSelectors: [
    '.advertisement',      // 广告
    '.comment-list',       // 评论区
    '#related-articles',   // 相关文章
    'aside',              // 侧边栏
    'footer'              // 页脚
  ]
});

2. 自定义提取规则

当自动提取不准确时，可以指定自定义 XPath：

const result = extractor.extract(html, {
  // 自定义标题提取
  titleXPath: '//h1[@class="article-title"]/text()',
  
  // 自定义作者提取
  authorXPath: '//span[@class="author-name"]/text()',
  
  // 自定义时间提取
  publishTimeXPath: '//time/@datetime',
  
  // 自定义正文提取
  contentXPath: '//div[@class="article-body"]'
});

3. 与 Markdown 转换结合

完整的新闻采集到 Markdown 的工作流：

import { NewsExtractor } from 'news-extractor-node';
import { convertString } from '@siping/html-to-markdown-node';

// 1. 提取新闻内容
const extractor = new NewsExtractor();
const news = extractor.extract(html, {
  url: 'https://example.com/news/article',
  noiseSelectors: ['.ad', '.comments']
});

// 2. 转换为 Markdown
const markdown = convertString(news.contentHtml, {
  domain: 'https://example.com'
});

// 3. 组合完整的 Markdown 文档
const fullMarkdown = `
# ${news.title}

**作者**：${news.author}  
**发布时间**：${news.publishTime}  
**来源**：https://example.com/news/article

---

${markdown}

## 图片列表

${news.images.map(img => `- ![](${img})`).join('\n')}
`;

console.log(fullMarkdown);

4. 批量处理

批量提取多个新闻页面：

const urls = [
  'https://example.com/news/1',
  'https://example.com/news/2',
  'https://example.com/news/3'
];

const extractor = new NewsExtractor();
const results = [];

for (const url of urls) {
  // 假设你已经获取了 HTML（使用 axios、puppeteer 等）
  const html = await fetchHtml(url);
  
  const news = extractor.extract(html, {
    url,
    noiseSelectors: ['.ad', '.sidebar']
  });
  
  results.push({
    url,
    ...news
  });
}

console.log(`成功提取 ${results.length} 篇文章`);

🔧 API 文档

`NewsExtractor`

主要的提取器类。

`extract(html: string, options?: ExtractOptions): NewsContent`

从 HTML 中提取新闻内容。

参数：

html (string)：要提取的 HTML 内容
options (ExtractOptions, 可选)：提取选项
- url (string)：页面 URL，用于解析相对路径的图片链接
- titleXPath (string)：自定义标题提取的 XPath
- authorXPath (string)：自定义作者提取的 XPath
- publishTimeXPath (string)：自定义发布时间提取的 XPath
- contentXPath (string)：自定义正文提取的 XPath
- noiseSelectors (string[])：要移除的噪音元素的 CSS 选择器数组

返回值：

interface NewsContent {
  title: string;           // 文章标题
  author: string;          // 作者
  publishTime: string;     // 发布时间
  content: string;         // 纯文本内容
  contentHtml: string;     // HTML 格式内容
  images: string[];        // 图片 URL 数组
}

`extractNews(html: string, options?: ExtractOptions): NewsContent`

便捷函数，等同于创建 NewsExtractor 实例并调用 extract 方法。

🎯 工作原理

本库实现了基于文本密度的内容提取算法：

1. 文本密度计算

文本密度 = 文本字符数 / HTML 标签数

遍历 DOM 树的每个节点
计算每个节点的文本字符数和 HTML 标签数
标点符号给予更高权重（5倍）
找出文本密度最高的区域作为正文

2. 元数据提取

标题提取策略：

尝试常见的标题选择器（h1.title、h1.article-title 等）
检查 Meta 标签（og:title、twitter:title）
回退到 <title> 标签

时间提取策略：

查找 <time> 标签和 datetime 属性
检查 Meta 标签（article:published_time）
使用正则表达式在页面文本中搜索日期模式
支持多种日期格式（ISO 8601、中文日期等）

作者提取策略：

查找常见的作者选择器（.author、.article-author 等）
检查 Meta 标签（author、article:author）
使用正则表达式匹配"作者："、"By"等模式

3. 噪音过滤

在提取前移除指定的噪音元素，提高提取准确率。

💡 使用场景

1. 新闻聚合平台

// 从多个新闻源采集内容
const sources = ['新浪', '网易', '腾讯'];
const articles = [];

for (const source of sources) {
  const html = await fetchFromSource(source);
  const news = extractNews(html, { url: source.url });
  articles.push(news);
}

2. 内容归档系统

// 定期归档新闻文章
import fs from 'fs';

const news = extractNews(html, { url });
const markdown = convertToMarkdown(news);

fs.writeFileSync(
  `archives/${news.publishTime}-${news.title}.md`,
  markdown
);

3. RSS 订阅生成

// 生成 RSS feed
const articles = await fetchLatestArticles();
const rssItems = articles.map(html => {
  const news = extractNews(html);
  return {
    title: news.title,
    description: news.content.substring(0, 200),
    pubDate: news.publishTime,
    author: news.author
  };
});

4. 知识库构建

// 构建 Markdown 知识库
const news = extractNews(html, { url });
const markdown = convertString(news.contentHtml);

await saveToKnowledgeBase({
  title: news.title,
  content: markdown,
  metadata: {
    author: news.author,
    date: news.publishTime,
    source: url
  }
});

⚠️ 注意事项

1. HTML 来源

本库只负责提取，不负责获取 HTML。你需要自己使用以下工具获取 HTML：

// 使用 axios
import axios from 'axios';
const { data: html } = await axios.get(url);

// 使用 node-fetch
import fetch from 'node-fetch';
const html = await fetch(url).then(r => r.text());

// 使用 puppeteer（适合 JavaScript 渲染的页面）
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const html = await page.content();

2. 适用范围

✅ 适用：新闻文章、博客文章、技术文档
❌ 不适用：图片集、视频页面、列表页

3. 准确率

在主流中文新闻网站测试准确率较高
对于特殊布局的网站，建议使用自定义 XPath
网站结构变化可能影响提取效果

4. 性能考虑

// 对于大量文章，建议复用 extractor 实例
const extractor = new NewsExtractor();

for (const html of htmlList) {
  const news = extractor.extract(html);
  // 处理结果...
}

🔗 相关项目

@siping/html-to-markdown-node - HTML 转 Markdown 工具
GeneralNewsExtractor - Python 版本（本项目的灵感来源）

📝 开发路线图

[x] 核心文本密度算法
[x] 标题提取
[x] 时间提取
[x] 作者提取
[x] 图片提取
[x] 噪音过滤
[x] TypeScript 支持
[ ] 单元测试
[ ] 性能优化
[ ] 多页文章支持
[ ] 更多网站适配

🤝 贡献

欢迎提交 Issue 和 Pull Request！

📄 许可证

MIT © Ping Si

👤 作者

Ping Si

GitHub: @sipingme
Email: [email protected]

如果这个项目对你有帮助，欢迎 ⭐️ Star！

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

news-extractor-node

✨ 特性

📦 安装

🚀 快速开始

基础使用

便捷函数

📖 详细使用

1. 过滤噪音内容

2. 自定义提取规则

3. 与 Markdown 转换结合

4. 批量处理

🔧 API 文档

NewsExtractor

extract(html: string, options?: ExtractOptions): NewsContent

extractNews(html: string, options?: ExtractOptions): NewsContent

🎯 工作原理

1. 文本密度计算

2. 元数据提取

标题提取策略：

时间提取策略：

作者提取策略：

3. 噪音过滤

💡 使用场景

1. 新闻聚合平台

2. 内容归档系统

3. RSS 订阅生成

4. 知识库构建

⚠️ 注意事项

1. HTML 来源

2. 适用范围

3. 准确率

4. 性能考虑

🔗 相关项目

📝 开发路线图

🤝 贡献

📄 许可证

👤 作者

`NewsExtractor`

`extract(html: string, options?: ExtractOptions): NewsContent`

`extractNews(html: string, options?: ExtractOptions): NewsContent`