npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

news-extractor-node

v0.1.0

Published

A Node.js library for extracting news content from HTML pages using text density algorithm

Downloads

119

Readme

news-extractor-node

基于文本密度算法的 Node.js 新闻内容提取库

npm version License: MIT

GeneralNewsExtractor (GNE) 启发,使用 TypeScript 实现的 Node.js 新闻内容提取工具。

✨ 特性

  • 🎯 智能内容提取:基于文本密度算法自动识别并提取新闻正文
  • 📰 元数据自动检测:智能提取标题、发布时间、作者信息
  • 🖼️ 图片提取:自动收集文章中的所有图片 URL
  • 🧹 噪音过滤:自动过滤广告、评论等无关内容
  • ⚙️ 灵活配置:支持自定义 XPath 和 CSS 选择器
  • 🚀 零配置使用:开箱即用,无需复杂配置
  • 📦 TypeScript 支持:完整的类型定义
  • 🔗 完美配合:与 @siping/html-to-markdown-node 无缝集成

📦 安装

npm install news-extractor-node

🚀 快速开始

基础使用

import { NewsExtractor } from 'news-extractor-node';

const html = `
<!DOCTYPE html>
<html>
<head>
  <title>新闻标题</title>
</head>
<body>
  <article>
    <h1>重大新闻:AI 技术取得突破</h1>
    <div class="meta">
      <span class="author">作者:张三</span>
      <time>2026-03-22</time>
    </div>
    <div class="content">
      <p>今天,研究人员宣布了一项重大突破...</p>
      <p>这项技术将改变整个行业...</p>
    </div>
  </article>
</body>
</html>
`;

const extractor = new NewsExtractor();
const result = extractor.extract(html, {
  url: 'https://example.com/news/article'
});

console.log(result);
// {
//   title: "重大新闻:AI 技术取得突破",
//   author: "张三",
//   publishTime: "2026-03-22",
//   content: "今天,研究人员宣布了一项重大突破...",
//   contentHtml: "<p>今天,研究人员宣布了一项重大突破...</p>...",
//   images: []
// }

便捷函数

import { extractNews } from 'news-extractor-node';

const result = extractNews(html, {
  url: 'https://example.com/news/article'
});

📖 详细使用

1. 过滤噪音内容

移除广告、评论等干扰元素:

const result = extractor.extract(html, {
  url: 'https://example.com/news',
  noiseSelectors: [
    '.advertisement',      // 广告
    '.comment-list',       // 评论区
    '#related-articles',   // 相关文章
    'aside',              // 侧边栏
    'footer'              // 页脚
  ]
});

2. 自定义提取规则

当自动提取不准确时,可以指定自定义 XPath:

const result = extractor.extract(html, {
  // 自定义标题提取
  titleXPath: '//h1[@class="article-title"]/text()',
  
  // 自定义作者提取
  authorXPath: '//span[@class="author-name"]/text()',
  
  // 自定义时间提取
  publishTimeXPath: '//time/@datetime',
  
  // 自定义正文提取
  contentXPath: '//div[@class="article-body"]'
});

3. 与 Markdown 转换结合

完整的新闻采集到 Markdown 的工作流:

import { NewsExtractor } from 'news-extractor-node';
import { convertString } from '@siping/html-to-markdown-node';

// 1. 提取新闻内容
const extractor = new NewsExtractor();
const news = extractor.extract(html, {
  url: 'https://example.com/news/article',
  noiseSelectors: ['.ad', '.comments']
});

// 2. 转换为 Markdown
const markdown = convertString(news.contentHtml, {
  domain: 'https://example.com'
});

// 3. 组合完整的 Markdown 文档
const fullMarkdown = `
# ${news.title}

**作者**:${news.author}  
**发布时间**:${news.publishTime}  
**来源**:https://example.com/news/article

---

${markdown}

## 图片列表

${news.images.map(img => `- ![](${img})`).join('\n')}
`;

console.log(fullMarkdown);

4. 批量处理

批量提取多个新闻页面:

const urls = [
  'https://example.com/news/1',
  'https://example.com/news/2',
  'https://example.com/news/3'
];

const extractor = new NewsExtractor();
const results = [];

for (const url of urls) {
  // 假设你已经获取了 HTML(使用 axios、puppeteer 等)
  const html = await fetchHtml(url);
  
  const news = extractor.extract(html, {
    url,
    noiseSelectors: ['.ad', '.sidebar']
  });
  
  results.push({
    url,
    ...news
  });
}

console.log(`成功提取 ${results.length} 篇文章`);

🔧 API 文档

NewsExtractor

主要的提取器类。

extract(html: string, options?: ExtractOptions): NewsContent

从 HTML 中提取新闻内容。

参数:

  • html (string):要提取的 HTML 内容
  • options (ExtractOptions, 可选):提取选项
    • url (string):页面 URL,用于解析相对路径的图片链接
    • titleXPath (string):自定义标题提取的 XPath
    • authorXPath (string):自定义作者提取的 XPath
    • publishTimeXPath (string):自定义发布时间提取的 XPath
    • contentXPath (string):自定义正文提取的 XPath
    • noiseSelectors (string[]):要移除的噪音元素的 CSS 选择器数组

返回值:

interface NewsContent {
  title: string;           // 文章标题
  author: string;          // 作者
  publishTime: string;     // 发布时间
  content: string;         // 纯文本内容
  contentHtml: string;     // HTML 格式内容
  images: string[];        // 图片 URL 数组
}

extractNews(html: string, options?: ExtractOptions): NewsContent

便捷函数,等同于创建 NewsExtractor 实例并调用 extract 方法。

🎯 工作原理

本库实现了基于文本密度的内容提取算法:

1. 文本密度计算

文本密度 = 文本字符数 / HTML 标签数
  • 遍历 DOM 树的每个节点
  • 计算每个节点的文本字符数和 HTML 标签数
  • 标点符号给予更高权重(5倍)
  • 找出文本密度最高的区域作为正文

2. 元数据提取

标题提取策略:

  1. 尝试常见的标题选择器(h1.titleh1.article-title 等)
  2. 检查 Meta 标签(og:titletwitter:title
  3. 回退到 <title> 标签

时间提取策略:

  1. 查找 <time> 标签和 datetime 属性
  2. 检查 Meta 标签(article:published_time
  3. 使用正则表达式在页面文本中搜索日期模式
  4. 支持多种日期格式(ISO 8601、中文日期等)

作者提取策略:

  1. 查找常见的作者选择器(.author.article-author 等)
  2. 检查 Meta 标签(authorarticle:author
  3. 使用正则表达式匹配"作者:"、"By"等模式

3. 噪音过滤

在提取前移除指定的噪音元素,提高提取准确率。

💡 使用场景

1. 新闻聚合平台

// 从多个新闻源采集内容
const sources = ['新浪', '网易', '腾讯'];
const articles = [];

for (const source of sources) {
  const html = await fetchFromSource(source);
  const news = extractNews(html, { url: source.url });
  articles.push(news);
}

2. 内容归档系统

// 定期归档新闻文章
import fs from 'fs';

const news = extractNews(html, { url });
const markdown = convertToMarkdown(news);

fs.writeFileSync(
  `archives/${news.publishTime}-${news.title}.md`,
  markdown
);

3. RSS 订阅生成

// 生成 RSS feed
const articles = await fetchLatestArticles();
const rssItems = articles.map(html => {
  const news = extractNews(html);
  return {
    title: news.title,
    description: news.content.substring(0, 200),
    pubDate: news.publishTime,
    author: news.author
  };
});

4. 知识库构建

// 构建 Markdown 知识库
const news = extractNews(html, { url });
const markdown = convertString(news.contentHtml);

await saveToKnowledgeBase({
  title: news.title,
  content: markdown,
  metadata: {
    author: news.author,
    date: news.publishTime,
    source: url
  }
});

⚠️ 注意事项

1. HTML 来源

本库只负责提取,不负责获取 HTML。你需要自己使用以下工具获取 HTML:

// 使用 axios
import axios from 'axios';
const { data: html } = await axios.get(url);

// 使用 node-fetch
import fetch from 'node-fetch';
const html = await fetch(url).then(r => r.text());

// 使用 puppeteer(适合 JavaScript 渲染的页面)
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const html = await page.content();

2. 适用范围

  • 适用:新闻文章、博客文章、技术文档
  • 不适用:图片集、视频页面、列表页

3. 准确率

  • 在主流中文新闻网站测试准确率较高
  • 对于特殊布局的网站,建议使用自定义 XPath
  • 网站结构变化可能影响提取效果

4. 性能考虑

// 对于大量文章,建议复用 extractor 实例
const extractor = new NewsExtractor();

for (const html of htmlList) {
  const news = extractor.extract(html);
  // 处理结果...
}

🔗 相关项目

📝 开发路线图

  • [x] 核心文本密度算法
  • [x] 标题提取
  • [x] 时间提取
  • [x] 作者提取
  • [x] 图片提取
  • [x] 噪音过滤
  • [x] TypeScript 支持
  • [ ] 单元测试
  • [ ] 性能优化
  • [ ] 多页文章支持
  • [ ] 更多网站适配

🤝 贡献

欢迎提交 Issue 和 Pull Request!

📄 许可证

MIT © Ping Si

👤 作者

Ping Si


如果这个项目对你有帮助,欢迎 ⭐️ Star!