news-to-markdown

v3.3.7

Published

14 days ago

Convert news articles to Markdown with platform-specific optimizations

0High
0Medium
0Low

siping

news markdown converter extractor toutiao wechat xiaohongshu

news-to-markdown

输入文章 URL，输出干净的 Markdown 正文。17 个平台专项优化，其余走通用算法。

✨ 特性

🎯 17 个平台专项适配 — 头条、微信、小红书、知乎、36kr、虎嗅、华尔街见闻、澎湃、InfoQ 等
🔄 三层抓取策略 — curl → wget → Playwright 自动回退，静态/动态页面均支持
🧠 双引擎提取 — Mozilla Readability + news-extractor-node 智能取优
🖼️ 图片本地化 — 将远程图片下载到本地，生成可离线 Markdown + 图片包
🎨 无封面自动兜底 — 文章未识别到封面图时，基于标题生成抽象图案占位封面，避免下游发布链路（如微信公众号）因缺少 cover 而失败
🔌 可扩展 — 继承 BasePlatform 即可注册自定义平台适配器
📦 CLI + API — 命令行与 Node.js API 双模式

📦 安装

# 全局安装（推荐用于 CLI）
npm install -g news-to-markdown

# 或直接用 npx，无需安装
npx --yes news-to-markdown@latest --url "https://www.toutiao.com/article/123"

可选：安装 Playwright（支持 JavaScript 动态渲染的页面）

npx playwright install chromium

🚀 快速开始

CLI

# 基本转换，输出到终端
npx --yes news-to-markdown@latest --url "https://www.toutiao.com/article/123"

# 保存到文件
npx --yes news-to-markdown@latest --url "https://mp.weixin.qq.com/s/xxx" --output ./article.md

# 下载图片到本地（生成离线包）
npx --yes news-to-markdown@latest --url "https://mp.weixin.qq.com/s/xxx" \
  --download-images --output ./article

# 去掉元数据（只要正文）
npx --yes news-to-markdown@latest --url "https://www.zhihu.com/p/xxx" --no-metadata

# 详细日志（调试用）
npx --yes news-to-markdown@latest --url "https://36kr.com/p/xxx" --verbose

Node.js API

import { NewsToMarkdownConverter } from 'news-to-markdown';

const converter = new NewsToMarkdownConverter();
const result = await converter.convert({
  url: 'https://www.toutiao.com/article/123',
});

console.log(result.markdown);
console.log(result.metadata.title);    // 文章标题
console.log(result.metadata.author);   // 作者
console.log(result.metadata.platform); // 识别到的平台

🔗 与 browser-web-search 配合使用

最常见的 AI Agent 编排模式：搜索 → 提取正文

browser-web-search  →  搜索，产出 URL 列表
news-to-markdown    →  读取正文，产出 Markdown

# Step 1：搜索今日头条，拿到 3 篇文章的 URL
bws site toutiao/search "ai agent" --count 3

# Step 2：逐篇提取正文
npx --yes news-to-markdown@latest --url "https://www.toutiao.com/article/111"
npx --yes news-to-markdown@latest --url "https://www.toutiao.com/article/222"
npx --yes news-to-markdown@latest --url "https://www.toutiao.com/article/333"

适用于所有返回 url 字段的 bws 命令，包括头条、微信、知乎、36kr、虎嗅、华尔街见闻、InfoQ、澎湃等。

📋 支持平台

专项优化平台（17 个）

自动通过 URL 域名识别，无需手动指定 platform。

| 平台 | 域名 | 专项优化说明 | |-----|------|------| | 今日头条 | toutiao.com | 标题规范化、data-src 图片、列表符号修复 | | 微信公众号 | mp.weixin.qq.com | #js_content 提取、服务端移动端 UA 回退 | | 小红书 | xiaohongshu.com | .note-content 提取、懒加载图片处理 | | 知乎 | zhihu.com | 真实 Chrome 绕过 zse-ck 反爬检测 | | 36kr | 36kr.com | 自动转移动端 URL 绕过反爬 | | 虎嗅 | huxiu.com | 正文区域提取、懒加载图片修复 | | 华尔街见闻 | wallstreetcn.com | 财经正文提取、去搜索高亮 <em> 标签 | | 澎湃新闻 | thepaper.cn | .news_txt 正文区域提取 | | InfoQ | infoq.cn / infoq.com | 技术文章正文提取 | | Bilibili | bilibili.com | 专栏文章（/read/）正文提取 | | 掘金 | juejin.cn | 代码块与正文提取 | | CSDN | csdn.net | 去广告侧边栏、#content_views 提取 | | 博客园 | cnblogs.com | 技术博客正文提取 | | 简书 | jianshu.com | 文章正文提取 | | SegmentFault | segmentfault.com | 技术问答正文提取 | | 开源中国 | oschina.net | 资讯正文提取 | | 人人都是产品经理 | woshipm.com | 产品文章正文提取 |

其他平台（通用算法）

未在以上列表中的 URL，自动走 Mozilla Readability + news-extractor-node 双引擎通用提取，适用于：

英文媒体：The Verge、Ars Technica、Engadget 等
各类技术博客、新闻站点

注意：部分平台（GitHub 仓库、Reddit 帖子、X/Twitter 推文、微博、雪球等）本身不是长文章，bws 已直接返回结构化摘要，无需再用本工具提取正文。

📐 API 参考

`convert(options): Promise<ConvertResult>`

interface ConvertOptions {
  url: string;                    // 文章 URL（必填）
  platform?: string;              // 强制指定平台，跳过自动识别（可选）
  selector?: string;              // 自定义内容区域 CSS 选择器（可选）
  noiseSelectors?: string[];      // 需要移除的噪音元素选择器（可选）
  includeMetadata?: boolean;      // 是否包含元数据，默认 true
  fetchStrategy?: FetchStrategy;  // 'auto' | 'curl' | 'wget' | 'playwright'，默认 'auto'
  customPlatform?: Platform;      // 临时使用自定义适配器（可选）
  timeout?: number;               // 超时毫秒，默认 30000
  verbose?: boolean;              // 输出详细日志，默认 false
  downloadImages?: boolean;       // 下载图片到本地，默认 false
  outputDir?: string;             // 输出目录（downloadImages=true 时必填）
}

interface ConvertResult {
  markdown: string;               // 转换后的 Markdown 正文
  metadata: {
    title?: string;               // 文章标题
    author?: string;              // 作者
    publishTime?: string;         // 发布时间
    platform: string;             // 识别到的平台名
    imageCount: number;           // 图片数量
    contentLength: number;        // 正文字符数
    fetchMethod?: string;         // 实际使用的抓取方式
    coverImage?: string;          // 封面图 URL
    localCoverPath?: string;      // 本地封面图路径（downloadImages=true 时）
    downloadedImages?: number;    // 已下载图片数量
  };
  html?: string;                  // 原始 HTML（verbose=true 时返回）
}

🖼️ 图片本地化

将远程图片下载到本地，生成可离线的 Markdown + 图片包，适合二次发布到微信公众号等平台。

const result = await converter.convert({
  url: 'https://www.toutiao.com/article/123',
  downloadImages: true,
  outputDir: './output/article-1',
});

// 输出结构：
// output/article-1/
// ├── article.md       ← 图片引用为相对路径 ./images/xxx.jpg
// └── images/
//     ├── cover.jpg
//     ├── image_1.jpg
//     └── image_2.jpg

🎨 无封面自动兜底

某些纯文本类文章（含部分微信公众号、短资讯、专栏短文）可能完全提取不到任何图片，导致下游发布链路（如微信公众号草稿要求 thumb_media_id）卡住。本工具会在以下情况自动兜底：

coverImage 选举失败（meta 标签 + 正文图片均无可用封面）
title 已经成功提取

兜底策略：基于标题计算稳定 seed，调用 DiceBear shapes 风格的开放 API 生成一张抽象图案 SVG，并写入 frontmatter 的 cover: 字段。同一标题每次生成的封面相同（seed 由标题字符码累加得到），便于复现。

说明：
兜底封面是远程 SVG URL；如果开启 downloadImages: true，会跟随其他图片一起下载到本地 images/cover.svg。
微信公众号草稿对封面图有自己的要求（jpg/png、尺寸比例等），如需直接用于公众号发布，建议在发布侧再做一次本地光栅化或替换为业务自有封面。

🔧 自定义平台适配器

import { NewsToMarkdownConverter, BasePlatform } from 'news-to-markdown';
import { load } from 'cheerio';

class MyPlatform extends BasePlatform {
  name = 'my-platform';

  detect(url: string): boolean {
    return url.includes('mysite.com');
  }

  preprocess(html: string, url: string): string {
    const $ = load(html);
    $('.ad, .sidebar, .comments').remove();
    return $.html($('.article-content'));
  }

  postprocess(markdown: string): string {
    return markdown.replace(/\[广告\]/g, '');
  }
}

const converter = new NewsToMarkdownConverter();
converter.registerPlatform(new MyPlatform());

const result = await converter.convert({ url: 'https://mysite.com/article/123' });

🏗️ 架构

news-to-markdown
├── Core
│   ├── Converter          主编排器，串联抓取 → 提取 → 转换
│   ├── Fetcher            三层抓取：curl → wget → Playwright
│   └── DualExtractor      双引擎：Readability（结构）+ NewsExtractor（元数据）
├── Platforms              平台适配器（17 个专项 + 1 个通用回退）
│   ├── toutiao / wechat / xiaohongshu / zhihu / 36kr
│   ├── huxiu / wallstreetcn / thepaper / infoq
│   ├── bilibili / juejin / csdn / cnblogs / jianshu
│   ├── segmentfault / oschina / woshipm
│   └── generic            通用回退
└── Utils
    └── ImageDownloader    图片下载与路径替换

License

MIT — Ping Si

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

news-to-markdown

✨ 特性

📦 安装

🚀 快速开始

CLI

Node.js API

🔗 与 browser-web-search 配合使用

📋 支持平台

专项优化平台（17 个）

其他平台（通用算法）

📐 API 参考

convert(options): Promise<ConvertResult>

🖼️ 图片本地化

🎨 无封面自动兜底

🔧 自定义平台适配器

🏗️ 架构

License

`convert(options): Promise<ConvertResult>`