news-to-markdown
v1.2.2
Published
Convert news articles to Markdown with platform-specific optimizations
Maintainers
Readme
news-to-markdown
Convert news articles to Markdown with platform-specific optimizations.
Features
- 🎯 Platform-Specific Adapters: Built-in support for TouTiao, WeChat, XiaoHongShu
- 🔌 Extensible Architecture: Easy to add custom platform adapters
- 🚀 Three-Tier Fetch Strategy: curl → wget → Playwright (auto fallback)
- 🧠 Dual-Engine Content Extraction: Combines Mozilla Readability + news-extractor-node
- Readability: Complete content extraction (preserves images & multimedia)
- NewsExtractor: Metadata extraction (title, author, publish time)
- Smart selection of best extraction result
- 📝 High-Quality Markdown: Powered by
@siping/html-to-markdown-node
Installation
npm install news-to-markdownOptional: Install Playwright for dynamic pages
npx playwright install chromiumQuick Start
import { NewsToMarkdownConverter } from 'news-to-markdown';
const converter = new NewsToMarkdownConverter();
const result = await converter.convert({
url: 'https://www.toutiao.com/article/123',
});
console.log(result.markdown);API
NewsToMarkdownConverter
convert(options: ConvertOptions): Promise<ConvertResult>
Convert a news URL to Markdown.
Options:
interface ConvertOptions {
url: string; // Required: Article URL
platform?: string; // Optional: Force specific platform
selector?: string; // Optional: CSS selector for content
noiseSelectors?: string[]; // Optional: Selectors to remove
includeMetadata?: boolean; // Optional: Include metadata (default: true)
fetchStrategy?: FetchStrategy; // Optional: 'curl' | 'wget' | 'playwright' | 'auto'
customPlatform?: Platform; // Optional: Custom platform adapter
timeout?: number; // Optional: Fetch timeout (default: 30000ms)
verbose?: boolean; // Optional: Enable verbose logging
}Result:
interface ConvertResult {
markdown: string;
metadata: {
title?: string;
author?: string;
publishTime?: string;
platform: string;
imageCount: number;
contentLength: number;
fetchMethod?: string;
coverImage?: string; // Recommended cover image
};
html?: string; // Only if verbose: true
}Built-in Platforms
TouTiao (头条)
- Normalizes headings (
pgc-h-*classes) - Handles
data-srcimages - Converts
●bullet points to lists - Fixes code block spacing
const result = await converter.convert({
url: 'https://www.toutiao.com/article/123',
platform: 'toutiao',
});WeChat (微信)
- Extracts
#js_contentsection - Handles
data-srcimages - Removes inline styles
const result = await converter.convert({
url: 'https://mp.weixin.qq.com/s/xxx',
platform: 'wechat',
});XiaoHongShu (小红书)
- Extracts
.note-contentsection - Handles
data-srcanddata-originalimages - Removes tag elements
const result = await converter.convert({
url: 'https://www.xiaohongshu.com/explore/xxx',
platform: 'xiaohongshu',
});Custom Platform Adapter
import { NewsToMarkdownConverter, BasePlatform } from 'news-to-markdown';
import { load } from 'cheerio';
class MyPlatform extends BasePlatform {
name = 'my-platform';
detect(url: string): boolean {
return url.includes('mysite.com');
}
preprocess(html: string, url: string): string {
const $ = load(html);
$('.ad, .sidebar').remove();
return $.html($('.content'));
}
postprocess(markdown: string): string {
return markdown.replace(/\[AD\]/g, '');
}
}
const converter = new NewsToMarkdownConverter();
converter.registerPlatform(new MyPlatform());
const result = await converter.convert({
url: 'https://mysite.com/article',
});Fetch Strategies
Auto (Default)
Tries curl → wget → Playwright in sequence until success.
const result = await converter.convert({
url: 'https://example.com/article',
fetchStrategy: 'auto', // default
});Curl
Fastest (1-3s), for static pages.
const result = await converter.convert({
url: 'https://example.com/article',
fetchStrategy: 'curl',
});Wget
Fallback (2-4s), better compatibility.
const result = await converter.convert({
url: 'https://example.com/article',
fetchStrategy: 'wget',
});Playwright
Slowest (5-10s), supports JavaScript rendering.
const result = await converter.convert({
url: 'https://example.com/article',
fetchStrategy: 'playwright',
});Examples
See the examples directory:
basic.ts- Simple usageplatform-specific.ts- Platform-specific conversionscustom-platform.ts- Custom platform adapter
Architecture
news├to-markdowny)
│ └── DualExtractor (dual-engine extraction)
│ ├── Readabilit (content structure)
│ └── NewsExtractor (metadata
├── Core
│ ├── Converter (main orchestrator)
│ └── Fetcher (three-tier strategy)
├── Platforms (extensible adapters)
│ ├── TouTiao
│ ├── WeChat
│ ├── Xuzill/eaability (ctntextacti
│ ├── Generic (fallback)meadaaion)
└── @sipng/html-to-markdown-node (HTML → Markdw
└── Dependencies
├── @siping/html-to-markdown-node (HTML → Markdown)
└── news-extractor-node (content extraction)License
MIT
Author
Ping Si [email protected]
