html-content-processor
v1.0.5
Published
A professional library for processing, cleaning, filtering, and converting HTML content to Markdown. Features advanced customization options, presets, plugin support, fluent API, and TypeScript integration for reliable content extraction.
Maintainers
Readme
HTML Content Processor
A modern TypeScript library for cleaning, filtering, and converting HTML content to Markdown with intelligent content extraction. Supports cross-environment execution (Browser/Node.js) with automatic page type detection.
Features
- 🚀 Modern API Design - Clean functional and class-based APIs
- 🧠 Intelligent Filtering - Automatic page type detection with optimal filtering strategies
- 📝 High-Quality Markdown Conversion - Advanced HTML to Markdown transformation
- 🌐 Cross-Environment Support - Full compatibility with both browser and Node.js environments
- 🎯 Smart Presets - Optimized configurations for different content types
- 🔌 Plugin System - Extensible plugin architecture
- 📊 Automatic Detection - Smart detection of search engines, blogs, news, documentation, and more
Installation
npm install html-content-processorQuick Start
Basic Usage
import { htmlToMarkdown, htmlToText, cleanHtml } from 'html-content-processor';
// Convert HTML to Markdown
const markdown = await htmlToMarkdown('<h1>Hello</h1><p>World</p>');
// Convert HTML to plain text
const text = await htmlToText('<h1>Hello</h1><p>World</p>');
// Clean HTML content
const clean = await cleanHtml('<div>Content</div><script>ads</script>');Automatic Page Type Detection (Recommended)
The library can automatically detect page types and apply optimal filtering strategies:
import { htmlToMarkdownAuto, cleanHtmlAuto, extractContentAuto } from 'html-content-processor';
// Automatic detection with URL context
const markdown = await htmlToMarkdownAuto(html, 'https://example.com/blog-post');
// Clean HTML with automatic page type detection
const cleanHtml = await cleanHtmlAuto(html, 'https://news.example.com/article');
// Extract content with detailed page type information
const result = await extractContentAuto(html, 'https://docs.example.com/guide');
console.log('Detected page type:', result.pageType.type);
console.log('Confidence:', result.pageType.confidence);
console.log('Markdown:', result.markdown.content);HtmlProcessor Class (Advanced Usage)
import { HtmlProcessor } from 'html-content-processor';
// Method chaining
const result = await HtmlProcessor
.from(html)
.withBaseUrl('https://example.com')
.withAutoDetection() // Enable automatic page type detection
.filter()
.toMarkdown();
// Manual page type setting
const processor = await HtmlProcessor
.from(html)
.withPageType('blog') // Manually set page type
.filter();
const markdown = await processor.toMarkdown();Content-Specific Presets
import {
htmlToArticleMarkdown,
htmlToBlogMarkdown,
htmlToNewsMarkdown
} from 'html-content-processor';
// Optimized for different content types
const articleMd = await htmlToArticleMarkdown(html, baseUrl);
const blogMd = await htmlToBlogMarkdown(html, baseUrl);
const newsMd = await htmlToNewsMarkdown(html, baseUrl);API Reference
Core Functions
| Function | Description | Return Type |
|----------|-------------|-------------|
| htmlToMarkdown(html, options?) | Convert HTML to Markdown | Promise<string> |
| htmlToMarkdownWithCitations(html, baseUrl?, options?) | Convert HTML to Markdown with citations | Promise<string> |
| htmlToText(html, options?) | Convert HTML to plain text | Promise<string> |
| cleanHtml(html, options?) | Clean HTML content | Promise<string> |
| extractContent(html, options?) | Extract content fragments | Promise<string[]> |
Automatic Detection Functions
| Function | Description | Benefits |
|----------|-------------|----------|
| htmlToMarkdownAuto(html, url?, options?) | Auto-detect page type and convert to Markdown | Optimal filtering for each page type |
| cleanHtmlAuto(html, url?, options?) | Auto-detect page type and clean HTML | Smart noise removal |
| extractContentAuto(html, url?, options?) | Auto-detect and extract with detailed results | Comprehensive page analysis |
Example: Using Auto-Detection
// Blog post detection
const blogResult = await htmlToMarkdownAuto(html, 'https://medium.com/@user/post');
// Automatically applies blog-optimized filtering
// News article detection
const newsResult = await htmlToMarkdownAuto(html, 'https://cnn.com/article');
// Automatically applies news-optimized filtering
// Documentation detection
const docsResult = await htmlToMarkdownAuto(html, 'https://docs.react.dev/guide');
// Automatically applies documentation-optimized filtering
// Search engine results detection
const searchResult = await htmlToMarkdownAuto(html, 'https://google.com/search?q=query');
// Automatically applies search-results-optimized filteringContent-Specific Presets
| Function | Optimized For |
|----------|---------------|
| htmlToArticleMarkdown() | Long-form articles |
| htmlToBlogMarkdown() | Blog posts |
| htmlToNewsMarkdown() | News articles |
| strictCleanHtml() | Aggressive cleaning |
| gentleCleanHtml() | Conservative cleaning |
HtmlProcessor Class
// Create processor
const processor = HtmlProcessor.from(html, options);
// Configuration methods
processor.withBaseUrl(url) // Set base URL
processor.withOptions(options) // Update options
processor.withAutoDetection(url?) // Enable auto-detection
processor.withPageType(type) // Manually set page type
// Processing methods
await processor.filter(options?) // Apply filtering
await processor.toMarkdown(options?) // Convert to Markdown
await processor.toText() // Convert to plain text
await processor.toArray() // Convert to fragment array
processor.toString() // Get cleaned HTML
// Information methods
processor.getOptions() // Get current options
processor.isProcessed() // Check if processed
processor.getPageTypeResult() // Get page type detection resultConfiguration Options
Filter Options (FilterOptions)
{
threshold?: number; // Filtering threshold (default: 2)
strategy?: 'fixed' | 'dynamic'; // Filtering strategy (default: 'dynamic')
ratio?: number; // Text density ratio (default: 0.48)
minWords?: number; // Minimum word count (default: 0)
preserveStructure?: boolean; // Preserve structure (default: false)
keepElements?: string[]; // Elements to keep
removeElements?: string[]; // Elements to remove
}Convert Options (ConvertOptions)
{
citations?: boolean; // Generate citations (default: true)
ignoreLinks?: boolean; // Ignore links (default: false)
ignoreImages?: boolean; // Ignore images (default: false)
baseUrl?: string; // Base URL
threshold?: number; // Filter threshold
strategy?: 'fixed' | 'dynamic'; // Filter strategy
ratio?: number; // Text density ratio
}Automatic Page Type Detection
The library automatically detects and optimizes for these page types:
search-engine- Search engine result pagesblog- Blog posts and personal articlesnews- News articles and journalismdocumentation- Technical documentatione-commerce- E-commerce and product pagessocial-media- Social media contentforum- Forum discussions and Q&Aarticle- General articles and contentlanding-page- Marketing and landing pages
How Auto-Detection Works
import { extractContentAuto } from 'html-content-processor';
const result = await extractContentAuto(html, url);
console.log('Page Type:', result.pageType.type);
console.log('Confidence:', (result.pageType.confidence * 100).toFixed(1) + '%');
console.log('Detection Reasons:', result.pageType.reasons);
console.log('Applied Filter Options:', result.pageType.filterOptions);Environment Support
Node.js
npm install jsdom # Recommended for best performanceBrowser
Direct support, no additional dependencies required.
CDN
<script src="https://unpkg.com/html-content-processor"></script>
<script>
// Global variable: window.htmlFilter
htmlFilter.htmlToMarkdown(html).then(console.log);
// Auto-detection example
htmlFilter.htmlToMarkdownAuto(html, window.location.href).then(result => {
console.log('Auto-detected content:', result);
});
</script>Real-World Examples
Web Scraping with Auto-Detection
import { htmlToMarkdownAuto } from 'html-content-processor';
// Scrape and convert blog post
const response = await fetch('https://blog.example.com/post-123');
const html = await response.text();
const markdown = await htmlToMarkdownAuto(html, response.url);
// Automatically detects it's a blog and applies blog-specific filteringNews Article Processing
import { extractContentAuto } from 'html-content-processor';
const result = await extractContentAuto(newsHtml, 'https://news.site.com/article');
if (result.pageType.type === 'news') {
console.log('High-quality news content extracted');
console.log('Confidence:', result.pageType.confidence);
}Documentation Conversion
import { htmlToMarkdownAuto } from 'html-content-processor';
// Convert technical documentation
const docMarkdown = await htmlToMarkdownAuto(docsHtml, 'https://docs.example.com/api');
// Automatically preserves code blocks, headers, and technical content structurePerformance
- ⚡ Fast Processing: Optimized algorithms for quick content extraction
- 💾 Memory Efficient: Minimal memory footprint
- 🔄 Batch Processing: Handle multiple documents efficiently
- 📊 Smart Caching: Automatic page type detection caching
License
MIT License
