html-content-processor

v1.0.5

Published

7 months ago

A professional library for processing, cleaning, filtering, and converting HTML content to Markdown. Features advanced customization options, presets, plugin support, fluent API, and TypeScript integration for reliable content extraction.

HTML Content Processor

A modern TypeScript library for cleaning, filtering, and converting HTML content to Markdown with intelligent content extraction. Supports cross-environment execution (Browser/Node.js) with automatic page type detection.

Features

🚀 Modern API Design - Clean functional and class-based APIs
🧠 Intelligent Filtering - Automatic page type detection with optimal filtering strategies
📝 High-Quality Markdown Conversion - Advanced HTML to Markdown transformation
🌐 Cross-Environment Support - Full compatibility with both browser and Node.js environments
🎯 Smart Presets - Optimized configurations for different content types
🔌 Plugin System - Extensible plugin architecture
📊 Automatic Detection - Smart detection of search engines, blogs, news, documentation, and more

Installation

npm install html-content-processor

Quick Start

Basic Usage

import { htmlToMarkdown, htmlToText, cleanHtml } from 'html-content-processor';

// Convert HTML to Markdown
const markdown = await htmlToMarkdown('<h1>Hello</h1><p>World</p>');

// Convert HTML to plain text
const text = await htmlToText('<h1>Hello</h1><p>World</p>');

// Clean HTML content
const clean = await cleanHtml('<div>Content</div><script>ads</script>');

Automatic Page Type Detection (Recommended)

The library can automatically detect page types and apply optimal filtering strategies:

import { htmlToMarkdownAuto, cleanHtmlAuto, extractContentAuto } from 'html-content-processor';

// Automatic detection with URL context
const markdown = await htmlToMarkdownAuto(html, 'https://example.com/blog-post');

// Clean HTML with automatic page type detection
const cleanHtml = await cleanHtmlAuto(html, 'https://news.example.com/article');

// Extract content with detailed page type information
const result = await extractContentAuto(html, 'https://docs.example.com/guide');
console.log('Detected page type:', result.pageType.type);
console.log('Confidence:', result.pageType.confidence);
console.log('Markdown:', result.markdown.content);

HtmlProcessor Class (Advanced Usage)

import { HtmlProcessor } from 'html-content-processor';

// Method chaining
const result = await HtmlProcessor
  .from(html)
  .withBaseUrl('https://example.com')
  .withAutoDetection() // Enable automatic page type detection
  .filter()
  .toMarkdown();

// Manual page type setting
const processor = await HtmlProcessor
  .from(html)
  .withPageType('blog') // Manually set page type
  .filter();

const markdown = await processor.toMarkdown();

Content-Specific Presets

import { 
  htmlToArticleMarkdown, 
  htmlToBlogMarkdown, 
  htmlToNewsMarkdown 
} from 'html-content-processor';

// Optimized for different content types
const articleMd = await htmlToArticleMarkdown(html, baseUrl);
const blogMd = await htmlToBlogMarkdown(html, baseUrl);
const newsMd = await htmlToNewsMarkdown(html, baseUrl);

API Reference

Core Functions

| Function | Description | Return Type | |----------|-------------|-------------| | htmlToMarkdown(html, options?) | Convert HTML to Markdown | Promise<string> | | htmlToMarkdownWithCitations(html, baseUrl?, options?) | Convert HTML to Markdown with citations | Promise<string> | | htmlToText(html, options?) | Convert HTML to plain text | Promise<string> | | cleanHtml(html, options?) | Clean HTML content | Promise<string> | | extractContent(html, options?) | Extract content fragments | Promise<string[]> |

Automatic Detection Functions

| Function | Description | Benefits | |----------|-------------|----------| | htmlToMarkdownAuto(html, url?, options?) | Auto-detect page type and convert to Markdown | Optimal filtering for each page type | | cleanHtmlAuto(html, url?, options?) | Auto-detect page type and clean HTML | Smart noise removal | | extractContentAuto(html, url?, options?) | Auto-detect and extract with detailed results | Comprehensive page analysis |

Example: Using Auto-Detection

// Blog post detection
const blogResult = await htmlToMarkdownAuto(html, 'https://medium.com/@user/post');
// Automatically applies blog-optimized filtering

// News article detection  
const newsResult = await htmlToMarkdownAuto(html, 'https://cnn.com/article');
// Automatically applies news-optimized filtering

// Documentation detection
const docsResult = await htmlToMarkdownAuto(html, 'https://docs.react.dev/guide');
// Automatically applies documentation-optimized filtering

// Search engine results detection
const searchResult = await htmlToMarkdownAuto(html, 'https://google.com/search?q=query');
// Automatically applies search-results-optimized filtering

Content-Specific Presets

| Function | Optimized For | |----------|---------------| | htmlToArticleMarkdown() | Long-form articles | | htmlToBlogMarkdown() | Blog posts | | htmlToNewsMarkdown() | News articles | | strictCleanHtml() | Aggressive cleaning | | gentleCleanHtml() | Conservative cleaning |

HtmlProcessor Class

// Create processor
const processor = HtmlProcessor.from(html, options);

// Configuration methods
processor.withBaseUrl(url)           // Set base URL
processor.withOptions(options)       // Update options
processor.withAutoDetection(url?)    // Enable auto-detection
processor.withPageType(type)         // Manually set page type

// Processing methods
await processor.filter(options?)     // Apply filtering
await processor.toMarkdown(options?) // Convert to Markdown
await processor.toText()             // Convert to plain text
await processor.toArray()            // Convert to fragment array
processor.toString()                 // Get cleaned HTML

// Information methods
processor.getOptions()               // Get current options
processor.isProcessed()              // Check if processed
processor.getPageTypeResult()        // Get page type detection result

Configuration Options

Filter Options (FilterOptions)

{
  threshold?: number;           // Filtering threshold (default: 2)
  strategy?: 'fixed' | 'dynamic'; // Filtering strategy (default: 'dynamic')
  ratio?: number;              // Text density ratio (default: 0.48)
  minWords?: number;           // Minimum word count (default: 0)
  preserveStructure?: boolean; // Preserve structure (default: false)
  keepElements?: string[];     // Elements to keep
  removeElements?: string[];   // Elements to remove
}

Convert Options (ConvertOptions)

{
  citations?: boolean;         // Generate citations (default: true)
  ignoreLinks?: boolean;       // Ignore links (default: false)
  ignoreImages?: boolean;      // Ignore images (default: false)
  baseUrl?: string;           // Base URL
  threshold?: number;         // Filter threshold
  strategy?: 'fixed' | 'dynamic'; // Filter strategy
  ratio?: number;             // Text density ratio
}

Automatic Page Type Detection

The library automatically detects and optimizes for these page types:

search-engine - Search engine result pages
blog - Blog posts and personal articles
news - News articles and journalism
documentation - Technical documentation
e-commerce - E-commerce and product pages
social-media - Social media content
forum - Forum discussions and Q&A
article - General articles and content
landing-page - Marketing and landing pages

How Auto-Detection Works

import { extractContentAuto } from 'html-content-processor';

const result = await extractContentAuto(html, url);

console.log('Page Type:', result.pageType.type);
console.log('Confidence:', (result.pageType.confidence * 100).toFixed(1) + '%');
console.log('Detection Reasons:', result.pageType.reasons);
console.log('Applied Filter Options:', result.pageType.filterOptions);

Environment Support

Node.js

npm install jsdom  # Recommended for best performance

Browser

Direct support, no additional dependencies required.

CDN

<script src="https://unpkg.com/html-content-processor"></script>
<script>
  // Global variable: window.htmlFilter
  htmlFilter.htmlToMarkdown(html).then(console.log);
  
  // Auto-detection example
  htmlFilter.htmlToMarkdownAuto(html, window.location.href).then(result => {
    console.log('Auto-detected content:', result);
  });
</script>

Real-World Examples

Web Scraping with Auto-Detection

import { htmlToMarkdownAuto } from 'html-content-processor';

// Scrape and convert blog post
const response = await fetch('https://blog.example.com/post-123');
const html = await response.text();
const markdown = await htmlToMarkdownAuto(html, response.url);
// Automatically detects it's a blog and applies blog-specific filtering

News Article Processing

import { extractContentAuto } from 'html-content-processor';

const result = await extractContentAuto(newsHtml, 'https://news.site.com/article');
if (result.pageType.type === 'news') {
  console.log('High-quality news content extracted');
  console.log('Confidence:', result.pageType.confidence);
}

Documentation Conversion

import { htmlToMarkdownAuto } from 'html-content-processor';

// Convert technical documentation
const docMarkdown = await htmlToMarkdownAuto(docsHtml, 'https://docs.example.com/api');
// Automatically preserves code blocks, headers, and technical content structure

Performance

⚡ Fast Processing: Optimized algorithms for quick content extraction
💾 Memory Efficient: Minimal memory footprint
🔄 Batch Processing: Handle multiple documents efficiently
📊 Smart Caching: Automatic page type detection caching

License

MIT License

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

HTML Content Processor

Features

Installation

Quick Start

Basic Usage

Automatic Page Type Detection (Recommended)

HtmlProcessor Class (Advanced Usage)

Content-Specific Presets

API Reference

Core Functions

Automatic Detection Functions

Example: Using Auto-Detection

Content-Specific Presets

HtmlProcessor Class

Configuration Options

Filter Options (FilterOptions)

Convert Options (ConvertOptions)

Automatic Page Type Detection

How Auto-Detection Works

Environment Support

Node.js

Browser

CDN

Real-World Examples

Web Scraping with Auto-Detection

News Article Processing

Documentation Conversion

Performance

License