npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

html-content-processor

v1.0.5

Published

A professional library for processing, cleaning, filtering, and converting HTML content to Markdown. Features advanced customization options, presets, plugin support, fluent API, and TypeScript integration for reliable content extraction.

Readme

HTML Content Processor

A modern TypeScript library for cleaning, filtering, and converting HTML content to Markdown with intelligent content extraction. Supports cross-environment execution (Browser/Node.js) with automatic page type detection.

Features

  • 🚀 Modern API Design - Clean functional and class-based APIs
  • 🧠 Intelligent Filtering - Automatic page type detection with optimal filtering strategies
  • 📝 High-Quality Markdown Conversion - Advanced HTML to Markdown transformation
  • 🌐 Cross-Environment Support - Full compatibility with both browser and Node.js environments
  • 🎯 Smart Presets - Optimized configurations for different content types
  • 🔌 Plugin System - Extensible plugin architecture
  • 📊 Automatic Detection - Smart detection of search engines, blogs, news, documentation, and more

Installation

npm install html-content-processor

Quick Start

Basic Usage

import { htmlToMarkdown, htmlToText, cleanHtml } from 'html-content-processor';

// Convert HTML to Markdown
const markdown = await htmlToMarkdown('<h1>Hello</h1><p>World</p>');

// Convert HTML to plain text
const text = await htmlToText('<h1>Hello</h1><p>World</p>');

// Clean HTML content
const clean = await cleanHtml('<div>Content</div><script>ads</script>');

Automatic Page Type Detection (Recommended)

The library can automatically detect page types and apply optimal filtering strategies:

import { htmlToMarkdownAuto, cleanHtmlAuto, extractContentAuto } from 'html-content-processor';

// Automatic detection with URL context
const markdown = await htmlToMarkdownAuto(html, 'https://example.com/blog-post');

// Clean HTML with automatic page type detection
const cleanHtml = await cleanHtmlAuto(html, 'https://news.example.com/article');

// Extract content with detailed page type information
const result = await extractContentAuto(html, 'https://docs.example.com/guide');
console.log('Detected page type:', result.pageType.type);
console.log('Confidence:', result.pageType.confidence);
console.log('Markdown:', result.markdown.content);

HtmlProcessor Class (Advanced Usage)

import { HtmlProcessor } from 'html-content-processor';

// Method chaining
const result = await HtmlProcessor
  .from(html)
  .withBaseUrl('https://example.com')
  .withAutoDetection() // Enable automatic page type detection
  .filter()
  .toMarkdown();

// Manual page type setting
const processor = await HtmlProcessor
  .from(html)
  .withPageType('blog') // Manually set page type
  .filter();

const markdown = await processor.toMarkdown();

Content-Specific Presets

import { 
  htmlToArticleMarkdown, 
  htmlToBlogMarkdown, 
  htmlToNewsMarkdown 
} from 'html-content-processor';

// Optimized for different content types
const articleMd = await htmlToArticleMarkdown(html, baseUrl);
const blogMd = await htmlToBlogMarkdown(html, baseUrl);
const newsMd = await htmlToNewsMarkdown(html, baseUrl);

API Reference

Core Functions

| Function | Description | Return Type | |----------|-------------|-------------| | htmlToMarkdown(html, options?) | Convert HTML to Markdown | Promise<string> | | htmlToMarkdownWithCitations(html, baseUrl?, options?) | Convert HTML to Markdown with citations | Promise<string> | | htmlToText(html, options?) | Convert HTML to plain text | Promise<string> | | cleanHtml(html, options?) | Clean HTML content | Promise<string> | | extractContent(html, options?) | Extract content fragments | Promise<string[]> |

Automatic Detection Functions

| Function | Description | Benefits | |----------|-------------|----------| | htmlToMarkdownAuto(html, url?, options?) | Auto-detect page type and convert to Markdown | Optimal filtering for each page type | | cleanHtmlAuto(html, url?, options?) | Auto-detect page type and clean HTML | Smart noise removal | | extractContentAuto(html, url?, options?) | Auto-detect and extract with detailed results | Comprehensive page analysis |

Example: Using Auto-Detection

// Blog post detection
const blogResult = await htmlToMarkdownAuto(html, 'https://medium.com/@user/post');
// Automatically applies blog-optimized filtering

// News article detection  
const newsResult = await htmlToMarkdownAuto(html, 'https://cnn.com/article');
// Automatically applies news-optimized filtering

// Documentation detection
const docsResult = await htmlToMarkdownAuto(html, 'https://docs.react.dev/guide');
// Automatically applies documentation-optimized filtering

// Search engine results detection
const searchResult = await htmlToMarkdownAuto(html, 'https://google.com/search?q=query');
// Automatically applies search-results-optimized filtering

Content-Specific Presets

| Function | Optimized For | |----------|---------------| | htmlToArticleMarkdown() | Long-form articles | | htmlToBlogMarkdown() | Blog posts | | htmlToNewsMarkdown() | News articles | | strictCleanHtml() | Aggressive cleaning | | gentleCleanHtml() | Conservative cleaning |

HtmlProcessor Class

// Create processor
const processor = HtmlProcessor.from(html, options);

// Configuration methods
processor.withBaseUrl(url)           // Set base URL
processor.withOptions(options)       // Update options
processor.withAutoDetection(url?)    // Enable auto-detection
processor.withPageType(type)         // Manually set page type

// Processing methods
await processor.filter(options?)     // Apply filtering
await processor.toMarkdown(options?) // Convert to Markdown
await processor.toText()             // Convert to plain text
await processor.toArray()            // Convert to fragment array
processor.toString()                 // Get cleaned HTML

// Information methods
processor.getOptions()               // Get current options
processor.isProcessed()              // Check if processed
processor.getPageTypeResult()        // Get page type detection result

Configuration Options

Filter Options (FilterOptions)

{
  threshold?: number;           // Filtering threshold (default: 2)
  strategy?: 'fixed' | 'dynamic'; // Filtering strategy (default: 'dynamic')
  ratio?: number;              // Text density ratio (default: 0.48)
  minWords?: number;           // Minimum word count (default: 0)
  preserveStructure?: boolean; // Preserve structure (default: false)
  keepElements?: string[];     // Elements to keep
  removeElements?: string[];   // Elements to remove
}

Convert Options (ConvertOptions)

{
  citations?: boolean;         // Generate citations (default: true)
  ignoreLinks?: boolean;       // Ignore links (default: false)
  ignoreImages?: boolean;      // Ignore images (default: false)
  baseUrl?: string;           // Base URL
  threshold?: number;         // Filter threshold
  strategy?: 'fixed' | 'dynamic'; // Filter strategy
  ratio?: number;             // Text density ratio
}

Automatic Page Type Detection

The library automatically detects and optimizes for these page types:

  • search-engine - Search engine result pages
  • blog - Blog posts and personal articles
  • news - News articles and journalism
  • documentation - Technical documentation
  • e-commerce - E-commerce and product pages
  • social-media - Social media content
  • forum - Forum discussions and Q&A
  • article - General articles and content
  • landing-page - Marketing and landing pages

How Auto-Detection Works

import { extractContentAuto } from 'html-content-processor';

const result = await extractContentAuto(html, url);

console.log('Page Type:', result.pageType.type);
console.log('Confidence:', (result.pageType.confidence * 100).toFixed(1) + '%');
console.log('Detection Reasons:', result.pageType.reasons);
console.log('Applied Filter Options:', result.pageType.filterOptions);

Environment Support

Node.js

npm install jsdom  # Recommended for best performance

Browser

Direct support, no additional dependencies required.

CDN

<script src="https://unpkg.com/html-content-processor"></script>
<script>
  // Global variable: window.htmlFilter
  htmlFilter.htmlToMarkdown(html).then(console.log);
  
  // Auto-detection example
  htmlFilter.htmlToMarkdownAuto(html, window.location.href).then(result => {
    console.log('Auto-detected content:', result);
  });
</script>

Real-World Examples

Web Scraping with Auto-Detection

import { htmlToMarkdownAuto } from 'html-content-processor';

// Scrape and convert blog post
const response = await fetch('https://blog.example.com/post-123');
const html = await response.text();
const markdown = await htmlToMarkdownAuto(html, response.url);
// Automatically detects it's a blog and applies blog-specific filtering

News Article Processing

import { extractContentAuto } from 'html-content-processor';

const result = await extractContentAuto(newsHtml, 'https://news.site.com/article');
if (result.pageType.type === 'news') {
  console.log('High-quality news content extracted');
  console.log('Confidence:', result.pageType.confidence);
}

Documentation Conversion

import { htmlToMarkdownAuto } from 'html-content-processor';

// Convert technical documentation
const docMarkdown = await htmlToMarkdownAuto(docsHtml, 'https://docs.example.com/api');
// Automatically preserves code blocks, headers, and technical content structure

Performance

  • Fast Processing: Optimized algorithms for quick content extraction
  • 💾 Memory Efficient: Minimal memory footprint
  • 🔄 Batch Processing: Handle multiple documents efficiently
  • 📊 Smart Caching: Automatic page type detection caching

License

MIT License