h2m-parser

v1.0.0

Published

4 months ago

LLM-ready HTML to Markdown pipeline with Readability, htmlparser2, and post-processing utilities.

Downloads

241

h2m-parser

LLM-friendly HTML → Markdown parser with Readability extraction, a streaming renderer, and opinionated post-processing.

Why h2m-parser?

Article aware – runs Mozilla Readability atop Linkedom for fast, script-free DOM extraction.
Deterministic Markdown – single-pass htmlparser2 renderer with stable spacing, link styling, figures, and GFM-friendly tables.
Built for pipelines – YAML front matter, optional chunking, content hashing, and NDJSON transform helpers.
Customisable – per-tag translators, ignore/block lists, regex replacements, and telemetry hooks.
DX-first – TypeScript types, Biome lint/format, Vitest coverage, tsup dual outputs, Changesets releases.

Requirements

Node.js 20.11 or newer.

Installation

bun add h2m-parser
# or
pnpm add h2m-parser
# or
npm install h2m-parser

Quick Start

Minimal conversion

import { H2MParser } from "h2m-parser";

const markdown = await H2MParser.processHtml(
  '<h1>Hello</h1><p>World</p>',
  'https://example.com',
);

console.log(markdown.markdown);

End-to-end pipeline with Readability

const converter = new H2MParser({
  extract: { readability: true },
  markdown: { linkStyle: "inline" },
  llm: { frontMatter: true, addHash: true },
});

const result = await converter.process(articleHtml, 'https://example.com');
console.log(result.markdown);
console.log(result.meta); // title, byline, lang, hash, etc.

CLI

# stdin → stdout
h2m --url https://example.com < article.html > article.md

# enable Readability extraction
h2m --readability < raw.html > main-content.md

Pipeline overview

Extract – normalise HTML with Linkedom + Readability (configurable figure retention, URL resolution, tracking-parameter stripping, data URI policy).
Convert – stream nodes through the htmlparser2-based renderer (custom translators, footnotes, reference links, table handling).
Post-process – add optional front matter, hash, chunking, and attach telemetry for observability.

Configuration highlights

import type { Options } from "h2m-parser";

const options: Options = {
  extract: {
    readability: true,
    resolveRelativeUrls: true,
    stripTrackingParams: true,
  },
  markdown: {
    linkStyle: "inline",
    ignoreTags: ["aside"],
    textReplacements: [{ pattern: /[email protected]/g, replacement: "[redacted]" }],
  },
  llm: {
    frontMatter: true,
    addHash: false,
    chunk: { targetTokens: 500, overlapTokens: 60 },
  },
};

Benchmarks

Performance

Runtime ranking (lower is better):

mdream — 1.571ms
h2m-parser — 1.793ms
Turndown — 7.181ms
node-html-markdown — 132.565ms

Benchmark Methodology

Dataset: 95 files (5 synthetic + 90 real HTML documents)
Dataset path: tests/fixtures
File sizes: 21KB to 1771KB (mean: ~123KB)
Iterations: 100 per file for statistical significance
Total runtime: 675.0 seconds
Environment: Node.js with standard V8 optimizations

Average Processing Time

Tested across 95 files in tests/fixtures (up to 1771KB):

| Library | Without Readability | With Readability | Relative | |---------|---------------------|------------------|----------| | mdream | 1.571ms | ❌ Not supported | Fastest | | h2m-parser ✅ | 1.793ms | 13.927ms | 1.14x slower | | Turndown | 7.181ms | ❌ Not supported | 4.57x slower | | node-html-markdown | 132.565ms | ❌ Not supported | 84.37x slower |

Readability overhead (h2m-parser): +12.134ms (enables article extraction + content cleaning)

Performance Analysis

Fastest baseline: mdream averages 1.571ms per document without Readability.
h2m-parser gap to mdream: 1.14× slower ( mdream: 1.571ms → h2m-parser: 1.793ms ).
h2m-parser vs Turndown: 4.00x faster (7.181ms → 1.793ms)
h2m-parser vs node-html-markdown: 73.94x faster (132.565ms → 1.793ms)
h2m-parser vs mdream: 0.88x slower (1.571ms → 1.793ms)
Readability impact: 7.8x slower when enabled (1.793ms → 13.927ms)
Token savings vs raw HTML: 24051 tokens saved (95.63%) on tests/fixtures/039c4b966d1f2a0c589ac0aad211fe65500ad1cb58c7f45b34251db7056803ec.html.
Algorithmic complexity: O(n) linear scaling confirmed across file sizes

Performance Projections

Estimated processing times for different file sizes (without Readability):

  100KB  1ms
  1MB    15ms
  10MB   150ms
  100MB  1.5s

Based on linear scaling from 123KB average file size at 1.793ms

Detailed Results by File Size

tiny (18 bytes)

| Library | Mean (ms) | P95 (ms) | P99 (ms) | |---------|-----------|----------|----------| | h2m-parser (no Readability) | 0.022 | 0.033 | 0.035 | | h2m-parser (with Readability) | 0.257 | 0.376 | 0.399 | | Turndown | 0.021 | 0.037 | 0.041 | | node-html-markdown | 0.011 | 0.017 | 0.018 | | Mdream | 0.005 | 0.007 | 0.010 |

small (84 bytes)

| Library | Mean (ms) | P95 (ms) | P99 (ms) | |---------|-----------|----------|----------| | h2m-parser (no Readability) | 0.015 | 0.022 | 0.023 | | h2m-parser (with Readability) | 0.180 | 0.262 | 0.280 | | Turndown | 0.038 | 0.047 | 0.048 | | node-html-markdown | 0.022 | 0.030 | 0.031 | | Mdream | 0.013 | 0.018 | 0.018 |

medium (369 bytes)

| Library | Mean (ms) | P95 (ms) | P99 (ms) | |---------|-----------|----------|----------| | h2m-parser (no Readability) | 0.016 | 0.020 | 0.021 | | h2m-parser (with Readability) | 0.216 | 0.255 | 0.284 | | Turndown | 0.046 | 0.054 | 0.056 | | node-html-markdown | 0.019 | 0.022 | 0.025 | | Mdream | 0.022 | 0.040 | 0.040 |

file_42 (21KB)

| Library | Mean (ms) | P95 (ms) | P99 (ms) | |---------|-----------|----------|----------| | h2m-parser (no Readability) | 0.375 | 0.511 | 0.588 | | h2m-parser (with Readability) | 2.208 | 3.243 | 4.079 | | Turndown | 1.401 | 1.678 | 1.766 | | node-html-markdown | 0.392 | 0.414 | 0.428 | | Mdream | 0.328 | 0.337 | 0.341 |

file_57 (88KB)

| Library | Mean (ms) | P95 (ms) | P99 (ms) | |---------|-----------|----------|----------| | h2m-parser (no Readability) | 1.116 | 1.244 | 1.270 | | h2m-parser (with Readability) | 6.270 | 6.814 | 7.168 | | Turndown | 4.254 | 5.300 | 5.489 | | node-html-markdown | 2.097 | 2.328 | 2.375 | | Mdream | 1.110 | 1.200 | 1.245 |

file_91 (1771KB)

| Library | Mean (ms) | P95 (ms) | P99 (ms) | |---------|-----------|----------|----------| | h2m-parser (no Readability) | 40.421 | 43.418 | 43.590 | | h2m-parser (with Readability) | 622.410 | 865.618 | 877.353 | | Turndown | 184.309 | 189.178 | 192.011 | | node-html-markdown | 12274.670 | 13276.637 | 13418.863 | | Mdream | 50.946 | 51.870 | 51.987 |

See bench/comparison-results.md for complete results across all 95 files

Workflow Comparison (Await vs Stream)

| Mode | Iterations | Mean (ms) | p95 (ms) | Min (ms) | Max (ms) | |------|------------|-----------|----------|----------|----------| | h2m-parser (await) | 10 | 12.34 | 41.79 | 7.56 | 41.79 | | mdream (await) | 10 | 11.57 | 97.94 | 1.65 | 97.94 | | mdream (stream) | 10 | 14.06 | 110.91 | 1.94 | 110.91 |

Token Savings

Model: gpt-4o-mini
HTML tokens: 25151
Markdown tokens: 1100
Savings: 24051 tokens (95.63%)
Estimated cost delta per document: $0.003608
Markdown length: 4869 characters

Memory Snapshot

Mode: h2m-reuse
Iterations: 10
RSS change: 41.80 MB

Bundle Size Snapshot

Generated: 2025-10-06T08:36:12.983Z

| File | Size | Gzipped | Δ Size | Δ Gzipped | |------|------|---------|--------|-----------| | cli.cjs | 22KB | 8KB | +0 B (+0.00%) | +0 B (+0.00%) | | cli.mjs | 22KB | 8KB | +0 B (+0.00%) | +0 B (+0.00%) | | index.cjs | 19KB | 7KB | +0 B (+0.00%) | +0 B (+0.00%) | | index.mjs | 19KB | 7KB | +0 B (+0.00%) | +0 B (+0.00%) |

Live Fetch Results

Fetched: https://en.wikipedia.org/wiki/Markdown

| Tool | Mean | Min | Max | |------|------|-----|-----| | h2m-parser | 51.53ms | 43.44ms | 65.47ms | | mdream (await) | 6.70ms | 3.97ms | 11.96ms | | mdream (stream) | 13.63ms | 11.98ms | 16.39ms |

Feature Comparison

| Feature | h2m-parser | Turndown | node-html-markdown | mdream | |---------|------------|----------|--------------------|--------| | Performance | ⚠️ +14% slower | ❌ +357% slower | ❌ +8337% slower | ✅ Fastest | | Readability | ✅ | ❌ | ❌ | ⚠️ | | Link cleanup | ✅ | ❌ | ❌ | ⚠️ | | Front matter | ✅ | ❌ | ❌ | ✅ | | Chunking | ✅ | ❌ | ❌ | ⚠️ | | TypeScript | ✅ | ❌ | ✅ | ✅ | | Streaming | ✅ | ❌ | ❌ | ✅ |

Benchmark Transparency

Raw results: bench/.results/comparison-latest.json
Benchmark runner: bench/compare.js
Test dataset: tests/fixtures/ (90 real HTML files)
Statistical data: Includes mean, median, P95, P99, min/max for each test
Reproducible: Run bun bench:compare:full to verify results

Run benchmarks yourself:

# Quick comparison (10 iterations)
bun bench:compare:quick

# Full comparison (1000 iterations)
bun bench:compare:full

# Update README with fresh results
bun bench:readme

Development

bun install
bun verify

Contributing

We welcome improvements! See CONTRIBUTING.md for:

Development setup and coding standards
Commit conventions and release workflow
Maintainer scripts and workflows
Performance baselines and troubleshooting

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

h2m-parser

Why h2m-parser?

Requirements

Installation

Quick Start

Minimal conversion

End-to-end pipeline with Readability

CLI

Pipeline overview

Configuration highlights

Benchmarks

Performance

Benchmark Methodology

Average Processing Time

Performance Analysis

Performance Projections

Detailed Results by File Size

tiny (18 bytes)

small (84 bytes)

medium (369 bytes)

file_42 (21KB)

file_57 (88KB)

file_91 (1771KB)

Workflow Comparison (Await vs Stream)

Token Savings

Memory Snapshot

Bundle Size Snapshot

Live Fetch Results

Feature Comparison

Benchmark Transparency

Development

Contributing

License