@kreuzberg/html-to-markdown-node

v2.25.1

Published

7 days ago

High-performance HTML to Markdown converter - Node.js native bindings

0High
0Medium
0Low

nhirschfeld

html markdown converter rust napi native

@kreuzberg/html-to-markdown-node

npm package: @kreuzberg/html-to-markdown-node (this README). Use @kreuzberg/html-to-markdown-wasm for the portable WASM build.

Native Node.js and Bun bindings for html-to-markdown using NAPI-RS v3.

Built on the shared Rust engine that powers the Python wheels, Ruby gem, PHP extension, WebAssembly package, and CLI – ensuring identical Markdown output across every language target.

High-performance HTML to Markdown conversion using native Rust code compiled to platform-specific binaries.

Migration Guide (v2.18.x → v2.19.0)

⚠️ BREAKING CHANGE: Package Namespace Update
In v2.19.0, the npm package namespace changed from html-to-markdown-node to @kreuzberg/html-to-markdown-node to reflect the new Kreuzberg.dev organization.

Install Updated Package

Before (v2.18.x):

npm install html-to-markdown-node

After (v2.19.0+):

npm install @kreuzberg/html-to-markdown-node

Update Import Statements

Before:

import { convert } from 'html-to-markdown-node';

After:

import { convert } from '@kreuzberg/html-to-markdown-node';

Summary of Changes

Package renamed from html-to-markdown-node to @kreuzberg/html-to-markdown-node
All APIs remain identical
Full backward compatibility after updating package name and imports

Performance

Native NAPI-RS bindings deliver the fastest HTML to Markdown conversion available in JavaScript.

Benchmark Results (Apple M4)

| Document Type | ops/sec | Notes | | -------------------------- | ---------- | ------------------ | | Small (5 paragraphs) | 86,233 | Simple documents | | Medium (25 paragraphs) | 18,979 | Nested formatting | | Large (100 paragraphs) | 4,907 | Complex structures | | Tables (20 tables) | 5,003 | Table processing | | Lists (500 items) | 1,819 | Nested lists | | Wikipedia (129KB) | 1,125 | Real-world content | | Wikipedia (653KB) | 156 | Large documents |

Average: ~18,162 ops/sec across varied workloads.

Comparison

vs WASM: ~1.17× faster (native has zero startup time, direct memory access)
vs Python: ~7.4× faster (avoids FFI overhead)
Best for: Node.js and Bun server-side applications requiring maximum throughput

Benchmark Fixtures (Apple M4)

The shared benchmark harness lives in tools/benchmark-harness. Node keeps pace with the Rust CLI across the board:

| Document | Size | ops/sec (Node) | | ---------------------- | ------ | -------------- | | Lists (Timeline) | 129 KB | 3,137 | | Tables (Countries) | 360 KB | 932 | | Medium (Python) | 657 KB | 460 | | Large (Rust) | 567 KB | 554 | | Small (Intro) | 463 KB | 627 | | hOCR German PDF | 44 KB | 8,724 | | hOCR Invoice | 4 KB | 96,138 | | hOCR Embedded Tables | 37 KB | 9,591 |

Run task bench:harness -- --frameworks node to regenerate these numbers.

Installation

Node.js

npm install @kreuzberg/html-to-markdown-node
# or
yarn add @kreuzberg/html-to-markdown-node
# or
pnpm add @kreuzberg/html-to-markdown-node

Bun

bun add @kreuzberg/html-to-markdown-node

Usage

Basic Conversion

import { convert } from '@kreuzberg/html-to-markdown-node';

const html = '<h1>Hello World</h1><p>This is <strong>fast</strong>!</p>';
const markdown = convert(html);
console.log(markdown);
// # Hello World
//
// This is **fast**!

With Options

import { convert } from '@kreuzberg/html-to-markdown-node';

const markdown = convert(html, {
  headingStyle: 'Atx',
  codeBlockStyle: 'Backticks',
  listIndentWidth: 2,
  bullets: '-',
  wrap: true,
  wrapWidth: 80
});

Preserve Complex HTML (NEW in v2.5)

import { convert } from '@kreuzberg/html-to-markdown-node';

const html = `
<h1>Report</h1>
<table>
  <tr><th>Name</th><th>Value</th></tr>
  <tr><td>Foo</td><td>Bar</td></tr>
</table>
`;

const markdown = convert(html, {
  preserveTags: ['table'] // Keep tables as HTML
});
// # Report
//
// <table>
//   <tr><th>Name</th><th>Value</th></tr>
//   <tr><td>Foo</td><td>Bar</td></tr>
// </table>

TypeScript

Full TypeScript definitions included:

import { convert, convertWithInlineImages, type JsConversionOptions } from '@kreuzberg/html-to-markdown-node';

const options: JsConversionOptions = {
  headingStyle: 'Atx',
  codeBlockStyle: 'Backticks',
  listIndentWidth: 2,
  bullets: '-',
  wrap: true,
  wrapWidth: 80
};

const markdown = convert('<h1>Hello</h1>', options);

Reusing Parsed Options

Avoid re-parsing the same options object on every call (benchmarks, tight render loops) by creating a reusable handle:

import {
  createConversionOptionsHandle,
  convertWithOptionsHandle,
} from '@kreuzberg/html-to-markdown-node';

const handle = createConversionOptionsHandle({ hocrSpatialTables: false });
const markdown = convertWithOptionsHandle('<h1>Handles</h1>', handle);

Zero-Copy Buffer Input

Skip the intermediate UTF-16 string allocation by feeding Buffer/Uint8Array inputs directly—handy for benchmark harnesses or when you already have raw bytes:

import {
  convertBuffer,
  convertInlineImagesBuffer,
  convertBufferWithOptionsHandle,
  createConversionOptionsHandle,
} from '@kreuzberg/html-to-markdown-node';
import { readFileSync } from 'node:fs';

const html = readFileSync('fixtures/lists.html'); // Buffer
const markdown = convertBuffer(html);

const handle = createConversionOptionsHandle({ headingStyle: 'Atx' });
const markdownFromHandle = convertBufferWithOptionsHandle(html, handle);

// Inline images work too:
const extraction = convertInlineImagesBuffer(html, null, {
  maxDecodedSizeBytes: 5 * 1024 * 1024,
});

Inline Images

Extract and decode inline images (data URIs, SVG):

import { convertWithInlineImages } from '@kreuzberg/html-to-markdown-node';

const html = '<img src="data:image/png;base64,iVBORw0..." alt="Logo">';

const result = convertWithInlineImages(html, null, {
  maxDecodedSizeBytes: 5 * 1024 * 1024, // 5MB
  inferDimensions: true,
  filenamePrefix: 'img_',
  captureSvg: true
});

console.log(result.markdown);
console.log(`Extracted ${result.inlineImages.length} images`);

for (const img of result.inlineImages) {
  console.log(`${img.filename}: ${img.format}, ${img.data.length} bytes`);
  // Save image data to disk
  require('fs').writeFileSync(img.filename, img.data);
}

Supported Platforms

Pre-built native binaries are provided for:

| Platform | Architectures | | ----------- | --------------------------------------------------- | | macOS | x64 (Intel), ARM64 (Apple Silicon) | | Linux | x64 (glibc/musl), ARM64 (glibc/musl), ARMv7 (glibc) | | Windows | x64, ARM64 |

Runtime Compatibility

✅ Node.js 18+ (LTS) ✅ Bun 1.0+ (full NAPI-RS support) ❌ Deno (use @kreuzberg/html-to-markdown-wasm instead)

When to Use

Choose @kreuzberg/html-to-markdown-node when:

✅ Running in Node.js or Bun
✅ Maximum performance is required
✅ Server-side conversion at scale

Use html-to-markdown-wasm for:

🌐 Browser/client-side conversion
🦕 Deno runtime
☁️ Edge runtimes (Cloudflare Workers, Deno Deploy)
📦 Universal packages

Other runtimes:

Configuration Options

See ConversionOptions for all available options including:

Heading styles (ATX, underlined, ATX closed)
Code block styles (indented, backticks, tildes)
List formatting (indent width, bullet characters)
Text escaping and formatting
Tag preservation (preserveTags) and stripping (stripTags)
Preprocessing for web scraping
hOCR table extraction
And more...

Examples

Preserving HTML Tags

Keep specific HTML tags in their original form instead of converting to Markdown:

import { convert } from '@kreuzberg/html-to-markdown-node';

const html = `
<p>Before table</p>
<table class="data">
    <tr><th>Name</th><th>Value</th></tr>
    <tr><td>Item 1</td><td>100</td></tr>
</table>
<p>After table</p>
`;

const markdown = convert(html, {
  preserveTags: ['table']
});

// Result includes the table as HTML:
// "Before table\n\n<table class=\"data\">...</table>\n\nAfter table\n"

Combine with stripTags for fine-grained control:

const markdown = convert(html, {
  preserveTags: ['table', 'form'],  // Keep these as HTML
  stripTags: ['script', 'style']    // Remove these entirely
});

Web Scraping

const { convert } = require('@kreuzberg/html-to-markdown-node');

const scrapedHtml = await fetch('https://example.com').then(r => r.text());

const markdown = convert(scrapedHtml, {
  preprocessing: {
    enabled: true,
    preset: 'Aggressive',
    removeNavigation: true,
    removeForms: true
  },
  headingStyle: 'Atx',
  codeBlockStyle: 'Backticks'
});

hOCR Document Processing

const { convert } = require('@kreuzberg/html-to-markdown-node');
const fs = require('fs');

// OCR output from Tesseract in hOCR format
const hocrHtml = fs.readFileSync('scan.hocr', 'utf8');

// Automatically detects hOCR and reconstructs tables
const markdown = convert(hocrHtml, {
  hocrSpatialTables: true  // Enable spatial table reconstruction
});

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@kreuzberg/html-to-markdown-node

Migration Guide (v2.18.x → v2.19.0)

Install Updated Package

Update Import Statements

Summary of Changes

Performance

Benchmark Results (Apple M4)

Comparison

Benchmark Fixtures (Apple M4)

Installation

Node.js

Bun

Usage

Basic Conversion

With Options

Preserve Complex HTML (NEW in v2.5)

TypeScript

Reusing Parsed Options

Zero-Copy Buffer Input

Inline Images

Supported Platforms

Runtime Compatibility

When to Use

Configuration Options

Examples

Preserving HTML Tags

Web Scraping

hOCR Document Processing

Links

License