@kreuzberg/html-to-markdown

v2.24.4

Published

2 hours ago

High-performance HTML to Markdown converter for TypeScript/Node.js with a Rust core.

0High
0Medium
0Low

nhirschfeld

html markdown converter rust cli napi typescript bun node

html-to-markdown

High-performance HTML to Markdown converter for Node.js and Bun with full TypeScript support. This package wraps native @kreuzberg/html-to-markdown-node bindings and provides a type-safe API.

Installation

npm install @kreuzberg/html-to-markdown

Requires Node.js 18+ or Bun. Native bindings provide superior performance.

npm:

npm install @kreuzberg/html-to-markdown

pnpm:

pnpm add @kreuzberg/html-to-markdown

yarn:

yarn add @kreuzberg/html-to-markdown

bun:

bun add @kreuzberg/html-to-markdown

Alternatively, use the WebAssembly version for browser/edge environments:

npm install @kreuzberg/html-to-markdown-wasm

Performance Snapshot

Apple M4 • Real Wikipedia documents • convert() (TypeScript (Node.js))

| Document | Size | Latency | Throughput | | -------- | ---- | ------- | ---------- | | Lists (Timeline) | 129KB | 0.58ms | 222 MB/s | | Tables (Countries) | 360KB | 1.89ms | 190 MB/s | | Mixed (Python wiki) | 656KB | 4.21ms | 156 MB/s |

See Performance Guide for detailed benchmarks.

Quick Start

Basic conversion:

import { convert } from &#39;@kreuzberg/html-to-markdown&#39;;

const markdown: string = convert(&#39;&lt;h1&gt;Hello World&lt;/h1&gt;&#39;);
console.log(markdown); // # Hello World

With conversion options:

import { convert, ConversionOptions } from &#39;@kreuzberg/html-to-markdown&#39;;

const options: ConversionOptions = {
  headingStyle: &#39;atx&#39;,
  listIndentWidth: 2,
  wrap: true,
};

const markdown = convert(&#39;&lt;h1&gt;Title&lt;/h1&gt;&lt;p&gt;Content&lt;/p&gt;&#39;, options);

API Reference

Core Functions

convert(html: string, options?: ConversionOptions): string

Basic HTML-to-Markdown conversion. Fast and simple.

convertWithMetadata(html: string, options?: ConversionOptions, config?: MetadataConfig): { markdown: string; metadata: Metadata }

Extract Markdown plus metadata (headers, links, images, structured data) in a single pass. See Metadata Extraction Guide.

convertWithVisitor(html: string, options: { visitor: Visitor } & ConversionOptions): string

Customize conversion with visitor callbacks for element interception. See Visitor Pattern Guide.

convertWithAsyncVisitor(html: string, options: { visitor: AsyncVisitor } & ConversionOptions): Promise<string>

Async version of visitor pattern for I/O operations.

convertWithInlineImages(html: string, config?: InlineImageConfig): { markdown: string; images: ImageData[]; warnings: string[] }

Extract base64-encoded inline images with metadata.

Options

ConversionOptions – Key configuration fields:

heading_style: Heading format ("underlined" | "atx" | "atx_closed") — default: "underlined"
list_indent_width: Spaces per indent level — default: 2
bullets: Bullet characters cycle — default: "*+-"
wrap: Enable text wrapping — default: false
wrap_width: Wrap at column — default: 80
code_language: Default fenced code block language — default: none
extract_metadata: Embed metadata as YAML frontmatter — default: false
output_format: Output markup format ("markdown" | "djot") — default: "markdown"

MetadataConfig – Selective metadata extraction:

extract_headers: h1-h6 elements — default: true
extract_links: Hyperlinks — default: true
extract_images: Image elements — default: true
extract_structured_data: JSON-LD, Microdata, RDFa — default: true
max_structured_data_size: Size limit in bytes — default: 100KB

Djot Output Format

The library supports converting HTML to Djot, a lightweight markup language similar to Markdown but with a different syntax for some elements. Set output_format to "djot" to use this format.

Syntax Differences

| Element | Markdown | Djot | |---------|----------|------| | Strong | **text** | *text* | | Emphasis | *text* | _text_ | | Strikethrough | ~~text~~ | {-text-} | | Inserted/Added | N/A | {+text+} | | Highlighted | N/A | {=text=} | | Subscript | N/A | ~text~ | | Superscript | N/A | ^text^ |

Example Usage

import { convert, ConversionOptions } from '@kreuzberg/html-to-markdown';

const html = "<p>This is <strong>bold</strong> and <em>italic</em> text.</p>";

// Default Markdown output
const markdown = convert(html);
// Result: "This is **bold** and *italic* text."

// Djot output
const djot = convert(html, { outputFormat: 'djot' });
// Result: "This is *bold* and _italic_ text."

Djot's extended syntax allows you to express more semantic meaning in lightweight text, making it useful for documents that require strikethrough, insertion tracking, or mathematical notation.

Metadata Extraction

The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass.

Use Cases:

SEO analysis – Extract title, description, Open Graph tags, Twitter cards
Table of contents generation – Build structured outlines from heading hierarchy
Content migration – Document all external links and resources
Accessibility audits – Check for images without alt text, empty links, invalid heading hierarchy
Link validation – Classify and validate anchor, internal, external, email, and phone links

Zero Overhead When Disabled: Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Disable unused metadata types in MetadataConfig to optimize further.

Example: Quick Start

import { convertWithMetadata } from '@kreuzberg/html-to-markdown';

const html = '<h1>Article</h1><img src="test.jpg" alt="test">';
const { markdown, metadata } = convertWithMetadata(html);

console.log(metadata.document.title);      // Document title
console.log(metadata.headers);             // All h1-h6 elements
console.log(metadata.links);               // All hyperlinks
console.log(metadata.images);              // All images with alt text
console.log(metadata.structuredData);      // JSON-LD, Microdata, RDFa

For detailed examples including SEO extraction, table-of-contents generation, link validation, and accessibility audits, see the Metadata Extraction Guide.

Visitor Pattern

The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Use visitors to transform content, filter elements, validate structure, or collect analytics.

Use Cases:

Custom Markdown dialects – Convert to Obsidian, Notion, or other flavors
Content filtering – Remove tracking pixels, ads, or unwanted elements
URL rewriting – Rewrite CDN URLs, add query parameters, validate links
Accessibility validation – Check alt text, heading hierarchy, link text
Analytics – Track element usage, link destinations, image sources

Supported Visitor Methods: 40+ callbacks for text, inline elements, links, images, headings, lists, blocks, and tables.

Example: Quick Start

import { convertWithVisitor, type Visitor, type NodeContext, type VisitResult } from 'html-to-markdown';

const visitor: Visitor = {
  visitLink(ctx: NodeContext, href: string, text: string, title?: string): VisitResult {
    // Rewrite CDN URLs
    if (href.startsWith('https://old-cdn.com')) {
      href = href.replace('https://old-cdn.com', 'https://new-cdn.com');
    }
    return { type: 'custom', output: `[${text}](${href})` };
  },

  visitImage(ctx: NodeContext, src: string, alt?: string, title?: string): VisitResult {
    // Skip tracking pixels
    if (src.includes('tracking')) {
      return { type: 'skip' };
    }
    return { type: 'continue' };
  },
};

const html = '<a href="https://old-cdn.com/file.pdf">Download</a>';
const markdown = convertWithVisitor(html, { visitor });

Async support:

import { convertWithAsyncVisitor, type AsyncVisitor } from 'html-to-markdown';

const asyncVisitor: AsyncVisitor = {
  async visitLink(ctx, href, text, title) {
    const isValid = await validateUrl(href);
    return isValid ? { type: 'continue' } : { type: 'error', message: `Broken link: ${href}` };
  },
};

const markdown = await convertWithAsyncVisitor(html, { visitor: asyncVisitor });

For comprehensive examples including content filtering, link footnotes, accessibility validation, and asynchronous URL validation, see the Visitor Pattern Guide.

Examples

Contributing

We welcome contributions! Please see our Contributing Guide for details on:

Setting up the development environment
Running tests locally
Submitting pull requests
Reporting issues

All contributions must follow our code quality standards (enforced via pre-commit hooks):

Proper test coverage (Rust 95%+, language bindings 80%+)
Formatting and linting checks
Documentation for public APIs

License

MIT License – see LICENSE.

Support

If you find this library useful, consider sponsoring the project.

Have questions or run into issues? We're here to help:

GitHub Issues: github.com/kreuzberg-dev/html-to-markdown/issues
Discussions: github.com/kreuzberg-dev/html-to-markdown/discussions
Discord Community: discord.gg/pXxagNK2zN

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

html-to-markdown

Installation

Performance Snapshot

Quick Start

API Reference

Core Functions

Options

Djot Output Format

Syntax Differences

Example Usage

Metadata Extraction

Example: Quick Start

Visitor Pattern

Example: Quick Start

Examples

Links

Contributing

License

Support