docling-node-ts
v0.3.1
Published
Convert documents to clean RAG-ready markdown in Node.js
Readme
docling-node-ts
Zero-dependency document-to-markdown conversion for Node.js.
Convert HTML, plain text, and markdown documents into clean, structure-preserving markdown suitable for RAG (Retrieval-Augmented Generation) pipelines, knowledge base construction, and LLM ingestion. Accepts string or Buffer input, auto-detects the format, routes to the appropriate converter, extracts metadata and image references, and returns a typed ConversionResult. No external services, no Python runtime, no network calls -- everything runs locally in Node.js.
Installation
npm install docling-node-tsRequires Node.js 18 or later.
Quick Start
import { convert } from 'docling-node-ts';
// Convert HTML to markdown
const result = convert('<h1>Quarterly Report</h1><p>Revenue grew <strong>15%</strong> year-over-year.</p>');
console.log(result.markdown);
// # Quarterly Report
//
// Revenue grew **15%** year-over-year.
console.log(result.metadata);
// { wordCount: 5, headingCount: 1, imageCount: 0, readingTimeMinutes: 1 }
console.log(result.durationMs);
// 2// Convert a Buffer with auto-detection
import { readFileSync } from 'fs';
const buf = readFileSync('report.html');
const { markdown, metadata, images, warnings } = convert(buf);Features
- HTML to Markdown -- Converts headings (h1-h6), paragraphs, bold, italic, strikethrough, inline code, links, images, ordered and unordered lists (including nested), GFM pipe tables, fenced code blocks with language hints, blockquotes, horizontal rules,
<figure>/<figcaption>, and<sup>/<sub>elements. - Plain Text to Markdown -- Detects setext-style headings (underlined with
===or---), ALL CAPS headings, unordered and ordered lists, and paragraph breaks. Normalizes list markers and line endings. - Markdown Normalization -- Cleans and normalizes existing markdown: collapses excessive blank lines, standardizes list markers to
-, normalizes heading levels to eliminate gaps, fixes broken links with empty hrefs, and ensures consistent spacing around headings. - Format Auto-Detection -- Detects the input format automatically using file extension, magic bytes (for Buffer inputs), and content analysis (HTML tags, markdown patterns). Supports explicit format override via options.
- Metadata Extraction -- Returns word count, heading count, image count, and estimated reading time. For HTML inputs, extracts title, author, and date from
<title>,<meta>, and Open Graph tags. - Image Reference Extraction -- Collects all image references from HTML with their
id,alttext, andsrcpath. Can be disabled withextractImages: false. - Binary Format Guidance -- Detects PDF, DOCX, and PPTX inputs (via magic bytes or extension) and returns informative messages with suggested packages (
pdfjs-dist,mammoth,jszip) and code examples. No binary parsers are bundled to keep the dependency tree at zero. - HTML Sanitization -- Strips
<script>,<style>,<noscript>,<iframe>,<svg>,<canvas>,<nav>,<footer>,<header>, and<aside>elements. Decodes HTML entities including numeric and hex character references. - Zero Dependencies -- No runtime dependencies. Only devDependencies for building and testing.
API Reference
convert(input, options?)
The primary conversion function. Accepts a string or Buffer, auto-detects the format (or uses the explicit format from options), converts to markdown, and returns a ConversionResult.
function convert(input: string | Buffer, options?: ConvertOptions): ConversionResult;Parameters:
| Parameter | Type | Description |
| --- | --- | --- |
| input | string \| Buffer | The document content to convert |
| options | ConvertOptions | Optional conversion settings |
Returns: ConversionResult
import { convert } from 'docling-node-ts';
const result = convert('<table><thead><tr><th>Name</th><th>Age</th></tr></thead><tbody><tr><td>Alice</td><td>30</td></tr></tbody></table>');
console.log(result.markdown);
// | Name | Age |
// | --- | --- |
// | Alice | 30 |convertHtml(html)
Convenience function that converts HTML to markdown. Equivalent to calling convert(html, { format: 'html' }).
function convertHtml(html: string): ConversionResult;import { convertHtml } from 'docling-node-ts';
const { markdown } = convertHtml('<ul><li>First</li><li>Second</li></ul>');
// - First
// - SecondconvertMarkdown(md)
Cleans and normalizes existing markdown. Standardizes list markers, normalizes heading levels, collapses blank lines, removes broken links, and ensures consistent formatting. Equivalent to calling convert(md, { format: 'markdown' }).
function convertMarkdown(md: string): ConversionResult;import { convertMarkdown } from 'docling-node-ts';
const { markdown } = convertMarkdown('# Title\n\n\n\n\n#### Skipped Level\n\n* Item');
// # Title
//
// ## Skipped Level
//
// - ItemconvertText(text)
Converts plain text to markdown. Detects headings, lists, and paragraph structure. Equivalent to calling convert(text, { format: 'text' }).
function convertText(text: string): ConversionResult;import { convertText } from 'docling-node-ts';
const { markdown } = convertText('INTRODUCTION\n\nSome body text.\n\n1) First step\n2) Second step');
// ## INTRODUCTION
//
// Some body text.
//
// 1. First step
// 2. Second stepdetectFormat(input, fileName?)
Detects the format of a document from its content or file name.
Detection priority:
- File extension from
fileName(.pdf,.docx,.pptx,.html,.htm,.xhtml,.txt,.md,.markdown) - Magic bytes for Buffer inputs (
%PDFfor PDF,PK\x03\x04for ZIP-based Office formats) - Content analysis (HTML tags, markdown patterns)
- Default:
'text'
function detectFormat(input: string | Buffer, fileName?: string): InputFormat;import { detectFormat } from 'docling-node-ts';
detectFormat('', 'report.pdf'); // 'pdf'
detectFormat('<html><body>Hi</body></html>'); // 'html'
detectFormat('# Title\n\n## Section'); // 'markdown'
detectFormat('Just plain text.'); // 'text'
const pdfBuffer = Buffer.from('%PDF-1.4 ...');
detectFormat(pdfBuffer); // 'pdf'extractMetadata(markdown)
Extracts metadata from a markdown string. Computes word count, heading count, image count, and estimated reading time.
function extractMetadata(markdown: string): Pick<
DocumentMetadata,
'wordCount' | 'headingCount' | 'imageCount' | 'readingTimeMinutes'
>;import { extractMetadata } from 'docling-node-ts';
const meta = extractMetadata('# Title\n\nSome **bold** text with .\n');
// { wordCount: 4, headingCount: 1, imageCount: 1, readingTimeMinutes: 1 }Word counting strips markdown syntax (headings, bold/italic, code blocks, image references, links, blockquotes, horizontal rules, table pipes, and HTML tags) before counting. Reading time is calculated at 200 words per minute, rounded up, with a minimum of 1 minute.
Types
ConversionResult
The return type of all conversion functions.
interface ConversionResult {
/** The converted markdown string */
markdown: string;
/** Extracted document metadata */
metadata: DocumentMetadata;
/** Image references found in the document */
images: ImageReference[];
/** Per-page content breakdown (for paginated formats) */
pages: PageContent[];
/** Warnings generated during conversion */
warnings: string[];
/** Conversion duration in milliseconds */
durationMs: number;
}ConvertOptions
Options for the convert function.
interface ConvertOptions {
/** Explicitly specify the input format (skips auto-detection) */
format?: InputFormat;
/** Whether to extract image references (default: true) */
extractImages?: boolean;
/** Whether to preserve document structure like headings and lists (default: true) */
preserveStructure?: boolean;
/** Maximum number of pages to process (for paginated formats) */
maxPages?: number;
/** Whether to insert page break markers (default: false) */
pageBreaks?: boolean;
/** File name hint for format detection */
fileName?: string;
}InputFormat
Supported input format identifiers.
type InputFormat = 'html' | 'markdown' | 'text' | 'pdf' | 'docx' | 'pptx';DocumentMetadata
Metadata extracted from a converted document.
interface DocumentMetadata {
title?: string;
author?: string;
date?: string;
pageCount?: number;
wordCount: number;
headingCount: number;
imageCount: number;
readingTimeMinutes: number;
}ImageReference
A reference to an image found in the document.
interface ImageReference {
/** Unique identifier for the image (e.g., "img-1") */
id: string;
/** Alt text for the image */
alt: string;
/** Source URL or path of the image */
src: string;
/** Page number where the image was found (if applicable) */
page?: number;
}PageContent
Content of a single page in a paginated document.
interface PageContent {
/** Page number (1-based) */
pageNumber: number;
/** Markdown content of the page */
markdown: string;
/** Headings found on this page */
headings: string[];
}Configuration
Format Override
Skip auto-detection by specifying the format explicitly:
const result = convert(content, { format: 'html' });File Name Hint
Provide a file name for extension-based format detection:
const result = convert(buffer, { fileName: 'report.html' });Disable Image Extraction
Suppress image reference collection:
const result = convert(html, { extractImages: false });
console.log(result.images); // []Strip All Formatting
Produce plain text output with no markdown syntax:
const result = convert('# Heading\n\n**bold** and *italic*', {
format: 'markdown',
preserveStructure: false,
});
console.log(result.markdown);
// Heading
//
// bold and italicError Handling
All conversion functions are synchronous and do not throw under normal operation. Errors and edge cases are communicated through the warnings array in the ConversionResult.
Binary Formats
When a binary format (PDF, DOCX, PPTX) is detected, the library does not throw. Instead, it returns a ConversionResult with an informative markdown message describing the detected format, suggested external packages, and example code:
const result = convert(pdfBuffer);
console.log(result.warnings);
// [
// 'Binary format "pdf" detected. Install a dedicated parser for full support.',
// 'Suggested packages: `pdfjs-dist`, `pdf-parse`, `pdf2json`'
// ]Unexpected Formats
If the detected format does not match any known converter, the input is treated as plain text and a warning is added:
// result.warnings: ['Unexpected format: xyz. Treating as plain text.']Empty or Whitespace Input
Empty strings and whitespace-only input produce minimal output without errors:
const result = convert('');
console.log(result.markdown); // '\n'
console.log(result.metadata.wordCount); // 0Advanced Usage
RAG Pipeline Integration
Use docling-node-ts as the first stage in a document ingestion pipeline. The output markdown is designed for downstream chunking and embedding:
import { convert } from 'docling-node-ts';
function ingestDocument(html: string) {
const { markdown, metadata, images, warnings } = convert(html);
if (warnings.length > 0) {
console.warn('Conversion warnings:', warnings);
}
// Chunk the markdown for embedding (e.g., with chunk-smart)
// const chunks = chunkMarkdown(markdown, { maxTokens: 512 });
return { markdown, metadata, images };
}Processing Buffers from File Uploads
import { convert } from 'docling-node-ts';
function handleUpload(buffer: Buffer, originalFileName: string) {
const result = convert(buffer, { fileName: originalFileName });
return {
markdown: result.markdown,
title: result.metadata.title,
wordCount: result.metadata.wordCount,
readingTime: result.metadata.readingTimeMinutes,
imageCount: result.images.length,
};
}HTML Metadata Extraction
When converting HTML, the library extracts metadata from <head> elements:
import { convert } from 'docling-node-ts';
const html = `
<html>
<head>
<title>Annual Report 2024</title>
<meta name="author" content="Finance Team">
<meta name="date" content="2024-12-01">
<meta property="og:title" content="Annual Report">
</head>
<body>
<h1>Annual Report</h1>
<p>Revenue increased by 20%.</p>
</body>
</html>
`;
const result = convert(html);
console.log(result.metadata.title); // 'Annual Report 2024'
console.log(result.metadata.author); // 'Finance Team'
console.log(result.metadata.date); // '2024-12-01'Title extraction priority: <title> tag, then og:title. Author extraction checks both name="author" and property="article:author". Date extraction checks both name="date" and property="article:published_time".
Normalizing Imported Markdown
Clean up markdown from external sources that may have inconsistent formatting:
import { convertMarkdown } from 'docling-node-ts';
const messy = `
# Title
#### Jumped Heading Level
* Mixed
+ List
- Markers
Click [broken]() link.
[Valid link](https://example.com)
`;
const { markdown } = convertMarkdown(messy);
// Heading levels normalized (#### becomes ##)
// List markers standardized to -
// Broken link text extracted without brackets
// Excessive blank lines collapsedHTML Table Conversion
Tables are converted to GitHub Flavored Markdown pipe tables with column normalization and pipe escaping:
import { convertHtml } from 'docling-node-ts';
const html = `
<table>
<thead>
<tr><th>Product</th><th>Q1</th><th>Q2</th></tr>
</thead>
<tbody>
<tr><td>Widget A</td><td>$1,200</td><td>$1,500</td></tr>
<tr><td>Widget B</td><td>$800</td><td>$950</td></tr>
</tbody>
</table>
`;
const { markdown } = convertHtml(html);
// | Product | Q1 | Q2 |
// | --- | --- | --- |
// | Widget A | $1,200 | $1,500 |
// | Widget B | $800 | $950 |Rows with fewer columns are padded with empty cells. Pipe characters (|) inside cell content are escaped as \|.
TypeScript
This package is written in TypeScript and ships type declarations (dist/index.d.ts) alongside the compiled JavaScript. All public types are exported from the package entry point:
import type {
ConversionResult,
ConvertOptions,
InputFormat,
DocumentMetadata,
ImageReference,
PageContent,
} from 'docling-node-ts';Compiled with strict: true, targeting ES2022 with CommonJS module output.
License
MIT
