@kreuzberg/node

v4.3.8

Published

6 days ago

Kreuzberg document intelligence - Node.js native bindings

Downloads

21,549

TypeScript (Node.js)

Extract text, tables, images, and metadata from 75+ file formats including PDF, Office documents, and images. Native NAPI-RS bindings for Node.js with superior performance, async/await support, and TypeScript type definitions.

Installation

Package Installation

Install via one of the supported package managers:

npm:

npm install @kreuzberg/node

pnpm:

pnpm add @kreuzberg/node

yarn:

yarn add @kreuzberg/node

System Requirements

Node.js 22+ required (NAPI-RS native bindings)
Optional: ONNX Runtime version 1.24+ for embeddings support
Optional: Tesseract OCR for OCR functionality

Format Support Notes:

Legacy formats (DOC, XLS, PPT) are now extracted natively without external tools
Modern Office formats (DOCX, XLSX, PPTX) are fully supported
WASM binding supports all document formats via in-memory parsing

Platform Support

Pre-built binaries available for:

macOS (arm64, x64)
Linux (x64)
Windows (x64)

Quick Start

Basic Extraction

Extract text, metadata, and structure from any supported document format:

import { extractFileSync } from '@kreuzberg/node';

const config = {
	useCache: true,
	enableQualityProcessing: true,
};

const result = extractFileSync('document.pdf', null, config);

console.log(result.content);
console.log(`MIME Type: ${result.mimeType}`);

Common Use Cases

Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

import { extractFile } from '@kreuzberg/node';

const config = {
	ocr: {
		backend: 'tesseract',
		language: 'eng+fra',
		tesseractConfig: {
			psm: 3,
		},
	},
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

Table Extraction

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf');

for (const table of result.tables) {
	console.log(`Table with ${table.cells.length} rows`);
	console.log(`Page: ${table.pageNumber}`);
	console.log(table.markdown);
}

Processing Multiple Files

import { batchExtractFilesSync } from '@kreuzberg/node';

const files = ['doc1.pdf', 'doc2.docx', 'doc3.pptx'];
const results = batchExtractFilesSync(files);

results.forEach((result, i) => {
	console.log(`File ${i + 1}: ${result.content.length} characters`);
});

Async Processing

For non-blocking document processing:

import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf');
console.log(result.content);

Configuration Discovery

import { ExtractionConfig, extractFile } from '@kreuzberg/node';

const config = ExtractionConfig.discover();
if (config) {
  console.log('Found configuration file');
  const result = await extractFile('document.pdf', null, config);
  console.log(result.content);
} else {
  console.log('No configuration file found, using defaults');
  const result = await extractFile('document.pdf');
  console.log(result.content);
}

Worker Thread Pool

import { createWorkerPool, extractFileInWorker, batchExtractFilesInWorker, closeWorkerPool } from '@kreuzberg/node';

// Create a pool with 4 worker threads
const pool = createWorkerPool(4);

try {
  // Extract single file in worker
  const result = await extractFileInWorker(pool, 'document.pdf', null, {
    useCache: true
  });
  console.log(result.content);

  // Extract multiple files concurrently
  const files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
  const results = await batchExtractFilesInWorker(pool, files, {
    useCache: true
  });

  results.forEach((result, i) => {
    console.log(`File ${i + 1}: ${result.content.length} characters`);
  });
} finally {
  // Always close the pool when done
  await closeWorkerPool(pool);
}

Performance Benefits:

Parallel Processing: Multiple documents extracted simultaneously
CPU Utilization: Maximizes multi-core CPU usage for large batches
Queue Management: Automatically distributes work across available workers
Resource Control: Prevents thread exhaustion with configurable pool size

Best Practices:

Use worker pools for batches of 10+ documents
Set pool size to number of CPU cores (default behavior)
Always close pools with closeWorkerPool() to prevent resource leaks
Reuse pools across multiple batch operations for efficiency

Next Steps

Installation Guide - Platform-specific setup
API Documentation - Complete API reference
Examples & Guides - Full code examples and usage guides
Configuration Guide - Advanced configuration options

NAPI-RS Implementation Details

Native Performance

This binding uses NAPI-RS to provide native Node.js bindings with:

Zero-copy data transfer between JavaScript and Rust layers
Native thread pool for concurrent document processing
Direct memory management for efficient large document handling
Binary-compatible pre-built native modules across platforms

Threading Model

Single documents are processed synchronously or asynchronously in a dedicated thread
Batch operations distribute work across available CPU cores
Thread count is configurable but defaults to system CPU count
Long-running extractions block the event loop unless using async APIs

Memory Management

Large documents (> 100 MB) are streamed to avoid loading entirely into memory
Temporary files are created in system temp directory for extraction
Memory is automatically released after extraction completion
ONNX models are cached in memory for repeated embeddings operations

Features

Supported File Formats (75+)

75+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

| Category | Formats | Capabilities | |----------|---------|--------------| | Word Processing | .docx, .odt | Full text, tables, images, metadata, styles | | Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods | Sheet data, formulas, cell metadata, charts | | Presentations | .pptx, .ppt, .ppsx | Slides, speaker notes, images, metadata | | PDF | .pdf | Text, tables, images, metadata, OCR support | | eBooks | .epub, .fb2 | Chapters, metadata, embedded resources |

Images (OCR-Enabled)

| Category | Formats | Features | |----------|---------|----------| | Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif | OCR, table detection, EXIF metadata, dimensions, color space | | Advanced | .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm | OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata | | Vector | .svg | DOM parsing, embedded text, graphics metadata |

Web & Data

| Category | Formats | Features | |----------|---------|----------| | Markup | .html, .htm, .xhtml, .xml, .svg | DOM parsing, metadata (Open Graph, Twitter Card), link extraction | | Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv | Schema detection, nested structures, validation | | Text & Markdown | .txt, .md, .markdown, .djot, .rst, .org, .rtf | CommonMark, GFM, Djot, reStructuredText, Org Mode |

Email & Archives

| Category | Formats | Features | |----------|---------|----------| | Email | .eml, .msg | Headers, body (HTML/plain), attachments, threading | | Archives | .zip, .tar, .tgz, .gz, .7z | File listing, nested archives, metadata |

Academic & Scientific

| Category | Formats | Features | |----------|---------|----------| | Citations | .bib, .biblatex, .ris, .nbib, .enw, .csl | Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON | | Scientific | .tex, .latex, .typst, .jats, .ipynb, .docbook | LaTeX, Jupyter notebooks, PubMed JATS | | Documentation | .opml, .pod, .mdoc, .troff | Technical documentation formats |

Complete Format Reference

Key Capabilities

Text Extraction - Extract all text content with position and formatting information
Metadata Extraction - Retrieve document properties, creation date, author, etc.
Table Extraction - Parse tables with structure and cell content preservation
Image Extraction - Extract embedded images and render page previews
OCR Support - Integrate multiple OCR backends for scanned documents
Async/Await - Non-blocking document processing with concurrent operations
Plugin System - Extensible post-processing for custom text transformation
Embeddings - Generate vector embeddings using ONNX Runtime models
Batch Processing - Efficiently process multiple documents in parallel
Memory Efficient - Stream large files without loading entirely into memory
Language Detection - Detect and support multiple languages in documents
Configuration - Fine-grained control over extraction behavior

Performance Characteristics

| Format | Speed | Memory | Notes | |--------|-------|--------|-------| | PDF (text) | 10-100 MB/s | ~50MB per doc | Fastest extraction | | Office docs | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX | | Images (OCR) | 1-5 MB/s | Variable | Depends on OCR backend | | Archives | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. | | Web formats | 50-200 MB/s | Streaming | HTML, XML, JSON |

OCR Support

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

Tesseract
Paddleocr

OCR Configuration Example

import { extractFile } from '@kreuzberg/node';

const config = {
	ocr: {
		backend: 'tesseract',
		language: 'eng+fra',
		tesseractConfig: {
			psm: 3,
		},
	},
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

Async Support

This binding provides full async/await support for non-blocking document processing:

import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf');
console.log(result.content);

Plugin System

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

Embeddings Support

Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.

Embeddings Guide

Batch Processing

Process multiple documents efficiently:

import { batchExtractFilesSync } from '@kreuzberg/node';

const files = ['doc1.pdf', 'doc2.docx', 'doc3.pptx'];
const results = batchExtractFilesSync(files);

results.forEach((result, i) => {
	console.log(`File ${i + 1}: ${result.content.length} characters`);
});

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support

Discord Community: Join our Discord
GitHub Issues: Report bugs
Discussions: Ask questions