npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@kreuzberg/node

v4.3.8

Published

Kreuzberg document intelligence - Node.js native bindings

Downloads

21,549

Readme

TypeScript (Node.js)

Extract text, tables, images, and metadata from 75+ file formats including PDF, Office documents, and images. Native NAPI-RS bindings for Node.js with superior performance, async/await support, and TypeScript type definitions.

Installation

Package Installation

Install via one of the supported package managers:

npm:

npm install @kreuzberg/node

pnpm:

pnpm add @kreuzberg/node

yarn:

yarn add @kreuzberg/node

System Requirements

  • Node.js 22+ required (NAPI-RS native bindings)
  • Optional: ONNX Runtime version 1.24+ for embeddings support
  • Optional: Tesseract OCR for OCR functionality

Format Support Notes:

  • Legacy formats (DOC, XLS, PPT) are now extracted natively without external tools
  • Modern Office formats (DOCX, XLSX, PPTX) are fully supported
  • WASM binding supports all document formats via in-memory parsing

Platform Support

Pre-built binaries available for:

  • macOS (arm64, x64)
  • Linux (x64)
  • Windows (x64)

Quick Start

Basic Extraction

Extract text, metadata, and structure from any supported document format:

import { extractFileSync } from '@kreuzberg/node';

const config = {
	useCache: true,
	enableQualityProcessing: true,
};

const result = extractFileSync('document.pdf', null, config);

console.log(result.content);
console.log(`MIME Type: ${result.mimeType}`);

Common Use Cases

Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

import { extractFile } from '@kreuzberg/node';

const config = {
	ocr: {
		backend: 'tesseract',
		language: 'eng+fra',
		tesseractConfig: {
			psm: 3,
		},
	},
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

Table Extraction

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf');

for (const table of result.tables) {
	console.log(`Table with ${table.cells.length} rows`);
	console.log(`Page: ${table.pageNumber}`);
	console.log(table.markdown);
}

Processing Multiple Files

import { batchExtractFilesSync } from '@kreuzberg/node';

const files = ['doc1.pdf', 'doc2.docx', 'doc3.pptx'];
const results = batchExtractFilesSync(files);

results.forEach((result, i) => {
	console.log(`File ${i + 1}: ${result.content.length} characters`);
});

Async Processing

For non-blocking document processing:

import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf');
console.log(result.content);

Configuration Discovery

import { ExtractionConfig, extractFile } from '@kreuzberg/node';

const config = ExtractionConfig.discover();
if (config) {
  console.log('Found configuration file');
  const result = await extractFile('document.pdf', null, config);
  console.log(result.content);
} else {
  console.log('No configuration file found, using defaults');
  const result = await extractFile('document.pdf');
  console.log(result.content);
}

Worker Thread Pool

import { createWorkerPool, extractFileInWorker, batchExtractFilesInWorker, closeWorkerPool } from '@kreuzberg/node';

// Create a pool with 4 worker threads
const pool = createWorkerPool(4);

try {
  // Extract single file in worker
  const result = await extractFileInWorker(pool, 'document.pdf', null, {
    useCache: true
  });
  console.log(result.content);

  // Extract multiple files concurrently
  const files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
  const results = await batchExtractFilesInWorker(pool, files, {
    useCache: true
  });

  results.forEach((result, i) => {
    console.log(`File ${i + 1}: ${result.content.length} characters`);
  });
} finally {
  // Always close the pool when done
  await closeWorkerPool(pool);
}

Performance Benefits:

  • Parallel Processing: Multiple documents extracted simultaneously
  • CPU Utilization: Maximizes multi-core CPU usage for large batches
  • Queue Management: Automatically distributes work across available workers
  • Resource Control: Prevents thread exhaustion with configurable pool size

Best Practices:

  • Use worker pools for batches of 10+ documents
  • Set pool size to number of CPU cores (default behavior)
  • Always close pools with closeWorkerPool() to prevent resource leaks
  • Reuse pools across multiple batch operations for efficiency

Next Steps

NAPI-RS Implementation Details

Native Performance

This binding uses NAPI-RS to provide native Node.js bindings with:

  • Zero-copy data transfer between JavaScript and Rust layers
  • Native thread pool for concurrent document processing
  • Direct memory management for efficient large document handling
  • Binary-compatible pre-built native modules across platforms

Threading Model

  • Single documents are processed synchronously or asynchronously in a dedicated thread
  • Batch operations distribute work across available CPU cores
  • Thread count is configurable but defaults to system CPU count
  • Long-running extractions block the event loop unless using async APIs

Memory Management

  • Large documents (> 100 MB) are streamed to avoid loading entirely into memory
  • Temporary files are created in system temp directory for extraction
  • Memory is automatically released after extraction completion
  • ONNX models are cached in memory for repeated embeddings operations

Features

Supported File Formats (75+)

75+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

| Category | Formats | Capabilities | |----------|---------|--------------| | Word Processing | .docx, .odt | Full text, tables, images, metadata, styles | | Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods | Sheet data, formulas, cell metadata, charts | | Presentations | .pptx, .ppt, .ppsx | Slides, speaker notes, images, metadata | | PDF | .pdf | Text, tables, images, metadata, OCR support | | eBooks | .epub, .fb2 | Chapters, metadata, embedded resources |

Images (OCR-Enabled)

| Category | Formats | Features | |----------|---------|----------| | Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif | OCR, table detection, EXIF metadata, dimensions, color space | | Advanced | .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm | OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata | | Vector | .svg | DOM parsing, embedded text, graphics metadata |

Web & Data

| Category | Formats | Features | |----------|---------|----------| | Markup | .html, .htm, .xhtml, .xml, .svg | DOM parsing, metadata (Open Graph, Twitter Card), link extraction | | Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv | Schema detection, nested structures, validation | | Text & Markdown | .txt, .md, .markdown, .djot, .rst, .org, .rtf | CommonMark, GFM, Djot, reStructuredText, Org Mode |

Email & Archives

| Category | Formats | Features | |----------|---------|----------| | Email | .eml, .msg | Headers, body (HTML/plain), attachments, threading | | Archives | .zip, .tar, .tgz, .gz, .7z | File listing, nested archives, metadata |

Academic & Scientific

| Category | Formats | Features | |----------|---------|----------| | Citations | .bib, .biblatex, .ris, .nbib, .enw, .csl | Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON | | Scientific | .tex, .latex, .typst, .jats, .ipynb, .docbook | LaTeX, Jupyter notebooks, PubMed JATS | | Documentation | .opml, .pod, .mdoc, .troff | Technical documentation formats |

Complete Format Reference

Key Capabilities

  • Text Extraction - Extract all text content with position and formatting information

  • Metadata Extraction - Retrieve document properties, creation date, author, etc.

  • Table Extraction - Parse tables with structure and cell content preservation

  • Image Extraction - Extract embedded images and render page previews

  • OCR Support - Integrate multiple OCR backends for scanned documents

  • Async/Await - Non-blocking document processing with concurrent operations

  • Plugin System - Extensible post-processing for custom text transformation

  • Embeddings - Generate vector embeddings using ONNX Runtime models

  • Batch Processing - Efficiently process multiple documents in parallel

  • Memory Efficient - Stream large files without loading entirely into memory

  • Language Detection - Detect and support multiple languages in documents

  • Configuration - Fine-grained control over extraction behavior

Performance Characteristics

| Format | Speed | Memory | Notes | |--------|-------|--------|-------| | PDF (text) | 10-100 MB/s | ~50MB per doc | Fastest extraction | | Office docs | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX | | Images (OCR) | 1-5 MB/s | Variable | Depends on OCR backend | | Archives | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. | | Web formats | 50-200 MB/s | Streaming | HTML, XML, JSON |

OCR Support

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

  • Tesseract

  • Paddleocr

OCR Configuration Example

import { extractFile } from '@kreuzberg/node';

const config = {
	ocr: {
		backend: 'tesseract',
		language: 'eng+fra',
		tesseractConfig: {
			psm: 3,
		},
	},
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

Async Support

This binding provides full async/await support for non-blocking document processing:

import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf');
console.log(result.content);

Plugin System

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

Embeddings Support

Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.

Embeddings Guide

Batch Processing

Process multiple documents efficiently:

import { batchExtractFilesSync } from '@kreuzberg/node';

const files = ['doc1.pdf', 'doc2.docx', 'doc3.pptx'];
const results = batchExtractFilesSync(files);

results.forEach((result, i) => {
	console.log(`File ${i + 1}: ${result.content.length} characters`);
});

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support