npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

kreuzberg

v4.0.0-rc.1

Published

Kreuzberg document intelligence - Node.js native bindings

Downloads

27

Readme

Kreuzberg for Node.js

npm Crates.io PyPI RubyGems Node Version License: MIT Documentation

High-performance document intelligence for Node.js and TypeScript, powered by Rust.

Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.

🚀 Version 4.0.0 Release Candidate This is a pre-release version. We invite you to test the library and report any issues you encounter.

Features

  • 50+ File Formats: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
  • OCR Support: Built-in Tesseract, EasyOCR, and PaddleOCR backends for scanned documents
  • Table Extraction: Advanced table detection and structured data extraction
  • High Performance: Native Rust core provides 10-50x performance improvements over pure JavaScript
  • Type-Safe: Full TypeScript definitions for all methods, configurations, and return types
  • Async/Sync APIs: Both asynchronous and synchronous extraction methods
  • Batch Processing: Process multiple documents in parallel with optimized concurrency
  • Language Detection: Automatic language detection for extracted text
  • Text Chunking: Split long documents into manageable chunks for LLM processing
  • Caching: Built-in result caching for faster repeated extractions
  • Zero Configuration: Works out of the box with sensible defaults

Requirements

  • Node.js 18 or higher
  • Native bindings are prebuilt for:
    • macOS (x64, arm64)
    • Linux (x64, arm64, armv7)
    • Windows (x64, arm64)

Optional System Dependencies

  • Tesseract: For OCR functionality

    • macOS: brew install tesseract
    • Ubuntu: sudo apt-get install tesseract-ocr
    • Windows: Download from GitHub
  • LibreOffice: For legacy MS Office formats (.doc, .ppt)

    • macOS: brew install libreoffice
    • Ubuntu: sudo apt-get install libreoffice
  • Pandoc: For advanced document conversion

    • macOS: brew install pandoc
    • Ubuntu: sudo apt-get install pandoc

Installation

npm install kreuzberg

Or with pnpm:

pnpm add kreuzberg

Or with yarn:

yarn add kreuzberg

The package includes prebuilt native binaries for major platforms. No additional build steps required.

Quick Start

Basic Extraction

import { extractFileSync } from 'kreuzberg';

// Synchronous extraction
const result = extractFileSync('document.pdf');
console.log(result.content);
console.log(result.metadata);

Async Extraction (Recommended)

import { extractFile } from 'kreuzberg';

// Asynchronous extraction
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.tables);

With Full Type Safety

import {
  extractFile,
  type ExtractionConfig,
  type ExtractionResult
} from 'kreuzberg';

const config: ExtractionConfig = {
  useCache: true,
  enableQualityProcessing: true
};

const result: ExtractionResult = await extractFile('invoice.pdf', config);

// Type-safe access to all properties
console.log(result.content);
console.log(result.mimeType);
console.log(result.metadata);

if (result.tables) {
  for (const table of result.tables) {
    console.log(table.markdown);
  }
}

Configuration

OCR Configuration

import { extractFile, type ExtractionConfig, type OcrConfig } from 'kreuzberg';

const config: ExtractionConfig = {
  ocr: {
    backend: 'tesseract',
    language: 'eng',
    tesseractConfig: {
      enableTableDetection: true,
      psm: 6,
      minConfidence: 50.0
    }
  } as OcrConfig
};

const result = await extractFile('scanned.pdf', config);
console.log(result.content);

PDF Password Protection

import { extractFile, type PdfConfig } from 'kreuzberg';

const config = {
  pdfOptions: {
    passwords: ['password1', 'password2'],
    extractImages: true,
    extractMetadata: true
  } as PdfConfig
};

const result = await extractFile('protected.pdf', config);

Extract Tables

import { extractFile } from 'kreuzberg';

const result = await extractFile('financial-report.pdf');

if (result.tables) {
  for (const table of result.tables) {
    console.log('Table as Markdown:');
    console.log(table.markdown);

    console.log('Table cells:');
    console.log(JSON.stringify(table.cells, null, 2));
  }
}

Text Chunking

import { extractFile, type ChunkingConfig } from 'kreuzberg';

const config = {
  chunking: {
    maxChars: 1000,
    maxOverlap: 200
  } as ChunkingConfig
};

const result = await extractFile('long-document.pdf', config);

if (result.chunks) {
  for (const chunk of result.chunks) {
    console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
  }
}

Language Detection

import { extractFile, type LanguageDetectionConfig } from 'kreuzberg';

const config = {
  languageDetection: {
    enabled: true,
    minConfidence: 0.8,
    detectMultiple: false
  } as LanguageDetectionConfig
};

const result = await extractFile('multilingual.pdf', config);

if (result.language) {
  console.log(`Detected language: ${result.language.code}`);
  console.log(`Confidence: ${result.language.confidence}`);
}

Image Extraction

import { extractFile, type ImageExtractionConfig } from 'kreuzberg';
import { writeFile } from 'fs/promises';

const config = {
  images: {
    extractImages: true,
    targetDpi: 300,
    maxImageDimension: 4096,
    autoAdjustDpi: true
  } as ImageExtractionConfig
};

const result = await extractFile('document-with-images.pdf', config);

if (result.images) {
  for (let i = 0; i < result.images.length; i++) {
    const image = result.images[i];
    await writeFile(`image-${i}.${image.format}`, Buffer.from(image.data));
  }
}

Complete Configuration Example

import {
  extractFile,
  type ExtractionConfig,
  type OcrConfig,
  type ChunkingConfig,
  type ImageExtractionConfig,
  type PdfConfig,
  type TokenReductionConfig,
  type LanguageDetectionConfig
} from 'kreuzberg';

const config: ExtractionConfig = {
  useCache: true,
  enableQualityProcessing: true,
  forceOcr: false,
  maxConcurrentExtractions: 8,

  ocr: {
    backend: 'tesseract',
    language: 'eng',
    preprocessing: true,
    tesseractConfig: {
      enableTableDetection: true,
      psm: 6,
      oem: 3,
      minConfidence: 50.0
    }
  } as OcrConfig,

  chunking: {
    maxChars: 1000,
    maxOverlap: 200
  } as ChunkingConfig,

  images: {
    extractImages: true,
    targetDpi: 300,
    maxImageDimension: 4096,
    autoAdjustDpi: true
  } as ImageExtractionConfig,

  pdfOptions: {
    extractImages: true,
    passwords: [],
    extractMetadata: true
  } as PdfConfig,

  tokenReduction: {
    mode: 'moderate',
    preserveImportantWords: true
  } as TokenReductionConfig,

  languageDetection: {
    enabled: true,
    minConfidence: 0.8,
    detectMultiple: false
  } as LanguageDetectionConfig
};

const result = await extractFile('document.pdf', config);

Advanced Usage

Extract from Buffer

import { extractBytes } from 'kreuzberg';
import { readFile } from 'fs/promises';

const buffer = await readFile('document.pdf');
const result = await extractBytes(buffer, 'application/pdf');
console.log(result.content);

Batch Processing

import { batchExtractFiles } from 'kreuzberg';

const files = [
  'document1.pdf',
  'document2.docx',
  'document3.xlsx'
];

const results = await batchExtractFiles(files);

for (const result of results) {
  console.log(`${result.mimeType}: ${result.content.length} characters`);
}

Batch Processing with Custom Concurrency

import { batchExtractFiles } from 'kreuzberg';

const config = {
  maxConcurrentExtractions: 4  // Process 4 files at a time
};

const files = Array.from({ length: 20 }, (_, i) => `file-${i}.pdf`);
const results = await batchExtractFiles(files, config);

console.log(`Processed ${results.length} files`);

Extract with Metadata

import { extractFile } from 'kreuzberg';

const result = await extractFile('document.pdf');

if (result.metadata) {
  console.log('Title:', result.metadata.title);
  console.log('Author:', result.metadata.author);
  console.log('Creation Date:', result.metadata.creationDate);
  console.log('Page Count:', result.metadata.pageCount);
  console.log('Word Count:', result.metadata.wordCount);
}

Token Reduction for LLM Processing

import { extractFile, type TokenReductionConfig } from 'kreuzberg';

const config = {
  tokenReduction: {
    mode: 'aggressive',  // Options: 'light', 'moderate', 'aggressive'
    preserveImportantWords: true
  } as TokenReductionConfig
};

const result = await extractFile('long-document.pdf', config);

// Reduced token count while preserving meaning
console.log(`Original length: ${result.content.length}`);
console.log(`Processed for LLM context window`);

Error Handling

import {
  extractFile,
  KreuzbergError,
  ValidationError,
  ParsingError,
  OCRError,
  MissingDependencyError
} from 'kreuzberg';

try {
  const result = await extractFile('document.pdf');
  console.log(result.content);
} catch (error) {
  if (error instanceof ValidationError) {
    console.error('Invalid configuration or input:', error.message);
  } else if (error instanceof ParsingError) {
    console.error('Failed to parse document:', error.message);
  } else if (error instanceof OCRError) {
    console.error('OCR processing failed:', error.message);
  } else if (error instanceof MissingDependencyError) {
    console.error(`Missing dependency: ${error.dependency}`);
    console.error('Installation instructions:', error.message);
  } else if (error instanceof KreuzbergError) {
    console.error('Kreuzberg error:', error.message);
  } else {
    throw error;
  }
}

API Reference

Extraction Functions

extractFile(filePath: string, config?: ExtractionConfig): Promise<ExtractionResult>

Asynchronously extract content from a file.

extractFileSync(filePath: string, config?: ExtractionConfig): ExtractionResult

Synchronously extract content from a file.

extractBytes(data: Buffer, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult>

Asynchronously extract content from a buffer.

extractBytesSync(data: Buffer, mimeType: string, config?: ExtractionConfig): ExtractionResult

Synchronously extract content from a buffer.

batchExtractFiles(paths: string[], config?: ExtractionConfig): Promise<ExtractionResult[]>

Asynchronously extract content from multiple files in parallel.

batchExtractFilesSync(paths: string[], config?: ExtractionConfig): ExtractionResult[]

Synchronously extract content from multiple files.

Types

ExtractionResult

Main result object containing:

  • content: string - Extracted text content
  • mimeType: string - MIME type of the document
  • metadata?: Metadata - Document metadata
  • tables?: Table[] - Extracted tables
  • images?: ImageData[] - Extracted images
  • chunks?: Chunk[] - Text chunks (if chunking enabled)
  • language?: LanguageInfo - Detected language (if enabled)

ExtractionConfig

Configuration object for extraction:

  • useCache?: boolean - Enable result caching
  • enableQualityProcessing?: boolean - Enable text quality improvements
  • forceOcr?: boolean - Force OCR even for text-based PDFs
  • maxConcurrentExtractions?: number - Max parallel extractions
  • ocr?: OcrConfig - OCR settings
  • chunking?: ChunkingConfig - Text chunking settings
  • images?: ImageExtractionConfig - Image extraction settings
  • pdfOptions?: PdfConfig - PDF-specific options
  • tokenReduction?: TokenReductionConfig - Token reduction settings
  • languageDetection?: LanguageDetectionConfig - Language detection settings

OcrConfig

OCR configuration:

  • backend: string - OCR backend ('tesseract', 'easyocr', 'paddleocr')
  • language: string - Language code (e.g., 'eng', 'fra', 'deu')
  • preprocessing?: boolean - Enable image preprocessing
  • tesseractConfig?: TesseractConfig - Tesseract-specific options

Table

Extracted table structure:

  • markdown: string - Table in Markdown format
  • cells: TableCell[][] - 2D array of table cells
  • rowCount: number - Number of rows
  • columnCount: number - Number of columns

Exceptions

All Kreuzberg exceptions extend the base KreuzbergError class:

  • KreuzbergError - Base error class for all Kreuzberg errors
  • ValidationError - Invalid configuration, missing required fields, or invalid input
  • ParsingError - Document parsing failure or corrupted file
  • OCRError - OCR processing failure
  • MissingDependencyError - Missing optional system dependency (includes installation instructions)

Supported Formats

| Category | Formats | |----------|---------| | Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF | | Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF | | Web | HTML, XHTML, XML | | Text | TXT, MD, CSV, TSV, JSON, YAML, TOML | | Email | EML, MSG | | Archives | ZIP, TAR, 7Z | | Other | And 30+ more formats |

Performance

Kreuzberg is built with a native Rust core, providing significant performance improvements over pure JavaScript solutions:

  • 10-50x faster text extraction compared to pure Node.js libraries
  • Native multithreading for batch processing
  • Optimized memory usage with streaming for large files
  • Zero-copy operations where possible
  • Efficient caching to avoid redundant processing

Benchmarks

Processing 100 mixed documents (PDF, DOCX, XLSX):

| Library | Time | Memory | |---------|------|--------| | Kreuzberg | 2.3s | 145 MB | | pdf-parse + mammoth | 23.1s | 890 MB | | textract | 45.2s | 1.2 GB |

Troubleshooting

Native Module Not Found

If you encounter errors about missing native modules:

npm rebuild kreuzberg

OCR Not Working

Ensure Tesseract is installed and available in PATH:

tesseract --version

If Tesseract is not found:

  • macOS: brew install tesseract
  • Ubuntu: sudo apt-get install tesseract-ocr
  • Windows: Download from tesseract-ocr/tesseract

Memory Issues with Large PDFs

For very large PDFs, use chunking to reduce memory usage:

const config = {
  chunking: { maxChars: 1000 }
};
const result = await extractFile('large.pdf', config);

TypeScript Types Not Resolving

Make sure you're using:

  • Node.js 18 or higher
  • TypeScript 5.0 or higher

The package includes built-in type definitions.

Performance Optimization

For maximum performance when processing many files:

// Use batch processing instead of sequential
const results = await batchExtractFiles(files, {
  maxConcurrentExtractions: 8  // Tune based on CPU cores
});

Examples

Extract Invoice Data

import { extractFile } from 'kreuzberg';

const result = await extractFile('invoice.pdf');

// Access tables for line items
if (result.tables && result.tables.length > 0) {
  const lineItems = result.tables[0];
  console.log(lineItems.markdown);
}

// Access metadata for invoice details
if (result.metadata) {
  console.log('Invoice Date:', result.metadata.creationDate);
}

Process Scanned Documents

import { extractFile } from 'kreuzberg';

const config = {
  forceOcr: true,
  ocr: {
    backend: 'tesseract',
    language: 'eng',
    preprocessing: true
  }
};

const result = await extractFile('scanned-contract.pdf', config);
console.log(result.content);

Build a Document Search Index

import { batchExtractFiles } from 'kreuzberg';
import { glob } from 'glob';

// Find all documents
const files = await glob('documents/**/*.{pdf,docx,xlsx}');

// Extract in batches
const results = await batchExtractFiles(files, {
  maxConcurrentExtractions: 8,
  enableQualityProcessing: true
});

// Build search index
const searchIndex = results.map((result, i) => ({
  path: files[i],
  content: result.content,
  metadata: result.metadata
}));

console.log(`Indexed ${searchIndex.length} documents`);

Documentation

For comprehensive documentation, visit https://kreuzberg.dev

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

MIT

Links