kreuzberg
v4.0.0-rc.1
Published
Kreuzberg document intelligence - Node.js native bindings
Downloads
27
Maintainers
Readme
Kreuzberg for Node.js
High-performance document intelligence for Node.js and TypeScript, powered by Rust.
Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
🚀 Version 4.0.0 Release Candidate This is a pre-release version. We invite you to test the library and report any issues you encounter.
Features
- 50+ File Formats: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
- OCR Support: Built-in Tesseract, EasyOCR, and PaddleOCR backends for scanned documents
- Table Extraction: Advanced table detection and structured data extraction
- High Performance: Native Rust core provides 10-50x performance improvements over pure JavaScript
- Type-Safe: Full TypeScript definitions for all methods, configurations, and return types
- Async/Sync APIs: Both asynchronous and synchronous extraction methods
- Batch Processing: Process multiple documents in parallel with optimized concurrency
- Language Detection: Automatic language detection for extracted text
- Text Chunking: Split long documents into manageable chunks for LLM processing
- Caching: Built-in result caching for faster repeated extractions
- Zero Configuration: Works out of the box with sensible defaults
Requirements
- Node.js 18 or higher
- Native bindings are prebuilt for:
- macOS (x64, arm64)
- Linux (x64, arm64, armv7)
- Windows (x64, arm64)
Optional System Dependencies
Tesseract: For OCR functionality
- macOS:
brew install tesseract - Ubuntu:
sudo apt-get install tesseract-ocr - Windows: Download from GitHub
- macOS:
LibreOffice: For legacy MS Office formats (.doc, .ppt)
- macOS:
brew install libreoffice - Ubuntu:
sudo apt-get install libreoffice
- macOS:
Pandoc: For advanced document conversion
- macOS:
brew install pandoc - Ubuntu:
sudo apt-get install pandoc
- macOS:
Installation
npm install kreuzbergOr with pnpm:
pnpm add kreuzbergOr with yarn:
yarn add kreuzbergThe package includes prebuilt native binaries for major platforms. No additional build steps required.
Quick Start
Basic Extraction
import { extractFileSync } from 'kreuzberg';
// Synchronous extraction
const result = extractFileSync('document.pdf');
console.log(result.content);
console.log(result.metadata);Async Extraction (Recommended)
import { extractFile } from 'kreuzberg';
// Asynchronous extraction
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.tables);With Full Type Safety
import {
extractFile,
type ExtractionConfig,
type ExtractionResult
} from 'kreuzberg';
const config: ExtractionConfig = {
useCache: true,
enableQualityProcessing: true
};
const result: ExtractionResult = await extractFile('invoice.pdf', config);
// Type-safe access to all properties
console.log(result.content);
console.log(result.mimeType);
console.log(result.metadata);
if (result.tables) {
for (const table of result.tables) {
console.log(table.markdown);
}
}Configuration
OCR Configuration
import { extractFile, type ExtractionConfig, type OcrConfig } from 'kreuzberg';
const config: ExtractionConfig = {
ocr: {
backend: 'tesseract',
language: 'eng',
tesseractConfig: {
enableTableDetection: true,
psm: 6,
minConfidence: 50.0
}
} as OcrConfig
};
const result = await extractFile('scanned.pdf', config);
console.log(result.content);PDF Password Protection
import { extractFile, type PdfConfig } from 'kreuzberg';
const config = {
pdfOptions: {
passwords: ['password1', 'password2'],
extractImages: true,
extractMetadata: true
} as PdfConfig
};
const result = await extractFile('protected.pdf', config);Extract Tables
import { extractFile } from 'kreuzberg';
const result = await extractFile('financial-report.pdf');
if (result.tables) {
for (const table of result.tables) {
console.log('Table as Markdown:');
console.log(table.markdown);
console.log('Table cells:');
console.log(JSON.stringify(table.cells, null, 2));
}
}Text Chunking
import { extractFile, type ChunkingConfig } from 'kreuzberg';
const config = {
chunking: {
maxChars: 1000,
maxOverlap: 200
} as ChunkingConfig
};
const result = await extractFile('long-document.pdf', config);
if (result.chunks) {
for (const chunk of result.chunks) {
console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
}
}Language Detection
import { extractFile, type LanguageDetectionConfig } from 'kreuzberg';
const config = {
languageDetection: {
enabled: true,
minConfidence: 0.8,
detectMultiple: false
} as LanguageDetectionConfig
};
const result = await extractFile('multilingual.pdf', config);
if (result.language) {
console.log(`Detected language: ${result.language.code}`);
console.log(`Confidence: ${result.language.confidence}`);
}Image Extraction
import { extractFile, type ImageExtractionConfig } from 'kreuzberg';
import { writeFile } from 'fs/promises';
const config = {
images: {
extractImages: true,
targetDpi: 300,
maxImageDimension: 4096,
autoAdjustDpi: true
} as ImageExtractionConfig
};
const result = await extractFile('document-with-images.pdf', config);
if (result.images) {
for (let i = 0; i < result.images.length; i++) {
const image = result.images[i];
await writeFile(`image-${i}.${image.format}`, Buffer.from(image.data));
}
}Complete Configuration Example
import {
extractFile,
type ExtractionConfig,
type OcrConfig,
type ChunkingConfig,
type ImageExtractionConfig,
type PdfConfig,
type TokenReductionConfig,
type LanguageDetectionConfig
} from 'kreuzberg';
const config: ExtractionConfig = {
useCache: true,
enableQualityProcessing: true,
forceOcr: false,
maxConcurrentExtractions: 8,
ocr: {
backend: 'tesseract',
language: 'eng',
preprocessing: true,
tesseractConfig: {
enableTableDetection: true,
psm: 6,
oem: 3,
minConfidence: 50.0
}
} as OcrConfig,
chunking: {
maxChars: 1000,
maxOverlap: 200
} as ChunkingConfig,
images: {
extractImages: true,
targetDpi: 300,
maxImageDimension: 4096,
autoAdjustDpi: true
} as ImageExtractionConfig,
pdfOptions: {
extractImages: true,
passwords: [],
extractMetadata: true
} as PdfConfig,
tokenReduction: {
mode: 'moderate',
preserveImportantWords: true
} as TokenReductionConfig,
languageDetection: {
enabled: true,
minConfidence: 0.8,
detectMultiple: false
} as LanguageDetectionConfig
};
const result = await extractFile('document.pdf', config);Advanced Usage
Extract from Buffer
import { extractBytes } from 'kreuzberg';
import { readFile } from 'fs/promises';
const buffer = await readFile('document.pdf');
const result = await extractBytes(buffer, 'application/pdf');
console.log(result.content);Batch Processing
import { batchExtractFiles } from 'kreuzberg';
const files = [
'document1.pdf',
'document2.docx',
'document3.xlsx'
];
const results = await batchExtractFiles(files);
for (const result of results) {
console.log(`${result.mimeType}: ${result.content.length} characters`);
}Batch Processing with Custom Concurrency
import { batchExtractFiles } from 'kreuzberg';
const config = {
maxConcurrentExtractions: 4 // Process 4 files at a time
};
const files = Array.from({ length: 20 }, (_, i) => `file-${i}.pdf`);
const results = await batchExtractFiles(files, config);
console.log(`Processed ${results.length} files`);Extract with Metadata
import { extractFile } from 'kreuzberg';
const result = await extractFile('document.pdf');
if (result.metadata) {
console.log('Title:', result.metadata.title);
console.log('Author:', result.metadata.author);
console.log('Creation Date:', result.metadata.creationDate);
console.log('Page Count:', result.metadata.pageCount);
console.log('Word Count:', result.metadata.wordCount);
}Token Reduction for LLM Processing
import { extractFile, type TokenReductionConfig } from 'kreuzberg';
const config = {
tokenReduction: {
mode: 'aggressive', // Options: 'light', 'moderate', 'aggressive'
preserveImportantWords: true
} as TokenReductionConfig
};
const result = await extractFile('long-document.pdf', config);
// Reduced token count while preserving meaning
console.log(`Original length: ${result.content.length}`);
console.log(`Processed for LLM context window`);Error Handling
import {
extractFile,
KreuzbergError,
ValidationError,
ParsingError,
OCRError,
MissingDependencyError
} from 'kreuzberg';
try {
const result = await extractFile('document.pdf');
console.log(result.content);
} catch (error) {
if (error instanceof ValidationError) {
console.error('Invalid configuration or input:', error.message);
} else if (error instanceof ParsingError) {
console.error('Failed to parse document:', error.message);
} else if (error instanceof OCRError) {
console.error('OCR processing failed:', error.message);
} else if (error instanceof MissingDependencyError) {
console.error(`Missing dependency: ${error.dependency}`);
console.error('Installation instructions:', error.message);
} else if (error instanceof KreuzbergError) {
console.error('Kreuzberg error:', error.message);
} else {
throw error;
}
}API Reference
Extraction Functions
extractFile(filePath: string, config?: ExtractionConfig): Promise<ExtractionResult>
Asynchronously extract content from a file.
extractFileSync(filePath: string, config?: ExtractionConfig): ExtractionResult
Synchronously extract content from a file.
extractBytes(data: Buffer, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult>
Asynchronously extract content from a buffer.
extractBytesSync(data: Buffer, mimeType: string, config?: ExtractionConfig): ExtractionResult
Synchronously extract content from a buffer.
batchExtractFiles(paths: string[], config?: ExtractionConfig): Promise<ExtractionResult[]>
Asynchronously extract content from multiple files in parallel.
batchExtractFilesSync(paths: string[], config?: ExtractionConfig): ExtractionResult[]
Synchronously extract content from multiple files.
Types
ExtractionResult
Main result object containing:
content: string- Extracted text contentmimeType: string- MIME type of the documentmetadata?: Metadata- Document metadatatables?: Table[]- Extracted tablesimages?: ImageData[]- Extracted imageschunks?: Chunk[]- Text chunks (if chunking enabled)language?: LanguageInfo- Detected language (if enabled)
ExtractionConfig
Configuration object for extraction:
useCache?: boolean- Enable result cachingenableQualityProcessing?: boolean- Enable text quality improvementsforceOcr?: boolean- Force OCR even for text-based PDFsmaxConcurrentExtractions?: number- Max parallel extractionsocr?: OcrConfig- OCR settingschunking?: ChunkingConfig- Text chunking settingsimages?: ImageExtractionConfig- Image extraction settingspdfOptions?: PdfConfig- PDF-specific optionstokenReduction?: TokenReductionConfig- Token reduction settingslanguageDetection?: LanguageDetectionConfig- Language detection settings
OcrConfig
OCR configuration:
backend: string- OCR backend ('tesseract', 'easyocr', 'paddleocr')language: string- Language code (e.g., 'eng', 'fra', 'deu')preprocessing?: boolean- Enable image preprocessingtesseractConfig?: TesseractConfig- Tesseract-specific options
Table
Extracted table structure:
markdown: string- Table in Markdown formatcells: TableCell[][]- 2D array of table cellsrowCount: number- Number of rowscolumnCount: number- Number of columns
Exceptions
All Kreuzberg exceptions extend the base KreuzbergError class:
KreuzbergError- Base error class for all Kreuzberg errorsValidationError- Invalid configuration, missing required fields, or invalid inputParsingError- Document parsing failure or corrupted fileOCRError- OCR processing failureMissingDependencyError- Missing optional system dependency (includes installation instructions)
Supported Formats
| Category | Formats | |----------|---------| | Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF | | Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF | | Web | HTML, XHTML, XML | | Text | TXT, MD, CSV, TSV, JSON, YAML, TOML | | Email | EML, MSG | | Archives | ZIP, TAR, 7Z | | Other | And 30+ more formats |
Performance
Kreuzberg is built with a native Rust core, providing significant performance improvements over pure JavaScript solutions:
- 10-50x faster text extraction compared to pure Node.js libraries
- Native multithreading for batch processing
- Optimized memory usage with streaming for large files
- Zero-copy operations where possible
- Efficient caching to avoid redundant processing
Benchmarks
Processing 100 mixed documents (PDF, DOCX, XLSX):
| Library | Time | Memory | |---------|------|--------| | Kreuzberg | 2.3s | 145 MB | | pdf-parse + mammoth | 23.1s | 890 MB | | textract | 45.2s | 1.2 GB |
Troubleshooting
Native Module Not Found
If you encounter errors about missing native modules:
npm rebuild kreuzbergOCR Not Working
Ensure Tesseract is installed and available in PATH:
tesseract --versionIf Tesseract is not found:
- macOS:
brew install tesseract - Ubuntu:
sudo apt-get install tesseract-ocr - Windows: Download from tesseract-ocr/tesseract
Memory Issues with Large PDFs
For very large PDFs, use chunking to reduce memory usage:
const config = {
chunking: { maxChars: 1000 }
};
const result = await extractFile('large.pdf', config);TypeScript Types Not Resolving
Make sure you're using:
- Node.js 18 or higher
- TypeScript 5.0 or higher
The package includes built-in type definitions.
Performance Optimization
For maximum performance when processing many files:
// Use batch processing instead of sequential
const results = await batchExtractFiles(files, {
maxConcurrentExtractions: 8 // Tune based on CPU cores
});Examples
Extract Invoice Data
import { extractFile } from 'kreuzberg';
const result = await extractFile('invoice.pdf');
// Access tables for line items
if (result.tables && result.tables.length > 0) {
const lineItems = result.tables[0];
console.log(lineItems.markdown);
}
// Access metadata for invoice details
if (result.metadata) {
console.log('Invoice Date:', result.metadata.creationDate);
}Process Scanned Documents
import { extractFile } from 'kreuzberg';
const config = {
forceOcr: true,
ocr: {
backend: 'tesseract',
language: 'eng',
preprocessing: true
}
};
const result = await extractFile('scanned-contract.pdf', config);
console.log(result.content);Build a Document Search Index
import { batchExtractFiles } from 'kreuzberg';
import { glob } from 'glob';
// Find all documents
const files = await glob('documents/**/*.{pdf,docx,xlsx}');
// Extract in batches
const results = await batchExtractFiles(files, {
maxConcurrentExtractions: 8,
enableQualityProcessing: true
});
// Build search index
const searchIndex = results.map((result, i) => ({
path: files[i],
content: result.content,
metadata: result.metadata
}));
console.log(`Indexed ${searchIndex.length} documents`);Documentation
For comprehensive documentation, visit https://kreuzberg.dev
Contributing
We welcome contributions! Please see our Contributing Guide for details.
License
MIT
