@kreuzberg/wasm
v4.0.6
Published
Kreuzberg document intelligence - WebAssembly bindings
Downloads
2,765
Maintainers
Readme
WebAssembly
Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.
Installation
Package Installation
Install via one of the supported package managers:
npm:
npm install @kreuzberg/wasmpnpm:
pnpm add @kreuzberg/wasmyarn:
yarn add @kreuzberg/wasmSystem Requirements
- Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers
- Optional: Tesseract WASM for OCR functionality
Quick Start
Basic Extraction
Extract text, metadata, and structure from any supported document format:
import { extractBytes, initWasm } from "@kreuzberg/wasm";
async function main() {
await initWasm();
const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
const bytes = new Uint8Array(buffer);
const result = await extractBytes(bytes, "application/pdf");
console.log("Extracted content:");
console.log(result.content);
console.log("MIME type:", result.mimeType);
console.log("Metadata:", result.metadata);
}
main().catch(console.error);Common Use Cases
Extract with Custom Configuration
Most use cases benefit from configuration to control extraction behavior:
With OCR (for scanned documents):
import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
async function extractWithOcr() {
await initWasm();
try {
await enableOcr();
console.log("OCR enabled successfully");
} catch (error) {
console.error("Failed to enable OCR:", error);
return;
}
const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
const result = await extractBytes(bytes, "image/png", {
ocr: {
backend: "tesseract-wasm",
language: "eng",
},
});
console.log("Extracted text:");
console.log(result.content);
}
extractWithOcr().catch(console.error);Table Extraction
See Table Extraction Guide for detailed examples.
Processing Multiple Files
import { extractBytes, initWasm } from "@kreuzberg/wasm";
interface DocumentJob {
name: string;
bytes: Uint8Array;
mimeType: string;
}
async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
await initWasm();
const results: Record<string, string> = {};
const queue = [...documents];
const workers = Array(concurrency)
.fill(null)
.map(async () => {
while (queue.length > 0) {
const doc = queue.shift();
if (!doc) break;
try {
const result = await extractBytes(doc.bytes, doc.mimeType);
results[doc.name] = result.content;
} catch (error) {
console.error(`Failed to process ${doc.name}:`, error);
}
}
});
await Promise.all(workers);
return results;
}Async Processing
For non-blocking document processing:
import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
const caps = getWasmCapabilities();
if (!caps.hasWasm) {
throw new Error("WebAssembly not supported");
}
await initWasm();
const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
return results.map((r) => ({
content: r.content,
pageCount: r.metadata?.pageCount,
}));
}
const fileBytes = [new Uint8Array([1, 2, 3])];
const mimes = ["application/pdf"];
extractDocuments(fileBytes, mimes)
.then((results) => console.log(results))
.catch(console.error);Next Steps
- Installation Guide - Platform-specific setup
- API Documentation - Complete API reference
- Examples & Guides - Full code examples and usage guides
- Configuration Guide - Advanced configuration options
Features
Supported File Formats (56+)
56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
Office Documents
| Category | Formats | Capabilities |
|----------|---------|--------------|
| Word Processing | .docx, .odt | Full text, tables, images, metadata, styles |
| Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods | Sheet data, formulas, cell metadata, charts |
| Presentations | .pptx, .ppt, .ppsx | Slides, speaker notes, images, metadata |
| PDF | .pdf | Text, tables, images, metadata, OCR support |
| eBooks | .epub, .fb2 | Chapters, metadata, embedded resources |
Images (OCR-Enabled)
| Category | Formats | Features |
|----------|---------|----------|
| Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif | OCR, table detection, EXIF metadata, dimensions, color space |
| Advanced | .jp2, .jpx, .jpm, .mj2, .pnm, .pbm, .pgm, .ppm | OCR, table detection, format-specific metadata |
| Vector | .svg | DOM parsing, embedded text, graphics metadata |
Web & Data
| Category | Formats | Features |
|----------|---------|----------|
| Markup | .html, .htm, .xhtml, .xml, .svg | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv | Schema detection, nested structures, validation |
| Text & Markdown | .txt, .md, .markdown, .rst, .org, .rtf | CommonMark, GFM, reStructuredText, Org Mode |
Email & Archives
| Category | Formats | Features |
|----------|---------|----------|
| Email | .eml, .msg | Headers, body (HTML/plain), attachments, threading |
| Archives | .zip, .tar, .tgz, .gz, .7z | File listing, nested archives, metadata |
Academic & Scientific
| Category | Formats | Features |
|----------|---------|----------|
| Citations | .bib, .biblatex, .ris, .enw, .csl | Bibliography parsing, citation extraction |
| Scientific | .tex, .latex, .typst, .jats, .ipynb, .docbook | LaTeX, Jupyter notebooks, PubMed JATS |
| Documentation | .opml, .pod, .mdoc, .troff | Technical documentation formats |
Key Capabilities
Text Extraction - Extract all text content with position and formatting information
Metadata Extraction - Retrieve document properties, creation date, author, etc.
Table Extraction - Parse tables with structure and cell content preservation
Image Extraction - Extract embedded images and render page previews
OCR Support - Integrate multiple OCR backends for scanned documents
Async/Await - Non-blocking document processing with concurrent operations
Plugin System - Extensible post-processing for custom text transformation
Batch Processing - Efficiently process multiple documents in parallel
Memory Efficient - Stream large files without loading entirely into memory
Language Detection - Detect and support multiple languages in documents
Configuration - Fine-grained control over extraction behavior
Performance Characteristics
| Format | Speed | Memory | Notes | |--------|-------|--------|-------| | PDF (text) | 10-100 MB/s | ~50MB per doc | Fastest extraction | | Office docs | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX | | Images (OCR) | 1-5 MB/s | Variable | Depends on OCR backend | | Archives | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. | | Web formats | 50-200 MB/s | Streaming | HTML, XML, JSON |
OCR Support
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
- Tesseract-Wasm
OCR Configuration Example
import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
async function extractWithOcr() {
await initWasm();
try {
await enableOcr();
console.log("OCR enabled successfully");
} catch (error) {
console.error("Failed to enable OCR:", error);
return;
}
const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
const result = await extractBytes(bytes, "image/png", {
ocr: {
backend: "tesseract-wasm",
language: "eng",
},
});
console.log("Extracted text:");
console.log(result.content);
}
extractWithOcr().catch(console.error);Async Support
This binding provides full async/await support for non-blocking document processing:
import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
const caps = getWasmCapabilities();
if (!caps.hasWasm) {
throw new Error("WebAssembly not supported");
}
await initWasm();
const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
return results.map((r) => ({
content: r.content,
pageCount: r.metadata?.pageCount,
}));
}
const fileBytes = [new Uint8Array([1, 2, 3])];
const mimes = ["application/pdf"];
extractDocuments(fileBytes, mimes)
.then((results) => console.log(results))
.catch(console.error);Plugin System
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
For detailed plugin documentation, visit Plugin System Guide.
Batch Processing
Process multiple documents efficiently:
import { extractBytes, initWasm } from "@kreuzberg/wasm";
interface DocumentJob {
name: string;
bytes: Uint8Array;
mimeType: string;
}
async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
await initWasm();
const results: Record<string, string> = {};
const queue = [...documents];
const workers = Array(concurrency)
.fill(null)
.map(async () => {
while (queue.length > 0) {
const doc = queue.shift();
if (!doc) break;
try {
const result = await extractBytes(doc.bytes, doc.mimeType);
results[doc.name] = result.content;
} catch (error) {
console.error(`Failed to process ${doc.name}:`, error);
}
}
});
await Promise.all(workers);
return results;
}Configuration
For advanced configuration options including language detection, table extraction, OCR settings, and more:
Documentation
Contributing
Contributions are welcome! See Contributing Guide.
License
MIT License - see LICENSE file for details.
Support
- Discord Community: Join our Discord
- GitHub Issues: Report bugs
- Discussions: Ask questions
