npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@kreuzberg/wasm

v4.0.6

Published

Kreuzberg document intelligence - WebAssembly bindings

Downloads

2,765

Readme

WebAssembly

Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.

Installation

Package Installation

Install via one of the supported package managers:

npm:

npm install @kreuzberg/wasm

pnpm:

pnpm add @kreuzberg/wasm

yarn:

yarn add @kreuzberg/wasm

System Requirements

  • Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers
  • Optional: Tesseract WASM for OCR functionality

Quick Start

Basic Extraction

Extract text, metadata, and structure from any supported document format:

import { extractBytes, initWasm } from "@kreuzberg/wasm";

async function main() {
	await initWasm();

	const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
	const bytes = new Uint8Array(buffer);

	const result = await extractBytes(bytes, "application/pdf");

	console.log("Extracted content:");
	console.log(result.content);
	console.log("MIME type:", result.mimeType);
	console.log("Metadata:", result.metadata);
}

main().catch(console.error);

Common Use Cases

Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";

async function extractWithOcr() {
	await initWasm();

	try {
		await enableOcr();
		console.log("OCR enabled successfully");
	} catch (error) {
		console.error("Failed to enable OCR:", error);
		return;
	}

	const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));

	const result = await extractBytes(bytes, "image/png", {
		ocr: {
			backend: "tesseract-wasm",
			language: "eng",
		},
	});

	console.log("Extracted text:");
	console.log(result.content);
}

extractWithOcr().catch(console.error);

Table Extraction

See Table Extraction Guide for detailed examples.

Processing Multiple Files

import { extractBytes, initWasm } from "@kreuzberg/wasm";

interface DocumentJob {
	name: string;
	bytes: Uint8Array;
	mimeType: string;
}

async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
	await initWasm();

	const results: Record<string, string> = {};
	const queue = [...documents];

	const workers = Array(concurrency)
		.fill(null)
		.map(async () => {
			while (queue.length > 0) {
				const doc = queue.shift();
				if (!doc) break;

				try {
					const result = await extractBytes(doc.bytes, doc.mimeType);
					results[doc.name] = result.content;
				} catch (error) {
					console.error(`Failed to process ${doc.name}:`, error);
				}
			}
		});

	await Promise.all(workers);
	return results;
}

Async Processing

For non-blocking document processing:

import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";

async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
	const caps = getWasmCapabilities();
	if (!caps.hasWasm) {
		throw new Error("WebAssembly not supported");
	}

	await initWasm();

	const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));

	return results.map((r) => ({
		content: r.content,
		pageCount: r.metadata?.pageCount,
	}));
}

const fileBytes = [new Uint8Array([1, 2, 3])];
const mimes = ["application/pdf"];

extractDocuments(fileBytes, mimes)
	.then((results) => console.log(results))
	.catch(console.error);

Next Steps

Features

Supported File Formats (56+)

56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

| Category | Formats | Capabilities | |----------|---------|--------------| | Word Processing | .docx, .odt | Full text, tables, images, metadata, styles | | Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods | Sheet data, formulas, cell metadata, charts | | Presentations | .pptx, .ppt, .ppsx | Slides, speaker notes, images, metadata | | PDF | .pdf | Text, tables, images, metadata, OCR support | | eBooks | .epub, .fb2 | Chapters, metadata, embedded resources |

Images (OCR-Enabled)

| Category | Formats | Features | |----------|---------|----------| | Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif | OCR, table detection, EXIF metadata, dimensions, color space | | Advanced | .jp2, .jpx, .jpm, .mj2, .pnm, .pbm, .pgm, .ppm | OCR, table detection, format-specific metadata | | Vector | .svg | DOM parsing, embedded text, graphics metadata |

Web & Data

| Category | Formats | Features | |----------|---------|----------| | Markup | .html, .htm, .xhtml, .xml, .svg | DOM parsing, metadata (Open Graph, Twitter Card), link extraction | | Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv | Schema detection, nested structures, validation | | Text & Markdown | .txt, .md, .markdown, .rst, .org, .rtf | CommonMark, GFM, reStructuredText, Org Mode |

Email & Archives

| Category | Formats | Features | |----------|---------|----------| | Email | .eml, .msg | Headers, body (HTML/plain), attachments, threading | | Archives | .zip, .tar, .tgz, .gz, .7z | File listing, nested archives, metadata |

Academic & Scientific

| Category | Formats | Features | |----------|---------|----------| | Citations | .bib, .biblatex, .ris, .enw, .csl | Bibliography parsing, citation extraction | | Scientific | .tex, .latex, .typst, .jats, .ipynb, .docbook | LaTeX, Jupyter notebooks, PubMed JATS | | Documentation | .opml, .pod, .mdoc, .troff | Technical documentation formats |

Complete Format Reference

Key Capabilities

  • Text Extraction - Extract all text content with position and formatting information

  • Metadata Extraction - Retrieve document properties, creation date, author, etc.

  • Table Extraction - Parse tables with structure and cell content preservation

  • Image Extraction - Extract embedded images and render page previews

  • OCR Support - Integrate multiple OCR backends for scanned documents

  • Async/Await - Non-blocking document processing with concurrent operations

  • Plugin System - Extensible post-processing for custom text transformation

  • Batch Processing - Efficiently process multiple documents in parallel

  • Memory Efficient - Stream large files without loading entirely into memory

  • Language Detection - Detect and support multiple languages in documents

  • Configuration - Fine-grained control over extraction behavior

Performance Characteristics

| Format | Speed | Memory | Notes | |--------|-------|--------|-------| | PDF (text) | 10-100 MB/s | ~50MB per doc | Fastest extraction | | Office docs | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX | | Images (OCR) | 1-5 MB/s | Variable | Depends on OCR backend | | Archives | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. | | Web formats | 50-200 MB/s | Streaming | HTML, XML, JSON |

OCR Support

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

  • Tesseract-Wasm

OCR Configuration Example

import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";

async function extractWithOcr() {
	await initWasm();

	try {
		await enableOcr();
		console.log("OCR enabled successfully");
	} catch (error) {
		console.error("Failed to enable OCR:", error);
		return;
	}

	const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));

	const result = await extractBytes(bytes, "image/png", {
		ocr: {
			backend: "tesseract-wasm",
			language: "eng",
		},
	});

	console.log("Extracted text:");
	console.log(result.content);
}

extractWithOcr().catch(console.error);

Async Support

This binding provides full async/await support for non-blocking document processing:

import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";

async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
	const caps = getWasmCapabilities();
	if (!caps.hasWasm) {
		throw new Error("WebAssembly not supported");
	}

	await initWasm();

	const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));

	return results.map((r) => ({
		content: r.content,
		pageCount: r.metadata?.pageCount,
	}));
}

const fileBytes = [new Uint8Array([1, 2, 3])];
const mimes = ["application/pdf"];

extractDocuments(fileBytes, mimes)
	.then((results) => console.log(results))
	.catch(console.error);

Plugin System

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

Batch Processing

Process multiple documents efficiently:

import { extractBytes, initWasm } from "@kreuzberg/wasm";

interface DocumentJob {
	name: string;
	bytes: Uint8Array;
	mimeType: string;
}

async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
	await initWasm();

	const results: Record<string, string> = {};
	const queue = [...documents];

	const workers = Array(concurrency)
		.fill(null)
		.map(async () => {
			while (queue.length > 0) {
				const doc = queue.shift();
				if (!doc) break;

				try {
					const result = await extractBytes(doc.bytes, doc.mimeType);
					results[doc.name] = result.content;
				} catch (error) {
					console.error(`Failed to process ${doc.name}:`, error);
				}
			}
		});

	await Promise.all(workers);
	return results;
}

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support