pdf-oxide

v0.3.63

Published

a day ago

High-performance PDF parsing and text extraction library — prebuilt native bindings, no build toolchain required

0High
0Medium
0Low

yfedoseev

pdf text-extraction pdf-parsing rust-ffi native-binding prebuilt

PDF Oxide for Node.js — The Fastest PDF Toolkit for JavaScript & TypeScript

The fastest Node.js PDF library for text extraction, image extraction, and markdown conversion. Powered by a pure-Rust core, exposed to Node.js through a native N-API addon. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT / Apache-2.0 licensed.

Part of the PDF Oxide toolkit. Same Rust core, same speed, same 100% pass rate as the Rust, Python, Go, C# / .NET, and WASM bindings.
Need to run in browsers, Deno, Bun, or Cloudflare Workers? Use the WASM build instead — same API, no native binaries.

Quick Start

npm install pdf-oxide

import { PdfDocument } from "pdf-oxide";

const doc = PdfDocument.open("paper.pdf");
const text = doc.extractText(0);
const markdown = doc.toMarkdown(0);
doc.close();

pdf-oxide is an ES module. Use import (shown above). From CommonJS, load it with a dynamic import: const { PdfDocument } = await import("pdf-oxide");. Open a file with the PdfDocument.open(path) factory — the constructor is internal and does not take a path.

TypeScript:

import { PdfDocument } from "pdf-oxide";

const doc = PdfDocument.open("paper.pdf");
const text: string = doc.extractText(0);
const markdown: string = doc.toMarkdown(0);
doc.close();

Why pdf_oxide?

Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts, no segfaults
Complete — Text extraction, image extraction, search, form fields, PDF creation, and editing in one package
Permissive license — MIT / Apache-2.0 — use freely in commercial and closed-source projects
Pure Rust core — Memory-safe, panic-free, no C dependencies beyond the N-API glue
Native binaries — Pre-built .node addons for Linux, macOS, and Windows (x64 + ARM64)
Full TypeScript support — Type definitions ship in the package

Performance

Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only. Single-thread, 60s timeout, no warm-up.

| Library | Mean | p99 | Pass Rate | License | |---------|------|-----|-----------|---------| | PDF Oxide | 0.8ms | 9ms | 100% | MIT / Apache-2.0 | | PyMuPDF | 4.6ms | 28ms | 99.3% | AGPL-3.0 | | pypdfium2 | 4.1ms | 42ms | 99.2% | Apache-2.0 | | pdftext | 7.3ms | 82ms | 99.0% | GPL-3.0 | | pdfminer | 16.8ms | 124ms | 98.8% | MIT | | pypdf | 12.1ms | 97ms | 98.4% | BSD-3 |

99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. The Node.js binding adds negligible overhead — extraction stays within ~25% of direct Rust calls on real-world fixtures.

Installation

npm install pdf-oxide

Pre-built native addons for:

| Platform | x64 | ARM64 | |---|---|---| | Linux (glibc) | Yes | Yes | | Linux (musl) | Yes | Yes | | macOS | Yes | Yes (Apple Silicon) | | Windows | Yes | Yes |

Requires Node.js 18 or newer. No system dependencies. No Rust toolchain required.

API Tour

Open a document

import { PdfDocument } from "pdf-oxide";

const doc = PdfDocument.open("report.pdf");
console.log(`Pages: ${doc.getPageCount()}`);

const { major, minor } = doc.getVersion();
console.log(`PDF version: ${major}.${minor}`);

doc.close();

Use using for automatic cleanup (Node.js 22+):

{
  using doc = PdfDocument.open("report.pdf");
  const text = doc.extractText(0);
} // doc.close() called automatically

Text extraction

const text = doc.extractText(0);            // single page
const markdown = doc.toMarkdown(0);         // single page → Markdown
const html = doc.toHtml(0);                 // single page → HTML
const plain = doc.toPlainText(0);           // single page → plain text

const allMarkdown = doc.toMarkdownAll();    // entire document
const allHtml = doc.toHtmlAll();

Iterate all pages

const doc = PdfDocument.open("document.pdf");
const pageCount = doc.getPageCount();

const pages = [];
for (let i = 0; i < pageCount; i++) {
  pages.push(doc.extractText(i));
}

doc.close();

Async wrapper

async function extractAll(filePath) {
  const doc = PdfDocument.open(filePath);
  try {
    const pageCount = doc.getPageCount();
    const pages = [];
    for (let i = 0; i < pageCount; i++) {
      pages.push(doc.extractText(i));
    }
    return pages;
  } finally {
    doc.close();
  }
}

const pages = await extractAll("document.pdf");

Error handling

All methods throw on failure. Catch with try/catch:

try {
  const text = doc.extractText(0);
} catch (err) {
  console.error("Extraction failed:", err.message);
} finally {
  doc.close();
}

OCR & Auto Mode

OCR ships in the prebuilt pdf-oxide native addon as of v0.3.52 — no --build-from-source. Install ONNX Runtime via npm, point at it once, then let pdf_oxide route per page (native text where present, OCR where the page is image-only, graceful fallback when OCR is unavailable):

import { createRequire } from 'node:module';
const require = createRequire(import.meta.url);
process.env.ORT_DYLIB_PATH = require.resolve(
  'onnxruntime-node/bin/napi-v6/linux/x64/libonnxruntime.so.1');

const px = await import('pdf-oxide');
px.prefetchModels(['english']);                    // one-off provisioning

const doc = px.PdfDocument.open('scanned-or-mixed.pdf');
console.log(doc.extractTextAuto(0));               // recommended

For manual OCR engine setup, doc.classifyPage(0) routing, custom configs, the WebAssembly (wasm-ocr) build, and full per-binding recipes: OCR Guide.

Other languages

PDF Oxide ships the same Rust core through six bindings:

Rust — cargo add pdf_oxide — see docs.rs/pdf_oxide
Python — pip install pdf_oxide — see python/README.md
Go — go get github.com/yfedoseev/pdf_oxide/go — see go/README.md
C# / .NET — dotnet add package PdfOxide — see csharp/README.md
WASM (browsers, Deno, Bun, edge runtimes) — npm install pdf-oxide-wasm — see wasm-pkg/README.md

A bug fix in the Rust core lands in every binding on the next release.

Documentation

Full Documentation — Complete documentation site
JavaScript Getting Started — Step-by-step Node.js guide
Main Repository — Rust core, CLI, MCP server, all bindings
Performance Benchmarks — Full benchmark methodology and results
GitHub Issues — Bug reports and feature requests

Use Cases

RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain.js, LlamaIndex.js, or any framework
Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
Server-side PDF rendering — Extract structured content for search indexing, archival, or transformation pipelines
PDF generation — Create invoices, reports, certificates, and templated documents programmatically
PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions, no Python required

Why I built this

I needed PyMuPDF's speed without its AGPL license, and I needed it in more than one language. Nothing existed that ticked all three boxes — fast, MIT, multi-language — so I wrote it. The Rust core is what does the real work; the bindings for Python, Go, JS/TS, C#, and WASM are thin shells around the same code, so a bug fix in one lands in all of them. It now passes 100% of the veraPDF + Mozilla pdf.js + DARPA SafeDocs test corpora (3,830 PDFs) on every platform I've tested.

If it's useful to you, a star on GitHub genuinely helps. If something's broken or missing, open an issue — I read all of them.

— Yury

License

Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.

Citation

@software{pdf_oxide,
  title = {PDF Oxide: Fast PDF Toolkit for Rust, Python, Go, JavaScript, and C#},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}

JavaScript + TypeScript + Rust core | MIT / Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders