npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@alexstep/doc-extract

v0.1.5

Published

Native document text extraction for Node.js and Bun (PDF, Office, EPUB, and more)

Readme

doc-extract

Text extraction library for RAG pipelines, LLM apps, and AI agents — turn uploaded documents into clean plain text before chunking, embedding, or indexing.

Native addon for Node.js and Bun (Rust + N-API): fast, in-process, no subprocesses. PDF, Office, EPUB, calendars, contacts, Apple Wallet passes, and more.

Why doc-extract

  • Fast - Rust parsers run natively
  • Small footprint - ~6 MB native addon per platform
  • Non-blocking - extractText() returns a Promise; CPU work runs on a Rust thread pool
  • Backpressure - built-in concurrency limit (default 32) via setMaxConcurrent
  • Flexible limits - global, per-instance, or per-call file size in megabytes
  • Large files - file paths and ZIP-based formats read from disk; buffers above threshold auto-spill to temp
  • No subprocesses - PDF and Office handled in-process

Install

npm install @alexstep/doc-extract
# or
bun add @alexstep/doc-extract

Requires Node.js ≥ 18 or Bun ≥ 1.3.

Quick start

import docExtract from '@alexstep/doc-extract'

// Global defaults
docExtract.setMaxConcurrent(4)
docExtract.setMaxFilesizeMB(42)
docExtract.setInMemoryThresholdMB(64)

const text = await docExtract.extractText('./report.pdf')
const fromUrl = await docExtract.extractText('https://example.com/file.pdf')
const pass = await docExtract.parsePkPass('./ticket.pkpass')

// Isolated instance
const custom = new docExtract({ maxConcurrent: 4, maxFileSizeMB: 200 })

// Buffer: auto-detect by magic bytes (%PDF, ZIP/docx, etc.)
const buffer = await Bun.file('./report.pdf').bytes()
const doc = await custom.extractText(Buffer.from(buffer))

// Explicit format when magic is ambiguous (e.g. plain CSV bytes)
const csv = await custom.extractText(someBuffer, 'csv')

Auto-detect

| Input | Hint source | |-------|-------------| | File path | extension hint + magic bytes from file head | | URL | extension from pathname + magic bytes | | Buffer | magic bytes only (no filename) |

Works without a second argument for PDF (%PDF), Office ZIP (docx/xlsx/pptx), ICS/VCF, JSON, HTML, and similar.

For buffers, pass format explicitly when the content has no clear signature (e.g. legacy .doc, ambiguous plain text):

await docExtract.extractText(buffer, 'docx')
await docExtract.extractText(buffer, { format: 'pdf', debug: true })
await docExtract.extractText(buffer, { unknown: 'reject' }) // strict: no text heuristic

Unknown content policy

When format cannot be determined confidently:

| unknown | Behavior | |-----------|----------| | text-if-likely (default) | Treat as txt only if bytes look like text (UTF-8/UTF-16/BOM, low control-byte ratio) | | reject | Return "" (unsupported) | | text-lossy | Try txt unless bytes are obviously binary |

Detection uses explicit format first, then combines extension hints and magic bytes. Magic bytes override conflicting extensions when possible (e.g. %PDF vs a .zip path, ICS content vs a .txt name). Unknown bytes fall back according to unknown policy.

Path-based text heuristics read up to 32 KB of file head; magic/ZIP sniffing uses the first 4 KB.

API

| Method | Description | |--------|-------------| | docExtract.extractText(input, format?) | Extract text. input = Buffer, file path, or http(s):// URL. Optional format or { format, unknown, maxFileSizeMB, inMemoryThresholdMB, tempDir }. | | docExtract.parsePkPass(input, options?) | Parse .pkpass{ pass, localization?, stripImage? }. | | docExtract.setMaxConcurrent(n) | Global parallel parse limit (n === 0 or negative = no-op). | | docExtract.setMaxFilesizeMB(n) | Global max input size in MB (default 42). 0 = unlimited. | | docExtract.setInMemoryThresholdMB(n) | Above this size, paths/URLs skip JS heap; buffers spill to temp (default 64). | | docExtract.setMaxWorkingSetMB(n) | Optional cap on total in-flight parse memory (0 = disabled). | | docExtract.setTempDir(dir) | Directory for auto-spill temp files. | | new docExtract({ maxConcurrent, maxFileSizeMB, inMemoryThresholdMB, tempDir, debug }) | Instance with its own limits and optional debug logging. |

Environment: DOCEXTRACT_MAX_CONCURRENT, DOCEXTRACT_MAX_FILESIZE_MB, DOCEXTRACT_MAX_BYTES, DOCEXTRACT_IN_MEMORY_THRESHOLD_MB, DOCEXTRACT_MAX_WORKING_SET_MB, DOCEXTRACT_TMPDIR, DOCEXTRACT_DEBUG.

Error handling

extractText resolves to a string — no try/catch needed for content issues:

| Situation | Result | |-----------|--------| | Empty PDF (image scan), unsupported format, parse failure | "" | | File too large, missing file, URL fetch error | throws — use .catch() |

const text = await docExtract.extractText('./scan.pdf') // "" if image-only PDF

await docExtract.extractText('https://example.com/huge.pdf').catch((err) => {
  // size limit, network, missing file
})

await docExtract.extractText('./scan.pdf', { debug: true })
// console.debug: PDF has no text layer (likely image-only scan)

parsePkPass still returns null for invalid passes (unchanged).

Large files

doc-extract is built for batch imports and server-side pipelines, not only small uploads.

| Input | Behavior | |-------|----------| | File path | Rust opens the file directly; ZIP formats (docx, xlsx, pptx, epub, odt, pkpass) read entries from disk | | URL | Streamed to a temp file, then parsed via path pipeline | | Buffer ≤ threshold | Passed to native code in memory (default threshold 64 MB) | | Buffer > threshold | Auto-written to temp, parsed from disk, temp removed |

Policy vs memory:

  • maxFileSizeMB — reject files larger than N (0 = no limit)
  • inMemoryThresholdMB — when to avoid keeping full payload in the JS heap

Batch tip: for many large files (e.g. 20 × 200 MB), lower maxConcurrent (2–4) and optionally set setMaxWorkingSetMB to cap total in-flight memory. Peak RAM for ZIP formats is roughly concurrency × parser working set, not concurrency × file size.

pkpass

const text = await docExtract.extractText('./ticket.pkpass') // formatted text for AI
const json = await docExtract.parsePkPass('./ticket.pkpass') // structured pass.json

Supported formats

| Group | Extensions | |-------|------------| | Office | pdf, docx, docm, xlsx, xls, ods, pptx, pptm, odt, rtf | | Books | epub, fb2 | | Calendar / contacts | ics, ifb, ical, vcf, vcard | | Data | json, jsonl, ndjson, csv, tsv | | Web / text | html, htm, xhtml, xml, txt, md, markdown, log | | Wallet | pkpass (auto-detected from ZIP + pass.json) |

Limitations & alternatives

doc-extract targets in-process text extraction: text-layer PDF, Office Open XML, EPUB, calendars, and similar. No OCR, no subprocesses, no Docker sidecar.

Not supported (returns "" or needs another tool):

  • Image-only / scanned PDF (no text layer) — see OCR below
  • Legacy .doc (binary Word), .msg, PostScript
  • Images with text: PNG, JPEG, TIFF (needs Tesseract OCR)
  • Audio: mp3, wav

For those cases, a HTTP sidecar such as textract-docker is a practical fallback. It wraps Python textract with Tesseract OCR, antiword for .doc, and many other backends behind a simple REST API.

Typical integration pattern:

  1. Try docExtract.extractText() first (fast, in-process).
  2. If result is "" or format is unsupported — call textract-docker (or your existing docparser service).

Performance

doc-extract runs inside the Node/Bun process — no HTTP, base64 encoding, or Docker hop per request. That makes it a better fit for high-throughput paths (batch imports) where latency and concurrency matter.

textract-docker adds network and Python/subprocess overhead on each call, but covers OCR and legacy formats doc-extract deliberately skips.

Build from source

git clone https://github.com/alexstep/doc-extract.git
cd doc-extract
bun install
bun run build   # Rust ≥ 1.88
bun test

License

MIT


Stats

| | | |---|---| | Native addon size | ~6 MB per platform | | Default max input | 42 MB (setMaxFilesizeMB, 0 = unlimited) | | In-memory threshold | 64 MB (setInMemoryThresholdMB) | | Zip entry cap | 64 MB per entry — exceeds limit throws (not silent truncate) | | Supported extensions | 30+ | | Default concurrency | 32 (DOCEXTRACT_MAX_CONCURRENT) | | Runtime | Node.js ≥ 18, Bun ≥ 1.3 |