npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

pdfexcavator

v0.1.2

Published

A powerful PDF extraction library for Node.js built on Mozilla's pdf.js - extract text, tables, and visual elements with precision

Readme

PDFExcavator

A powerful PDF extraction library for Node.js built on Mozilla's pdf.js.

The JavaScript/TypeScript alternative to Python's pdfplumber - extract text, tables, graphics, and visual elements from PDF files with precision.

If you're coming from Python and looking for pdfplumber-like functionality in Node.js, PDFExcavator is your drop-in solution with similar APIs and capabilities.

Features

  • Text Extraction - Character, word, line, and paragraph level with positions, fonts, colors
  • Table Extraction - Bordered and borderless tables with confidence scoring
  • Graphics Extraction - Lines, rectangles, curves, images, annotations
  • OCR Support - Tesseract.js integration for scanned documents
  • Layout Analysis - LAParams for precise text grouping
  • Multi-Column - Column detection and reading order
  • CJK Support - Chinese, Japanese, Korean text
  • Font Handling - Automatic substitution for PDF base fonts
  • CLI Tool - Command-line extraction
  • TypeScript - Full type definitions

PDFExcavator vs pdfplumber

| Feature | PDFExcavator (Node.js) | pdfplumber (Python) | |---------|------------------------|---------------------| | Text Extraction | ✅ Full support | ✅ Full support | | Table Extraction | ✅ With confidence scoring | ✅ Basic | | Borderless Tables | ✅ Projection profile analysis | ✅ Basic | | Nested Tables | ✅ Supported | ❌ Not supported | | Character-level Data | ✅ With colors, fonts | ✅ Basic | | Word Extraction | ✅ Configurable tolerance | ✅ Basic | | Graphics (lines/rects/curves) | ✅ Full support | ✅ Full support | | Image Extraction | ✅ Metadata + render | ✅ Basic | | Annotations | ✅ Full support | ✅ Basic | | OCR Integration | ✅ Tesseract.js | ⚠️ External only | | Layout Analysis (LAParams) | ✅ Full pdfminer-style | ✅ Full support | | Multi-Column Detection | ✅ Auto-detect & extract | ❌ Manual | | CJK Text Support | ✅ Full CMap support | ✅ Full support | | Font Substitution | ✅ Auto for 14 base fonts | ❌ Not available | | Precision Mode | ✅ Full state tracking | ❌ Not available | | Visual Debugging | ✅ Render + draw annotations | ✅ Similar | | PDF Repair | ✅ Built-in | ❌ Not available | | CLI Tool | ✅ Built-in | ❌ Not built-in | | TypeScript Types | ✅ Full definitions | N/A (Python) | | Async/Streaming | ✅ Native async | ⚠️ Synchronous | | Large PDF Handling | ✅ Concurrent processing | ⚠️ Sequential |

Why Choose PDFExcavator?

  • Node.js/TypeScript native - No Python dependency, seamless JavaScript integration
  • Modern async API - Non-blocking operations, concurrent page processing
  • Enhanced table detection - Confidence scoring, nested tables, borderless detection
  • Built-in CLI - Quick extraction without writing code
  • Active maintenance - Built on Mozilla's pdf.js

Installation

npm install pdfexcavator

Optional Dependencies

# For rendering pages to images
npm install canvas

# For OCR support (scanned documents)
npm install tesseract.js

Quick Start

import pdfexcavator from 'pdfexcavator';

// Open a PDF
const pdf = await pdfexcavator.open('document.pdf');

// Get metadata
const metadata = await pdf.metadata;
console.log(`Title: ${metadata.title}`);
console.log(`Pages: ${metadata.pageCount}`);

// Extract text from each page
for (const page of pdf.pages) {
  const text = await page.extractText();
  console.log(text);
}

// Extract tables
for (const page of pdf.pages) {
  const tables = await page.extractTables();
  for (const table of tables) {
    console.log(table.rows);
  }
}

// Close when done
await pdf.close();

CLI

# Install globally
npm install -g pdfexcavator

# Extract text
pdfexcavator text document.pdf

# Extract tables
pdfexcavator tables document.pdf

# Get help
pdfexcavator --help

Documentation

For detailed API reference and guides, see the Documentation:

Security

PDFExcavator includes built-in security features to protect against common vulnerabilities:

Path Traversal Protection

When processing user-provided file paths, use the basePath option to restrict file access:

// Restrict file operations to a specific directory
const pdf = await pdfexcavator.open(userProvidedPath, {
  basePath: '/safe/uploads/directory'
});

// Also available for image saving
const image = await page.toImage();
await image.save(userProvidedPath, {
  basePath: '/safe/output/directory'
});

Safe Text Search

The search() method treats string patterns as literal text by default, preventing ReDoS attacks:

// Safe - pattern is escaped automatically
const results = await page.search(userInput);

// If you need regex, pass a RegExp object (use with caution on user input)
const results = await page.search(/pattern/gi);

// Or explicitly enable regex mode
const results = await page.search(userInput, { literal: false });

Requirements

  • Node.js >= 18.0.0

License

MIT