pdfexcavator

v0.1.2

Published

3 months ago

A powerful PDF extraction library for Node.js built on Mozilla's pdf.js - extract text, tables, and visual elements with precision

0High
0Medium
0Low

devadatta

pitabas.nt

mohommed.nt

pdf pdfjs pdf.js pdfplumber mozilla extract text table parse document ocr tesseract cjk layout font typescript

PDFExcavator

A powerful PDF extraction library for Node.js built on Mozilla's pdf.js.

The JavaScript/TypeScript alternative to Python's pdfplumber - extract text, tables, graphics, and visual elements from PDF files with precision.

If you're coming from Python and looking for pdfplumber-like functionality in Node.js, PDFExcavator is your drop-in solution with similar APIs and capabilities.

Features

Text Extraction - Character, word, line, and paragraph level with positions, fonts, colors
Table Extraction - Bordered and borderless tables with confidence scoring
Graphics Extraction - Lines, rectangles, curves, images, annotations
OCR Support - Tesseract.js integration for scanned documents
Layout Analysis - LAParams for precise text grouping
Multi-Column - Column detection and reading order
CJK Support - Chinese, Japanese, Korean text
Font Handling - Automatic substitution for PDF base fonts
CLI Tool - Command-line extraction
TypeScript - Full type definitions

PDFExcavator vs pdfplumber

| Feature | PDFExcavator (Node.js) | pdfplumber (Python) | |---------|------------------------|---------------------| | Text Extraction | ✅ Full support | ✅ Full support | | Table Extraction | ✅ With confidence scoring | ✅ Basic | | Borderless Tables | ✅ Projection profile analysis | ✅ Basic | | Nested Tables | ✅ Supported | ❌ Not supported | | Character-level Data | ✅ With colors, fonts | ✅ Basic | | Word Extraction | ✅ Configurable tolerance | ✅ Basic | | Graphics (lines/rects/curves) | ✅ Full support | ✅ Full support | | Image Extraction | ✅ Metadata + render | ✅ Basic | | Annotations | ✅ Full support | ✅ Basic | | OCR Integration | ✅ Tesseract.js | ⚠️ External only | | Layout Analysis (LAParams) | ✅ Full pdfminer-style | ✅ Full support | | Multi-Column Detection | ✅ Auto-detect & extract | ❌ Manual | | CJK Text Support | ✅ Full CMap support | ✅ Full support | | Font Substitution | ✅ Auto for 14 base fonts | ❌ Not available | | Precision Mode | ✅ Full state tracking | ❌ Not available | | Visual Debugging | ✅ Render + draw annotations | ✅ Similar | | PDF Repair | ✅ Built-in | ❌ Not available | | CLI Tool | ✅ Built-in | ❌ Not built-in | | TypeScript Types | ✅ Full definitions | N/A (Python) | | Async/Streaming | ✅ Native async | ⚠️ Synchronous | | Large PDF Handling | ✅ Concurrent processing | ⚠️ Sequential |

Why Choose PDFExcavator?

Node.js/TypeScript native - No Python dependency, seamless JavaScript integration
Modern async API - Non-blocking operations, concurrent page processing
Enhanced table detection - Confidence scoring, nested tables, borderless detection
Built-in CLI - Quick extraction without writing code
Active maintenance - Built on Mozilla's pdf.js

Installation

npm install pdfexcavator

Optional Dependencies

# For rendering pages to images
npm install canvas

# For OCR support (scanned documents)
npm install tesseract.js

Quick Start

import pdfexcavator from 'pdfexcavator';

// Open a PDF
const pdf = await pdfexcavator.open('document.pdf');

// Get metadata
const metadata = await pdf.metadata;
console.log(`Title: ${metadata.title}`);
console.log(`Pages: ${metadata.pageCount}`);

// Extract text from each page
for (const page of pdf.pages) {
  const text = await page.extractText();
  console.log(text);
}

// Extract tables
for (const page of pdf.pages) {
  const tables = await page.extractTables();
  for (const table of tables) {
    console.log(table.rows);
  }
}

// Close when done
await pdf.close();

CLI

# Install globally
npm install -g pdfexcavator

# Extract text
pdfexcavator text document.pdf

# Extract tables
pdfexcavator tables document.pdf

# Get help
pdfexcavator --help

Documentation

For detailed API reference and guides, see the Documentation:

Security

PDFExcavator includes built-in security features to protect against common vulnerabilities:

Path Traversal Protection

When processing user-provided file paths, use the basePath option to restrict file access:

// Restrict file operations to a specific directory
const pdf = await pdfexcavator.open(userProvidedPath, {
  basePath: '/safe/uploads/directory'
});

// Also available for image saving
const image = await page.toImage();
await image.save(userProvidedPath, {
  basePath: '/safe/output/directory'
});

Safe Text Search

The search() method treats string patterns as literal text by default, preventing ReDoS attacks:

// Safe - pattern is escaped automatically
const results = await page.search(userInput);

// If you need regex, pass a RegExp object (use with caution on user input)
const results = await page.search(/pattern/gi);

// Or explicitly enable regex mode
const results = await page.search(userInput, { literal: false });

Requirements

Node.js >= 18.0.0

License

MIT