npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

node-pdfplumber

v1.0.0

Published

PDF text, table and data extraction for Node.js — pdfplumber API for Node.js

Readme

node-pdfplumber

A Node.js port of the popular Python pdfplumber library. Extract text, tables, and metadata from PDFs with an easy-to-use API.

Features

  • Text extraction with bounding boxes (word-level coordinates)
  • Table detection and extraction using grid line detection (lattice strategy)
  • Metadata extraction (title, author, creation date, etc.)
  • Graphical object detection (lines, rectangles)
  • Page cropping for region-specific extraction
  • Lazy loading for memory efficiency with large PDFs
  • TypeScript support with full type definitions
  • Zero system dependencies (pure JavaScript/WASM)

Installation

npm install node-pdfplumber

Quick Start

import { PDFPlumber } from 'node-pdfplumber';

// Open PDF file
const pdf = await PDFPlumber.open('document.pdf');

// Get first page
const page = pdf.pages[0];

// Extract text
const text = await page.extractText();
console.log(text);

// Extract all tables
const tables = await page.extractTables();
console.log(tables); // Array of tables: string[][][]

// Extract words with positions
const words = await page.extractWords();
words.forEach(word => {
  console.log(`${word.text} at (${word.x0}, ${word.y0})`);
});

// Clean up
pdf.close();

API Reference

PDFPlumber

PDFPlumber.open(path: string | Buffer | Uint8Array): Promise<PDFDocument>

Open a PDF from file path or buffer.

PDFPlumber.fromBuffer(buffer: Buffer | Uint8Array): Promise<PDFDocument>

Load PDF from a buffer.

PDFPlumber.fromURL(url: string): Promise<PDFDocument>

Load PDF from a remote URL.

PDFDocument

const pdf = await PDFPlumber.open('file.pdf');

// Properties
pdf.pages: PDFPage[]              // All pages
pdf.pageCount: number              // Number of pages
pdf.metadata: PDFMetadata          // Document metadata

// Methods
pdf.close(): void                  // Clean up resources

PDFPage

const page = pdf.pages[0];

// Properties
page.width: number                // Page width in points
page.height: number               // Page height in points
page.bbox: BBox                   // Full page bounding box

// Text extraction
await page.extractText(layout?: boolean): Promise<string>
await page.extractWords(): Promise<Word[]>
await page.getChars(): Promise<Char[]>

// Table extraction
await page.extractTable(): Promise<string[][] | null>
await page.extractTables(): Promise<string[][][]>
await page.findTables(settings?): Promise<Table[]>

// Graphical objects
await page.getRects(): Promise<Rect[]>
await page.getLines(): Promise<LineSegment[]>

// Cropping
page.crop(bbox: BBox): CroppedPage

Word

interface Word {
  text: string;
  x0: number;          // Left x-coordinate
  y0: number;          // Top y-coordinate  
  x1: number;          // Right x-coordinate
  y1: number;          // Bottom y-coordinate
  fontName?: string;
  fontSize?: number;
}

Table Extraction Settings

interface TableFinderSettings {
  verticalStrategy?: 'lines' | 'lines_strict' | 'text';
  horizontalStrategy?: 'lines' | 'lines_strict' | 'text';
  snapTolerance?: number;        // Default: 3
  joinTolerance?: number;        // Default: 3
  edgeMinLength?: number;        // Default: 3
  minWords?: number;             // Default: 1
}

Examples

Extract all text from a PDF

const pdf = await PDFPlumber.open('document.pdf');

for (const page of pdf.pages) {
  const text = await page.extractText();
  console.log(`Page ${page.number}: ${text}`);
}

pdf.close();

Extract tables from a specific page

const page = pdf.pages[0];
const tables = await page.extractTables();

tables.forEach((table, i) => {
  console.log(`Table ${i}:`, table);
});

Crop and extract from a region

const page = pdf.pages[0];

// Crop to a specific region
const cropped = page.crop({
  x0: 50,
  y0: 100,
  x1: 500,
  y1: 600
});

// All extraction methods work on cropped region
const text = await cropped.extractText();
const tables = await cropped.extractTables();

Find words in a region

const words = await page.extractWords();

// Find words in specific bbox
const region = { x0: 0, y0: 0, x1: 200, y1: 200 };
const regionWords = words.filter(w => 
  w.x0 >= region.x0 && w.y0 >= region.y0 &&
  w.x1 <= region.x1 && w.y1 <= region.y1
);

console.log(regionWords);

Search for text patterns

const words = await page.extractWords();

// Find all words matching a pattern
const emailWords = words.filter(w => /@/.test(w.text));
console.log(emailWords);

How It Works

Table Detection

The library uses the lattice strategy to detect tables by finding grid lines:

  1. Extract lines from PDF operator list (stroke commands)
  2. Snap similar lines together (within tolerance)
  3. Find grid intersections to identify cell boundaries
  4. Map text to cells based on position

For PDFs without explicit grid lines, the library can fall back to text-based strategy which analyzes whitespace gaps between words to detect column boundaries.

Coordinate System

All coordinates use a top-left origin system (matching pdfplumber):

  • X increases rightward
  • Y increases downward
  • Origin (0,0) is at top-left corner

Differences from pdfplumber (Python)

  • Character-level bounding boxes are returned at word granularity (pdfjs-dist limitation)
  • Some advanced pdfplumber features (visual debugging, custom extraction rules) are not yet implemented
  • Table stream detection (text-only tables) has limited support

Performance

  • Large PDFs are loaded lazily (pages parsed on demand)
  • Results are cached to avoid re-parsing
  • Suitable for production use with typical PDFs (10-100MB)

Building from Source

npm install
npm run build
npm run test

License

MIT

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Related