node-pdfplumber

v1.0.0

Published

3 months ago

PDF text, table and data extraction for Node.js — pdfplumber API for Node.js

Downloads

0High
0Medium
0Low

jiteshbhatt08

pdf parse extract table pdfplumber text-extraction

node-pdfplumber

A Node.js port of the popular Python pdfplumber library. Extract text, tables, and metadata from PDFs with an easy-to-use API.

Features

Text extraction with bounding boxes (word-level coordinates)
Table detection and extraction using grid line detection (lattice strategy)
Metadata extraction (title, author, creation date, etc.)
Graphical object detection (lines, rectangles)
Page cropping for region-specific extraction
Lazy loading for memory efficiency with large PDFs
TypeScript support with full type definitions
Zero system dependencies (pure JavaScript/WASM)

Installation

npm install node-pdfplumber

Quick Start

import { PDFPlumber } from 'node-pdfplumber';

// Open PDF file
const pdf = await PDFPlumber.open('document.pdf');

// Get first page
const page = pdf.pages[0];

// Extract text
const text = await page.extractText();
console.log(text);

// Extract all tables
const tables = await page.extractTables();
console.log(tables); // Array of tables: string[][][]

// Extract words with positions
const words = await page.extractWords();
words.forEach(word => {
  console.log(`${word.text} at (${word.x0}, ${word.y0})`);
});

// Clean up
pdf.close();

API Reference

PDFPlumber

`PDFPlumber.open(path: string | Buffer | Uint8Array): Promise<PDFDocument>`

Open a PDF from file path or buffer.

`PDFPlumber.fromBuffer(buffer: Buffer | Uint8Array): Promise<PDFDocument>`

Load PDF from a buffer.

`PDFPlumber.fromURL(url: string): Promise<PDFDocument>`

Load PDF from a remote URL.

PDFDocument

const pdf = await PDFPlumber.open('file.pdf');

// Properties
pdf.pages: PDFPage[]              // All pages
pdf.pageCount: number              // Number of pages
pdf.metadata: PDFMetadata          // Document metadata

// Methods
pdf.close(): void                  // Clean up resources

PDFPage

const page = pdf.pages[0];

// Properties
page.width: number                // Page width in points
page.height: number               // Page height in points
page.bbox: BBox                   // Full page bounding box

// Text extraction
await page.extractText(layout?: boolean): Promise<string>
await page.extractWords(): Promise<Word[]>
await page.getChars(): Promise<Char[]>

// Table extraction
await page.extractTable(): Promise<string[][] | null>
await page.extractTables(): Promise<string[][][]>
await page.findTables(settings?): Promise<Table[]>

// Graphical objects
await page.getRects(): Promise<Rect[]>
await page.getLines(): Promise<LineSegment[]>

// Cropping
page.crop(bbox: BBox): CroppedPage

Word

interface Word {
  text: string;
  x0: number;          // Left x-coordinate
  y0: number;          // Top y-coordinate  
  x1: number;          // Right x-coordinate
  y1: number;          // Bottom y-coordinate
  fontName?: string;
  fontSize?: number;
}

Table Extraction Settings

interface TableFinderSettings {
  verticalStrategy?: 'lines' | 'lines_strict' | 'text';
  horizontalStrategy?: 'lines' | 'lines_strict' | 'text';
  snapTolerance?: number;        // Default: 3
  joinTolerance?: number;        // Default: 3
  edgeMinLength?: number;        // Default: 3
  minWords?: number;             // Default: 1
}

Examples

Extract all text from a PDF

const pdf = await PDFPlumber.open('document.pdf');

for (const page of pdf.pages) {
  const text = await page.extractText();
  console.log(`Page ${page.number}: ${text}`);
}

pdf.close();

Extract tables from a specific page

const page = pdf.pages[0];
const tables = await page.extractTables();

tables.forEach((table, i) => {
  console.log(`Table ${i}:`, table);
});

Crop and extract from a region

const page = pdf.pages[0];

// Crop to a specific region
const cropped = page.crop({
  x0: 50,
  y0: 100,
  x1: 500,
  y1: 600
});

// All extraction methods work on cropped region
const text = await cropped.extractText();
const tables = await cropped.extractTables();

Find words in a region

const words = await page.extractWords();

// Find words in specific bbox
const region = { x0: 0, y0: 0, x1: 200, y1: 200 };
const regionWords = words.filter(w => 
  w.x0 >= region.x0 && w.y0 >= region.y0 &&
  w.x1 <= region.x1 && w.y1 <= region.y1
);

console.log(regionWords);

Search for text patterns

const words = await page.extractWords();

// Find all words matching a pattern
const emailWords = words.filter(w => /@/.test(w.text));
console.log(emailWords);

How It Works

Table Detection

The library uses the lattice strategy to detect tables by finding grid lines:

Extract lines from PDF operator list (stroke commands)
Snap similar lines together (within tolerance)
Find grid intersections to identify cell boundaries
Map text to cells based on position

For PDFs without explicit grid lines, the library can fall back to text-based strategy which analyzes whitespace gaps between words to detect column boundaries.

Coordinate System

All coordinates use a top-left origin system (matching pdfplumber):

X increases rightward
Y increases downward
Origin (0,0) is at top-left corner

Differences from pdfplumber (Python)

Character-level bounding boxes are returned at word granularity (pdfjs-dist limitation)
Some advanced pdfplumber features (visual debugging, custom extraction rules) are not yet implemented
Table stream detection (text-only tables) has limited support

Performance

Large PDFs are loaded lazily (pages parsed on demand)
Results are cached to avoid re-parsing
Suitable for production use with typical PDFs (10-100MB)

Building from Source

npm install
npm run build
npm run test

License

MIT

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

pdfplumber - Original Python library
pdfjs-dist - Underlying PDF parser

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

node-pdfplumber

Features

Installation

Quick Start

API Reference

PDFPlumber

PDFPlumber.open(path: string | Buffer | Uint8Array): Promise<PDFDocument>

PDFPlumber.fromBuffer(buffer: Buffer | Uint8Array): Promise<PDFDocument>

PDFPlumber.fromURL(url: string): Promise<PDFDocument>

PDFDocument

PDFPage

Word

Table Extraction Settings

Examples

Extract all text from a PDF

Extract tables from a specific page

Crop and extract from a region

Find words in a region

Search for text patterns

How It Works

Table Detection

Coordinate System

Differences from pdfplumber (Python)

Performance

Building from Source

License

Contributing

Related

`PDFPlumber.open(path: string | Buffer | Uint8Array): Promise<PDFDocument>`

`PDFPlumber.fromBuffer(buffer: Buffer | Uint8Array): Promise<PDFDocument>`

`PDFPlumber.fromURL(url: string): Promise<PDFDocument>`