node-pdfplumber
v1.0.0
Published
PDF text, table and data extraction for Node.js — pdfplumber API for Node.js
Maintainers
Readme
node-pdfplumber
A Node.js port of the popular Python pdfplumber library. Extract text, tables, and metadata from PDFs with an easy-to-use API.
Features
- Text extraction with bounding boxes (word-level coordinates)
- Table detection and extraction using grid line detection (lattice strategy)
- Metadata extraction (title, author, creation date, etc.)
- Graphical object detection (lines, rectangles)
- Page cropping for region-specific extraction
- Lazy loading for memory efficiency with large PDFs
- TypeScript support with full type definitions
- Zero system dependencies (pure JavaScript/WASM)
Installation
npm install node-pdfplumberQuick Start
import { PDFPlumber } from 'node-pdfplumber';
// Open PDF file
const pdf = await PDFPlumber.open('document.pdf');
// Get first page
const page = pdf.pages[0];
// Extract text
const text = await page.extractText();
console.log(text);
// Extract all tables
const tables = await page.extractTables();
console.log(tables); // Array of tables: string[][][]
// Extract words with positions
const words = await page.extractWords();
words.forEach(word => {
console.log(`${word.text} at (${word.x0}, ${word.y0})`);
});
// Clean up
pdf.close();API Reference
PDFPlumber
PDFPlumber.open(path: string | Buffer | Uint8Array): Promise<PDFDocument>
Open a PDF from file path or buffer.
PDFPlumber.fromBuffer(buffer: Buffer | Uint8Array): Promise<PDFDocument>
Load PDF from a buffer.
PDFPlumber.fromURL(url: string): Promise<PDFDocument>
Load PDF from a remote URL.
PDFDocument
const pdf = await PDFPlumber.open('file.pdf');
// Properties
pdf.pages: PDFPage[] // All pages
pdf.pageCount: number // Number of pages
pdf.metadata: PDFMetadata // Document metadata
// Methods
pdf.close(): void // Clean up resourcesPDFPage
const page = pdf.pages[0];
// Properties
page.width: number // Page width in points
page.height: number // Page height in points
page.bbox: BBox // Full page bounding box
// Text extraction
await page.extractText(layout?: boolean): Promise<string>
await page.extractWords(): Promise<Word[]>
await page.getChars(): Promise<Char[]>
// Table extraction
await page.extractTable(): Promise<string[][] | null>
await page.extractTables(): Promise<string[][][]>
await page.findTables(settings?): Promise<Table[]>
// Graphical objects
await page.getRects(): Promise<Rect[]>
await page.getLines(): Promise<LineSegment[]>
// Cropping
page.crop(bbox: BBox): CroppedPageWord
interface Word {
text: string;
x0: number; // Left x-coordinate
y0: number; // Top y-coordinate
x1: number; // Right x-coordinate
y1: number; // Bottom y-coordinate
fontName?: string;
fontSize?: number;
}Table Extraction Settings
interface TableFinderSettings {
verticalStrategy?: 'lines' | 'lines_strict' | 'text';
horizontalStrategy?: 'lines' | 'lines_strict' | 'text';
snapTolerance?: number; // Default: 3
joinTolerance?: number; // Default: 3
edgeMinLength?: number; // Default: 3
minWords?: number; // Default: 1
}Examples
Extract all text from a PDF
const pdf = await PDFPlumber.open('document.pdf');
for (const page of pdf.pages) {
const text = await page.extractText();
console.log(`Page ${page.number}: ${text}`);
}
pdf.close();Extract tables from a specific page
const page = pdf.pages[0];
const tables = await page.extractTables();
tables.forEach((table, i) => {
console.log(`Table ${i}:`, table);
});Crop and extract from a region
const page = pdf.pages[0];
// Crop to a specific region
const cropped = page.crop({
x0: 50,
y0: 100,
x1: 500,
y1: 600
});
// All extraction methods work on cropped region
const text = await cropped.extractText();
const tables = await cropped.extractTables();Find words in a region
const words = await page.extractWords();
// Find words in specific bbox
const region = { x0: 0, y0: 0, x1: 200, y1: 200 };
const regionWords = words.filter(w =>
w.x0 >= region.x0 && w.y0 >= region.y0 &&
w.x1 <= region.x1 && w.y1 <= region.y1
);
console.log(regionWords);Search for text patterns
const words = await page.extractWords();
// Find all words matching a pattern
const emailWords = words.filter(w => /@/.test(w.text));
console.log(emailWords);How It Works
Table Detection
The library uses the lattice strategy to detect tables by finding grid lines:
- Extract lines from PDF operator list (stroke commands)
- Snap similar lines together (within tolerance)
- Find grid intersections to identify cell boundaries
- Map text to cells based on position
For PDFs without explicit grid lines, the library can fall back to text-based strategy which analyzes whitespace gaps between words to detect column boundaries.
Coordinate System
All coordinates use a top-left origin system (matching pdfplumber):
- X increases rightward
- Y increases downward
- Origin (0,0) is at top-left corner
Differences from pdfplumber (Python)
- Character-level bounding boxes are returned at word granularity (pdfjs-dist limitation)
- Some advanced pdfplumber features (visual debugging, custom extraction rules) are not yet implemented
- Table stream detection (text-only tables) has limited support
Performance
- Large PDFs are loaded lazily (pages parsed on demand)
- Results are cached to avoid re-parsing
- Suitable for production use with typical PDFs (10-100MB)
Building from Source
npm install
npm run build
npm run testLicense
MIT
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
Related
- pdfplumber - Original Python library
- pdfjs-dist - Underlying PDF parser
