mursa-pdf-parser
v1.0.1
Published
A comprehensive PDF parsing library for Node.js - extract text, metadata, images, and full PDF object access
Downloads
25
Maintainers
Readme
Mursa PDF Parser
A comprehensive, zero-dependency* PDF parsing library for Node.js with support for text extraction, metadata extraction, image extraction, and full PDF object model access.
*Only depends on
pakofor zlib decompression
Features
- Text Extraction - Extract text content with positioning information
- Metadata Extraction - Document Info Dictionary and XMP metadata
- Image Extraction - Extract embedded images (JPEG, JPEG2000, raw bitmap)
- Full PDF Object Access - Low-level access to all PDF objects
- Stream Decompression - FlateDecode, LZW, ASCII85, ASCIIHex, RunLength
- ToUnicode CMap Support - Proper character encoding for complex fonts
- No Native Dependencies - Pure JavaScript, works everywhere Node.js runs
Installation
npm install mursa-pdf-parserQuick Start
import { MursaPDF } from 'mursa-pdf-parser';
// Load a PDF
const pdf = await MursaPDF.load('document.pdf');
// Extract text
const text = pdf.getText();
console.log(text);
// Get metadata
const metadata = pdf.getMetadata();
console.log(metadata.info.Title);
// Get page count
console.log(`Pages: ${pdf.getPageCount()}`);API Reference
Loading PDFs
import { MursaPDF, parsePDF } from 'mursa-pdf-parser';
// From file path
const pdf = await MursaPDF.load('path/to/file.pdf');
// From Buffer
const pdf = MursaPDF.fromBuffer(buffer);
// From base64 string
const pdf = MursaPDF.fromBase64(base64String);
// Using convenience function
const pdf = await parsePDF('document.pdf');Text Extraction
// Get all text as a single string
const text = pdf.getText();
// Get text with page information
const result = pdf.getTextWithPages();
// Returns:
// {
// text: "full text...",
// pages: [
// { pageNumber: 1, text: "...", items: [...] }
// ]
// }
// Get text from a specific page (1-indexed)
const page1Text = pdf.getTextFromPage(1);Metadata Extraction
// Get all metadata
const metadata = pdf.getMetadata();
// Returns:
// {
// info: { Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate },
// xmp: { raw, parsed },
// structure: { version, pageCount, pageLayout, pageMode, hasOutlines, hasAcroForm }
// }
// Get document info only
const info = pdf.getInfo();
// Get XMP metadata
const xmp = pdf.getXMP();
// Get bookmarks/outlines
const outlines = pdf.getOutlines();Image Extraction
// Get all images with raw data
const images = pdf.getImages();
// Get images as files with proper extensions
const files = pdf.getImageFiles();
// Returns:
// [
// {
// filename: "image_1.jpg",
// mimeType: "image/jpeg",
// data: Uint8Array,
// width: 800,
// height: 600
// }
// ]
// Get image summary
const summary = pdf.getImageSummary();
// Returns:
// {
// totalImages: 5,
// totalSize: 102400,
// byPage: { 1: 2, 2: 3 },
// byFormat: { jpeg: 4, raw: 1 },
// byColorSpace: { DeviceRGB: 5 },
// images: [...]
// }Document Information
const version = pdf.getVersion(); // "1.7"
const pageCount = pdf.getPageCount(); // 10
const pages = pdf.getPages(); // Array of page dictionaries
const catalog = pdf.getCatalog(); // Root catalog objectLow-Level Access
// Get raw PDF object by object number
const obj = pdf.getObject(5, 0);
// Resolve an indirect reference
const resolved = pdf.resolveReference(ref);
// Get cross-reference table
const xref = pdf.getXRef();
// Get trailer dictionary
const trailer = pdf.getTrailer();Advanced: Direct Access to Extractors
// Access the text extractor directly
const textExtractor = pdf.textExtractor;
// Access the metadata extractor directly
const metadataExtractor = pdf.metadataExtractor;
// Access the image extractor directly
const imageExtractor = pdf.imageExtractor;Complete Example
import { MursaPDF } from 'mursa-pdf-parser';
import { writeFile } from 'fs/promises';
async function extractPDF(filePath) {
// Load the PDF
const pdf = await MursaPDF.load(filePath);
// Get basic info
console.log(`PDF Version: ${pdf.getVersion()}`);
console.log(`Pages: ${pdf.getPageCount()}`);
// Get metadata
const metadata = pdf.getMetadata();
console.log(`Title: ${metadata.info.Title || 'Untitled'}`);
console.log(`Author: ${metadata.info.Author || 'Unknown'}`);
// Extract all text
const text = pdf.getText();
await writeFile('output.txt', text);
console.log(`Extracted ${text.length} characters`);
// Extract images
const images = pdf.getImageFiles();
for (const img of images) {
await writeFile(img.filename, img.data);
console.log(`Saved ${img.filename} (${img.width}x${img.height})`);
}
console.log(`Extracted ${images.length} images`);
}
extractPDF('document.pdf');Architecture
mursa-pdf-parser/
├── src/
│ ├── core/
│ │ ├── lexer.js # Tokenizer - converts bytes to tokens
│ │ ├── parser.js # Parser - builds PDF object model
│ │ └── objects.js # PDF object types (Name, String, Array, etc.)
│ ├── filters/
│ │ └── index.js # Stream decompression (FlateDecode, LZW, etc.)
│ ├── extraction/
│ │ ├── text.js # Text extraction with font handling
│ │ ├── metadata.js # Document info and XMP extraction
│ │ └── images.js # Image extraction and conversion
│ └── index.js # Main API (MursaPDF class)
├── examples/
│ └── basic-usage.js # Usage examples
└── test/
└── test.js # Test suiteHow PDF Parsing Works
PDF Structure
A PDF file consists of four main parts:
- Header - PDF version identifier (
%PDF-1.7) - Body - Objects containing document content (text, images, fonts)
- Cross-Reference Table - Maps object numbers to byte offsets
- Trailer - Points to the document catalog and metadata
Parsing Process
- Read Header - Extract PDF version
- Find Trailer - Locate from end of file
- Parse XRef - Build object location map
- Parse Objects - On-demand parsing of referenced objects
- Extract Content - Process page content streams
Text Extraction Process
- Get page content streams
- Decompress stream data (FlateDecode, etc.)
- Parse content stream operators (
Tj,TJ,Tf,Td, etc.) - Map character codes to Unicode using font encodings and ToUnicode CMaps
Supported Features
Compression Filters
- FlateDecode (zlib/deflate)
- ASCIIHexDecode
- ASCII85Decode
- LZWDecode
- RunLengthDecode
Image Formats
- JPEG / DCTDecode
- JPEG2000 / JPXDecode
- Raw bitmap data
- Indexed color images
Color Spaces
- DeviceGray
- DeviceRGB
- DeviceCMYK
- ICCBased
- Indexed
- Separation
PDF Versions
- PDF 1.0 - 2.0
- Traditional XRef tables
- XRef streams (PDF 1.5+)
- Object streams (PDF 1.5+)
Limitations
- Encrypted PDFs - Password-protected PDFs are not currently supported
- Complex Fonts - Some CID fonts with unusual encodings may not decode correctly
- Scanned PDFs - Documents containing only scanned images require OCR (not included)
- Form Data - Interactive form field values are not extracted
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Murali - GitHub
Acknowledgments
- Uses pako for zlib decompression
