node-pdf-extractor

v1.0.0

Published

15 days ago

A simple and powerful Node.js PDF text extractor

Downloads

105

0High
0Medium
0Low

planet09ai

pdf text extractor parser document extract

node-pdf-extractor

A simple and powerful Node.js PDF text extractor.

Installation

npm install node-pdf-extractor

Or install globally for CLI usage:

npm install -g node-pdf-extractor

Usage

As a Module

const { extractText, extractFromPath, PDFExtractor } = require('node-pdf-extractor');

// Simple text extraction
const text = await extractText('document.pdf');
console.log(text);

// Full extraction with metadata
const result = await extractFromPath('document.pdf');
console.log(result.text);       // Extracted text
console.log(result.numPages);   // Number of pages
console.log(result.info);       // PDF info (title, author, etc.)

// Using the class
const extractor = new PDFExtractor();
const data = await extractor.extract('document.pdf');
console.log(data.text);

Extract from Buffer

const fs = require('fs');
const { extractFromBuffer } = require('node-pdf-extractor');

const buffer = fs.readFileSync('document.pdf');
const result = await extractFromBuffer(buffer);
console.log(result.text);

With Express/Multer (File Uploads)

const express = require('express');
const multer = require('multer');
const { extractFromBuffer } = require('node-pdf-extractor');

const app = express();
const upload = multer({ storage: multer.memoryStorage() });

app.post('/extract', upload.single('pdf'), async (req, res) => {
    try {
        const result = await extractFromBuffer(req.file.buffer);
        res.json(result);
    } catch (error) {
        res.status(500).json({ error: error.message });
    }
});

app.listen(3000);

CLI Usage

# Extract and print to console
pdf-extract document.pdf

# Extract and save to file
pdf-extract document.pdf output.txt

API Reference

`extractText(input, options)`

Returns just the text string from a PDF.

input - File path (string) or Buffer
options - Optional parsing options

`extractFromPath(filePath, options)`

Extracts text and metadata from a PDF file.

filePath - Path to the PDF file
Returns: { text, numPages, info, metadata, version }

`extractFromBuffer(buffer, options)`

Extracts text and metadata from a PDF buffer.

buffer - PDF file as Buffer
Returns: { text, numPages, info, metadata, version }

`extractPages(input, startPage, endPage)`

Extracts text from specific pages.

input - File path or Buffer
startPage - Starting page (1-indexed)
endPage - Ending page (1-indexed)

`saveToFile(text, outputPath)`

Saves text to a file.

text - Text content to save
outputPath - Output file path

`PDFExtractor` Class

OOP interface with the same methods:

extract(filePath) - Extract from path
extractBuffer(buffer) - Extract from buffer
getText(input) - Get text only
getPages(input, start, end) - Extract specific pages
save(text, path) - Save to file

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

node-pdf-extractor

Installation

Usage

As a Module

Extract from Buffer

With Express/Multer (File Uploads)

CLI Usage

API Reference

extractText(input, options)

extractFromPath(filePath, options)

extractFromBuffer(buffer, options)

extractPages(input, startPage, endPage)

saveToFile(text, outputPath)

PDFExtractor Class

License

`extractText(input, options)`

`extractFromPath(filePath, options)`

`extractFromBuffer(buffer, options)`

`extractPages(input, startPage, endPage)`

`saveToFile(text, outputPath)`

`PDFExtractor` Class