pdf-extract-text
v0.1.3
Published
A fast, native Node.js module to extract and process text from PDF files using Rust and N-API. Built with [Tokio](https://tokio.rs/), [`pdf-extract`](https://docs.rs/pdf-extract), and [`text-splitter`](https://crates.io/crates/text-splitter), this package
Readme
PDF Text Extractor
A fast, native Node.js module to extract and process text from PDF files using Rust and N-API. Built with Tokio, pdf-extract, and text-splitter, this package provides efficient and easy-to-use async APIs.
Features
- High-performance native code (Rust)
- Asynchronous functions (non-blocking I/O)
- Useful for LLM pipelines and search indexing
- Extract cleaned text from PDF files
- Split PDF text into pages with automatic page number detection
- Generate overlapping text chunks with configurable sizes
- TypeScript support
Installation
npm install pdf-extract-text
# or
yarn add pdf-extract-textUsage
Basic Text Extraction
JavaScript
const { extractTextFromPdf } = require('pdf-extract-text');
async function main() {
try {
const text = await extractTextFromPdf('./document.pdf');
console.log('Cleaned PDF text:', text);
} catch (error) {
console.error('Error:', error.message);
}
}
main();TypeScript
import { extractTextFromPdf } from 'pdf-extract-text';
async function main() {
try {
const text: string = await extractTextFromPdf('./document.pdf');
console.log('Cleaned PDF text:', text);
} catch (error) {
console.error('Error:', (error as Error).message);
}
}
main();Page-based Extraction
JavaScript
const { extractTextPages } = require('pdf-extract-text');
async function extractPages() {
try {
const pages = await extractTextPages('./document.pdf');
pages.forEach(page => {
console.log(`Page ${page.page}:`);
console.log(page.text);
console.log('\n---\n');
});
} catch (error) {
console.error('Error:', error.message);
}
}
extractPages();TypeScript
import { extractTextPages, Page } from 'pdf-extract-text';
async function extractPages() {
try {
const pages: Page[] = await extractTextPages('./document.pdf');
pages.forEach((page: Page) => {
console.log(`Page ${page.page}:`);
console.log(page.text);
console.log('\n---\n');
});
} catch (error) {
console.error('Error:', (error as Error).message);
}
}
extractPages();Text Chunking with Overlaps
JavaScript
const { extractTextChunks } = require('pdf-extract-text');
async function chunkText() {
try {
const chunks = await extractTextChunks('./document.pdf', 1000, 200);
chunks.forEach(chunk => {
console.log(`Chunk ${chunk.id}:`);
console.log(chunk.text);
console.log('\n=====\n');
});
} catch (error) {
console.error('Error:', error.message);
}
}
chunkText();TypeScript
import { extractTextChunks, TextChunk } from 'pdf-extract-text';
async function chunkText() {
try {
const chunks: TextChunk[] = await extractTextChunks('./document.pdf', 1000, 200);
chunks.forEach((chunk: TextChunk) => {
console.log(`Chunk ${chunk.id}:`);
console.log(chunk.text);
console.log('\n=====\n');
});
} catch (error) {
console.error('Error:', (error as Error).message);
}
}
chunkText();Types
type Page = {
page: number;
text: string;
};
type TextChunk = {
id: number;
text: string;
};API Documentation
extractTextFromPdf(path: string): Promise<string>
Extracts and cleans text from a PDF file
path: Path to PDF file- Returns: Cleaned text with numeric-only lines removed
extractTextPages(path: string): Promise<Page[]>
Extracts text split into pages
interface Page {
page: number;
text: string;
}extractTextChunks(path: string, chunkSize: number, chunkOverlap: number): Promise<TextChunk[]>
Generates overlapping text chunks
interface TextChunk {
id: number;
text: string;
}chunkSize: Target chunk size in characterschunkOverlap: Overlap between chunks (must be < chunkSize)
Error Handling
All functions throw errors with descriptive messages for:
- File not found or read errors
- PDF parsing failures
- Invalid chunk configurations (overlap >= chunk size)
Use Cases
- Document understanding and chunking for LLMs
- PDF content extraction for chatbots or search
- Indexing and pre-processing for embeddings
Processing Details
Text Cleaning:
- Removes lines containing only numeric characters
- Preserves original line breaks and formatting
Page Detection:
- Splits text at lines containing only page numbers
- Handles variable page number positions
Chunking:
- Uses semantic-aware splitting (paragraphs/sentences)
- Maintains context with overlapping chunks
- Configurable through simple parameters
Requirements
- Node.js 16+
- Rust (for building from source)
License
MIT
