text-extract
v1.0.5
Published
A robust Node.js utility for extracting text from PDF, DOCX, DOC, XLSX, and TXT buffers.
Maintainers
Readme
text-extract
Robust, multi-format text extraction from binary buffers in Node.js
Extract readable text from PDFs, Word documents (.doc, .docx), Excel spreadsheets (.xls, .xlsx), plain text files, and legacy Microsoft Office compound files — with graceful error handling and MIME-type detection.
npm install text-extractFeatures
- Supports the most common office & document formats:
- DOC / DOCX (modern and legacy)
- XLS / XLSX
- Plain text (.txt)
- Compound File Binary Format (CFB) containers (old .doc / .xls)
- Automatic file-type detection
- Parallel processing of multiple files
- Clean error handling — one corrupt file doesn't crash the whole batch
Supported Formats
| Format | Extension(s) | MIME Type(s) | Notes |
|-----------------|------------------|------------------------------------------------------------|------------------------------------|
| PDF | .pdf | application/pdf | Text layer extraction |
| Word (modern) | .docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document | |
| Word (legacy) | .doc | application/msword | |
| Excel (modern) | .xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | CSV-style output per sheet |
| Excel (legacy) | .xls | application/vnd.ms-excel, application/x-cfb | CSV-style output per sheet |
| Plain Text | .txt | text/plain | UTF-8 decoded |
Usage
Extract text from a single buffer
import { readFile } from 'node:fs/promises';
import { parseText } from 'text-extract';
const buffer = await readFile('invoice.pdf');
const result = await parseText(buffer);
if (result) {
console.log(`Format: ${result.ext}`);
console.log('Text length:', result.text.length);
console.log(result.text.substring(0, 300)); // first 300 chars
} else {
console.log('Could not extract text');
}Batch process multiple files
import { parseTexts } from 'text-extract';
const buffers = [
await readFile('report.pdf'),
await readFile('proposal.docx'),
await readFile('data.xlsx'),
// ...
];
const results = await parseTexts(buffers, (result) => {
console.log(`Processed ${result.ext} – ${result.text.length} chars`);
});
console.log(`Successfully extracted text from ${results.length} files`);API
parseText(buffer: Buffer): Promise<{ ext: string, text: string } | null>
Extracts text from a single file buffer.
Returns null if the format is unsupported or extraction fails.
parseTexts(buffers: Buffer[], onComplete?: (result: { ext: string, text: string }) => void): Promise<Array<{ ext: string, text: string }>>
Processes multiple buffers in parallel.
Optional onComplete callback is called for every successfully processed file.
Error Handling
The library is designed to be very forgiving:
- One corrupt or unsupported file → that file returns
null - Exceptions are caught and logged (console.error)
- Invalid UTF-8 or binary garbage is safely handled
License
MIT
Made with ❤️ for Node.js developers who hate broken document parsers
