text-extract

v1.0.5

Published

a month ago

A robust Node.js utility for extracting text from PDF, DOCX, DOC, XLSX, and TXT buffers.

0High
0Medium
0Low

just-node

text-extraction text text-extract pdf docx xlsx node

text-extract

Robust, multi-format text extraction from binary buffers in Node.js

Extract readable text from PDFs, Word documents (.doc, .docx), Excel spreadsheets (.xls, .xlsx), plain text files, and legacy Microsoft Office compound files — with graceful error handling and MIME-type detection.

npm install text-extract

Features

Supports the most common office & document formats:
- PDF
- DOC / DOCX (modern and legacy)
- XLS / XLSX
- Plain text (.txt)
- Compound File Binary Format (CFB) containers (old .doc / .xls)
Automatic file-type detection
Parallel processing of multiple files
Clean error handling — one corrupt file doesn't crash the whole batch

Supported Formats

| Format | Extension(s) | MIME Type(s) | Notes | |-----------------|------------------|------------------------------------------------------------|------------------------------------| | PDF | .pdf | application/pdf | Text layer extraction | | Word (modern) | .docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document | | | Word (legacy) | .doc | application/msword | | | Excel (modern) | .xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | CSV-style output per sheet | | Excel (legacy) | .xls | application/vnd.ms-excel, application/x-cfb | CSV-style output per sheet | | Plain Text | .txt | text/plain | UTF-8 decoded |

Usage

Extract text from a single buffer

import { readFile } from 'node:fs/promises';
import { parseText } from 'text-extract';

const buffer = await readFile('invoice.pdf');

const result = await parseText(buffer);

if (result) {
  console.log(`Format: ${result.ext}`);
  console.log('Text length:', result.text.length);
  console.log(result.text.substring(0, 300)); // first 300 chars
} else {
  console.log('Could not extract text');
}

Batch process multiple files

import { parseTexts } from 'text-extract';

const buffers = [
  await readFile('report.pdf'),
  await readFile('proposal.docx'),
  await readFile('data.xlsx'),
  // ...
];

const results = await parseTexts(buffers, (result) => {
  console.log(`Processed ${result.ext} – ${result.text.length} chars`);
});

console.log(`Successfully extracted text from ${results.length} files`);

API

`parseText(buffer: Buffer): Promise<{ ext: string, text: string } | null>`

Extracts text from a single file buffer.
Returns null if the format is unsupported or extraction fails.

`parseTexts(buffers: Buffer[], onComplete?: (result: { ext: string, text: string }) => void): Promise<Array<{ ext: string, text: string }>>`

Processes multiple buffers in parallel.
Optional onComplete callback is called for every successfully processed file.

Error Handling

The library is designed to be very forgiving:

One corrupt or unsupported file → that file returns null
Exceptions are caught and logged (console.error)
Invalid UTF-8 or binary garbage is safely handled

License

MIT

Made with ❤️ for Node.js developers who hate broken document parsers

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

text-extract

Features

Supported Formats

Usage

Extract text from a single buffer

Batch process multiple files

API

parseText(buffer: Buffer): Promise<{ ext: string, text: string } | null>

parseTexts(buffers: Buffer[], onComplete?: (result: { ext: string, text: string }) => void): Promise<Array<{ ext: string, text: string }>>

Error Handling

License

`parseText(buffer: Buffer): Promise<{ ext: string, text: string } | null>`

`parseTexts(buffers: Buffer[], onComplete?: (result: { ext: string, text: string }) => void): Promise<Array<{ ext: string, text: string }>>`