@wdelhagen/textprep
v0.2.2
Published
Document text extraction with pluggable extractors. Supports PDF, DOCX, DOC, RTF, TXT, and image files with OCR capabilities.
Maintainers
Readme
@wdelhagen/textprep
A robust Node.js library for extracting text, HTML, and Markdown from various document formats including PDF, DOCX, DOC, RTF, and plain text files. Features pluggable extractors, OCR support, and comprehensive text processing pipelines.
Configuration Options
Core Extraction Options
| Option | Type | Default | Description |
| ---------------- | ------- | ------------ | -------------------------------------------- |
| useOCR | string | 'fallback' | OCR mode: 'only', 'fallback', or false |
| removeNewlines | boolean | false | Remove line breaks from extracted text |
| removeUrls | boolean | true | Remove HTTP/HTTPS URLs from text |
| minLength | number | 0 | Minimum required text length |
| maxLength | number | undefined | Maximum allowed text length (truncates) |
| maxFileSize | number | 52428800 | Maximum file size limit in bytes (50MB) |
| requireSpacing | boolean | true | Validate text has proper spacing |
| debug | boolean | false | Enable verbose logging |
Text Validation Options
| Option | Type | Default | Description |
| --------------------------- | -------- | ------------- | ----------------------------------------------------- |
| validateCommonWords | boolean | true | Enable common words validation |
| commonWordsTypes | string[] | ['general'] | Word list types: 'general', 'resume' |
| commonWordsMinMatches | number | 1 | Minimum number of words required to match |
| commonWordsCustomWords | string[] | [] | Optional custom word list for domain-specific content |
Common words validation helps detect corrupted or non-text content by checking for recognizable English words:
- General words: Common English words (the, and, for, that, etc.)
- Resume words: CV/resume-specific vocabulary (experience, education, skills, etc.)
- Custom words: Provide your own domain-specific word list
Example:
// Validate resumes/CVs with both general and resume-specific words
const result = await extractText('./resume.pdf', {
commonWordsTypes: ['general', 'resume'],
commonWordsMinMatches: 3 // Require at least 3 matching words
});
// Technical documents with custom vocabulary
const techDoc = await extractText('./api-docs.pdf', {
commonWordsTypes: ['general'],
commonWordsCustomWords: ['api', 'endpoint', 'authentication', 'jwt'],
commonWordsMinMatches: 2
});
// Disable common words validation
const lenient = await extractText('./document.pdf', {
validateCommonWords: false
});OCR Options
| Option | Type | Default | Description |
| ------------- | ------ | ----------- | -------------------------------- |
| ocrLanguage | string | 'eng' | Tesseract language code |
| dpi | number | 300 | DPI for image conversion |
| psm | number | 1 | Tesseract page segmentation mode |
| oem | number | undefined | Tesseract OCR engine mode |
| whitelist | string | undefined | Tesseract character whitelist |
| blacklist | string | undefined | Tesseract character blacklist |
Markdown-Specific Options
| Option | Type | Default | Description |
| ------------------- | ------- | ------- | ----------------------------------------------------- |
| preserveStructure | boolean | true | Maintain document structure |
| headingStyle | string | 'atx' | Heading style: 'atx' (##) or 'setext' (underline) |
| bulletListMarker | string | '-' | Bullet list marker character |
File Type Options
| Option | Type | Default | Description |
| --------------- | -------- | ----------- | ---------------------------------- |
| forceFileType | string | undefined | Force specific file type detection |
| allowedTypes | string[] | undefined | Restrict to specific file types |
Extractor Options
| Option | Type | Default | Description |
| ------------------ | ------ | ------- | ------------------------------------------------- |
| extractors | object | {} | Override extractor chains for specific file types |
| extractorOptions | object | {} | Pass options directly to underlying extractors |
Extractor-Specific Options
These options can be passed via extractorOptions:
PDF.js Options
| Option | Type | Default | Description |
| ---------- | ------ | ----------- | ------------------------ |
| maxPages | number | undefined | Maximum pages to extract |
| password | string | undefined | PDF password |
Mammoth Options (DOCX)
| Option | Type | Default | Description |
| ------------------------- | -------- | ----------- | ---------------------------------- |
| styleMap | Array | undefined | Style mapping rules for conversion |
| includeDefaultStyleMap | boolean | true | Include default style mapping |
| convertImage | function | undefined | Image conversion function |
| ignoreEmptyParagraphs | boolean | false | Skip empty paragraphs |
| includeEmbeddedStyleMap | boolean | true | Use embedded style map |
Text File Options
| Option | Type | Default | Description |
| ---------- | ------ | ----------- | -------------------------------------------- |
| encoding | string | undefined | Force specific encoding instead of detection |
Example with all options:
const result = await extractText("./document.pdf", {
// Core options
useOCR: "fallback",
removeNewlines: false,
removeUrls: false,
minLength: 100,
maxLength: 50000,
maxFileSize: 100 * 1024 * 1024,
requireSpacing: true,
debug: true,
// OCR options
ocrLanguage: "eng",
dpi: 300,
psm: 1,
oem: 3,
whitelist: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789",
blacklist: "",
// Markdown options (for extractMarkdown)
preserveStructure: true,
headingStyle: "atx",
bulletListMarker: "-",
// File type options
forceFileType: "pdf",
allowedTypes: ["pdf", "docx", "txt"],
// Extractor options
extractors: {
pdf: ["tesseract-text"], // Force specific extractor chain
},
extractorOptions: {
"pdfjs-text": {
maxPages: 10,
password: "secret",
},
mammoth: {
styleMap: ['p[style-name="Header"] => h1'],
includeDefaultStyleMap: true,
ignoreEmptyParagraphs: false,
includeEmbeddedStyleMap: true,
},
"fs-text": {
encoding: "utf8",
},
},
});Features
- Multi-format support: PDF, DOCX, DOC, RTF, TXT, and image files
- Multiple output formats: Extract as text, HTML, or Markdown
- OCR integration: Tesseract OCR for image-only documents and scanned PDFs
- Pluggable extractors: Multiple extraction methods with automatic fallback
- Text processing pipeline: Advanced cleaning, validation, and normalization
- Error handling: Comprehensive error recovery and detailed error information
- Performance tracking: Detailed metadata and timing information
- Debug mode: Verbose logging for troubleshooting
Installation
npm install @wdelhagen/textprepSystem Dependencies
The library requires several system utilities for full functionality:
macOS (via Homebrew):
brew install poppler # For PDF processing (pdftotext, pdftohtml, pdftoppm)
brew install tesseract # For OCR capabilities
brew install unrtf # For RTF file processing
brew install --cask libreoffice # For DOC file processingUbuntu/Debian:
sudo apt-get update
sudo apt-get install poppler-utils tesseract-ocr unrtf libreofficeWindows:
- Install Poppler for Windows
- Install Tesseract OCR
- Install LibreOffice
Quick Start
const {
extractText,
extractHTML,
extractMarkdown,
} = require("@wdelhagen/textprep");
// Extract plain text
const result = await extractText("./document.pdf");
console.log(result.text);
console.log(result.metadata.fileType); // 'pdf'
// Extract HTML
const htmlResult = await extractHTML("./document.docx");
console.log(htmlResult.text); // HTML content
// Extract Markdown
const mdResult = await extractMarkdown("./document.pdf");
console.log(mdResult.text); // Markdown contentAPI Reference
extractText(filePath, options)
Extracts plain text from supported document formats.
Parameters:
filePath(string): Path to the document fileoptions(object, optional): Extraction options
Returns: Promise<{text: string, metadata: object}>
Basic usage:
const result = await extractText("./document.pdf");
console.log(`Extracted ${result.text.length} characters`);
console.log(`File type: ${result.metadata.fileType}`);
console.log(`Extraction method: ${result.metadata.extractor}`);With options:
const result = await extractText("./document.pdf", {
useOCR: "fallback", // 'only', 'fallback', or false
removeNewlines: false, // Remove line breaks
removeUrls: false, // Remove HTTP/HTTPS URLs
minLength: 100, // Minimum text length required
maxLength: 50000, // Maximum text length allowed
maxFileSize: 100 * 1024 * 1024, // 100MB file size limit
debug: true, // Enable verbose logging
});extractHTML(filePath, options)
Extracts HTML content from supported document formats.
Parameters:
filePath(string): Path to the document fileoptions(object, optional): Extraction options
Returns: Promise<{text: string, metadata: object}>
const result = await extractHTML("./document.docx");
// For formats without native HTML support, returns text wrapped in <pre> tags
console.log(result.metadata.htmlFallback); // true if fallback was usedextractMarkdown(filePath, options)
Extracts content as Markdown from supported document formats.
Parameters:
filePath(string): Path to the document fileoptions(object, optional): Extraction options including Markdown-specific settings
Returns: Promise<{text: string, metadata: object}>
const result = await extractMarkdown("./document.pdf", {
preserveStructure: true, // Maintain document structure
headingStyle: "atx", // 'atx' (##) or 'setext' (underline)
bulletListMarker: "-", // Bullet list marker character
});detectFileType(filePath)
Detects file type using magic bytes with fallback to file extension.
Parameters:
filePath(string): Path to the file
Returns: Promise<{detectedType: string, extensionType: string, mismatch: boolean, metadata: object}>
const typeInfo = await detectFileType("./unknown-file");
console.log(`Detected: ${typeInfo.detectedType}`);
console.log(`Extension suggests: ${typeInfo.extensionType}`);
console.log(`Mismatch: ${typeInfo.mismatch}`);getExtractors(fileType?, outputType?)
Lists available extractors for a given file type and output format.
Parameters:
fileType(string, optional): File type ('pdf', 'docx', 'doc', 'rtf', 'txt', 'image')outputType(string, optional): Output type ('text' or 'html'), defaults to 'text'
Returns: Array of extractor names or complete registry object
// Get all PDF text extractors
const pdfExtractors = getExtractors("pdf", "text");
// ['pdfjs-text', 'pdftotext-text', 'pdf-parse-text', 'tesseract-text']
// Get entire registry
const allExtractors = getExtractors();
console.log(allExtractors);Options Reference
Extraction Options
| Option | Type | Default | Description |
| ---------------- | ------- | ------------ | -------------------------------------------- |
| useOCR | string | 'fallback' | OCR mode: 'only', 'fallback', or false |
| removeNewlines | boolean | false | Remove line breaks from extracted text |
| removeUrls | boolean | true | Remove HTTP/HTTPS URLs from text |
| minLength | number | 0 | Minimum required text length |
| maxLength | number | undefined | Maximum allowed text length (truncates) |
| maxFileSize | number | 52428800 | Maximum file size limit in bytes (50MB) |
| requireSpacing | boolean | true | Validate text has proper spacing |
| debug | boolean | false | Enable verbose logging |
OCR Options
| Option | Type | Default | Description |
| ------------- | ------ | ------- | ------------------------ |
| ocrLanguage | string | 'eng' | Tesseract language code |
| dpi | number | 300 | DPI for image conversion |
Force Options
| Option | Type | Default | Description |
| --------------- | ------ | ----------- | ---------------------------------- |
| forceFileType | string | undefined | Force specific file type detection |
Extractor-Specific Options
Pass options directly to underlying extractors:
const result = await extractText("./document.pdf", {
extractorOptions: {
"pdfjs-dist": {
// pdfjs-dist specific options
},
mammoth: {
// mammoth specific options
},
tesseract: {
// tesseract specific options
},
},
});Supported File Types
| Format | Extensions | Text Extraction | HTML Extraction | Extractors Used | | ----------------- | ----------------------------- | --------------- | --------------- | ------------------------------------------- | | PDF | .pdf | ✅ | ✅ | pdfjs-dist, pdftotext, pdf-parse, tesseract | | Word (Modern) | .docx | ✅ | ✅ | mammoth, textutil, tesseract | | Word (Legacy) | .doc | ✅ | ❌* | libreoffice, textutil | | Rich Text | .rtf | ✅ | ❌* | unrtf, textutil | | Plain Text | .txt | ✅ | ❌* | fs (encoding detection) | | Images | .jpg, .png, .bmp, .webp, .pbm | ✅ | ❌* | tesseract |
* HTML extraction falls back to text wrapped in <pre> tags
Usage Examples
Basic Text Extraction
const { extractText } = require("@wdelhagen/textprep");
async function extractFromFile(filePath) {
try {
const result = await extractText(filePath);
console.log(`Successfully extracted ${result.text.length} characters`);
console.log(`Processing time: ${result.metadata.totalDuration}ms`);
return result.text;
} catch (error) {
console.error("Extraction failed:", error.message);
if (error.context) {
console.error("Context:", error.context);
}
}
}OCR Processing
// OCR only mode - force OCR even for text-based documents
const ocrResult = await extractText("./scanned-document.pdf", {
useOCR: "only",
ocrLanguage: "eng",
dpi: 300,
});
// OCR fallback - try text extraction first, OCR if no text found
const fallbackResult = await extractText("./mixed-document.pdf", {
useOCR: "fallback",
});Force Options for Troubleshooting
// Force file type when automatic detection fails
const result = await extractText("./misnamed-file.xyz", {
forceFileType: "pdf",
});
// Force specific extractor chain
const result = await extractText("./document.pdf", {
forceFileType: "pdf",
extractors: {
pdf: ["tesseract-text"], // Use OCR for PDFs
},
});Advanced Text Processing
const result = await extractText("./document.pdf", {
removeNewlines: true, // Convert to single paragraph
removeUrls: true, // Clean URLs from text
minLength: 500, // Require at least 500 characters
maxLength: 10000, // Truncate at 10k characters
requireSpacing: true, // Validate proper word spacing
});Batch Processing
const fs = require("fs").promises;
const path = require("path");
const { extractText } = require("@wdelhagen/textprep");
async function processDirectory(dirPath) {
const files = await fs.readdir(dirPath);
const results = [];
for (const file of files) {
const filePath = path.join(dirPath, file);
try {
const result = await extractText(filePath, {
maxFileSize: 20 * 1024 * 1024, // 20MB limit
useOCR: "fallback",
});
results.push({
file,
success: true,
textLength: result.text.length,
extractor: result.metadata.extractor,
duration: result.metadata.totalDuration,
});
} catch (error) {
results.push({
file,
success: false,
error: error.message,
});
}
}
return results;
}Error Handling
const { extractText, errors } = require("@wdelhagen/textprep");
const { FileTypeError, ExtractionError, ValidationError } = errors;
async function robustExtraction(filePath) {
try {
return await extractText(filePath);
} catch (error) {
if (error instanceof FileTypeError) {
console.error("File type issue:", error.message);
console.error("File path:", error.filePath);
console.error("Detected type:", error.detectedType);
} else if (error instanceof ExtractionError) {
console.error("Extraction failed:", error.message);
console.error("Context:", error.context);
} else if (error instanceof ValidationError) {
console.error("Validation failed:", error.message);
console.error("Validation errors:", error.errors);
} else {
console.error("Unexpected error:", error);
}
throw error;
}
}Debug Mode
const result = await extractText("./document.pdf", {
debug: true, // Enables verbose logging
});
// Check debug information in metadata
if (result.metadata.debug) {
console.log("Detection time:", result.metadata.debug.timings.detection);
console.log(
"Extraction attempts:",
result.metadata.debug.extractorsAttempted,
);
console.log("Processing steps:", result.metadata.debug.processingSteps);
}Metadata Structure
All extraction functions return rich metadata:
{
text: "extracted content...",
metadata: {
// Basic information
fileType: "pdf",
fileSize: 1024576,
extractor: "pdfjs-text",
totalDuration: 1523,
// Timing breakdown
timings: {
detection: 45,
extraction: 1200,
processing: 278,
total: 1523
},
// Content analysis
contentAnalysis: {
originalLength: 15420,
finalLength: 14890,
compressionRatio: 0.97,
hasText: true,
estimatedPages: 3
},
// Processing details
processingSteps: ["detectFileType", "pdfjs-text", "processText"],
extractionAttempts: [
{
extractor: "pdfjs-text",
success: true,
duration: 1200,
textLength: 15420
}
],
// Error information (if any warnings)
warnings: [],
errorSummary: {
hasErrors: false,
errorCount: 0,
warningCount: 0
}
}
}Performance Considerations
- File size limits: Default 50MB limit prevents memory issues
- Extraction timeout: Operations timeout after reasonable periods
- Memory usage: Typically under 500MB for large documents
- OCR processing: Can be slow for large images/scanned documents
- Caching: System utility availability is cached for performance
Error Types
FileTypeError
Thrown when file type detection fails or file type is not supported.
try {
await extractText("./unknown-file");
} catch (error) {
if (error instanceof FileTypeError) {
console.log("File path:", error.filePath);
console.log("Detected type:", error.detectedType);
}
}ExtractionError
Thrown when all extraction methods fail.
try {
await extractText("./corrupted.pdf");
} catch (error) {
if (error instanceof ExtractionError) {
console.log("Context:", error.context);
console.log("Attempts:", error.context.attempts);
}
}ValidationError
Thrown when extracted text fails validation.
try {
await extractText("./tiny-file.txt", { minLength: 1000 });
} catch (error) {
if (error instanceof ValidationError) {
console.log("Validation errors:", error.errors);
}
}Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite:
npm test - Submit a pull request
Testing
# Run all tests
npm test
# Run with coverage
npm run test:coverage
# Run only unit tests
npm run test:unit
# Run only integration tests
npm run test:integrationLicense
MIT License - see LICENSE file for details.
Changelog
v0.1.0
- Initial release
- Support for PDF, DOCX, DOC, RTF, TXT, and image files
- Text, HTML, and Markdown output formats
- OCR integration with Tesseract
- Comprehensive error handling and metadata
- Debug mode and performance tracking
