@wdelhagen/textprep
v0.3.2
Published
Document text extraction with pluggable extractors. Supports PDF, DOCX, DOC, RTF, TXT, and image files with OCR capabilities.
Maintainers
Readme
@wdelhagen/textprep
A Node.js library for extracting text, HTML, and Markdown from PDF, DOCX, DOC, RTF, TXT, and image files. Features pluggable extractors, OCR support, and a comprehensive text processing pipeline.
Installation
npm install @wdelhagen/textprepSystem Dependencies
macOS (via Homebrew):
brew install poppler # PDF processing (pdftotext, pdftohtml, pdftoppm)
brew install tesseract # OCR
brew install unrtf # RTF processing
brew install --cask libreoffice # DOC/DOCX HTML extractionUbuntu/Debian:
sudo apt-get install poppler-utils tesseract-ocr unrtf libreofficeWindows: Install Poppler, Tesseract, and LibreOffice.
Quick Start
const { extractText, extractHTML, extractMarkdown } = require("@wdelhagen/textprep");
const text = await extractText("./document.pdf");
console.log(text.text);
const html = await extractHTML("./document.docx");
console.log(html.text); // HTML string
const md = await extractMarkdown("./document.pdf");
console.log(md.text); // Markdown stringAPI Reference
extractText(filePath, options?)
Extracts plain text. Runs the full text processing pipeline (validation + transforms) unless raw: true.
extractHTML(filePath, options?)
Extracts HTML. Accepts top-level and ocr options only — validation, transform, and markdown groups throw ValidationError. Throws ExtractionError for file types with no HTML extractor (TXT, images).
extractMarkdown(filePath, options?)
Extracts HTML then converts to Markdown, then runs the text pipeline. Accepts all option groups. Throws FileTypeError (file not found, size limit, unsupported type), ExtractionError (no HTML extractor for file type), or ValidationError (invalid options, validation failure) — consistent with extractText and extractHTML.
detectFileType(filePath)
Returns { detectedType, extensionType, mismatch, metadata }.
getExtractors(fileType?, outputType?)
Returns the extractor chain for a file type and output type, or the full registry. Pass 'ocr' or 'image' to query OCR extractors.
processText(text, options?, context?)
Runs the text processing pipeline (validation + cleaning + transforms) on a string directly — useful when text comes from a source other than the extraction functions. Accepts the same validation and transform option groups as extractText. Returns { text, errors, metadata }.
validateOptions(options)
Validates an options object for structure, types, and unknown keys. Throws ValidationError on any violation. This is method-agnostic — all option groups (ocr, validation, transform, markdown) are accepted. Use this for pre-flight validation or building tools on top of the library.
const { validateOptions, errors } = require("@wdelhagen/textprep");
// Throws ValidationError: Unknown option 'typo'
validateOptions({ typo: true });
// Throws ValidationError: ocr.dpi must be a number
validateOptions({ ocr: { dpi: "high" } });
// Valid — no error thrown
validateOptions({ useOCR: "fallback", validation: { minLength: 100 } });IMAGE_EXTENSIONS
Array of file extensions recognized as image types: ["bmp", "jpg", "jpeg", "png", "pbm", "webp"].
const { IMAGE_EXTENSIONS } = require("@wdelhagen/textprep");
const ext = filePath.split(".").pop().toLowerCase();
if (IMAGE_EXTENSIONS.includes(ext)) {
// handle as image
}Options
Options are organized into namespaced groups. Passing an unknown key at any level throws ValidationError.
Top-level (all methods)
| Option | Type | Default | Notes |
| ------------------ | --------------------------- | ------------ | ---------------------------------------------------------- |
| useOCR | 'only'│'fallback'│false | 'fallback' | |
| maxFileSize | number | 52428800 | bytes (50 MB) |
| forceFileType | string | — | 'pdf'│'docx'│'doc'│'rtf'│'txt'│'image'; allowedTypes is still enforced when set |
| allowedTypes | string[] | — | Throws FileTypeError if file type not in list |
| extractors | object | — | Override extractor chains: { pdf: ['pdfjs-text'] } |
| debug | boolean | false | |
| raw | boolean | false | Returns extractor output directly; skips pipeline entirely |
ocr group (all methods; ignored when useOCR: false)
| Option | Type | Default | Notes |
| -------------- | ------ | ----------- | ---------------------------- |
| ocr.language | string | 'eng' | Tesseract language code |
| ocr.dpi | number | 300 | DPI for image conversion; must be >= 72 |
| ocr.psm | number | 1 | Page segmentation mode (0–13)|
| ocr.oem | number | — | OCR engine mode (0–3) |
validation group (extractText + extractMarkdown only)
Pass validation: false to disable all validation.
| Option | Type | Default | Notes |
| --------------------------- | ---------------- | ------- | -------------------------------------------------- |
| validation.minLength | number | 0 | Throws ValidationError if text is shorter; must be >= 0 |
| validation.requireSpacing | boolean | true | Rejects text outside spacing range |
| validation.minSpacing | number | 3 | Minimum space percentage |
| validation.maxSpacing | number | 30 | Maximum space percentage |
| validation.commonWords | boolean│object | true | See below |
validation.commonWords values:
| Value | Behavior |
| ----------------------------------- | ------------------------------------------------------------------ |
| false | Disabled |
| true | Enabled with defaults: types: ['general'], minMatches: 1 |
| { types, minMatches, customWords }| Full config; all fields optional, fall back to defaults |
types may include 'general' (common English words) and/or 'resume' (CV vocabulary). customWords is an array of additional words to match.
transform group (extractText + extractMarkdown only)
| Option | Type | Default | Notes |
| ----------------------------------- | ---------------- | ------- | ---------------------------------------------------------------- |
| transform.removeNewlines | boolean | false | Converts newlines to spaces |
| transform.removeUrls | boolean | false | Removes HTTP/HTTPS URLs |
| transform.removeDocumentArtifacts | boolean | true | Strips page numbers, horizontal rules, headers/footers |
| transform.normalizeBullets | boolean│string | true | false = off; true = normalize to '-'; string = that marker |
| transform.maxLength | number | — | Truncates text if exceeded (applied last, after all other transforms); must be >= 0 |
markdown group (extractMarkdown only)
| Option | Type | Default | Notes |
| ---------------------------- | ---------------- | ------- | ----- |
| markdown.preserveStructure | boolean | true | |
| markdown.headingStyle | 'atx'│'setext' | 'atx' | |
| markdown.bulletListMarker | string | '-' | |
Examples
Basic extraction
const result = await extractText("./resume.pdf");
console.log(result.text);
console.log(result.metadata.fileType); // 'pdf'
console.log(result.metadata.extractor); // 'pdftotext-text'Validation and transforms
const result = await extractText("./resume.pdf", {
validation: {
minLength: 200,
requireSpacing: true,
commonWords: { types: ["general", "resume"], minMatches: 3 },
},
transform: {
removeUrls: true,
normalizeBullets: "-",
},
});Disable validation
// validation: false skips all validation — text is returned regardless
const result = await extractText("./document.pdf", {
validation: false,
});Raw mode — skip the pipeline
// raw: true returns extractor output directly, no cleaning or validation
const result = await extractText("./document.pdf", { raw: true });OCR
// OCR only — skip text extractors
const result = await extractText("./scanned.pdf", {
useOCR: "only",
ocr: { language: "eng", dpi: 300 },
});
// OCR fallback — try text first, OCR if no text found
const result = await extractText("./mixed.pdf", {
useOCR: "fallback",
});
// Shared config object: ocr group ignored when useOCR is false, no error thrown
const result = await extractText("./document.pdf", {
useOCR: false,
ocr: { language: "fra" }, // ignored silently
});HTML extraction
const result = await extractHTML("./document.docx");
console.log(result.text); // HTML string
// extractHTML only accepts top-level + ocr groups
// Passing validation/transform/markdown throws ValidationError
await extractHTML("./doc.pdf", { validation: { minLength: 0 } }); // throwsMarkdown extraction
const result = await extractMarkdown("./document.pdf", {
markdown: {
headingStyle: "atx",
bulletListMarker: "-",
},
validation: {
commonWords: false,
},
});Force file type
// Force type when detection fails
const result = await extractText("./misnamed-file.xyz", {
forceFileType: "pdf",
});Custom extractor chain
const result = await extractText("./document.pdf", {
extractors: { pdf: ["pdfjs-text"] },
});Restrict allowed file types
// Throws FileTypeError if file type not in list
const result = await extractText("./file.doc", {
allowedTypes: ["pdf", "docx"],
});Batch processing
const fs = require("fs").promises;
const path = require("path");
const { extractText } = require("@wdelhagen/textprep");
async function processDirectory(dirPath) {
const files = await fs.readdir(dirPath);
const results = [];
for (const file of files) {
try {
const result = await extractText(path.join(dirPath, file), {
maxFileSize: 20 * 1024 * 1024,
validation: { commonWords: true },
});
results.push({ file, success: true, length: result.text.length });
} catch (error) {
results.push({ file, success: false, error: error.message });
}
}
return results;
}Error handling
const { extractText, errors } = require("@wdelhagen/textprep");
const { FileTypeError, ExtractionError, ValidationError } = errors;
try {
await extractText("./document.pdf");
} catch (error) {
if (error instanceof FileTypeError) {
// File type unsupported, size exceeded, or file not found
console.error(error.message, error.filePath);
} else if (error instanceof ExtractionError) {
// All extractors failed
console.error(error.message, error.context);
} else if (error instanceof ValidationError) {
// Text failed validation, or invalid options passed
console.error(error.message, error.errors);
}
}Supported File Types
| Format | Extensions | Text | HTML | HTML extractors |
| -------------- | ----------------------------- | ---- | ---- | ---------------------------- |
| PDF | .pdf | ✅ | ✅ | pdftohtml |
| DOCX | .docx | ✅ | ✅ | mammoth |
| DOC | .doc | ✅ | ✅ | libreoffice |
| RTF | .rtf | ✅ | ✅ | libreoffice |
| Plain text | .txt | ✅ | ❌ | — (throws ExtractionError) |
| Images | .jpg, .jpeg, .png, .bmp, .webp, .pbm | ✅ | ❌ | — (throws ExtractionError) |
Metadata
All extraction functions return { text, metadata }:
{
text: "...",
metadata: {
fileType: "pdf",
fileSize: 102400,
extractor: "pdftotext-text",
totalDuration: 145, // ms
extractionDuration: 120, // ms
processingSteps: [...],
options: { /* normalized options */ },
errorSummary: { hasIssues: false, errorCount: 0, warningCount: 0 },
// content analysis
content: {
finalLength: 4800,
originalLength: 5100,
estimatedWordCount: 820,
whitespaceRatio: 16,
lineCount: 52
}
}
}Testing
npm test # Unit + lightweight integration (default project)
npm run test:unit # Unit tests only
npm run test:integration # Integration tests only
npm run test:heavy # Heavy integration (OCR, PDF tools, LibreOffice)
npm run test:all # All test suites
npm run test:coverage # With coveragePublishing
npm version patch # bumps version, commits, creates git tag
git push origin dev --tags
npm publish --access publicLicense
MIT — see LICENSE.
Changelog
v0.3.0
- Namespaced options API — clean break from flat bag of 20+ options
- Options organized into
ocr,validation,transform,markdowngroups - Unknown option keys throw
ValidationError raw: trueoption bypasses pipeline entirelyvalidation: falsedisables all validationnormalizeBulletsaccepts custom marker charactertransform.removeDocumentArtifactsoptionextractHTMLthrowsExtractionErrorfor file types with no HTML extractor (removed<pre>fallback)- DOC and RTF HTML extraction via LibreOffice
- Removed
requireSpacingsuppression bug whenforceFileTypeis set commonWordsvalidation now enabled by defaultextractMarkdownnow propagatesFileTypeErrorunchanged (file not found, size limit) instead of wrapping asExtractionError— error types now consistent across all three extraction functions'image'added as a first-class registry key;getExtractors('image', 'text')now returns the OCR extractor chainprocessTextexported as public API for running the text pipeline on externally-sourced textvalidateOptionsexported as public API for method-agnostic options validationIMAGE_EXTENSIONSexported — canonical list of supported image file extensionsmaxLengthmoved fromvalidationgroup totransformgroup — truncation now applies after all other transforms
v0.2.2
- Common words validation
- Improved spacing validation with custom thresholds
- Enhanced text processing pipeline
v0.1.0
- Initial release
