npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@wdelhagen/textprep

v0.3.2

Published

Document text extraction with pluggable extractors. Supports PDF, DOCX, DOC, RTF, TXT, and image files with OCR capabilities.

Readme

@wdelhagen/textprep

A Node.js library for extracting text, HTML, and Markdown from PDF, DOCX, DOC, RTF, TXT, and image files. Features pluggable extractors, OCR support, and a comprehensive text processing pipeline.

Installation

npm install @wdelhagen/textprep

System Dependencies

macOS (via Homebrew):

brew install poppler              # PDF processing (pdftotext, pdftohtml, pdftoppm)
brew install tesseract            # OCR
brew install unrtf                # RTF processing
brew install --cask libreoffice   # DOC/DOCX HTML extraction

Ubuntu/Debian:

sudo apt-get install poppler-utils tesseract-ocr unrtf libreoffice

Windows: Install Poppler, Tesseract, and LibreOffice.

Quick Start

const { extractText, extractHTML, extractMarkdown } = require("@wdelhagen/textprep");

const text = await extractText("./document.pdf");
console.log(text.text);

const html = await extractHTML("./document.docx");
console.log(html.text); // HTML string

const md = await extractMarkdown("./document.pdf");
console.log(md.text); // Markdown string

API Reference

extractText(filePath, options?)

Extracts plain text. Runs the full text processing pipeline (validation + transforms) unless raw: true.

extractHTML(filePath, options?)

Extracts HTML. Accepts top-level and ocr options only — validation, transform, and markdown groups throw ValidationError. Throws ExtractionError for file types with no HTML extractor (TXT, images).

extractMarkdown(filePath, options?)

Extracts HTML then converts to Markdown, then runs the text pipeline. Accepts all option groups. Throws FileTypeError (file not found, size limit, unsupported type), ExtractionError (no HTML extractor for file type), or ValidationError (invalid options, validation failure) — consistent with extractText and extractHTML.

detectFileType(filePath)

Returns { detectedType, extensionType, mismatch, metadata }.

getExtractors(fileType?, outputType?)

Returns the extractor chain for a file type and output type, or the full registry. Pass 'ocr' or 'image' to query OCR extractors.

processText(text, options?, context?)

Runs the text processing pipeline (validation + cleaning + transforms) on a string directly — useful when text comes from a source other than the extraction functions. Accepts the same validation and transform option groups as extractText. Returns { text, errors, metadata }.

validateOptions(options)

Validates an options object for structure, types, and unknown keys. Throws ValidationError on any violation. This is method-agnostic — all option groups (ocr, validation, transform, markdown) are accepted. Use this for pre-flight validation or building tools on top of the library.

const { validateOptions, errors } = require("@wdelhagen/textprep");

// Throws ValidationError: Unknown option 'typo'
validateOptions({ typo: true });

// Throws ValidationError: ocr.dpi must be a number
validateOptions({ ocr: { dpi: "high" } });

// Valid — no error thrown
validateOptions({ useOCR: "fallback", validation: { minLength: 100 } });

IMAGE_EXTENSIONS

Array of file extensions recognized as image types: ["bmp", "jpg", "jpeg", "png", "pbm", "webp"].

const { IMAGE_EXTENSIONS } = require("@wdelhagen/textprep");
const ext = filePath.split(".").pop().toLowerCase();
if (IMAGE_EXTENSIONS.includes(ext)) {
  // handle as image
}

Options

Options are organized into namespaced groups. Passing an unknown key at any level throws ValidationError.

Top-level (all methods)

| Option | Type | Default | Notes | | ------------------ | --------------------------- | ------------ | ---------------------------------------------------------- | | useOCR | 'only'│'fallback'│false | 'fallback' | | | maxFileSize | number | 52428800 | bytes (50 MB) | | forceFileType | string | — | 'pdf'│'docx'│'doc'│'rtf'│'txt'│'image'; allowedTypes is still enforced when set | | allowedTypes | string[] | — | Throws FileTypeError if file type not in list | | extractors | object | — | Override extractor chains: { pdf: ['pdfjs-text'] } | | debug | boolean | false | | | raw | boolean | false | Returns extractor output directly; skips pipeline entirely |

ocr group (all methods; ignored when useOCR: false)

| Option | Type | Default | Notes | | -------------- | ------ | ----------- | ---------------------------- | | ocr.language | string | 'eng' | Tesseract language code | | ocr.dpi | number | 300 | DPI for image conversion; must be >= 72 | | ocr.psm | number | 1 | Page segmentation mode (0–13)| | ocr.oem | number | — | OCR engine mode (0–3) |

validation group (extractText + extractMarkdown only)

Pass validation: false to disable all validation.

| Option | Type | Default | Notes | | --------------------------- | ---------------- | ------- | -------------------------------------------------- | | validation.minLength | number | 0 | Throws ValidationError if text is shorter; must be >= 0 | | validation.requireSpacing | boolean | true | Rejects text outside spacing range | | validation.minSpacing | number | 3 | Minimum space percentage | | validation.maxSpacing | number | 30 | Maximum space percentage | | validation.commonWords | boolean│object | true | See below |

validation.commonWords values:

| Value | Behavior | | ----------------------------------- | ------------------------------------------------------------------ | | false | Disabled | | true | Enabled with defaults: types: ['general'], minMatches: 1 | | { types, minMatches, customWords }| Full config; all fields optional, fall back to defaults |

types may include 'general' (common English words) and/or 'resume' (CV vocabulary). customWords is an array of additional words to match.

transform group (extractText + extractMarkdown only)

| Option | Type | Default | Notes | | ----------------------------------- | ---------------- | ------- | ---------------------------------------------------------------- | | transform.removeNewlines | boolean | false | Converts newlines to spaces | | transform.removeUrls | boolean | false | Removes HTTP/HTTPS URLs | | transform.removeDocumentArtifacts | boolean | true | Strips page numbers, horizontal rules, headers/footers | | transform.normalizeBullets | boolean│string | true | false = off; true = normalize to '-'; string = that marker | | transform.maxLength | number | — | Truncates text if exceeded (applied last, after all other transforms); must be >= 0 |

markdown group (extractMarkdown only)

| Option | Type | Default | Notes | | ---------------------------- | ---------------- | ------- | ----- | | markdown.preserveStructure | boolean | true | | | markdown.headingStyle | 'atx'│'setext' | 'atx' | | | markdown.bulletListMarker | string | '-' | |


Examples

Basic extraction

const result = await extractText("./resume.pdf");
console.log(result.text);
console.log(result.metadata.fileType);   // 'pdf'
console.log(result.metadata.extractor);  // 'pdftotext-text'

Validation and transforms

const result = await extractText("./resume.pdf", {
  validation: {
    minLength: 200,
    requireSpacing: true,
    commonWords: { types: ["general", "resume"], minMatches: 3 },
  },
  transform: {
    removeUrls: true,
    normalizeBullets: "-",
  },
});

Disable validation

// validation: false skips all validation — text is returned regardless
const result = await extractText("./document.pdf", {
  validation: false,
});

Raw mode — skip the pipeline

// raw: true returns extractor output directly, no cleaning or validation
const result = await extractText("./document.pdf", { raw: true });

OCR

// OCR only — skip text extractors
const result = await extractText("./scanned.pdf", {
  useOCR: "only",
  ocr: { language: "eng", dpi: 300 },
});

// OCR fallback — try text first, OCR if no text found
const result = await extractText("./mixed.pdf", {
  useOCR: "fallback",
});

// Shared config object: ocr group ignored when useOCR is false, no error thrown
const result = await extractText("./document.pdf", {
  useOCR: false,
  ocr: { language: "fra" }, // ignored silently
});

HTML extraction

const result = await extractHTML("./document.docx");
console.log(result.text); // HTML string

// extractHTML only accepts top-level + ocr groups
// Passing validation/transform/markdown throws ValidationError
await extractHTML("./doc.pdf", { validation: { minLength: 0 } }); // throws

Markdown extraction

const result = await extractMarkdown("./document.pdf", {
  markdown: {
    headingStyle: "atx",
    bulletListMarker: "-",
  },
  validation: {
    commonWords: false,
  },
});

Force file type

// Force type when detection fails
const result = await extractText("./misnamed-file.xyz", {
  forceFileType: "pdf",
});

Custom extractor chain

const result = await extractText("./document.pdf", {
  extractors: { pdf: ["pdfjs-text"] },
});

Restrict allowed file types

// Throws FileTypeError if file type not in list
const result = await extractText("./file.doc", {
  allowedTypes: ["pdf", "docx"],
});

Batch processing

const fs = require("fs").promises;
const path = require("path");
const { extractText } = require("@wdelhagen/textprep");

async function processDirectory(dirPath) {
  const files = await fs.readdir(dirPath);
  const results = [];

  for (const file of files) {
    try {
      const result = await extractText(path.join(dirPath, file), {
        maxFileSize: 20 * 1024 * 1024,
        validation: { commonWords: true },
      });
      results.push({ file, success: true, length: result.text.length });
    } catch (error) {
      results.push({ file, success: false, error: error.message });
    }
  }

  return results;
}

Error handling

const { extractText, errors } = require("@wdelhagen/textprep");
const { FileTypeError, ExtractionError, ValidationError } = errors;

try {
  await extractText("./document.pdf");
} catch (error) {
  if (error instanceof FileTypeError) {
    // File type unsupported, size exceeded, or file not found
    console.error(error.message, error.filePath);
  } else if (error instanceof ExtractionError) {
    // All extractors failed
    console.error(error.message, error.context);
  } else if (error instanceof ValidationError) {
    // Text failed validation, or invalid options passed
    console.error(error.message, error.errors);
  }
}

Supported File Types

| Format | Extensions | Text | HTML | HTML extractors | | -------------- | ----------------------------- | ---- | ---- | ---------------------------- | | PDF | .pdf | ✅ | ✅ | pdftohtml | | DOCX | .docx | ✅ | ✅ | mammoth | | DOC | .doc | ✅ | ✅ | libreoffice | | RTF | .rtf | ✅ | ✅ | libreoffice | | Plain text | .txt | ✅ | ❌ | — (throws ExtractionError) | | Images | .jpg, .jpeg, .png, .bmp, .webp, .pbm | ✅ | ❌ | — (throws ExtractionError) |


Metadata

All extraction functions return { text, metadata }:

{
  text: "...",
  metadata: {
    fileType: "pdf",
    fileSize: 102400,
    extractor: "pdftotext-text",
    totalDuration: 145,          // ms
    extractionDuration: 120,     // ms
    processingSteps: [...],
    options: { /* normalized options */ },
    errorSummary: { hasIssues: false, errorCount: 0, warningCount: 0 },
    // content analysis
    content: {
      finalLength: 4800,
      originalLength: 5100,
      estimatedWordCount: 820,
      whitespaceRatio: 16,
      lineCount: 52
    }
  }
}

Testing

npm test                 # Unit + lightweight integration (default project)
npm run test:unit        # Unit tests only
npm run test:integration # Integration tests only
npm run test:heavy       # Heavy integration (OCR, PDF tools, LibreOffice)
npm run test:all         # All test suites
npm run test:coverage    # With coverage

Publishing

npm version patch    # bumps version, commits, creates git tag
git push origin dev --tags
npm publish --access public

License

MIT — see LICENSE.

Changelog

v0.3.0

  • Namespaced options API — clean break from flat bag of 20+ options
  • Options organized into ocr, validation, transform, markdown groups
  • Unknown option keys throw ValidationError
  • raw: true option bypasses pipeline entirely
  • validation: false disables all validation
  • normalizeBullets accepts custom marker character
  • transform.removeDocumentArtifacts option
  • extractHTML throws ExtractionError for file types with no HTML extractor (removed <pre> fallback)
  • DOC and RTF HTML extraction via LibreOffice
  • Removed requireSpacing suppression bug when forceFileType is set
  • commonWords validation now enabled by default
  • extractMarkdown now propagates FileTypeError unchanged (file not found, size limit) instead of wrapping as ExtractionError — error types now consistent across all three extraction functions
  • 'image' added as a first-class registry key; getExtractors('image', 'text') now returns the OCR extractor chain
  • processText exported as public API for running the text pipeline on externally-sourced text
  • validateOptions exported as public API for method-agnostic options validation
  • IMAGE_EXTENSIONS exported — canonical list of supported image file extensions
  • maxLength moved from validation group to transform group — truncation now applies after all other transforms

v0.2.2

  • Common words validation
  • Improved spacing validation with custom thresholds
  • Enhanced text processing pipeline

v0.1.0

  • Initial release