npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@wdelhagen/textprep

v0.2.2

Published

Document text extraction with pluggable extractors. Supports PDF, DOCX, DOC, RTF, TXT, and image files with OCR capabilities.

Readme

@wdelhagen/textprep

A robust Node.js library for extracting text, HTML, and Markdown from various document formats including PDF, DOCX, DOC, RTF, and plain text files. Features pluggable extractors, OCR support, and comprehensive text processing pipelines.

Configuration Options

Core Extraction Options

| Option | Type | Default | Description | | ---------------- | ------- | ------------ | -------------------------------------------- | | useOCR | string | 'fallback' | OCR mode: 'only', 'fallback', or false | | removeNewlines | boolean | false | Remove line breaks from extracted text | | removeUrls | boolean | true | Remove HTTP/HTTPS URLs from text | | minLength | number | 0 | Minimum required text length | | maxLength | number | undefined | Maximum allowed text length (truncates) | | maxFileSize | number | 52428800 | Maximum file size limit in bytes (50MB) | | requireSpacing | boolean | true | Validate text has proper spacing | | debug | boolean | false | Enable verbose logging |

Text Validation Options

| Option | Type | Default | Description | | --------------------------- | -------- | ------------- | ----------------------------------------------------- | | validateCommonWords | boolean | true | Enable common words validation | | commonWordsTypes | string[] | ['general'] | Word list types: 'general', 'resume' | | commonWordsMinMatches | number | 1 | Minimum number of words required to match | | commonWordsCustomWords | string[] | [] | Optional custom word list for domain-specific content |

Common words validation helps detect corrupted or non-text content by checking for recognizable English words:

  • General words: Common English words (the, and, for, that, etc.)
  • Resume words: CV/resume-specific vocabulary (experience, education, skills, etc.)
  • Custom words: Provide your own domain-specific word list

Example:

// Validate resumes/CVs with both general and resume-specific words
const result = await extractText('./resume.pdf', {
  commonWordsTypes: ['general', 'resume'],
  commonWordsMinMatches: 3  // Require at least 3 matching words
});

// Technical documents with custom vocabulary
const techDoc = await extractText('./api-docs.pdf', {
  commonWordsTypes: ['general'],
  commonWordsCustomWords: ['api', 'endpoint', 'authentication', 'jwt'],
  commonWordsMinMatches: 2
});

// Disable common words validation
const lenient = await extractText('./document.pdf', {
  validateCommonWords: false
});

OCR Options

| Option | Type | Default | Description | | ------------- | ------ | ----------- | -------------------------------- | | ocrLanguage | string | 'eng' | Tesseract language code | | dpi | number | 300 | DPI for image conversion | | psm | number | 1 | Tesseract page segmentation mode | | oem | number | undefined | Tesseract OCR engine mode | | whitelist | string | undefined | Tesseract character whitelist | | blacklist | string | undefined | Tesseract character blacklist |

Markdown-Specific Options

| Option | Type | Default | Description | | ------------------- | ------- | ------- | ----------------------------------------------------- | | preserveStructure | boolean | true | Maintain document structure | | headingStyle | string | 'atx' | Heading style: 'atx' (##) or 'setext' (underline) | | bulletListMarker | string | '-' | Bullet list marker character |

File Type Options

| Option | Type | Default | Description | | --------------- | -------- | ----------- | ---------------------------------- | | forceFileType | string | undefined | Force specific file type detection | | allowedTypes | string[] | undefined | Restrict to specific file types |

Extractor Options

| Option | Type | Default | Description | | ------------------ | ------ | ------- | ------------------------------------------------- | | extractors | object | {} | Override extractor chains for specific file types | | extractorOptions | object | {} | Pass options directly to underlying extractors |

Extractor-Specific Options

These options can be passed via extractorOptions:

PDF.js Options

| Option | Type | Default | Description | | ---------- | ------ | ----------- | ------------------------ | | maxPages | number | undefined | Maximum pages to extract | | password | string | undefined | PDF password |

Mammoth Options (DOCX)

| Option | Type | Default | Description | | ------------------------- | -------- | ----------- | ---------------------------------- | | styleMap | Array | undefined | Style mapping rules for conversion | | includeDefaultStyleMap | boolean | true | Include default style mapping | | convertImage | function | undefined | Image conversion function | | ignoreEmptyParagraphs | boolean | false | Skip empty paragraphs | | includeEmbeddedStyleMap | boolean | true | Use embedded style map |

Text File Options

| Option | Type | Default | Description | | ---------- | ------ | ----------- | -------------------------------------------- | | encoding | string | undefined | Force specific encoding instead of detection |

Example with all options:

const result = await extractText("./document.pdf", {
  // Core options
  useOCR: "fallback",
  removeNewlines: false,
  removeUrls: false,
  minLength: 100,
  maxLength: 50000,
  maxFileSize: 100 * 1024 * 1024,
  requireSpacing: true,
  debug: true,

  // OCR options
  ocrLanguage: "eng",
  dpi: 300,
  psm: 1,
  oem: 3,
  whitelist: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789",
  blacklist: "",

  // Markdown options (for extractMarkdown)
  preserveStructure: true,
  headingStyle: "atx",
  bulletListMarker: "-",

  // File type options
  forceFileType: "pdf",
  allowedTypes: ["pdf", "docx", "txt"],

  // Extractor options
  extractors: {
    pdf: ["tesseract-text"], // Force specific extractor chain
  },
  extractorOptions: {
    "pdfjs-text": {
      maxPages: 10,
      password: "secret",
    },
    mammoth: {
      styleMap: ['p[style-name="Header"] => h1'],
      includeDefaultStyleMap: true,
      ignoreEmptyParagraphs: false,
      includeEmbeddedStyleMap: true,
    },
    "fs-text": {
      encoding: "utf8",
    },
  },
});

Features

  • Multi-format support: PDF, DOCX, DOC, RTF, TXT, and image files
  • Multiple output formats: Extract as text, HTML, or Markdown
  • OCR integration: Tesseract OCR for image-only documents and scanned PDFs
  • Pluggable extractors: Multiple extraction methods with automatic fallback
  • Text processing pipeline: Advanced cleaning, validation, and normalization
  • Error handling: Comprehensive error recovery and detailed error information
  • Performance tracking: Detailed metadata and timing information
  • Debug mode: Verbose logging for troubleshooting

Installation

npm install @wdelhagen/textprep

System Dependencies

The library requires several system utilities for full functionality:

macOS (via Homebrew):

brew install poppler         # For PDF processing (pdftotext, pdftohtml, pdftoppm)
brew install tesseract       # For OCR capabilities
brew install unrtf          # For RTF file processing
brew install --cask libreoffice  # For DOC file processing

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install poppler-utils tesseract-ocr unrtf libreoffice

Windows:

Quick Start

const {
  extractText,
  extractHTML,
  extractMarkdown,
} = require("@wdelhagen/textprep");

// Extract plain text
const result = await extractText("./document.pdf");
console.log(result.text);
console.log(result.metadata.fileType); // 'pdf'

// Extract HTML
const htmlResult = await extractHTML("./document.docx");
console.log(htmlResult.text); // HTML content

// Extract Markdown
const mdResult = await extractMarkdown("./document.pdf");
console.log(mdResult.text); // Markdown content

API Reference

extractText(filePath, options)

Extracts plain text from supported document formats.

Parameters:

  • filePath (string): Path to the document file
  • options (object, optional): Extraction options

Returns: Promise<{text: string, metadata: object}>

Basic usage:

const result = await extractText("./document.pdf");
console.log(`Extracted ${result.text.length} characters`);
console.log(`File type: ${result.metadata.fileType}`);
console.log(`Extraction method: ${result.metadata.extractor}`);

With options:

const result = await extractText("./document.pdf", {
  useOCR: "fallback", // 'only', 'fallback', or false
  removeNewlines: false, // Remove line breaks
  removeUrls: false, // Remove HTTP/HTTPS URLs
  minLength: 100, // Minimum text length required
  maxLength: 50000, // Maximum text length allowed
  maxFileSize: 100 * 1024 * 1024, // 100MB file size limit
  debug: true, // Enable verbose logging
});

extractHTML(filePath, options)

Extracts HTML content from supported document formats.

Parameters:

  • filePath (string): Path to the document file
  • options (object, optional): Extraction options

Returns: Promise<{text: string, metadata: object}>

const result = await extractHTML("./document.docx");
// For formats without native HTML support, returns text wrapped in <pre> tags
console.log(result.metadata.htmlFallback); // true if fallback was used

extractMarkdown(filePath, options)

Extracts content as Markdown from supported document formats.

Parameters:

  • filePath (string): Path to the document file
  • options (object, optional): Extraction options including Markdown-specific settings

Returns: Promise<{text: string, metadata: object}>

const result = await extractMarkdown("./document.pdf", {
  preserveStructure: true, // Maintain document structure
  headingStyle: "atx", // 'atx' (##) or 'setext' (underline)
  bulletListMarker: "-", // Bullet list marker character
});

detectFileType(filePath)

Detects file type using magic bytes with fallback to file extension.

Parameters:

  • filePath (string): Path to the file

Returns: Promise<{detectedType: string, extensionType: string, mismatch: boolean, metadata: object}>

const typeInfo = await detectFileType("./unknown-file");
console.log(`Detected: ${typeInfo.detectedType}`);
console.log(`Extension suggests: ${typeInfo.extensionType}`);
console.log(`Mismatch: ${typeInfo.mismatch}`);

getExtractors(fileType?, outputType?)

Lists available extractors for a given file type and output format.

Parameters:

  • fileType (string, optional): File type ('pdf', 'docx', 'doc', 'rtf', 'txt', 'image')
  • outputType (string, optional): Output type ('text' or 'html'), defaults to 'text'

Returns: Array of extractor names or complete registry object

// Get all PDF text extractors
const pdfExtractors = getExtractors("pdf", "text");
// ['pdfjs-text', 'pdftotext-text', 'pdf-parse-text', 'tesseract-text']

// Get entire registry
const allExtractors = getExtractors();
console.log(allExtractors);

Options Reference

Extraction Options

| Option | Type | Default | Description | | ---------------- | ------- | ------------ | -------------------------------------------- | | useOCR | string | 'fallback' | OCR mode: 'only', 'fallback', or false | | removeNewlines | boolean | false | Remove line breaks from extracted text | | removeUrls | boolean | true | Remove HTTP/HTTPS URLs from text | | minLength | number | 0 | Minimum required text length | | maxLength | number | undefined | Maximum allowed text length (truncates) | | maxFileSize | number | 52428800 | Maximum file size limit in bytes (50MB) | | requireSpacing | boolean | true | Validate text has proper spacing | | debug | boolean | false | Enable verbose logging |

OCR Options

| Option | Type | Default | Description | | ------------- | ------ | ------- | ------------------------ | | ocrLanguage | string | 'eng' | Tesseract language code | | dpi | number | 300 | DPI for image conversion |

Force Options

| Option | Type | Default | Description | | --------------- | ------ | ----------- | ---------------------------------- | | forceFileType | string | undefined | Force specific file type detection |

Extractor-Specific Options

Pass options directly to underlying extractors:

const result = await extractText("./document.pdf", {
  extractorOptions: {
    "pdfjs-dist": {
      // pdfjs-dist specific options
    },
    mammoth: {
      // mammoth specific options
    },
    tesseract: {
      // tesseract specific options
    },
  },
});

Supported File Types

| Format | Extensions | Text Extraction | HTML Extraction | Extractors Used | | ----------------- | ----------------------------- | --------------- | --------------- | ------------------------------------------- | | PDF | .pdf | ✅ | ✅ | pdfjs-dist, pdftotext, pdf-parse, tesseract | | Word (Modern) | .docx | ✅ | ✅ | mammoth, textutil, tesseract | | Word (Legacy) | .doc | ✅ | ❌* | libreoffice, textutil | | Rich Text | .rtf | ✅ | ❌* | unrtf, textutil | | Plain Text | .txt | ✅ | ❌* | fs (encoding detection) | | Images | .jpg, .png, .bmp, .webp, .pbm | ✅ | ❌* | tesseract |

* HTML extraction falls back to text wrapped in <pre> tags

Usage Examples

Basic Text Extraction

const { extractText } = require("@wdelhagen/textprep");

async function extractFromFile(filePath) {
  try {
    const result = await extractText(filePath);
    console.log(`Successfully extracted ${result.text.length} characters`);
    console.log(`Processing time: ${result.metadata.totalDuration}ms`);
    return result.text;
  } catch (error) {
    console.error("Extraction failed:", error.message);
    if (error.context) {
      console.error("Context:", error.context);
    }
  }
}

OCR Processing

// OCR only mode - force OCR even for text-based documents
const ocrResult = await extractText("./scanned-document.pdf", {
  useOCR: "only",
  ocrLanguage: "eng",
  dpi: 300,
});

// OCR fallback - try text extraction first, OCR if no text found
const fallbackResult = await extractText("./mixed-document.pdf", {
  useOCR: "fallback",
});

Force Options for Troubleshooting

// Force file type when automatic detection fails
const result = await extractText("./misnamed-file.xyz", {
  forceFileType: "pdf",
});

// Force specific extractor chain
const result = await extractText("./document.pdf", {
  forceFileType: "pdf",
  extractors: {
    pdf: ["tesseract-text"], // Use OCR for PDFs
  },
});

Advanced Text Processing

const result = await extractText("./document.pdf", {
  removeNewlines: true, // Convert to single paragraph
  removeUrls: true, // Clean URLs from text
  minLength: 500, // Require at least 500 characters
  maxLength: 10000, // Truncate at 10k characters
  requireSpacing: true, // Validate proper word spacing
});

Batch Processing

const fs = require("fs").promises;
const path = require("path");
const { extractText } = require("@wdelhagen/textprep");

async function processDirectory(dirPath) {
  const files = await fs.readdir(dirPath);
  const results = [];

  for (const file of files) {
    const filePath = path.join(dirPath, file);
    try {
      const result = await extractText(filePath, {
        maxFileSize: 20 * 1024 * 1024, // 20MB limit
        useOCR: "fallback",
      });

      results.push({
        file,
        success: true,
        textLength: result.text.length,
        extractor: result.metadata.extractor,
        duration: result.metadata.totalDuration,
      });
    } catch (error) {
      results.push({
        file,
        success: false,
        error: error.message,
      });
    }
  }

  return results;
}

Error Handling

const { extractText, errors } = require("@wdelhagen/textprep");
const { FileTypeError, ExtractionError, ValidationError } = errors;

async function robustExtraction(filePath) {
  try {
    return await extractText(filePath);
  } catch (error) {
    if (error instanceof FileTypeError) {
      console.error("File type issue:", error.message);
      console.error("File path:", error.filePath);
      console.error("Detected type:", error.detectedType);
    } else if (error instanceof ExtractionError) {
      console.error("Extraction failed:", error.message);
      console.error("Context:", error.context);
    } else if (error instanceof ValidationError) {
      console.error("Validation failed:", error.message);
      console.error("Validation errors:", error.errors);
    } else {
      console.error("Unexpected error:", error);
    }
    throw error;
  }
}

Debug Mode

const result = await extractText("./document.pdf", {
  debug: true, // Enables verbose logging
});

// Check debug information in metadata
if (result.metadata.debug) {
  console.log("Detection time:", result.metadata.debug.timings.detection);
  console.log(
    "Extraction attempts:",
    result.metadata.debug.extractorsAttempted,
  );
  console.log("Processing steps:", result.metadata.debug.processingSteps);
}

Metadata Structure

All extraction functions return rich metadata:

{
  text: "extracted content...",
  metadata: {
    // Basic information
    fileType: "pdf",
    fileSize: 1024576,
    extractor: "pdfjs-text",
    totalDuration: 1523,

    // Timing breakdown
    timings: {
      detection: 45,
      extraction: 1200,
      processing: 278,
      total: 1523
    },

    // Content analysis
    contentAnalysis: {
      originalLength: 15420,
      finalLength: 14890,
      compressionRatio: 0.97,
      hasText: true,
      estimatedPages: 3
    },

    // Processing details
    processingSteps: ["detectFileType", "pdfjs-text", "processText"],
    extractionAttempts: [
      {
        extractor: "pdfjs-text",
        success: true,
        duration: 1200,
        textLength: 15420
      }
    ],

    // Error information (if any warnings)
    warnings: [],
    errorSummary: {
      hasErrors: false,
      errorCount: 0,
      warningCount: 0
    }
  }
}

Performance Considerations

  • File size limits: Default 50MB limit prevents memory issues
  • Extraction timeout: Operations timeout after reasonable periods
  • Memory usage: Typically under 500MB for large documents
  • OCR processing: Can be slow for large images/scanned documents
  • Caching: System utility availability is cached for performance

Error Types

FileTypeError

Thrown when file type detection fails or file type is not supported.

try {
  await extractText("./unknown-file");
} catch (error) {
  if (error instanceof FileTypeError) {
    console.log("File path:", error.filePath);
    console.log("Detected type:", error.detectedType);
  }
}

ExtractionError

Thrown when all extraction methods fail.

try {
  await extractText("./corrupted.pdf");
} catch (error) {
  if (error instanceof ExtractionError) {
    console.log("Context:", error.context);
    console.log("Attempts:", error.context.attempts);
  }
}

ValidationError

Thrown when extracted text fails validation.

try {
  await extractText("./tiny-file.txt", { minLength: 1000 });
} catch (error) {
  if (error instanceof ValidationError) {
    console.log("Validation errors:", error.errors);
  }
}

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite: npm test
  6. Submit a pull request

Testing

# Run all tests
npm test

# Run with coverage
npm run test:coverage

# Run only unit tests
npm run test:unit

# Run only integration tests
npm run test:integration

License

MIT License - see LICENSE file for details.

Changelog

v0.1.0

  • Initial release
  • Support for PDF, DOCX, DOC, RTF, TXT, and image files
  • Text, HTML, and Markdown output formats
  • OCR integration with Tesseract
  • Comprehensive error handling and metadata
  • Debug mode and performance tracking