@wdelhagen/textprep

v0.2.2

Published

11 days ago

Document text extraction with pluggable extractors. Supports PDF, DOCX, DOC, RTF, TXT, and image files with OCR capabilities.

@wdelhagen/textprep

A robust Node.js library for extracting text, HTML, and Markdown from various document formats including PDF, DOCX, DOC, RTF, and plain text files. Features pluggable extractors, OCR support, and comprehensive text processing pipelines.

Configuration Options

Core Extraction Options

| Option | Type | Default | Description | | ---------------- | ------- | ------------ | -------------------------------------------- | | useOCR | string | 'fallback' | OCR mode: 'only', 'fallback', or false | | removeNewlines | boolean | false | Remove line breaks from extracted text | | removeUrls | boolean | true | Remove HTTP/HTTPS URLs from text | | minLength | number | 0 | Minimum required text length | | maxLength | number | undefined | Maximum allowed text length (truncates) | | maxFileSize | number | 52428800 | Maximum file size limit in bytes (50MB) | | requireSpacing | boolean | true | Validate text has proper spacing | | debug | boolean | false | Enable verbose logging |

Text Validation Options

| Option | Type | Default | Description | | --------------------------- | -------- | ------------- | ----------------------------------------------------- | | validateCommonWords | boolean | true | Enable common words validation | | commonWordsTypes | string[] | ['general'] | Word list types: 'general', 'resume' | | commonWordsMinMatches | number | 1 | Minimum number of words required to match | | commonWordsCustomWords | string[] | [] | Optional custom word list for domain-specific content |

Common words validation helps detect corrupted or non-text content by checking for recognizable English words:

General words: Common English words (the, and, for, that, etc.)
Resume words: CV/resume-specific vocabulary (experience, education, skills, etc.)
Custom words: Provide your own domain-specific word list

Example:

// Validate resumes/CVs with both general and resume-specific words
const result = await extractText('./resume.pdf', {
  commonWordsTypes: ['general', 'resume'],
  commonWordsMinMatches: 3  // Require at least 3 matching words
});

// Technical documents with custom vocabulary
const techDoc = await extractText('./api-docs.pdf', {
  commonWordsTypes: ['general'],
  commonWordsCustomWords: ['api', 'endpoint', 'authentication', 'jwt'],
  commonWordsMinMatches: 2
});

// Disable common words validation
const lenient = await extractText('./document.pdf', {
  validateCommonWords: false
});

OCR Options

| Option | Type | Default | Description | | ------------- | ------ | ----------- | -------------------------------- | | ocrLanguage | string | 'eng' | Tesseract language code | | dpi | number | 300 | DPI for image conversion | | psm | number | 1 | Tesseract page segmentation mode | | oem | number | undefined | Tesseract OCR engine mode | | whitelist | string | undefined | Tesseract character whitelist | | blacklist | string | undefined | Tesseract character blacklist |

Markdown-Specific Options

| Option | Type | Default | Description | | ------------------- | ------- | ------- | ----------------------------------------------------- | | preserveStructure | boolean | true | Maintain document structure | | headingStyle | string | 'atx' | Heading style: 'atx' (##) or 'setext' (underline) | | bulletListMarker | string | '-' | Bullet list marker character |

File Type Options

| Option | Type | Default | Description | | --------------- | -------- | ----------- | ---------------------------------- | | forceFileType | string | undefined | Force specific file type detection | | allowedTypes | string[] | undefined | Restrict to specific file types |

Extractor Options

| Option | Type | Default | Description | | ------------------ | ------ | ------- | ------------------------------------------------- | | extractors | object | {} | Override extractor chains for specific file types | | extractorOptions | object | {} | Pass options directly to underlying extractors |

Extractor-Specific Options

These options can be passed via extractorOptions:

PDF.js Options

| Option | Type | Default | Description | | ---------- | ------ | ----------- | ------------------------ | | maxPages | number | undefined | Maximum pages to extract | | password | string | undefined | PDF password |

Mammoth Options (DOCX)

| Option | Type | Default | Description | | ------------------------- | -------- | ----------- | ---------------------------------- | | styleMap | Array | undefined | Style mapping rules for conversion | | includeDefaultStyleMap | boolean | true | Include default style mapping | | convertImage | function | undefined | Image conversion function | | ignoreEmptyParagraphs | boolean | false | Skip empty paragraphs | | includeEmbeddedStyleMap | boolean | true | Use embedded style map |

Text File Options

| Option | Type | Default | Description | | ---------- | ------ | ----------- | -------------------------------------------- | | encoding | string | undefined | Force specific encoding instead of detection |

Example with all options:

const result = await extractText("./document.pdf", {
  // Core options
  useOCR: "fallback",
  removeNewlines: false,
  removeUrls: false,
  minLength: 100,
  maxLength: 50000,
  maxFileSize: 100 * 1024 * 1024,
  requireSpacing: true,
  debug: true,

  // OCR options
  ocrLanguage: "eng",
  dpi: 300,
  psm: 1,
  oem: 3,
  whitelist: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789",
  blacklist: "",

  // Markdown options (for extractMarkdown)
  preserveStructure: true,
  headingStyle: "atx",
  bulletListMarker: "-",

  // File type options
  forceFileType: "pdf",
  allowedTypes: ["pdf", "docx", "txt"],

  // Extractor options
  extractors: {
    pdf: ["tesseract-text"], // Force specific extractor chain
  },
  extractorOptions: {
    "pdfjs-text": {
      maxPages: 10,
      password: "secret",
    },
    mammoth: {
      styleMap: ['p[style-name="Header"] => h1'],
      includeDefaultStyleMap: true,
      ignoreEmptyParagraphs: false,
      includeEmbeddedStyleMap: true,
    },
    "fs-text": {
      encoding: "utf8",
    },
  },
});

Features

Multi-format support: PDF, DOCX, DOC, RTF, TXT, and image files
Multiple output formats: Extract as text, HTML, or Markdown
OCR integration: Tesseract OCR for image-only documents and scanned PDFs
Pluggable extractors: Multiple extraction methods with automatic fallback
Text processing pipeline: Advanced cleaning, validation, and normalization
Error handling: Comprehensive error recovery and detailed error information
Performance tracking: Detailed metadata and timing information
Debug mode: Verbose logging for troubleshooting

Installation

npm install @wdelhagen/textprep

System Dependencies

The library requires several system utilities for full functionality:

macOS (via Homebrew):

brew install poppler         # For PDF processing (pdftotext, pdftohtml, pdftoppm)
brew install tesseract       # For OCR capabilities
brew install unrtf          # For RTF file processing
brew install --cask libreoffice  # For DOC file processing

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install poppler-utils tesseract-ocr unrtf libreoffice

Windows:

Quick Start

const {
  extractText,
  extractHTML,
  extractMarkdown,
} = require("@wdelhagen/textprep");

// Extract plain text
const result = await extractText("./document.pdf");
console.log(result.text);
console.log(result.metadata.fileType); // 'pdf'

// Extract HTML
const htmlResult = await extractHTML("./document.docx");
console.log(htmlResult.text); // HTML content

// Extract Markdown
const mdResult = await extractMarkdown("./document.pdf");
console.log(mdResult.text); // Markdown content

API Reference

extractText(filePath, options)

Extracts plain text from supported document formats.

Parameters:

filePath (string): Path to the document file
options (object, optional): Extraction options

Returns: Promise<{text: string, metadata: object}>

Basic usage:

const result = await extractText("./document.pdf");
console.log(`Extracted ${result.text.length} characters`);
console.log(`File type: ${result.metadata.fileType}`);
console.log(`Extraction method: ${result.metadata.extractor}`);

With options:

const result = await extractText("./document.pdf", {
  useOCR: "fallback", // 'only', 'fallback', or false
  removeNewlines: false, // Remove line breaks
  removeUrls: false, // Remove HTTP/HTTPS URLs
  minLength: 100, // Minimum text length required
  maxLength: 50000, // Maximum text length allowed
  maxFileSize: 100 * 1024 * 1024, // 100MB file size limit
  debug: true, // Enable verbose logging
});

extractHTML(filePath, options)

Extracts HTML content from supported document formats.

Parameters:

filePath (string): Path to the document file
options (object, optional): Extraction options

Returns: Promise<{text: string, metadata: object}>

const result = await extractHTML("./document.docx");
// For formats without native HTML support, returns text wrapped in <pre> tags
console.log(result.metadata.htmlFallback); // true if fallback was used

extractMarkdown(filePath, options)

Extracts content as Markdown from supported document formats.

Parameters:

filePath (string): Path to the document file
options (object, optional): Extraction options including Markdown-specific settings

Returns: Promise<{text: string, metadata: object}>

const result = await extractMarkdown("./document.pdf", {
  preserveStructure: true, // Maintain document structure
  headingStyle: "atx", // 'atx' (##) or 'setext' (underline)
  bulletListMarker: "-", // Bullet list marker character
});

detectFileType(filePath)

Detects file type using magic bytes with fallback to file extension.

Parameters:

filePath (string): Path to the file

Returns: Promise<{detectedType: string, extensionType: string, mismatch: boolean, metadata: object}>

const typeInfo = await detectFileType("./unknown-file");
console.log(`Detected: ${typeInfo.detectedType}`);
console.log(`Extension suggests: ${typeInfo.extensionType}`);
console.log(`Mismatch: ${typeInfo.mismatch}`);

getExtractors(fileType?, outputType?)

Lists available extractors for a given file type and output format.

Parameters:

fileType (string, optional): File type ('pdf', 'docx', 'doc', 'rtf', 'txt', 'image')
outputType (string, optional): Output type ('text' or 'html'), defaults to 'text'

Returns: Array of extractor names or complete registry object

// Get all PDF text extractors
const pdfExtractors = getExtractors("pdf", "text");
// ['pdfjs-text', 'pdftotext-text', 'pdf-parse-text', 'tesseract-text']

// Get entire registry
const allExtractors = getExtractors();
console.log(allExtractors);

Options Reference

Extraction Options

OCR Options

| Option | Type | Default | Description | | ------------- | ------ | ------- | ------------------------ | | ocrLanguage | string | 'eng' | Tesseract language code | | dpi | number | 300 | DPI for image conversion |

Force Options

| Option | Type | Default | Description | | --------------- | ------ | ----------- | ---------------------------------- | | forceFileType | string | undefined | Force specific file type detection |

Extractor-Specific Options

Pass options directly to underlying extractors:

const result = await extractText("./document.pdf", {
  extractorOptions: {
    "pdfjs-dist": {
      // pdfjs-dist specific options
    },
    mammoth: {
      // mammoth specific options
    },
    tesseract: {
      // tesseract specific options
    },
  },
});

Supported File Types

| Format | Extensions | Text Extraction | HTML Extraction | Extractors Used | | ----------------- | ----------------------------- | --------------- | --------------- | ------------------------------------------- | | PDF | .pdf | ✅ | ✅ | pdfjs-dist, pdftotext, pdf-parse, tesseract | | Word (Modern) | .docx | ✅ | ✅ | mammoth, textutil, tesseract | | Word (Legacy) | .doc | ✅ | ❌* | libreoffice, textutil | | Rich Text | .rtf | ✅ | ❌* | unrtf, textutil | | Plain Text | .txt | ✅ | ❌* | fs (encoding detection) | | Images | .jpg, .png, .bmp, .webp, .pbm | ✅ | ❌* | tesseract |

* HTML extraction falls back to text wrapped in <pre> tags

Usage Examples

Basic Text Extraction

const { extractText } = require("@wdelhagen/textprep");

async function extractFromFile(filePath) {
  try {
    const result = await extractText(filePath);
    console.log(`Successfully extracted ${result.text.length} characters`);
    console.log(`Processing time: ${result.metadata.totalDuration}ms`);
    return result.text;
  } catch (error) {
    console.error("Extraction failed:", error.message);
    if (error.context) {
      console.error("Context:", error.context);
    }
  }
}

OCR Processing

// OCR only mode - force OCR even for text-based documents
const ocrResult = await extractText("./scanned-document.pdf", {
  useOCR: "only",
  ocrLanguage: "eng",
  dpi: 300,
});

// OCR fallback - try text extraction first, OCR if no text found
const fallbackResult = await extractText("./mixed-document.pdf", {
  useOCR: "fallback",
});

Force Options for Troubleshooting

// Force file type when automatic detection fails
const result = await extractText("./misnamed-file.xyz", {
  forceFileType: "pdf",
});

// Force specific extractor chain
const result = await extractText("./document.pdf", {
  forceFileType: "pdf",
  extractors: {
    pdf: ["tesseract-text"], // Use OCR for PDFs
  },
});

Advanced Text Processing

const result = await extractText("./document.pdf", {
  removeNewlines: true, // Convert to single paragraph
  removeUrls: true, // Clean URLs from text
  minLength: 500, // Require at least 500 characters
  maxLength: 10000, // Truncate at 10k characters
  requireSpacing: true, // Validate proper word spacing
});

Batch Processing

const fs = require("fs").promises;
const path = require("path");
const { extractText } = require("@wdelhagen/textprep");

async function processDirectory(dirPath) {
  const files = await fs.readdir(dirPath);
  const results = [];

  for (const file of files) {
    const filePath = path.join(dirPath, file);
    try {
      const result = await extractText(filePath, {
        maxFileSize: 20 * 1024 * 1024, // 20MB limit
        useOCR: "fallback",
      });

      results.push({
        file,
        success: true,
        textLength: result.text.length,
        extractor: result.metadata.extractor,
        duration: result.metadata.totalDuration,
      });
    } catch (error) {
      results.push({
        file,
        success: false,
        error: error.message,
      });
    }
  }

  return results;
}

Error Handling

const { extractText, errors } = require("@wdelhagen/textprep");
const { FileTypeError, ExtractionError, ValidationError } = errors;

async function robustExtraction(filePath) {
  try {
    return await extractText(filePath);
  } catch (error) {
    if (error instanceof FileTypeError) {
      console.error("File type issue:", error.message);
      console.error("File path:", error.filePath);
      console.error("Detected type:", error.detectedType);
    } else if (error instanceof ExtractionError) {
      console.error("Extraction failed:", error.message);
      console.error("Context:", error.context);
    } else if (error instanceof ValidationError) {
      console.error("Validation failed:", error.message);
      console.error("Validation errors:", error.errors);
    } else {
      console.error("Unexpected error:", error);
    }
    throw error;
  }
}

Debug Mode

const result = await extractText("./document.pdf", {
  debug: true, // Enables verbose logging
});

// Check debug information in metadata
if (result.metadata.debug) {
  console.log("Detection time:", result.metadata.debug.timings.detection);
  console.log(
    "Extraction attempts:",
    result.metadata.debug.extractorsAttempted,
  );
  console.log("Processing steps:", result.metadata.debug.processingSteps);
}

Metadata Structure

All extraction functions return rich metadata:

{
  text: "extracted content...",
  metadata: {
    // Basic information
    fileType: "pdf",
    fileSize: 1024576,
    extractor: "pdfjs-text",
    totalDuration: 1523,

    // Timing breakdown
    timings: {
      detection: 45,
      extraction: 1200,
      processing: 278,
      total: 1523
    },

    // Content analysis
    contentAnalysis: {
      originalLength: 15420,
      finalLength: 14890,
      compressionRatio: 0.97,
      hasText: true,
      estimatedPages: 3
    },

    // Processing details
    processingSteps: ["detectFileType", "pdfjs-text", "processText"],
    extractionAttempts: [
      {
        extractor: "pdfjs-text",
        success: true,
        duration: 1200,
        textLength: 15420
      }
    ],

    // Error information (if any warnings)
    warnings: [],
    errorSummary: {
      hasErrors: false,
      errorCount: 0,
      warningCount: 0
    }
  }
}

Performance Considerations

File size limits: Default 50MB limit prevents memory issues
Extraction timeout: Operations timeout after reasonable periods
Memory usage: Typically under 500MB for large documents
OCR processing: Can be slow for large images/scanned documents
Caching: System utility availability is cached for performance

Error Types

FileTypeError

Thrown when file type detection fails or file type is not supported.

try {
  await extractText("./unknown-file");
} catch (error) {
  if (error instanceof FileTypeError) {
    console.log("File path:", error.filePath);
    console.log("Detected type:", error.detectedType);
  }
}

ExtractionError

Thrown when all extraction methods fail.

try {
  await extractText("./corrupted.pdf");
} catch (error) {
  if (error instanceof ExtractionError) {
    console.log("Context:", error.context);
    console.log("Attempts:", error.context.attempts);
  }
}

ValidationError

Thrown when extracted text fails validation.

try {
  await extractText("./tiny-file.txt", { minLength: 1000 });
} catch (error) {
  if (error instanceof ValidationError) {
    console.log("Validation errors:", error.errors);
  }
}

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite: npm test
Submit a pull request

Testing

# Run all tests
npm test

# Run with coverage
npm run test:coverage

# Run only unit tests
npm run test:unit

# Run only integration tests
npm run test:integration

License

MIT License - see LICENSE file for details.

Changelog

v0.1.0

Initial release
Support for PDF, DOCX, DOC, RTF, TXT, and image files
Text, HTML, and Markdown output formats
OCR integration with Tesseract
Comprehensive error handling and metadata
Debug mode and performance tracking

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@wdelhagen/textprep

Configuration Options

Core Extraction Options

Text Validation Options

OCR Options

Markdown-Specific Options

File Type Options

Extractor Options

Extractor-Specific Options

PDF.js Options

Mammoth Options (DOCX)

Text File Options

Features

Installation

System Dependencies

Quick Start

API Reference

extractText(filePath, options)

extractHTML(filePath, options)

extractMarkdown(filePath, options)

detectFileType(filePath)

getExtractors(fileType?, outputType?)

Options Reference

Extraction Options

OCR Options

Force Options

Extractor-Specific Options

Supported File Types

Usage Examples

Basic Text Extraction

OCR Processing

Force Options for Troubleshooting

Advanced Text Processing

Batch Processing

Error Handling

Debug Mode

Metadata Structure

Performance Considerations

Error Types

FileTypeError

ExtractionError

ValidationError

Contributing

Testing

License

Changelog

v0.1.0