pdf-plus

v2.1.5

Published

a month ago

A comprehensive PDF content extraction library with support for text, images, and structured data

0High
0Medium
0Low

kauandotnet

pdf extraction text images document parsing content typescript

pdf-plus

A comprehensive PDF content extraction library with support for text, images, and structured data.

Features

📝 Text Extraction - High-quality text extraction with positioning
🖼️ Image Detection - Detect and reference images in PDF content
💾 Image File Extraction - Extract actual image files from PDFs
🎨 Image Optimization - Optional Sharp/Imagemin optimization with quality control
🔄 JP2 Conversion - Automatic JPEG 2000 to JPG conversion for compatibility
🚀 Parallel Processing - 1.5-3x faster with configurable concurrency (Phase 1)
⚡ Async I/O - Non-blocking file operations for better performance (Phase 2)
🧵 Worker Threads - True multi-threading for CPU-intensive operations (Phase 3)
🌊 Streaming API - Process large PDFs with 10-100x lower memory usage (Phase 4)
📄 Page to Image - Convert PDF pages to images (PNG, JPG, WebP) (Phase 5 - NEW!)
🎯 Format Preservation - Preserves original image formats (JPG, PNG) and full quality
🔧 TypeScript Support - Full TypeScript definitions included
🛡️ Robust Validation - Comprehensive input validation and error handling

Installation

# Using pnpm (recommended)
pnpm add pdf-plus

# Using npm
npm install pdf-plus

# Using yarn
yarn add pdf-plus

Quick Start

import { extractPdfContent } from "pdf-plus";

// Extract both text and images
const result = await extractPdfContent("document.pdf", {
  extractText: true,
  extractImages: true,
  verbose: true,
});

console.log(
  `Extracted ${result.images.length} images from ${result.document.pages} pages`
);
console.log(`Text content: ${result.cleanText.substring(0, 100)}...`);

Streaming API for Large PDFs (NEW! - Phase 4)

For large PDFs, use the streaming API for lower memory usage and real-time progress:

import { extractPdfStream } from "pdf-plus";

const stream = extractPdfStream("large-document.pdf", {
  extractImageFiles: true,
  imageOutputDir: "./images",
  streamMode: true,
});

for await (const event of stream) {
  if (event.type === "page") {
    console.log(`Page ${event.pageNumber}/${event.totalPages} complete`);
  } else if (event.type === "progress") {
    console.log(`Progress: ${event.percentComplete.toFixed(1)}%`);
  } else if (event.type === "complete") {
    console.log(`Done! ${event.totalImages} images extracted`);
  }
}

Benefits:

📉 10-100x lower memory usage for large PDFs
⚡ 100x faster time to first result
📊 Real-time progress tracking
🛑 Cancellation support

See PHASE4-STREAMING.md for complete streaming API documentation.

Generate Page Images (NEW! - Phase 5)

Render PDF pages to high-quality images with a simple function call:

import { generatePageImages } from "pdf-plus";

// Simple - render all pages to JPG images
const imagePaths = await generatePageImages(
  "document.pdf", // PDF file path
  "./page-images" // Output directory where images will be saved
);

console.log(`Generated ${imagePaths.length} page images`);
// Returns: ['/path/to/page-images/jpg/page-001.jpg', '/path/to/page-images/jpg/page-002.jpg', ...]

With Options:

const imagePaths = await generatePageImages("document.pdf", "./page-images", {
  pageImageFormat: "jpg", // 'jpg', 'png', or 'webp'
  pageImageDpi: 150, // DPI quality (72, 150, 300, 600)
  pageRenderEngine: "poppler", // 'poppler' (recommended) or 'pdfjs'
  specificPages: [1, 2, 3], // Optional: only render specific pages
  parallelProcessing: true, // Parallel rendering (default: true)
  maxConcurrentPages: 10, // Max parallel pages (default: 10)
  verbose: true, // Show progress
});

Features:

🎨 Multiple formats - JPG, PNG, WebP
📐 Quality control - Adjustable DPI (72, 150, 300, 600)
📄 Page selection - Render specific pages or all pages
🚀 Parallel rendering - Fast multi-page processing
📁 Returns file paths - Array of absolute paths to generated images
🔧 Two engines - Poppler (best quality) or PDF.js

Output Structure:

page-images/
└── jpg/
    ├── page-001.jpg
    ├── page-002.jpg
    └── page-003.jpg

See PAGE-TO-IMAGE-FEATURE.md for complete page-to-image documentation.

Usage Examples

Text-Only Extraction (Fast)

import { extractText } from "pdf-plus";

const text = await extractText("document.pdf");
console.log(`Extracted ${text.length} characters`);

Extract Embedded Images

import { extractImageFiles } from "pdf-plus";

// Extract and save embedded images from PDF
const imagePaths = await extractImageFiles(
  "document.pdf",
  "./extracted-images" // Output directory for embedded images
);

console.log(`Extracted ${imagePaths.length} embedded images`);

Generate Page Images (Render Pages)

import { generatePageImages } from "pdf-plus";

// Render PDF pages to image files
const imagePaths = await generatePageImages(
  "document.pdf",
  "./page-images" // Output directory for page images
);

console.log(`Generated ${imagePaths.length} page images`);
// Each page becomes an image: page-001.jpg, page-002.jpg, etc.

Image Extraction with Optimization

import { extractPdfContent } from "pdf-plus";

const result = await extractPdfContent("document.pdf", {
  extractImageFiles: true,
  imageOutputDir: "./images",

  // Enable optimization
  optimizeImages: true,
  imageOptimizer: "auto", // or 'sharp', 'imagemin'
  imageQuality: 80,
  imageProgressive: true,

  // Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true)
  convertJp2ToJpg: true,
  imageQuality: 100, // Default: 100 for JP2 conversion (max quality)

  verbose: true,
});

// Check optimization results
result.images.forEach((img) => {
  console.log(`${img.filename}: Optimized and saved`);
});

Performance Optimization (NEW! 🚀)

import { extractPdfContent } from "pdf-plus";

// BASIC: Parallel processing (enabled by default)
const result = await extractPdfContent("document.pdf", {
  extractImageFiles: true,
  imageOutputDir: "./images",
  parallelProcessing: true, // 1.5-3x faster
});

// ADVANCED: With worker threads for CPU-intensive operations
const result = await extractPdfContent("large-document.pdf", {
  extractImageFiles: true,
  imageOutputDir: "./images",

  // Enable parallel processing (default: true)
  parallelProcessing: true,

  // Enable worker threads for true multi-threading (default: false)
  useWorkerThreads: true, // 2.5-3.2x additional speedup!
  autoScaleWorkers: true, // Auto-adjust based on system resources
  maxWorkerThreads: 8, // Max worker threads (default: CPU cores - 1)

  // Fine-tune concurrency for your workload
  maxConcurrentPages: 20, // Process up to 20 pages simultaneously
  maxConcurrentImages: 50, // Extract up to 50 images per page in parallel
  maxConcurrentConversions: 5, // Convert up to 5 JP2 files simultaneously
  maxConcurrentOptimizations: 5, // Optimize up to 5 images simultaneously

  verbose: true,
});

// Performance gains (tested on Art Basel PDF, 54 images):
// - Baseline (sequential): 140ms
// - Parallel processing: 47ms (2.96x faster)
// - Parallel + Workers: 44ms (3.23x faster) 🚀

Performance Recommendations:

| PDF Size | Images | Recommended Settings | | -------- | ------ | ------------------------------------------------------------------------------------------------------------------------- | | Small | <20 | parallelProcessing: true (default settings) | | Medium | 20-50 | parallelProcessing: true, maxConcurrentPages: 10, maxConcurrentImages: 20 | | Large | 50+ | parallelProcessing: true, useWorkerThreads: true, maxConcurrentPages: 20, maxConcurrentImages: 50 | | Huge | 200+ | parallelProcessing: true, useWorkerThreads: true, maxWorkerThreads: 8, maxConcurrentPages: 30, maxConcurrentImages: 100 |

Worker Threads Benefits:

✅ True multi-threading (runs on separate CPU cores)
✅ 2.5-3.2x faster for CPU-intensive operations (JP2 conversion, optimization)
✅ Auto-scaling based on memory and CPU usage
✅ Opt-in (default: false) - no breaking changes

See PERFORMANCE.md and PHASE3-WORKERS.md for detailed benchmarks and optimization guide.

Custom Image References

import { extractPdfContent } from "pdf-plus";

const result = await extractPdfContent("document.pdf", {
  imageRefFormat: "📷 Image {index} on page {page}",
  extractImageFiles: true,
  useImagePaths: true,
});

// Text will contain: "📷 Image 1 on page 1" instead of "[IMAGE:img_1]"

Advanced Configuration

import { PDFExtractor } from "pdf-plus";

const extractor = new PDFExtractor();

const result = await extractor.extract("large-document.pdf", {
  extractText: true,
  extractImages: true,
  extractImageFiles: true,
  imageOutputDir: "./extracted-images",
  memoryLimit: "1GB",
  batchSize: 10,
  progressCallback: (progress) => {
    console.log(
      `Processing page ${progress.currentPage}/${progress.totalPages}`
    );
  },
});

Real-World Examples

Extract and Save Images from Academic Papers

import { extractPdfContent } from "pdf-plus";
import path from "path";

async function extractAcademicPaper(pdfPath: string) {
  const result = await extractPdfContent(pdfPath, {
    extractText: true,
    extractImages: true,
    extractImageFiles: true,
    imageOutputDir: "./paper-images",
    imageRefFormat: "Figure {index}: {name}",
    verbose: true,
  });

  // Save text content
  const fs = await import("fs");
  fs.writeFileSync("./paper-text.txt", result.cleanText);

  // Log extraction summary
  console.log(`📄 Extracted from ${result.document.filename}:`);
  console.log(`   📝 Text: ${result.document.textLength} characters`);
  console.log(`   🖼️  Images: ${result.images.length} found`);
  console.log(`   📊 Pages: ${result.document.pages}`);

  return result;
}

Batch Process Multiple PDFs

import { PDFExtractor } from "pdf-plus";
import { glob } from "glob";

async function batchProcessPDFs(pattern: string) {
  const extractor = new PDFExtractor("./cache"); // Enable caching
  const pdfFiles = await glob(pattern);

  const results = [];

  for (const pdfFile of pdfFiles) {
    console.log(`Processing: ${pdfFile}`);

    try {
      const result = await extractor.extract(pdfFile, {
        extractText: true,
        extractImages: true,
        imageOutputDir: `./output/${path.basename(pdfFile, ".pdf")}`,
        batchSize: 5, // Process 5 pages at a time
        verbose: false,
      });

      results.push({
        file: pdfFile,
        success: true,
        pages: result.document.pages,
        images: result.images.length,
        textLength: result.document.textLength,
      });
    } catch (error) {
      console.error(`Failed to process ${pdfFile}:`, error);
      results.push({
        file: pdfFile,
        success: false,
        error: error.message,
      });
    }
  }

  return results;
}

API Reference

Main Functions

`extractPdfContent(pdfPath, options)`

Extract complete content from a PDF file.

Parameters:

pdfPath (string) - Path to the PDF file
options (ExtractionOptions) - Extraction configuration

Returns: Promise<ExtractionResult>

`extractText(pdfPath, options)`

Extract only text content (optimized for speed).

Returns: Promise<string>

`extractImages(pdfPath, options)`

Extract only image references.

Returns: Promise<ImageItem[]>

`extractImageFiles(pdfPath, outputDir, options)`

Extract and save embedded image files from PDF.

Parameters:

pdfPath - Path to the PDF file
outputDir - Output directory path where embedded images will be saved
options - Optional extraction options

Returns: Promise<string[]> - Array of saved file paths

`generatePageImages(pdfPath, outputDir, options)`

Render PDF pages to image files (page-to-image conversion).

Parameters:

pdfPath - Path to the PDF file
outputDir - Output directory path where page images will be saved
options - Optional rendering options (pageImageFormat, pageImageDpi, pageRenderEngine, etc.)

Returns: Promise<string[]> - Array of absolute paths to generated page images

Example:

import { generatePageImages } from "pdf-plus";

const imagePaths = await generatePageImages("document.pdf", "./page-images", {
  pageImageFormat: "jpg",
  pageImageDpi: 150,
  pageRenderEngine: "poppler",
});

console.log(`Generated ${imagePaths.length} page images`);
// Returns: ['/absolute/path/to/page-images/jpg/page-001.jpg', ...]

Options

interface ExtractionOptions {
  // Basic extraction options
  extractText?: boolean; // Extract text content (default: true)
  extractImages?: boolean; // Extract image references (default: true)
  extractImageFiles?: boolean; // Save actual image files (default: false)
  useImagePaths?: boolean; // Use file paths in references (default: false)
  imageOutputDir?: string; // Directory for image files (default: './extracted-images')
  imageRefFormat?: string; // Custom reference format (default: '[IMAGE:{id}]')
  baseName?: string; // Base name for output files
  verbose?: boolean; // Show detailed progress (default: false)
  memoryLimit?: string; // Memory limit (e.g., '512MB', '1GB')
  batchSize?: number; // Pages per batch (1-100)
  progressCallback?: (progress: ProgressInfo) => void;

  // Image optimization options
  optimizeImages?: boolean; // Enable image optimization (default: false)
  imageOptimizer?: "auto" | "sharp" | "imagemin"; // Optimizer to use (default: 'auto')
  imageQuality?: number; // Image quality 1-100 (default: 80, JP2 conversion: 100)
  imageProgressive?: boolean; // Progressive JPEG (default: true)
  convertJp2ToJpg?: boolean; // Convert JP2 to JPG (default: true)

  // Performance options (NEW!)
  parallelProcessing?: boolean; // Enable parallel processing (default: true)
  maxConcurrentPages?: number; // Max pages in parallel (default: 10)
  maxConcurrentImages?: number; // Max images per page in parallel (default: 20)
  maxConcurrentConversions?: number; // Max JP2 conversions in parallel (default: 5)
  maxConcurrentOptimizations?: number; // Max optimizations in parallel (default: 5)

  // Worker thread options (NEW! 🚀)
  useWorkerThreads?: boolean; // Enable worker threads (default: false)
  autoScaleWorkers?: boolean; // Auto-scale workers (default: true)
  maxWorkerThreads?: number; // Max worker threads (default: CPU cores - 1)
  minWorkerThreads?: number; // Min worker threads (default: 1)
  memoryThreshold?: number; // Memory threshold 0-1 (default: 0.8)
  cpuThreshold?: number; // CPU threshold 0-1 (default: 0.9)
  workerTaskTimeout?: number; // Task timeout ms (default: 30000)
  workerIdleTimeout?: number; // Idle timeout ms (default: 60000)
  workerMemoryLimit?: number; // Memory per worker MB (default: 512)
  enableWorkerForConversion?: boolean; // Workers for JP2 (default: true)
  enableWorkerForOptimization?: boolean; // Workers for optimization (default: true)
  enableWorkerForDecoding?: boolean; // Workers for decoding (default: true)
}

Performance Options Explained:

Parallel Processing:

parallelProcessing: Enable/disable parallel processing. Enabled by default for 1.5-3x speedup.
maxConcurrentPages: How many pages to process simultaneously. Higher values = faster for multi-page PDFs, but more memory usage.
maxConcurrentImages: How many images per page to extract in parallel. Increase for pages with many images.
maxConcurrentConversions: How many JP2→JPG conversions to run simultaneously. Keep moderate (5-10) to avoid memory issues.
maxConcurrentOptimizations: How many image optimizations to run simultaneously. Keep moderate (5-10) as optimization is CPU-intensive.

Worker Threads (NEW! 🚀):

useWorkerThreads: Enable true multi-threading using Node.js worker threads. Provides 2.5-3.2x additional speedup for CPU-intensive operations. Default: false (opt-in).
autoScaleWorkers: Automatically adjust worker count based on system memory and CPU usage. Default: true.
maxWorkerThreads: Maximum number of worker threads. Default: CPU cores - 1.
minWorkerThreads: Minimum number of worker threads to keep alive. Default: 1.
memoryThreshold: Memory usage threshold (0-1) before scaling down workers. Default: 0.8 (80%).
cpuThreshold: CPU usage threshold (0-1) before scaling down workers. Default: 0.9 (90%).
workerTaskTimeout: Maximum time (ms) for a worker task before timeout. Default: 30000 (30 seconds).
workerIdleTimeout: Time (ms) before idle workers are terminated. Default: 60000 (60 seconds).
workerMemoryLimit: Memory limit (MB) per worker thread. Default: 512MB.
enableWorkerForConversion: Use workers for JP2 conversion. Default: true.
enableWorkerForOptimization: Use workers for image optimization. Default: true.
enableWorkerForDecoding: Use workers for image decoding. Default: true.

Format Placeholders

Use these placeholders in imageRefFormat:

{id} - Unique image ID (e.g., img_1)
{name} - Original image name from PDF
{page} - Page number
{index} - Global image index
{path} - File path (when extractImageFiles is true)

Examples:

[IMAGE:{id}] → [IMAGE:img_1]
📷 Image {index} → 📷 Image 1
{name} on page {page} → artwork_1 on page 5
<img src="{path}"> → <img src="./images/img_1.jpg">

Image Optimization & Conversion

Extract and optimize images in one step using Sharp or Imagemin:

import { extractPdfContent } from "pdf-plus";

const result = await extractPdfContent("document.pdf", {
  extractImageFiles: true,
  imageOutputDir: "./images",

  // Enable optimization
  optimizeImages: true,
  imageOptimizer: "auto", // Automatically selects best available
  imageQuality: 80,
  imageProgressive: true,

  // Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true)
  convertJp2ToJpg: true,

  verbose: true,
});

// Output:
// 🖼️  Extracting images from: document.pdf
// 📊 Processing 50 pages with PDF-lib engine
//    💾 Extracted real image: img_p1_1.jpg (245KB)
// 🔄 Converting 16 JP2 images to JPG...
//    🔄 Converted JP2 → JPG: img_p2_2.jpg (24026 → 18500 bytes)
// 🎨 Optimizing 54 images...
//    ✅ img_p1_1.jpg: 251904 → 184320 bytes (-26.8%) [sharp]
//    ✅ img_p2_2.jpg: 18500 → 15200 bytes (-17.8%) [sharp]

JP2 to JPG Conversion

JP2 (JPEG 2000) files are not widely supported by browsers and image tools. The library automatically converts them to standard JPG format:

const result = await extractPdfContent("document.pdf", {
  extractImageFiles: true,
  convertJp2ToJpg: true, // Default: true
  imageQuality: 100, // Default: 100 (maximum quality preservation)
});

// All JP2 images are now JPG files with better compatibility

Quality Preservation:

Default quality: 100 - Preserves maximum quality from JP2
Use lower values (80-90) if you want additional compression
Original JP2 files are deleted after successful conversion

Benefits:

✅ Better browser compatibility
✅ Can be optimized by Sharp/Imagemin
✅ Maximum quality preserved (quality=100)
✅ Works everywhere

Optimizer Comparison

| Optimizer | Speed | Quality | Formats | Platform | | ---------- | -------- | --------- | ------------------ | ----------------------------------------- | | sharp | Fast | Excellent | JPG, PNG, WebP | Native (requires compilation) | | imagemin | Medium | Excellent | JPG, PNG, GIF, SVG | Cross-platform | | auto | Variable | Excellent | All supported | Tries sharp first, falls back to imagemin |

Optimization Presets

// Maximum compression (slower, smaller files)
const result = await extractPdfContent("document.pdf", {
  optimizeImages: true,
  imageQuality: 70,
});

// Balanced (recommended)
const result = await extractPdfContent("document.pdf", {
  optimizeImages: true,
  imageQuality: 80, // Default
});

// Fast optimization with Sharp
const result = await extractPdfContent("document.pdf", {
  optimizeImages: true,
  imageOptimizer: "sharp",
  imageQuality: 85,
});

Performance Modes

Text-Only Mode (Fastest)

const text = await extractText("document.pdf");
// ~40% faster than combined mode

Images-Only Mode

const images = await extractImages("document.pdf");
// ~20% faster than combined mode

Combined Mode (Default)

const result = await extractPdfContent("document.pdf");
// Full extraction with text and image references

Error Handling

import { extractPdfContent } from "pdf-plus";

try {
  const result = await extractPdfContent("document.pdf");
} catch (error) {
  if (error.code === "VALIDATION_ERROR") {
    console.error("Configuration error:", error.validationErrors);
  } else if (error.code === "EXTRACTION_ERROR") {
    console.error("Extraction failed:", error.message);
  } else {
    console.error("Unexpected error:", error);
  }
}

Development

# Install dependencies
pnpm install

# Build the library
pnpm run build

# Lint and format
pnpm run lint:fix
pnpm run format

# Type checking
pnpm run check

Requirements

Node.js >= 18.0.0
TypeScript >= 5.0 (for development)

License

MIT

Contributing

Contributions are welcome! Please read our contributing guidelines and submit pull requests to our repository.

Troubleshooting

Common Issues

"Cannot find module" errors

Make sure you're using the correct import syntax for your environment:

// ESM (recommended)
import { extractPdfContent } from "pdf-plus";

// CommonJS
const { extractPdfContent } = require("pdf-plus");

Memory issues with large PDFs

For large documents, use streaming options:

const result = await extractPdfContent("large-document.pdf", {
  memoryLimit: "512MB",
  batchSize: 5,
  useCache: true,
});

Image extraction not working

Try different engines:

const result = await extractPdfContent("document.pdf", {
  imageEngine: "poppler", // or 'pdf-lib', 'auto'
  extractImageFiles: true,
});

Text extraction issues

Some PDFs may have encoding issues. Try:

const result = await extractPdfContent("document.pdf", {
  extractText: true,
  textEngine: "pdfjs", // Alternative engine
  verbose: true, // See detailed logs
});

Performance Tips

Use specific extraction modes for better performance:

// Text only (fastest)
const text = await extractText("document.pdf");

// Images only
const images = await extractImages("document.pdf");

Enable caching for repeated operations:

const extractor = new PDFExtractor("./cache");

Process pages in batches for large documents:

const result = await extractPdfContent("large.pdf", {
  batchSize: 10,
  memoryLimit: "1GB",
});

Getting Help

Check the Issues page
Review examples for common use cases
Enable verbose logging for debugging: { verbose: true }

Roadmap

Planned Features

OCR Support: Text extraction from image-based PDFs
Advanced Text Analysis: Font detection, text classification
Streaming API: Process large documents efficiently
Cloud Integration: Direct integration with cloud storage
CLI Tool: Command-line interface for batch processing
Web Worker Support: Browser-based extraction
Plugin System: Extensible architecture for custom extractors

Version 1.x Roadmap

[ ] OCR integration with Tesseract.js
[ ] Advanced image processing options
[ ] Streaming extraction API
[ ] Performance optimizations
[ ] Browser compatibility layer
[ ] CLI tool development

See CHANGELOG.md for detailed version history.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

pdf-plus

Features

Installation

Quick Start

Streaming API for Large PDFs (NEW! - Phase 4)

Generate Page Images (NEW! - Phase 5)

Usage Examples

Text-Only Extraction (Fast)

Extract Embedded Images

Generate Page Images (Render Pages)

Image Extraction with Optimization

Performance Optimization (NEW! 🚀)

Custom Image References

Advanced Configuration

Real-World Examples

Extract and Save Images from Academic Papers

Batch Process Multiple PDFs

API Reference

Main Functions

extractPdfContent(pdfPath, options)

extractText(pdfPath, options)

extractImages(pdfPath, options)

extractImageFiles(pdfPath, outputDir, options)

generatePageImages(pdfPath, outputDir, options)

Options

Format Placeholders

Image Optimization & Conversion

JP2 to JPG Conversion

Optimizer Comparison

Optimization Presets

Performance Modes

Text-Only Mode (Fastest)

Images-Only Mode

Combined Mode (Default)

Error Handling

Development

Requirements

License

Contributing

Troubleshooting

Common Issues

"Cannot find module" errors

Memory issues with large PDFs

Image extraction not working

Text extraction issues

Performance Tips

Getting Help

Roadmap

Planned Features

Version 1.x Roadmap

`extractPdfContent(pdfPath, options)`

`extractText(pdfPath, options)`

`extractImages(pdfPath, options)`

`extractImageFiles(pdfPath, outputDir, options)`

`generatePageImages(pdfPath, outputDir, options)`