@febbyrg/pdf-decomposer

v1.1.1

Published

5 days ago

A powerful PDF text and image extraction library with universal browser and Node.js support (Dual Licensed: Free for non-commercial, Paid for commercial use)

PDF-Decomposer

A powerful TypeScript library for comprehensive PDF processing and content extraction. Optimized for production use with universal browser and Node.js support.

Core Features

PDF Decomposer Class

Load Once, Use Many Times - Initialize PDF once, perform multiple operations
Progress Tracking - Observable pattern with real-time progress callbacks
Error Handling - Comprehensive error reporting with page-level context
Memory Efficient - Built-in memory management and cleanup
Universal Support - Works in Node.js 16+ and all modern browsers
Pluggable Renderer - Default node-canvas/browser canvas, with optional Puppeteer renderer for very large PDFs on Node.js

Main Operations

1. Content Decomposition (`decompose()`)

Extract structured text with positioning and formatting:

Smart element composition with elementComposer
Content area cleaning with cleanComposer
Page-level composition with pageComposer
Image extraction from embedded PDF objects
Link extraction from PDF annotations and text patterns
Smart URL detection with comprehensive email and domain pattern matching

2. Screenshot Generation (`screenshot()`)

High-quality page rendering to PNG/JPEG
Configurable resolution and quality
Batch processing with progress tracking
File output or base64 data URLs

3. PDF Data Generation (`data()`)

pwa-admin compatible data structure
Interactive area mapping with normalized coordinates
Widget ID generation following epub conventions
Article relationship management
skipScreenshots option for memory-constrained environments

4. PDF Slicing (`slice()`)

Extract specific page ranges
Generate new PDF documents
Replace internal document structure
Preserve all metadata and formatting

Advanced Content Processing

Element Composer

Groups scattered text elements into coherent paragraphs
Font-size based header element recognition (h1, h2, h3, etc.)
Smart span merging for headers with same font-size/family but different colors
Content consolidation for multiple heading tags
Preserves reading order and text flow
Smart font and spacing analysis

Page Composer

Merges continuous content across pages
Detects article boundaries and section breaks
Interview and feature content recognition
Typography consistency analysis

Clean Composer

Filters out headers, footers, and page numbers
Content area detection with configurable margins
Image size validation and filtering
Control character removal

Image Extraction

Universal browser-compatible processing
Multiple format support (RGB, RGBA, Grayscale)
Auto-scaling for memory safety
Duplicate detection and removal

Link Extraction (`extractLinks: true`)

PDF Annotations: Extract interactive link annotations with URLs and destinations
Text Pattern Matching: Detect URLs in text content (e.g., "GIA.edu/jewelryservices")
Email Detection: Find email addresses in document text with automatic mailto: prefix
Smart URL Recognition: Enhanced regex patterns for domain+path detection
Link Types: Support for external URLs, internal PDF destinations, and email links
No Duplicates: Intelligent handling prevents text/link element duplication
Position Data: Accurate bounding box coordinates for each link
Link Attributes: Rich metadata including link type, context text, and extraction method

Performance and Memory

Memory Manager - Adaptive cleanup and monitoring
Progress Callbacks - Real-time operation tracking
Background Processing - Non-blocking operations
Batch Processing - Efficient multi-page handling

Installation

npm install @febbyrg/pdf-decomposer

# For Node.js with canvas support (optional)
npm install canvas

# For browser usage
npm install pdfjs-dist

Quick Start

Class-Based API (Recommended)

import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// Load PDF once, use many times
const pdf = new PdfDecomposer(buffer) // Buffer, ArrayBuffer, or Uint8Array
await pdf.initialize()

// Multiple operations on same PDF
const pages = await pdf.decompose({
  elementComposer: true, // Group text into paragraphs
  pageComposer: true, // Merge continuous content across pages
  cleanComposer: true, // Clean headers/footers
  extractImages: true, // Extract embedded images
  extractLinks: true // Extract links and annotations from PDF
})

// Enhanced MinifyOptions with Element Attributes
const styledPages = await pdf.decompose({
  elementComposer: true,
  minify: true,
  minifyOptions: {
    format: 'html', // data field contains formatted HTML
    elementAttributes: true // Include styling information
  }
})

const screenshots = await pdf.screenshot({
  imageWidth: 1024,
  imageQuality: 90
})

const pdfData = await pdf.data({
  // pwa-admin compatible format
  imageWidth: 1024,
  elementComposer: true
})

const sliced = await pdf.slice({
  // Extract first 5 pages
  numberPages: 5
})

// Access PDF properties
console.log(`Pages: \${pdf.numPages}`)
console.log(`Fingerprint: \${pdf.fingerprint}`)

// Optional but recommended for long-running consumers: release pdf.js worker
// state and (if used) the custom renderer. Required when using PuppeteerRenderer.
await pdf.dispose()

Factory Method (One-liner)

import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// Create and initialize in one step
const pdf = await PdfDecomposer.create(buffer)
const pages = await pdf.decompose({ elementComposer: true })

Progress Tracking

const pdf = new PdfDecomposer(buffer)

// Subscribe to progress updates
pdf.subscribe((state) => {
  console.log(`\${state.progress}% - \${state.message}`)
})

await pdf.initialize()
const result = await pdf.decompose({
  startPage: 1,
  endPage: 10,
  elementComposer: true
})

Browser Environment (Angular, React, Vue)

import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// In browser - use File API
async function processPdfFile(file: File) {
  const arrayBuffer = await file.arrayBuffer()
  const pdf = new PdfDecomposer(arrayBuffer)
  await pdf.initialize()

  return await pdf.decompose({
    elementComposer: true,
    extractImages: true
  })
}

// Configure PDF.js worker (once per app)
import { PdfWorkerConfig } from '@febbyrg/pdf-decomposer'
PdfWorkerConfig.configure() // Auto-configures worker URL

Advanced Usage Examples

Content Processing Pipeline

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Step 1: Extract raw content with advanced processing
const pages = await pdf.decompose({
  startPage: 1,
  endPage: 10,
  elementComposer: true,
  pageComposer: true,
  cleanComposer: true,
  extractImages: true,
  minify: true,
  cleanComposerOptions: {
    topMarginPercent: 0.15,
    bottomMarginPercent: 0.1,
    minTextHeight: 8,
    removeControlCharacters: true
  }
})

// Step 2: Generate interactive data for web apps
const interactiveData = await pdf.data({
  startPage: 1,
  endPage: 10,
  imageWidth: 1024,
  elementComposer: true
})

// Step 3: Create high-quality screenshots
const screenshots = await pdf.screenshot({
  startPage: 1,
  endPage: 10,
  imageWidth: 1200,
  imageQuality: 95
})

PDF Slicing and Processing

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

console.log(`Original PDF: \${pdf.numPages} pages`)

// Slice to first 5 pages (modifies internal PDF)
const sliceResult = await pdf.slice({
  numberPages: 5
})

console.log(`Sliced PDF: \${pdf.numPages} pages`) // Now shows 5
console.log(`Saved \${sliceResult.fileSize} bytes`)

// Process the sliced PDF
const pages = await pdf.decompose({
  elementComposer: true
})

Link Extraction

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Extract links from PDF content
const pagesWithLinks = await pdf.decompose({
  extractLinks: true,
  elementComposer: true
})

// Process found links
pagesWithLinks.pages.forEach((page, pageIndex) => {
  const linkElements = page.elements.filter((el) => el.type === 'link')

  linkElements.forEach((link) => {
    console.log(`Page \${pageIndex + 1}: Found \${link.attributes.linkType}`)
    console.log(`  URL: \${link.data}`)
    console.log(`  Position: [\${link.boundingBox.left}, \${link.boundingBox.top}]`)

    if (link.attributes.text) {
      console.log(`  Context: "\${link.attributes.text}"`)
    }
  })
})

API Reference

PdfDecomposer Class

Constructor

new PdfDecomposer(input: Buffer | ArrayBuffer | Uint8Array)

Static Methods

// Factory method - create and initialize in one step
static async create(input: Buffer | ArrayBuffer | Uint8Array): Promise<PdfDecomposer>

Instance Methods

// Initialize PDF (required before other operations)
async initialize(): Promise<void>

// Extract content and structure
async decompose(options?: PdfDecomposerOptions): Promise<DecomposeResult>

// Generate page screenshots
async screenshot(options?: ScreenshotOptions): Promise<ScreenshotResult>

// Generate pwa-admin compatible data structure
async data(options?: DataOptions): Promise<DataResult>

// Slice PDF to specific page range
async slice(options?: SliceOptions): Promise<SliceResult>

// Subscribe to progress updates
subscribe(callback: (state: PdfDecomposerState) => void): void

// Get PDF and page fingerprints for caching
async getFingerprints(): Promise<{ pdfHash: string; pageHashes: string[]; total: number }>

Properties

readonly numPages: number           // Total number of pages
readonly fingerprint: string        // PDF fingerprint for caching
readonly initialized: boolean       // Initialization status

Options Interfaces

PdfDecomposerOptions

interface PdfDecomposerOptions {
  startPage?: number // First page (1-indexed, default: 1)
  endPage?: number // Last page (1-indexed, default: all)
  outputDir?: string // Output directory for files
  elementComposer?: boolean // Group text into paragraphs
  pageComposer?: boolean // Merge continuous content across pages
  extractImages?: boolean // Extract embedded images
  extractLinks?: boolean // Extract links and annotations from PDF
  minify?: boolean // Compact output format
  cleanComposer?: boolean // Remove headers/footers
  cleanComposerOptions?: PdfCleanComposerOptions
  minifyOptions?: {
    format?: 'plain' | 'html' // Data field format
    elementAttributes?: boolean // Include slim element attributes
  }
}

ScreenshotOptions

interface ScreenshotOptions {
  startPage?: number // First page (1-indexed)
  endPage?: number // Last page (1-indexed)
  outputDir?: string // Output directory for image files
  imageWidth?: number // Image width (default: 1200)
  imageQuality?: number // JPEG quality 1-100 (default: 90)
}

DataOptions

interface DataOptions {
  startPage?: number // First page (1-indexed)
  endPage?: number // Last page (1-indexed)
  outputDir?: string // Output directory
  extractImages?: boolean // Extract embedded images
  extractLinks?: boolean // Extract links and annotations
  elementComposer?: boolean // Group elements into paragraphs
  cleanComposer?: boolean // Clean content area
  imageWidth?: number // Screenshot width (default: 1024)
  imageQuality?: number // Screenshot quality (default: 90)
}

SliceOptions

interface SliceOptions {
  numberPages?: number // Number of pages from start
  startPage?: number // Starting page (1-indexed, default: 1)
  endPage?: number // Ending page (1-indexed)
}

PdfCleanComposerOptions

interface PdfCleanComposerOptions {
  topMarginPercent?: number // Exclude top % for headers (default: 0.1)
  bottomMarginPercent?: number // Exclude bottom % for footers (default: 0.1)
  sideMarginPercent?: number // Exclude side % (default: 0.05)
  minTextHeight?: number // Minimum text height (default: 8)
  minTextWidth?: number // Minimum text width (default: 10)
  minTextLength?: number // Minimum text length (default: 3)
  removeControlCharacters?: boolean // Remove non-printable chars (default: true)
  removeIsolatedCharacters?: boolean // Remove isolated chars (default: true)
  minImageWidth?: number // Minimum image width (default: 50)
  minImageHeight?: number // Minimum image height (default: 50)
  minImageArea?: number // Minimum image area (default: 2500)
  coverPageDetection?: boolean // Detect cover pages (default: true)
  coverPageThreshold?: number // Cover detection threshold (default: 0.8)
  coverPageScreenshotQuality?: number // JPEG quality for page/cover screenshots, 1-100 (default: 95)
  coverPageScreenshotWidth?: number // Target width (px) for page/cover screenshots when rendered via a renderer (default: 1024)
}

When cleanComposer converts a full-page-image or cover page into a single screenshot, that page is rasterized through the renderer configured on PdfDecomposer (e.g. PuppeteerRenderer) when one is set, and through node-canvas otherwise. coverPageScreenshotWidth only applies to the renderer path. The renderer is applied automatically. It is not something you pass in cleanComposerOptions.

Result Interfaces

DecomposeResult

interface DecomposeResult {
  pages: PdfPageContent[]
}

interface PdfPageContent {
  pageIndex: number // 0-based page index
  pageNumber: number // 1-based page number
  width: number // Page width in points
  height: number // Page height in points
  title: string // Page title
  elements: PdfElement[] // Extracted elements
  metadata?: {
    composedFromPages?: number[] // Original page indices (for pageComposer)
    [key: string]: any
  }
}

ScreenshotResult

interface ScreenshotResult {
  totalPages: number
  screenshots: ScreenshotPageResult[]
}

interface ScreenshotPageResult {
  pageNumber: number // 1-based page number
  width: number // Image width in pixels
  height: number // Image height in pixels
  screenshot: string // Base64 data URL
  filePath?: string // File path if outputDir provided
  error?: string // Error message if failed
}

DataResult

interface DataResult {
  data: PdfData[]
}

interface PdfData {
  id: string // Unique page identifier
  index: number // 0-based page index
  image: string // Page screenshot URL
  thumbnail: string // Thumbnail URL
  areas: PdfArea[] // Interactive areas
}

interface PdfArea {
  id: string // Unique area identifier
  coords: number[] // [x1, y1, x2, y2] normalized 0-1
  articleId: number // Associated article ID
  widgetId: string // Widget identifier (P: or T:)
}

SliceResult

interface SliceResult {
  pdfBytes: Uint8Array // Sliced PDF data
  originalPageCount: number // Original page count
  slicedPageCount: number // Sliced page count
  pageRange: {
    startPage: number
    endPage: number
  }
  fileSize: number // Size in bytes
}

Testing and Development

Run Tests

npm test                    # Comprehensive test suite
npm run test:screenshot     # Screenshot generation tests
npm run test:data          # PDF data generation tests

Build and Development

npm run build              # Build TypeScript to dist/
npm run build:watch        # Watch mode for development
npm run lint               # ESLint validation

Environment Support

| Feature | Node.js | Browser | Notes | | --------------------- | ------- | ------- | --------------------------------------------------------------- | | Text Extraction | Yes | Yes | Full support both environments | | Image Extraction | Yes | Yes | Universal canvas-based processing | | Screenshots | Yes | Yes | Node uses canvas (default) or Puppeteer (opt-in); browser canvas | | PDF Slicing | Yes | Yes | Uses pdf-lib in both environments | | Progress Tracking | Yes | Yes | Observable pattern with callbacks | | Memory Management | Yes | Limited | Advanced in Node.js, basic in browser | | File Output | Yes | No | Browser returns data URLs/blobs | | Element Composer | Yes | Yes | Smart text grouping | | Page Composer | Yes | Yes | Cross-page content merging | | Clean Composer | Yes | Yes | Header/footer removal | | dispose() lifecycle | Yes | Yes | Releases pdf.js + custom renderer resources |

Browser Compatibility

Chrome 60+
Firefox 55+
Safari 11+
Edge 79+
Mobile browsers (iOS Safari, Chrome Mobile)

Node.js Requirements

Node.js 16+ required
canvas optional for the default Node screenshot path
puppeteer optional for the PuppeteerRenderer (large PDFs on Node)
TypeScript 4.9+ for development

Production Usage Examples

Memory Optimization

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Process in smaller batches for large PDFs
const totalPages = pdf.numPages
const batchSize = 10

for (let start = 1; start <= totalPages; start += batchSize) {
  const end = Math.min(start + batchSize - 1, totalPages)

  const batch = await pdf.decompose({
    startPage: start,
    endPage: end,
    elementComposer: true
  })

  // Process batch results...
}

Built-in Memory Limits (v1.0.6+):

MAX_SAFE_PIXELS: 2M pixels per image
MAX_DIMENSION: 2000px max width/height
MAX_IMAGES_PER_PAGE: 20 images
Canvas size limits: 1200x1600 for screenshots
Sequential processing to reduce peak memory
Use skipScreenshots: true in data() to skip page image generation

Pluggable Renderer (Node.js, large PDFs)

The default Node.js screenshot path uses node-canvas. For very large PDFs (100+ pages, hundreds of MB) the underlying Context2d::GetImageData can hit v8::ArrayBuffer::New OOM regardless of --max-old-space-size — this is a documented limitation of the node-canvas + pdf.js + V8 ArrayBuffer allocator interaction. See docs/NODE_CANVAS_OOM_VS_PUPPETEER.md for the full write-up.

The library exposes an optional renderer constructor option that swaps the per-page rasterization path without changing any other behavior. Browser usage is unaffected. Text/image/link extraction still runs on the Node-side pdf.js. Every page→image step follows the renderer: screenshot(), the page images data() produces, and the cover/page-screenshot conversion that cleanComposer performs on full-page-image pages. When no renderer is set, all of these fall back to node-canvas. This consistency matters for large CMYK-heavy PDFs, where the cleanComposer cover/page conversion would otherwise still hit the node-canvas OOM even when a renderer was configured (fixed in 1.1.1).

import { PdfDecomposer, PuppeteerRenderer } from '@febbyrg/pdf-decomposer'

// Install puppeteer separately (downloads Chromium, ~300MB):
//   npm install puppeteer
const renderer = new PuppeteerRenderer()

const pdf = new PdfDecomposer(buffer, { renderer })
await pdf.initialize()

const screenshots = await pdf.screenshot({ imageWidth: 1024 })
// `data()` also routes through the renderer when generating page images:
const data = await pdf.data({ imageWidth: 1024 })

// IMPORTANT: dispose closes Chromium, the temp HTTP server, and pdf.js doc.
await pdf.dispose()

PuppeteerRenderer renders pages inside a headless Chromium browser using the same document.createElement('canvas') + pdf.js pipeline that the in-browser path uses. Chromium handles canvas memory natively, so the OOM at Context2d::GetImageData is bypassed entirely.

How PDF bytes reach Chromium: the renderer spawns a tiny localhost HTTP server (bound to 127.0.0.1 on a random ephemeral port) that serves the PDF and pdf.js worker. Chromium fetches them via standard browser XHR — no CDP-bound binary blobs, no JSON serialization of 100+ MB payloads. The server lifecycle is tied to initialize() / dispose().

Trade-offs

Cold-start adds ~1500–2500 ms per PdfDecomposer lifetime (one-time, not per page).
Requires Chromium on disk (~300 MB), already present in environments that use Puppeteer for other tasks (e.g. cloud-run-jobs).
Text/image extraction still runs on the Node-side pdf.js. Only page rasterization (screenshots and the cleanComposer cover/page conversion) uses the renderer.
dispose() becomes mandatory — without it, the Chromium subprocess and HTTP server leak.

When to use

Cloud Functions handling PDFs ≥ 50 pages / ≥ 100 MB where the default path hits the documented node-canvas OOM.
Local batch jobs against very large PDFs.
Any environment where flexpdf-class stability is required server-side.

When not to use

Small PDFs where the default node-canvas path comfortably fits in memory — cold-start overhead isn't worth it.
Browser environments — the browser already gives the same memory model.
Disk-constrained images that can't afford the extra ~300 MB Chromium.

Reference

docs/NODE_CANVAS_OOM_VS_PUPPETEER.md — root cause, design rationale, references to upstream issues.
PdfPageRenderer — interface for writing custom renderers.

Error Handling

const pdf = new PdfDecomposer(buffer)

pdf.subscribe((state) => {
  console.log(`Progress: \${state.progress}%`)
})

try {
  await pdf.initialize()
  const result = await pdf.decompose()
} catch (error) {
  if (error.name === 'InvalidPdfError') {
    console.error('Invalid PDF format:', error.message)
  } else if (error.name === 'MemoryError') {
    console.error('Memory limit exceeded:', error.message)
  } else {
    console.error('Processing failed:', error.message)
  }
}

Caching Strategy

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Use fingerprint for caching
const fingerprints = await pdf.getFingerprints()
const cacheKey = `pdf_\${fingerprints.pdfHash}`

// Check cache before processing
const cached = cache.get(cacheKey)
if (!cached) {
  const result = await pdf.decompose()
  cache.set(cacheKey, result, { ttl: 3600 }) // 1 hour
}

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Use TypeScript for all new code
Add tests for new features
Update README for API changes
Follow existing code style
Test in both Node.js and browser environments

Publishing

Setup for Publishing

# Initial setup (run once)
npm run setup:publishing

# Verify configuration
npm run setup:verify

Publishing Commands

# Publish to NPM only
npm run release:npm

# Publish to GitHub Packages only
npm run release:github

# Publish to both registries
npm run release:both

# Version bump + publish (patch/minor/major)
npm run release:minor

License

PDF-Decomposer is dual-licensed:

Non-Commercial Use (Free)

Personal projects
Educational use
Research purposes
Open source projects

Commercial Use (Paid License Required)

Commercial applications
Revenue-generating products
Enterprise software
Distribution in commercial products

For commercial licensing, contact [email protected]

See LICENSE file for complete terms.