npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@febbyrg/pdf-decomposer

v1.0.9

Published

A powerful PDF text and image extraction library with universal browser and Node.js support (Dual Licensed: Free for non-commercial, Paid for commercial use)

Readme

PDF-Decomposer

NPM Version TypeScript Dual License

A powerful TypeScript library for comprehensive PDF processing and content extraction. Optimized for production use with universal browser and Node.js support.

Core Features

PDF Decomposer Class

  • Load Once, Use Many Times - Initialize PDF once, perform multiple operations
  • Progress Tracking - Observable pattern with real-time progress callbacks
  • Error Handling - Comprehensive error reporting with page-level context
  • Memory Efficient - Built-in memory management and cleanup
  • Universal Support - Works in Node.js 16+ and all modern browsers

Main Operations

1. Content Decomposition (decompose())

Extract structured text with positioning and formatting:

  • Smart element composition with elementComposer
  • Content area cleaning with cleanComposer
  • Page-level composition with pageComposer
  • Image extraction from embedded PDF objects
  • Link extraction from PDF annotations and text patterns
  • Smart URL detection with comprehensive email and domain pattern matching

2. Screenshot Generation (screenshot())

  • High-quality page rendering to PNG/JPEG
  • Configurable resolution and quality
  • Batch processing with progress tracking
  • File output or base64 data URLs

3. PDF Data Generation (data())

  • pwa-admin compatible data structure
  • Interactive area mapping with normalized coordinates
  • Widget ID generation following epub conventions
  • Article relationship management
  • skipScreenshots option for memory-constrained environments

4. PDF Slicing (slice())

  • Extract specific page ranges
  • Generate new PDF documents
  • Replace internal document structure
  • Preserve all metadata and formatting

Advanced Content Processing

Element Composer

  • Groups scattered text elements into coherent paragraphs
  • Font-size based header element recognition (h1, h2, h3, etc.)
  • Smart span merging for headers with same font-size/family but different colors
  • Content consolidation for multiple heading tags
  • Preserves reading order and text flow
  • Smart font and spacing analysis

Page Composer

  • Merges continuous content across pages
  • Detects article boundaries and section breaks
  • Interview and feature content recognition
  • Typography consistency analysis

Clean Composer

  • Filters out headers, footers, and page numbers
  • Content area detection with configurable margins
  • Image size validation and filtering
  • Control character removal

Image Extraction

  • Universal browser-compatible processing
  • Multiple format support (RGB, RGBA, Grayscale)
  • Auto-scaling for memory safety
  • Duplicate detection and removal

Link Extraction (extractLinks: true)

  • PDF Annotations: Extract interactive link annotations with URLs and destinations
  • Text Pattern Matching: Detect URLs in text content (e.g., "GIA.edu/jewelryservices")
  • Email Detection: Find email addresses in document text with automatic mailto: prefix
  • Smart URL Recognition: Enhanced regex patterns for domain+path detection
  • Link Types: Support for external URLs, internal PDF destinations, and email links
  • No Duplicates: Intelligent handling prevents text/link element duplication
  • Position Data: Accurate bounding box coordinates for each link
  • Link Attributes: Rich metadata including link type, context text, and extraction method

Performance and Memory

  • Memory Manager - Adaptive cleanup and monitoring
  • Progress Callbacks - Real-time operation tracking
  • Background Processing - Non-blocking operations
  • Batch Processing - Efficient multi-page handling

Installation

npm install @febbyrg/pdf-decomposer

# For Node.js with canvas support (optional)
npm install canvas

# For browser usage
npm install pdfjs-dist

Quick Start

Class-Based API (Recommended)

import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// Load PDF once, use many times
const pdf = new PdfDecomposer(buffer) // Buffer, ArrayBuffer, or Uint8Array
await pdf.initialize()

// Multiple operations on same PDF
const pages = await pdf.decompose({
  elementComposer: true, // Group text into paragraphs
  pageComposer: true, // Merge continuous content across pages
  cleanComposer: true, // Clean headers/footers
  extractImages: true, // Extract embedded images
  extractLinks: true // Extract links and annotations from PDF
})

// Enhanced MinifyOptions with Element Attributes
const styledPages = await pdf.decompose({
  elementComposer: true,
  minify: true,
  minifyOptions: {
    format: 'html', // data field contains formatted HTML
    elementAttributes: true // Include styling information
  }
})

const screenshots = await pdf.screenshot({
  imageWidth: 1024,
  imageQuality: 90
})

const pdfData = await pdf.data({
  // pwa-admin compatible format
  imageWidth: 1024,
  elementComposer: true
})

const sliced = await pdf.slice({
  // Extract first 5 pages
  numberPages: 5
})

// Access PDF properties
console.log(`Pages: \${pdf.numPages}`)
console.log(`Fingerprint: \${pdf.fingerprint}`)

Factory Method (One-liner)

import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// Create and initialize in one step
const pdf = await PdfDecomposer.create(buffer)
const pages = await pdf.decompose({ elementComposer: true })

Progress Tracking

const pdf = new PdfDecomposer(buffer)

// Subscribe to progress updates
pdf.subscribe((state) => {
  console.log(`\${state.progress}% - \${state.message}`)
})

await pdf.initialize()
const result = await pdf.decompose({
  startPage: 1,
  endPage: 10,
  elementComposer: true
})

Browser Environment (Angular, React, Vue)

import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// In browser - use File API
async function processPdfFile(file: File) {
  const arrayBuffer = await file.arrayBuffer()
  const pdf = new PdfDecomposer(arrayBuffer)
  await pdf.initialize()

  return await pdf.decompose({
    elementComposer: true,
    extractImages: true
  })
}

// Configure PDF.js worker (once per app)
import { PdfWorkerConfig } from '@febbyrg/pdf-decomposer'
PdfWorkerConfig.configure() // Auto-configures worker URL

Advanced Usage Examples

Content Processing Pipeline

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Step 1: Extract raw content with advanced processing
const pages = await pdf.decompose({
  startPage: 1,
  endPage: 10,
  elementComposer: true,
  pageComposer: true,
  cleanComposer: true,
  extractImages: true,
  minify: true,
  cleanComposerOptions: {
    topMarginPercent: 0.15,
    bottomMarginPercent: 0.1,
    minTextHeight: 8,
    removeControlCharacters: true
  }
})

// Step 2: Generate interactive data for web apps
const interactiveData = await pdf.data({
  startPage: 1,
  endPage: 10,
  imageWidth: 1024,
  elementComposer: true
})

// Step 3: Create high-quality screenshots
const screenshots = await pdf.screenshot({
  startPage: 1,
  endPage: 10,
  imageWidth: 1200,
  imageQuality: 95
})

PDF Slicing and Processing

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

console.log(`Original PDF: \${pdf.numPages} pages`)

// Slice to first 5 pages (modifies internal PDF)
const sliceResult = await pdf.slice({
  numberPages: 5
})

console.log(`Sliced PDF: \${pdf.numPages} pages`) // Now shows 5
console.log(`Saved \${sliceResult.fileSize} bytes`)

// Process the sliced PDF
const pages = await pdf.decompose({
  elementComposer: true
})

Link Extraction

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Extract links from PDF content
const pagesWithLinks = await pdf.decompose({
  extractLinks: true,
  elementComposer: true
})

// Process found links
pagesWithLinks.pages.forEach((page, pageIndex) => {
  const linkElements = page.elements.filter((el) => el.type === 'link')

  linkElements.forEach((link) => {
    console.log(`Page \${pageIndex + 1}: Found \${link.attributes.linkType}`)
    console.log(`  URL: \${link.data}`)
    console.log(`  Position: [\${link.boundingBox.left}, \${link.boundingBox.top}]`)

    if (link.attributes.text) {
      console.log(`  Context: "\${link.attributes.text}"`)
    }
  })
})

API Reference

PdfDecomposer Class

Constructor

new PdfDecomposer(input: Buffer | ArrayBuffer | Uint8Array)

Static Methods

// Factory method - create and initialize in one step
static async create(input: Buffer | ArrayBuffer | Uint8Array): Promise<PdfDecomposer>

Instance Methods

// Initialize PDF (required before other operations)
async initialize(): Promise<void>

// Extract content and structure
async decompose(options?: PdfDecomposerOptions): Promise<DecomposeResult>

// Generate page screenshots
async screenshot(options?: ScreenshotOptions): Promise<ScreenshotResult>

// Generate pwa-admin compatible data structure
async data(options?: DataOptions): Promise<DataResult>

// Slice PDF to specific page range
async slice(options?: SliceOptions): Promise<SliceResult>

// Subscribe to progress updates
subscribe(callback: (state: PdfDecomposerState) => void): void

// Get PDF and page fingerprints for caching
async getFingerprints(): Promise<{ pdfHash: string; pageHashes: string[]; total: number }>

Properties

readonly numPages: number           // Total number of pages
readonly fingerprint: string        // PDF fingerprint for caching
readonly initialized: boolean       // Initialization status

Options Interfaces

PdfDecomposerOptions

interface PdfDecomposerOptions {
  startPage?: number // First page (1-indexed, default: 1)
  endPage?: number // Last page (1-indexed, default: all)
  outputDir?: string // Output directory for files
  elementComposer?: boolean // Group text into paragraphs
  pageComposer?: boolean // Merge continuous content across pages
  extractImages?: boolean // Extract embedded images
  extractLinks?: boolean // Extract links and annotations from PDF
  minify?: boolean // Compact output format
  cleanComposer?: boolean // Remove headers/footers
  cleanComposerOptions?: PdfCleanComposerOptions
  minifyOptions?: {
    format?: 'plain' | 'html' // Data field format
    elementAttributes?: boolean // Include slim element attributes
  }
}

ScreenshotOptions

interface ScreenshotOptions {
  startPage?: number // First page (1-indexed)
  endPage?: number // Last page (1-indexed)
  outputDir?: string // Output directory for image files
  imageWidth?: number // Image width (default: 1200)
  imageQuality?: number // JPEG quality 1-100 (default: 90)
}

DataOptions

interface DataOptions {
  startPage?: number // First page (1-indexed)
  endPage?: number // Last page (1-indexed)
  outputDir?: string // Output directory
  extractImages?: boolean // Extract embedded images
  extractLinks?: boolean // Extract links and annotations
  elementComposer?: boolean // Group elements into paragraphs
  cleanComposer?: boolean // Clean content area
  imageWidth?: number // Screenshot width (default: 1024)
  imageQuality?: number // Screenshot quality (default: 90)
}

SliceOptions

interface SliceOptions {
  numberPages?: number // Number of pages from start
  startPage?: number // Starting page (1-indexed, default: 1)
  endPage?: number // Ending page (1-indexed)
}

PdfCleanComposerOptions

interface PdfCleanComposerOptions {
  topMarginPercent?: number // Exclude top % for headers (default: 0.1)
  bottomMarginPercent?: number // Exclude bottom % for footers (default: 0.1)
  sideMarginPercent?: number // Exclude side % (default: 0.05)
  minTextHeight?: number // Minimum text height (default: 8)
  minTextWidth?: number // Minimum text width (default: 10)
  minTextLength?: number // Minimum text length (default: 3)
  removeControlCharacters?: boolean // Remove non-printable chars (default: true)
  removeIsolatedCharacters?: boolean // Remove isolated chars (default: true)
  minImageWidth?: number // Minimum image width (default: 50)
  minImageHeight?: number // Minimum image height (default: 50)
  minImageArea?: number // Minimum image area (default: 2500)
  coverPageDetection?: boolean // Detect cover pages (default: true)
  coverPageThreshold?: number // Cover detection threshold (default: 0.8)
}

Result Interfaces

DecomposeResult

interface DecomposeResult {
  pages: PdfPageContent[]
}

interface PdfPageContent {
  pageIndex: number // 0-based page index
  pageNumber: number // 1-based page number
  width: number // Page width in points
  height: number // Page height in points
  title: string // Page title
  elements: PdfElement[] // Extracted elements
  metadata?: {
    composedFromPages?: number[] // Original page indices (for pageComposer)
    [key: string]: any
  }
}

ScreenshotResult

interface ScreenshotResult {
  totalPages: number
  screenshots: ScreenshotPageResult[]
}

interface ScreenshotPageResult {
  pageNumber: number // 1-based page number
  width: number // Image width in pixels
  height: number // Image height in pixels
  screenshot: string // Base64 data URL
  filePath?: string // File path if outputDir provided
  error?: string // Error message if failed
}

DataResult

interface DataResult {
  data: PdfData[]
}

interface PdfData {
  id: string // Unique page identifier
  index: number // 0-based page index
  image: string // Page screenshot URL
  thumbnail: string // Thumbnail URL
  areas: PdfArea[] // Interactive areas
}

interface PdfArea {
  id: string // Unique area identifier
  coords: number[] // [x1, y1, x2, y2] normalized 0-1
  articleId: number // Associated article ID
  widgetId: string // Widget identifier (P: or T:)
}

SliceResult

interface SliceResult {
  pdfBytes: Uint8Array // Sliced PDF data
  originalPageCount: number // Original page count
  slicedPageCount: number // Sliced page count
  pageRange: {
    startPage: number
    endPage: number
  }
  fileSize: number // Size in bytes
}

Testing and Development

Run Tests

npm test                    # Comprehensive test suite
npm run test:screenshot     # Screenshot generation tests
npm run test:data          # PDF data generation tests

Build and Development

npm run build              # Build TypeScript to dist/
npm run build:watch        # Watch mode for development
npm run lint               # ESLint validation

Environment Support

| Feature | Node.js | Browser | Notes | | --------------------- | ------- | ------- | ---------------------------------------- | | Text Extraction | Yes | Yes | Full support both environments | | Image Extraction | Yes | Yes | Universal canvas-based processing | | Screenshots | Yes | Yes | Node.js uses canvas, browser Canvas API | | PDF Slicing | Yes | Yes | Uses pdf-lib in both environments | | Progress Tracking | Yes | Yes | Observable pattern with callbacks | | Memory Management | Yes | Limited | Advanced in Node.js, basic in browser | | File Output | Yes | No | Browser returns data URLs/blobs | | Element Composer | Yes | Yes | Smart text grouping | | Page Composer | Yes | Yes | Cross-page content merging | | Clean Composer | Yes | Yes | Header/footer removal |

Browser Compatibility

  • Chrome 60+
  • Firefox 55+
  • Safari 11+
  • Edge 79+
  • Mobile browsers (iOS Safari, Chrome Mobile)

Node.js Requirements

  • Node.js 16+ required
  • Canvas optional for enhanced screenshot quality
  • TypeScript 4.9+ for development

Production Usage Examples

Memory Optimization

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Process in smaller batches for large PDFs
const totalPages = pdf.numPages
const batchSize = 10

for (let start = 1; start <= totalPages; start += batchSize) {
  const end = Math.min(start + batchSize - 1, totalPages)

  const batch = await pdf.decompose({
    startPage: start,
    endPage: end,
    elementComposer: true
  })

  // Process batch results...
}

Built-in Memory Limits (v1.0.6+):

  • MAX_SAFE_PIXELS: 2M pixels per image
  • MAX_DIMENSION: 2000px max width/height
  • MAX_IMAGES_PER_PAGE: 20 images
  • Canvas size limits: 1200x1600 for screenshots
  • Sequential processing to reduce peak memory
  • Use skipScreenshots: true in data() to skip page image generation

Error Handling

const pdf = new PdfDecomposer(buffer)

pdf.subscribe((state) => {
  console.log(`Progress: \${state.progress}%`)
})

try {
  await pdf.initialize()
  const result = await pdf.decompose()
} catch (error) {
  if (error.name === 'InvalidPdfError') {
    console.error('Invalid PDF format:', error.message)
  } else if (error.name === 'MemoryError') {
    console.error('Memory limit exceeded:', error.message)
  } else {
    console.error('Processing failed:', error.message)
  }
}

Caching Strategy

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Use fingerprint for caching
const fingerprints = await pdf.getFingerprints()
const cacheKey = `pdf_\${fingerprints.pdfHash}`

// Check cache before processing
const cached = cache.get(cacheKey)
if (!cached) {
  const result = await pdf.decompose()
  cache.set(cacheKey, result, { ttl: 3600 }) // 1 hour
}

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Use TypeScript for all new code
  • Add tests for new features
  • Update README for API changes
  • Follow existing code style
  • Test in both Node.js and browser environments

Publishing

Setup for Publishing

# Initial setup (run once)
npm run setup:publishing

# Verify configuration
npm run setup:verify

Publishing Commands

# Publish to NPM only
npm run publish:npm

# Publish to GitHub Packages only
npm run publish:github

# Publish to both registries
npm run publish:both

# Version bump + publish
npm version patch && npm run publish:both

License

PDF-Decomposer is dual-licensed:

Non-Commercial Use (Free)

  • Personal projects
  • Educational use
  • Research purposes
  • Open source projects

Commercial Use (Paid License Required)

  • Commercial applications
  • Revenue-generating products
  • Enterprise software
  • Distribution in commercial products

For commercial licensing, contact [email protected]

See LICENSE file for complete terms.

Links