@febbyrg/pdf-decomposer
v1.0.9
Published
A powerful PDF text and image extraction library with universal browser and Node.js support (Dual Licensed: Free for non-commercial, Paid for commercial use)
Maintainers
Readme
PDF-Decomposer
A powerful TypeScript library for comprehensive PDF processing and content extraction. Optimized for production use with universal browser and Node.js support.
Core Features
PDF Decomposer Class
- Load Once, Use Many Times - Initialize PDF once, perform multiple operations
- Progress Tracking - Observable pattern with real-time progress callbacks
- Error Handling - Comprehensive error reporting with page-level context
- Memory Efficient - Built-in memory management and cleanup
- Universal Support - Works in Node.js 16+ and all modern browsers
Main Operations
1. Content Decomposition (decompose())
Extract structured text with positioning and formatting:
- Smart element composition with
elementComposer - Content area cleaning with
cleanComposer - Page-level composition with
pageComposer - Image extraction from embedded PDF objects
- Link extraction from PDF annotations and text patterns
- Smart URL detection with comprehensive email and domain pattern matching
2. Screenshot Generation (screenshot())
- High-quality page rendering to PNG/JPEG
- Configurable resolution and quality
- Batch processing with progress tracking
- File output or base64 data URLs
3. PDF Data Generation (data())
- pwa-admin compatible data structure
- Interactive area mapping with normalized coordinates
- Widget ID generation following epub conventions
- Article relationship management
skipScreenshotsoption for memory-constrained environments
4. PDF Slicing (slice())
- Extract specific page ranges
- Generate new PDF documents
- Replace internal document structure
- Preserve all metadata and formatting
Advanced Content Processing
Element Composer
- Groups scattered text elements into coherent paragraphs
- Font-size based header element recognition (h1, h2, h3, etc.)
- Smart span merging for headers with same font-size/family but different colors
- Content consolidation for multiple heading tags
- Preserves reading order and text flow
- Smart font and spacing analysis
Page Composer
- Merges continuous content across pages
- Detects article boundaries and section breaks
- Interview and feature content recognition
- Typography consistency analysis
Clean Composer
- Filters out headers, footers, and page numbers
- Content area detection with configurable margins
- Image size validation and filtering
- Control character removal
Image Extraction
- Universal browser-compatible processing
- Multiple format support (RGB, RGBA, Grayscale)
- Auto-scaling for memory safety
- Duplicate detection and removal
Link Extraction (extractLinks: true)
- PDF Annotations: Extract interactive link annotations with URLs and destinations
- Text Pattern Matching: Detect URLs in text content (e.g., "GIA.edu/jewelryservices")
- Email Detection: Find email addresses in document text with automatic mailto: prefix
- Smart URL Recognition: Enhanced regex patterns for domain+path detection
- Link Types: Support for external URLs, internal PDF destinations, and email links
- No Duplicates: Intelligent handling prevents text/link element duplication
- Position Data: Accurate bounding box coordinates for each link
- Link Attributes: Rich metadata including link type, context text, and extraction method
Performance and Memory
- Memory Manager - Adaptive cleanup and monitoring
- Progress Callbacks - Real-time operation tracking
- Background Processing - Non-blocking operations
- Batch Processing - Efficient multi-page handling
Installation
npm install @febbyrg/pdf-decomposer
# For Node.js with canvas support (optional)
npm install canvas
# For browser usage
npm install pdfjs-distQuick Start
Class-Based API (Recommended)
import { PdfDecomposer } from '@febbyrg/pdf-decomposer'
// Load PDF once, use many times
const pdf = new PdfDecomposer(buffer) // Buffer, ArrayBuffer, or Uint8Array
await pdf.initialize()
// Multiple operations on same PDF
const pages = await pdf.decompose({
elementComposer: true, // Group text into paragraphs
pageComposer: true, // Merge continuous content across pages
cleanComposer: true, // Clean headers/footers
extractImages: true, // Extract embedded images
extractLinks: true // Extract links and annotations from PDF
})
// Enhanced MinifyOptions with Element Attributes
const styledPages = await pdf.decompose({
elementComposer: true,
minify: true,
minifyOptions: {
format: 'html', // data field contains formatted HTML
elementAttributes: true // Include styling information
}
})
const screenshots = await pdf.screenshot({
imageWidth: 1024,
imageQuality: 90
})
const pdfData = await pdf.data({
// pwa-admin compatible format
imageWidth: 1024,
elementComposer: true
})
const sliced = await pdf.slice({
// Extract first 5 pages
numberPages: 5
})
// Access PDF properties
console.log(`Pages: \${pdf.numPages}`)
console.log(`Fingerprint: \${pdf.fingerprint}`)Factory Method (One-liner)
import { PdfDecomposer } from '@febbyrg/pdf-decomposer'
// Create and initialize in one step
const pdf = await PdfDecomposer.create(buffer)
const pages = await pdf.decompose({ elementComposer: true })Progress Tracking
const pdf = new PdfDecomposer(buffer)
// Subscribe to progress updates
pdf.subscribe((state) => {
console.log(`\${state.progress}% - \${state.message}`)
})
await pdf.initialize()
const result = await pdf.decompose({
startPage: 1,
endPage: 10,
elementComposer: true
})Browser Environment (Angular, React, Vue)
import { PdfDecomposer } from '@febbyrg/pdf-decomposer'
// In browser - use File API
async function processPdfFile(file: File) {
const arrayBuffer = await file.arrayBuffer()
const pdf = new PdfDecomposer(arrayBuffer)
await pdf.initialize()
return await pdf.decompose({
elementComposer: true,
extractImages: true
})
}
// Configure PDF.js worker (once per app)
import { PdfWorkerConfig } from '@febbyrg/pdf-decomposer'
PdfWorkerConfig.configure() // Auto-configures worker URLAdvanced Usage Examples
Content Processing Pipeline
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()
// Step 1: Extract raw content with advanced processing
const pages = await pdf.decompose({
startPage: 1,
endPage: 10,
elementComposer: true,
pageComposer: true,
cleanComposer: true,
extractImages: true,
minify: true,
cleanComposerOptions: {
topMarginPercent: 0.15,
bottomMarginPercent: 0.1,
minTextHeight: 8,
removeControlCharacters: true
}
})
// Step 2: Generate interactive data for web apps
const interactiveData = await pdf.data({
startPage: 1,
endPage: 10,
imageWidth: 1024,
elementComposer: true
})
// Step 3: Create high-quality screenshots
const screenshots = await pdf.screenshot({
startPage: 1,
endPage: 10,
imageWidth: 1200,
imageQuality: 95
})PDF Slicing and Processing
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()
console.log(`Original PDF: \${pdf.numPages} pages`)
// Slice to first 5 pages (modifies internal PDF)
const sliceResult = await pdf.slice({
numberPages: 5
})
console.log(`Sliced PDF: \${pdf.numPages} pages`) // Now shows 5
console.log(`Saved \${sliceResult.fileSize} bytes`)
// Process the sliced PDF
const pages = await pdf.decompose({
elementComposer: true
})Link Extraction
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()
// Extract links from PDF content
const pagesWithLinks = await pdf.decompose({
extractLinks: true,
elementComposer: true
})
// Process found links
pagesWithLinks.pages.forEach((page, pageIndex) => {
const linkElements = page.elements.filter((el) => el.type === 'link')
linkElements.forEach((link) => {
console.log(`Page \${pageIndex + 1}: Found \${link.attributes.linkType}`)
console.log(` URL: \${link.data}`)
console.log(` Position: [\${link.boundingBox.left}, \${link.boundingBox.top}]`)
if (link.attributes.text) {
console.log(` Context: "\${link.attributes.text}"`)
}
})
})API Reference
PdfDecomposer Class
Constructor
new PdfDecomposer(input: Buffer | ArrayBuffer | Uint8Array)Static Methods
// Factory method - create and initialize in one step
static async create(input: Buffer | ArrayBuffer | Uint8Array): Promise<PdfDecomposer>Instance Methods
// Initialize PDF (required before other operations)
async initialize(): Promise<void>
// Extract content and structure
async decompose(options?: PdfDecomposerOptions): Promise<DecomposeResult>
// Generate page screenshots
async screenshot(options?: ScreenshotOptions): Promise<ScreenshotResult>
// Generate pwa-admin compatible data structure
async data(options?: DataOptions): Promise<DataResult>
// Slice PDF to specific page range
async slice(options?: SliceOptions): Promise<SliceResult>
// Subscribe to progress updates
subscribe(callback: (state: PdfDecomposerState) => void): void
// Get PDF and page fingerprints for caching
async getFingerprints(): Promise<{ pdfHash: string; pageHashes: string[]; total: number }>Properties
readonly numPages: number // Total number of pages
readonly fingerprint: string // PDF fingerprint for caching
readonly initialized: boolean // Initialization statusOptions Interfaces
PdfDecomposerOptions
interface PdfDecomposerOptions {
startPage?: number // First page (1-indexed, default: 1)
endPage?: number // Last page (1-indexed, default: all)
outputDir?: string // Output directory for files
elementComposer?: boolean // Group text into paragraphs
pageComposer?: boolean // Merge continuous content across pages
extractImages?: boolean // Extract embedded images
extractLinks?: boolean // Extract links and annotations from PDF
minify?: boolean // Compact output format
cleanComposer?: boolean // Remove headers/footers
cleanComposerOptions?: PdfCleanComposerOptions
minifyOptions?: {
format?: 'plain' | 'html' // Data field format
elementAttributes?: boolean // Include slim element attributes
}
}ScreenshotOptions
interface ScreenshotOptions {
startPage?: number // First page (1-indexed)
endPage?: number // Last page (1-indexed)
outputDir?: string // Output directory for image files
imageWidth?: number // Image width (default: 1200)
imageQuality?: number // JPEG quality 1-100 (default: 90)
}DataOptions
interface DataOptions {
startPage?: number // First page (1-indexed)
endPage?: number // Last page (1-indexed)
outputDir?: string // Output directory
extractImages?: boolean // Extract embedded images
extractLinks?: boolean // Extract links and annotations
elementComposer?: boolean // Group elements into paragraphs
cleanComposer?: boolean // Clean content area
imageWidth?: number // Screenshot width (default: 1024)
imageQuality?: number // Screenshot quality (default: 90)
}SliceOptions
interface SliceOptions {
numberPages?: number // Number of pages from start
startPage?: number // Starting page (1-indexed, default: 1)
endPage?: number // Ending page (1-indexed)
}PdfCleanComposerOptions
interface PdfCleanComposerOptions {
topMarginPercent?: number // Exclude top % for headers (default: 0.1)
bottomMarginPercent?: number // Exclude bottom % for footers (default: 0.1)
sideMarginPercent?: number // Exclude side % (default: 0.05)
minTextHeight?: number // Minimum text height (default: 8)
minTextWidth?: number // Minimum text width (default: 10)
minTextLength?: number // Minimum text length (default: 3)
removeControlCharacters?: boolean // Remove non-printable chars (default: true)
removeIsolatedCharacters?: boolean // Remove isolated chars (default: true)
minImageWidth?: number // Minimum image width (default: 50)
minImageHeight?: number // Minimum image height (default: 50)
minImageArea?: number // Minimum image area (default: 2500)
coverPageDetection?: boolean // Detect cover pages (default: true)
coverPageThreshold?: number // Cover detection threshold (default: 0.8)
}Result Interfaces
DecomposeResult
interface DecomposeResult {
pages: PdfPageContent[]
}
interface PdfPageContent {
pageIndex: number // 0-based page index
pageNumber: number // 1-based page number
width: number // Page width in points
height: number // Page height in points
title: string // Page title
elements: PdfElement[] // Extracted elements
metadata?: {
composedFromPages?: number[] // Original page indices (for pageComposer)
[key: string]: any
}
}ScreenshotResult
interface ScreenshotResult {
totalPages: number
screenshots: ScreenshotPageResult[]
}
interface ScreenshotPageResult {
pageNumber: number // 1-based page number
width: number // Image width in pixels
height: number // Image height in pixels
screenshot: string // Base64 data URL
filePath?: string // File path if outputDir provided
error?: string // Error message if failed
}DataResult
interface DataResult {
data: PdfData[]
}
interface PdfData {
id: string // Unique page identifier
index: number // 0-based page index
image: string // Page screenshot URL
thumbnail: string // Thumbnail URL
areas: PdfArea[] // Interactive areas
}
interface PdfArea {
id: string // Unique area identifier
coords: number[] // [x1, y1, x2, y2] normalized 0-1
articleId: number // Associated article ID
widgetId: string // Widget identifier (P: or T:)
}SliceResult
interface SliceResult {
pdfBytes: Uint8Array // Sliced PDF data
originalPageCount: number // Original page count
slicedPageCount: number // Sliced page count
pageRange: {
startPage: number
endPage: number
}
fileSize: number // Size in bytes
}Testing and Development
Run Tests
npm test # Comprehensive test suite
npm run test:screenshot # Screenshot generation tests
npm run test:data # PDF data generation testsBuild and Development
npm run build # Build TypeScript to dist/
npm run build:watch # Watch mode for development
npm run lint # ESLint validationEnvironment Support
| Feature | Node.js | Browser | Notes | | --------------------- | ------- | ------- | ---------------------------------------- | | Text Extraction | Yes | Yes | Full support both environments | | Image Extraction | Yes | Yes | Universal canvas-based processing | | Screenshots | Yes | Yes | Node.js uses canvas, browser Canvas API | | PDF Slicing | Yes | Yes | Uses pdf-lib in both environments | | Progress Tracking | Yes | Yes | Observable pattern with callbacks | | Memory Management | Yes | Limited | Advanced in Node.js, basic in browser | | File Output | Yes | No | Browser returns data URLs/blobs | | Element Composer | Yes | Yes | Smart text grouping | | Page Composer | Yes | Yes | Cross-page content merging | | Clean Composer | Yes | Yes | Header/footer removal |
Browser Compatibility
- Chrome 60+
- Firefox 55+
- Safari 11+
- Edge 79+
- Mobile browsers (iOS Safari, Chrome Mobile)
Node.js Requirements
- Node.js 16+ required
- Canvas optional for enhanced screenshot quality
- TypeScript 4.9+ for development
Production Usage Examples
Memory Optimization
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()
// Process in smaller batches for large PDFs
const totalPages = pdf.numPages
const batchSize = 10
for (let start = 1; start <= totalPages; start += batchSize) {
const end = Math.min(start + batchSize - 1, totalPages)
const batch = await pdf.decompose({
startPage: start,
endPage: end,
elementComposer: true
})
// Process batch results...
}Built-in Memory Limits (v1.0.6+):
- MAX_SAFE_PIXELS: 2M pixels per image
- MAX_DIMENSION: 2000px max width/height
- MAX_IMAGES_PER_PAGE: 20 images
- Canvas size limits: 1200x1600 for screenshots
- Sequential processing to reduce peak memory
- Use
skipScreenshots: trueindata()to skip page image generation
Error Handling
const pdf = new PdfDecomposer(buffer)
pdf.subscribe((state) => {
console.log(`Progress: \${state.progress}%`)
})
try {
await pdf.initialize()
const result = await pdf.decompose()
} catch (error) {
if (error.name === 'InvalidPdfError') {
console.error('Invalid PDF format:', error.message)
} else if (error.name === 'MemoryError') {
console.error('Memory limit exceeded:', error.message)
} else {
console.error('Processing failed:', error.message)
}
}Caching Strategy
const pdf = new PdfDecomposer(buffer)
await pdf.initialize()
// Use fingerprint for caching
const fingerprints = await pdf.getFingerprints()
const cacheKey = `pdf_\${fingerprints.pdfHash}`
// Check cache before processing
const cached = cache.get(cacheKey)
if (!cached) {
const result = await pdf.decompose()
cache.set(cacheKey, result, { ttl: 3600 }) // 1 hour
}Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Guidelines
- Use TypeScript for all new code
- Add tests for new features
- Update README for API changes
- Follow existing code style
- Test in both Node.js and browser environments
Publishing
Setup for Publishing
# Initial setup (run once)
npm run setup:publishing
# Verify configuration
npm run setup:verifyPublishing Commands
# Publish to NPM only
npm run publish:npm
# Publish to GitHub Packages only
npm run publish:github
# Publish to both registries
npm run publish:both
# Version bump + publish
npm version patch && npm run publish:bothLicense
PDF-Decomposer is dual-licensed:
Non-Commercial Use (Free)
- Personal projects
- Educational use
- Research purposes
- Open source projects
Commercial Use (Paid License Required)
- Commercial applications
- Revenue-generating products
- Enterprise software
- Distribution in commercial products
For commercial licensing, contact [email protected]
See LICENSE file for complete terms.
