@shubhu/pdfdiff

v1.6.0

Published

6 months ago

Platform/framework agnostic PDF diffing library with CLI support

Downloads

413

0High
0Medium
0Low

shubhu

pdf diff cli comparison text-extraction

PDFDiff

A platform/framework agnostic PDF diffing library with CLI support. Compare PDF files and get detailed text-based differences using modern JavaScript.

Features

🔍 Text-based PDF comparison using pdfjs-dist for accurate text extraction
📊 Multiple diff modes - character, word, and line-level comparison
🖥️ CLI interface with comprehensive options and colored output
📦 ESM support for modern JavaScript environments
🌐 Browser compatible with standalone builds for web applications
🔧 Platform agnostic - works across Node.js and browser environments
📝 JSDoc type annotations for better development experience
⚡ Fast comparison with detailed timing information
🎯 Flexible ignore options for whitespace and case differences
🎨 Visual diff support with positioned text overlay for PDF rendering

Installation

Global Installation (CLI usage)

npm install -g pdfdiff

Local Installation (Library usage)

npm install pdfdiff

CLI Usage

Basic Comparison

pdfdiff file1.pdf file2.pdf

Available Options

pdfdiff <pdf1> <pdf2> [options]

Arguments:
  pdf1                  First PDF file to compare
  pdf2                  Second PDF file to compare

Options:
  -i, --ignore-whitespace    Ignore whitespace differences
  -c, --ignore-case          Ignore case differences  
  -u, --show-unchanged       Show unchanged lines in output
  -m, --mode <mode>          Diff mode: char (default), word, or line
  --context <number>         Number of context lines around changes (default: 3)
  --no-color                 Disable color output
  -h, --help                 Show help message
  -v, --version              Show version number

Examples

# Basic comparison (character-level by default)
pdfdiff document1.pdf document2.pdf

# Word-level comparison
pdfdiff file1.pdf file2.pdf --mode word

# Line-level comparison  
pdfdiff file1.pdf file2.pdf --mode line

# Ignore whitespace differences
pdfdiff file1.pdf file2.pdf --ignore-whitespace

# Show unchanged lines with custom context
pdfdiff file1.pdf file2.pdf --show-unchanged --context 5

# Ignore case differences
pdfdiff file1.pdf file2.pdf --ignore-case

# Combined options
pdfdiff file1.pdf file2.pdf --mode word --ignore-case --no-color

Exit Codes

0 - Files are identical
1 - Files are different or error occurred

Diff Modes

PDFDiff supports three different comparison modes, each providing different levels of granularity:

Character Mode (Default)

Most granular: Compares text character by character
Best for: Detecting small changes, typos, and precise modifications
Output: Shows exact character differences
Example: Hello World vs Hello Earth shows individual character changes

Word Mode

Moderate granularity: Compares text word by word
Best for: Content changes, word replacements, and readability
Output: Shows word-level additions and removals
Example: Hello World vs Hello Earth shows World removed, Earth added

Line Mode

Least granular: Compares text line by line
Best for: Structural changes, paragraph modifications
Output: Shows entire line differences
Example: Full lines shown as added or removed

Performance Considerations

Character mode: More detailed output, larger diffs for big changes
Word mode: Balanced detail and readability
Line mode: Fastest processing, most concise output for large documents

Library Usage

ESM Import

import { comparePdfs, extractPdfText, formatDiff } from 'pdfdiff';

Extract Text from PDF

import { extractPdfText } from 'pdfdiff';

// From file path
const text = await extractPdfText('./document.pdf');
console.log(text);

// From Buffer
const buffer = await readFile('./document.pdf');
const text = await extractPdfText(buffer);

Compare PDFs

import { comparePdfs } from 'pdfdiff';

// Basic comparison (character mode by default)
const result = await comparePdfs('./file1.pdf', './file2.pdf');

// With options
const result = await comparePdfs('./file1.pdf', './file2.pdf', {
  mode: 'word',              // 'char' (default), 'word', or 'line'
  ignoreWhitespace: false,
  ignoreCase: false
});

console.log(result.summary);
console.log(result.identical); // boolean
console.log(result.changes);   // array of diff changes

Format Diff Output

import { comparePdfs, formatDiff } from 'pdfdiff';

const diffResult = await comparePdfs('./file1.pdf', './file2.pdf');
const formatted = formatDiff(diffResult, {
  showUnchanged: true,
  context: 3
});

console.log(formatted);

Visual Diff with Positioned Text

import { extractPositionedPdfText, comparePdfs } from 'pdfdiff';

// Extract positioned text for visual overlays
const positions1 = await extractPositionedPdfText('./file1.pdf');
const positions2 = await extractPositionedPdfText('./file2.pdf');

// Get diff changes
const diffResult = await comparePdfs('./file1.pdf', './file2.pdf');

// Create visual overlays for PDF viewers
function mapDiffToPositions(diffChanges, positions) {
  const overlays = [];
  let textOffset = 0;
  
  for (const change of diffChanges) {
    if (change.added || change.removed) {
      // Find text positions corresponding to this change
      const relevantItems = findTextInRange(positions, textOffset, change.value.length);
      overlays.push({
        type: change.added ? 'addition' : 'removal',
        text: change.value,
        positions: relevantItems
      });
    }
    textOffset += change.value.length;
  }
  
  return overlays;
}

const overlays1 = mapDiffToPositions(diffResult.changes, positions1);
const overlays2 = mapDiffToPositions(diffResult.changes, positions2);

API Reference

`extractPdfText(pdfPath)`

Extract text content from a PDF file.

Parameters:

pdfPath (string|Buffer): Path to PDF file or Buffer containing PDF data

Returns: Promise<string> - Extracted text content

`extractPositionedPdfText(pdfPath)`

Extract text content with positioning information from a PDF file.

Parameters:

pdfPath (string|Buffer): Path to PDF file or Buffer containing PDF data

Returns: Promise<Array<PageTextContent>> - Array of pages with positioned text items

`comparePdfs(pdf1, pdf2, options?)`

Compare two PDF files and return differences.

Parameters:

pdf1 (string|Buffer): First PDF file path or Buffer
pdf2 (string|Buffer): Second PDF file path or Buffer
options (Object, optional):
- mode (string): Diff mode - 'char' (default), 'word', or 'line'
- ignoreWhitespace (boolean): Ignore whitespace differences (default: false)
- ignoreCase (boolean): Ignore case differences (default: false)

Returns: Promise<DiffResult>

DiffResult:

{
  changes: Array<DiffChange>,  // Array of diff changes
  summary: string,             // Summary of changes
  identical: boolean           // Whether PDFs are identical
}

`formatDiff(diffResult, options?)`

Format diff output for console display.

Parameters:

diffResult (DiffResult): Result from comparePdfs
options (Object, optional):
- showUnchanged (boolean): Show unchanged lines (default: false)
- context (number): Number of context lines around changes (default: 3)

Returns: string - Formatted diff output

Output Data Specification

DiffResult Object

The core output from comparePdfs() follows this structure:

{
  changes: Array<DiffChange>,  // Array of individual changes
  summary: string,             // Human-readable summary
  identical: boolean           // Whether files are identical
}

DiffChange Object

Each change in the changes array represents a segment of text with its status:

{
  value: string,        // The text content of this change
  added?: boolean,      // true if this text was added (undefined for unchanged)
  removed?: boolean,    // true if this text was removed (undefined for unchanged)
  count?: number        // Number of units (chars/words/lines) in this change
}

Change Types

Unchanged segments: { value: "text", count: 5 }
Added segments: { value: "new text", added: true, count: 2 }
Removed segments: { value: "old text", removed: true, count: 2 }

Summary Format

The summary string format varies by diff mode:

Character mode: "724 characters added, 775 characters removed"
Word mode: "181 words added, 159 words removed"
Line mode: "2 lines added, 2 lines removed"
Identical files: "PDFs are identical" (all modes)

Example Output

{
  changes: [
    { value: "Hello ", count: 6 },                    // Unchanged
    { value: "World", removed: true, count: 5 },      // Removed
    { value: "Earth", added: true, count: 5 },        // Added
    { value: "!\nThis is a test.", count: 17 }        // Unchanged
  ],
  summary: "5 characters added, 5 characters removed",
  identical: false
}

Visual Diff Output Specification

For visual diff applications (like overlaying differences on PDF renderings), the library provides positioned text data that can be used to create visual overlays.

Positioned Text Extraction

import { extractPositionedPdfText } from 'pdfdiff';

const positionedText = await extractPositionedPdfText('./document.pdf');
console.log(positionedText);

PositionedTextData Structure

[
  {
    page: 1,                    // Page number (1-based)
    items: [                    // Array of positioned text items
      {
        text: "Hello World",    // Text content
        x: 72,                  // X coordinate (points)
        y: 720,                 // Y coordinate (points, top-down)
        width: 85.2,            // Text width (points)
        height: 12,             // Text height (points)
        transform: [12, 0, 0, 12, 72, 720], // Full transformation matrix
        fontName: "Arial-Bold", // Font name (if available)
        page: 1                 // Page reference
      }
    ],
    viewport: {
      width: 612,               // Page width (points)
      height: 792               // Page height (points)
    }
  }
]

Visual Diff Overlay Usage

The positioned text data can be combined with diff results to create visual overlays:

import { extractPositionedPdfText, comparePdfs } from 'pdfdiff';

// Extract positioned text from both PDFs
const positions1 = await extractPositionedPdfText('./file1.pdf');
const positions2 = await extractPositionedPdfText('./file2.pdf');

// Get text-based diff
const diffResult = await comparePdfs('./file1.pdf', './file2.pdf');

// Create visual overlay data by mapping diff changes to text positions
function createVisualDiff(positions, diffChanges) {
  const overlays = [];
  let textOffset = 0;
  
  for (const change of diffChanges) {
    if (change.added || change.removed) {
      // Find corresponding positioned text items
      const matchingItems = findTextItemsInRange(positions, textOffset, change.value.length);
      
      overlays.push({
        type: change.added ? 'addition' : 'removal',
        items: matchingItems,
        bounds: calculateBounds(matchingItems)
      });
    }
    textOffset += change.value.length;
  }
  
  return overlays;
}

Coordinate System

Origin: Top-left corner of the page
Units: Points (1/72 inch)
Y-axis: Top-down (0 at top, increases downward)
Standard page: 612x792 points (8.5" x 11" at 72 DPI)

Visual Overlay Applications

The positioned text data enables:

SVG overlays: Create <rect> elements highlighting differences
Canvas rendering: Draw colored rectangles over changed text areas
HTML positioning: Absolutely position diff markers over PDF viewers
Annotation layers: Add visual indicators for additions/removals

Example SVG Overlay

function createSVGOverlay(visualDiff) {
  const svg = document.createElementNS('http://www.w3.org/2000/svg', 'svg');
  
  visualDiff.forEach(overlay => {
    const rect = document.createElementNS('http://www.w3.org/2000/svg', 'rect');
    rect.setAttribute('x', overlay.bounds.x);
    rect.setAttribute('y', overlay.bounds.y);
    rect.setAttribute('width', overlay.bounds.width);
    rect.setAttribute('height', overlay.bounds.height);
    rect.setAttribute('fill', overlay.type === 'addition' ? 'rgba(0,255,0,0.3)' : 'rgba(255,0,0,0.3)');
    rect.setAttribute('stroke', overlay.type === 'addition' ? '#00aa00' : '#aa0000');
    svg.appendChild(rect);
  });
  
  return svg;
}

Browser Usage

For browser environments, import the standalone build:

<script src="/path/to/pdfdiff.standalone.js"></script>
<script>
  // PDFDiff is available globally
  const result = await PDFDiff.comparePdfs(pdf1Data, pdf2Data, {
    mode: 'word',
    ignoreCase: true
  });
</script>

Development

Scripts

# Type checking
npm run typecheck

# Run tests (placeholder)
npm test

Requirements

Node.js 16+ (ESM support)
Modern JavaScript environment

Dependencies

pdfjs-dist - PDF parsing and text extraction
diff - Text diffing algorithms (character, word, line)

Browser Compatibility

Modern browsers supporting ES2020+
PDF.js worker support for PDF processing
ArrayBuffer and Uint8Array support

License

ISC

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

PDFDiff

Features

Installation

Global Installation (CLI usage)

Local Installation (Library usage)

CLI Usage

Basic Comparison

Available Options

Examples

Exit Codes

Diff Modes

Character Mode (Default)

Word Mode

Line Mode

Performance Considerations

Library Usage

ESM Import

Extract Text from PDF

Compare PDFs

Format Diff Output

Visual Diff with Positioned Text

API Reference

extractPdfText(pdfPath)

extractPositionedPdfText(pdfPath)

comparePdfs(pdf1, pdf2, options?)

formatDiff(diffResult, options?)

Output Data Specification

DiffResult Object

DiffChange Object

Change Types

Summary Format

Example Output

Visual Diff Output Specification

Positioned Text Extraction

PositionedTextData Structure

Visual Diff Overlay Usage

Coordinate System

Visual Overlay Applications

Example SVG Overlay

Browser Usage

Development

Scripts

Requirements

Dependencies

Browser Compatibility

License

Contributing

`extractPdfText(pdfPath)`

`extractPositionedPdfText(pdfPath)`

`comparePdfs(pdf1, pdf2, options?)`

`formatDiff(diffResult, options?)`