@shubhu/pdfdiff
v1.6.0
Published
Platform/framework agnostic PDF diffing library with CLI support
Downloads
413
Maintainers
Readme
PDFDiff
A platform/framework agnostic PDF diffing library with CLI support. Compare PDF files and get detailed text-based differences using modern JavaScript.
Features
- 🔍 Text-based PDF comparison using
pdfjs-distfor accurate text extraction - 📊 Multiple diff modes - character, word, and line-level comparison
- 🖥️ CLI interface with comprehensive options and colored output
- 📦 ESM support for modern JavaScript environments
- 🌐 Browser compatible with standalone builds for web applications
- 🔧 Platform agnostic - works across Node.js and browser environments
- 📝 JSDoc type annotations for better development experience
- ⚡ Fast comparison with detailed timing information
- 🎯 Flexible ignore options for whitespace and case differences
- 🎨 Visual diff support with positioned text overlay for PDF rendering
Installation
Global Installation (CLI usage)
npm install -g pdfdiffLocal Installation (Library usage)
npm install pdfdiffCLI Usage
Basic Comparison
pdfdiff file1.pdf file2.pdfAvailable Options
pdfdiff <pdf1> <pdf2> [options]
Arguments:
pdf1 First PDF file to compare
pdf2 Second PDF file to compare
Options:
-i, --ignore-whitespace Ignore whitespace differences
-c, --ignore-case Ignore case differences
-u, --show-unchanged Show unchanged lines in output
-m, --mode <mode> Diff mode: char (default), word, or line
--context <number> Number of context lines around changes (default: 3)
--no-color Disable color output
-h, --help Show help message
-v, --version Show version numberExamples
# Basic comparison (character-level by default)
pdfdiff document1.pdf document2.pdf
# Word-level comparison
pdfdiff file1.pdf file2.pdf --mode word
# Line-level comparison
pdfdiff file1.pdf file2.pdf --mode line
# Ignore whitespace differences
pdfdiff file1.pdf file2.pdf --ignore-whitespace
# Show unchanged lines with custom context
pdfdiff file1.pdf file2.pdf --show-unchanged --context 5
# Ignore case differences
pdfdiff file1.pdf file2.pdf --ignore-case
# Combined options
pdfdiff file1.pdf file2.pdf --mode word --ignore-case --no-colorExit Codes
0- Files are identical1- Files are different or error occurred
Diff Modes
PDFDiff supports three different comparison modes, each providing different levels of granularity:
Character Mode (Default)
- Most granular: Compares text character by character
- Best for: Detecting small changes, typos, and precise modifications
- Output: Shows exact character differences
- Example:
Hello WorldvsHello Earthshows individual character changes
Word Mode
- Moderate granularity: Compares text word by word
- Best for: Content changes, word replacements, and readability
- Output: Shows word-level additions and removals
- Example:
Hello WorldvsHello EarthshowsWorldremoved,Earthadded
Line Mode
- Least granular: Compares text line by line
- Best for: Structural changes, paragraph modifications
- Output: Shows entire line differences
- Example: Full lines shown as added or removed
Performance Considerations
- Character mode: More detailed output, larger diffs for big changes
- Word mode: Balanced detail and readability
- Line mode: Fastest processing, most concise output for large documents
Library Usage
ESM Import
import { comparePdfs, extractPdfText, formatDiff } from 'pdfdiff';Extract Text from PDF
import { extractPdfText } from 'pdfdiff';
// From file path
const text = await extractPdfText('./document.pdf');
console.log(text);
// From Buffer
const buffer = await readFile('./document.pdf');
const text = await extractPdfText(buffer);Compare PDFs
import { comparePdfs } from 'pdfdiff';
// Basic comparison (character mode by default)
const result = await comparePdfs('./file1.pdf', './file2.pdf');
// With options
const result = await comparePdfs('./file1.pdf', './file2.pdf', {
mode: 'word', // 'char' (default), 'word', or 'line'
ignoreWhitespace: false,
ignoreCase: false
});
console.log(result.summary);
console.log(result.identical); // boolean
console.log(result.changes); // array of diff changesFormat Diff Output
import { comparePdfs, formatDiff } from 'pdfdiff';
const diffResult = await comparePdfs('./file1.pdf', './file2.pdf');
const formatted = formatDiff(diffResult, {
showUnchanged: true,
context: 3
});
console.log(formatted);Visual Diff with Positioned Text
import { extractPositionedPdfText, comparePdfs } from 'pdfdiff';
// Extract positioned text for visual overlays
const positions1 = await extractPositionedPdfText('./file1.pdf');
const positions2 = await extractPositionedPdfText('./file2.pdf');
// Get diff changes
const diffResult = await comparePdfs('./file1.pdf', './file2.pdf');
// Create visual overlays for PDF viewers
function mapDiffToPositions(diffChanges, positions) {
const overlays = [];
let textOffset = 0;
for (const change of diffChanges) {
if (change.added || change.removed) {
// Find text positions corresponding to this change
const relevantItems = findTextInRange(positions, textOffset, change.value.length);
overlays.push({
type: change.added ? 'addition' : 'removal',
text: change.value,
positions: relevantItems
});
}
textOffset += change.value.length;
}
return overlays;
}
const overlays1 = mapDiffToPositions(diffResult.changes, positions1);
const overlays2 = mapDiffToPositions(diffResult.changes, positions2);API Reference
extractPdfText(pdfPath)
Extract text content from a PDF file.
Parameters:
pdfPath(string|Buffer): Path to PDF file or Buffer containing PDF data
Returns: Promise<string> - Extracted text content
extractPositionedPdfText(pdfPath)
Extract text content with positioning information from a PDF file.
Parameters:
pdfPath(string|Buffer): Path to PDF file or Buffer containing PDF data
Returns: Promise<Array<PageTextContent>> - Array of pages with positioned text items
comparePdfs(pdf1, pdf2, options?)
Compare two PDF files and return differences.
Parameters:
pdf1(string|Buffer): First PDF file path or Bufferpdf2(string|Buffer): Second PDF file path or Bufferoptions(Object, optional):mode(string): Diff mode - 'char' (default), 'word', or 'line'ignoreWhitespace(boolean): Ignore whitespace differences (default: false)ignoreCase(boolean): Ignore case differences (default: false)
Returns: Promise<DiffResult>
DiffResult:
{
changes: Array<DiffChange>, // Array of diff changes
summary: string, // Summary of changes
identical: boolean // Whether PDFs are identical
}formatDiff(diffResult, options?)
Format diff output for console display.
Parameters:
diffResult(DiffResult): Result from comparePdfsoptions(Object, optional):showUnchanged(boolean): Show unchanged lines (default: false)context(number): Number of context lines around changes (default: 3)
Returns: string - Formatted diff output
Output Data Specification
DiffResult Object
The core output from comparePdfs() follows this structure:
{
changes: Array<DiffChange>, // Array of individual changes
summary: string, // Human-readable summary
identical: boolean // Whether files are identical
}DiffChange Object
Each change in the changes array represents a segment of text with its status:
{
value: string, // The text content of this change
added?: boolean, // true if this text was added (undefined for unchanged)
removed?: boolean, // true if this text was removed (undefined for unchanged)
count?: number // Number of units (chars/words/lines) in this change
}Change Types
- Unchanged segments:
{ value: "text", count: 5 } - Added segments:
{ value: "new text", added: true, count: 2 } - Removed segments:
{ value: "old text", removed: true, count: 2 }
Summary Format
The summary string format varies by diff mode:
- Character mode:
"724 characters added, 775 characters removed" - Word mode:
"181 words added, 159 words removed" - Line mode:
"2 lines added, 2 lines removed" - Identical files:
"PDFs are identical"(all modes)
Example Output
{
changes: [
{ value: "Hello ", count: 6 }, // Unchanged
{ value: "World", removed: true, count: 5 }, // Removed
{ value: "Earth", added: true, count: 5 }, // Added
{ value: "!\nThis is a test.", count: 17 } // Unchanged
],
summary: "5 characters added, 5 characters removed",
identical: false
}Visual Diff Output Specification
For visual diff applications (like overlaying differences on PDF renderings), the library provides positioned text data that can be used to create visual overlays.
Positioned Text Extraction
import { extractPositionedPdfText } from 'pdfdiff';
const positionedText = await extractPositionedPdfText('./document.pdf');
console.log(positionedText);PositionedTextData Structure
[
{
page: 1, // Page number (1-based)
items: [ // Array of positioned text items
{
text: "Hello World", // Text content
x: 72, // X coordinate (points)
y: 720, // Y coordinate (points, top-down)
width: 85.2, // Text width (points)
height: 12, // Text height (points)
transform: [12, 0, 0, 12, 72, 720], // Full transformation matrix
fontName: "Arial-Bold", // Font name (if available)
page: 1 // Page reference
}
],
viewport: {
width: 612, // Page width (points)
height: 792 // Page height (points)
}
}
]Visual Diff Overlay Usage
The positioned text data can be combined with diff results to create visual overlays:
import { extractPositionedPdfText, comparePdfs } from 'pdfdiff';
// Extract positioned text from both PDFs
const positions1 = await extractPositionedPdfText('./file1.pdf');
const positions2 = await extractPositionedPdfText('./file2.pdf');
// Get text-based diff
const diffResult = await comparePdfs('./file1.pdf', './file2.pdf');
// Create visual overlay data by mapping diff changes to text positions
function createVisualDiff(positions, diffChanges) {
const overlays = [];
let textOffset = 0;
for (const change of diffChanges) {
if (change.added || change.removed) {
// Find corresponding positioned text items
const matchingItems = findTextItemsInRange(positions, textOffset, change.value.length);
overlays.push({
type: change.added ? 'addition' : 'removal',
items: matchingItems,
bounds: calculateBounds(matchingItems)
});
}
textOffset += change.value.length;
}
return overlays;
}Coordinate System
- Origin: Top-left corner of the page
- Units: Points (1/72 inch)
- Y-axis: Top-down (0 at top, increases downward)
- Standard page: 612x792 points (8.5" x 11" at 72 DPI)
Visual Overlay Applications
The positioned text data enables:
- SVG overlays: Create
<rect>elements highlighting differences - Canvas rendering: Draw colored rectangles over changed text areas
- HTML positioning: Absolutely position diff markers over PDF viewers
- Annotation layers: Add visual indicators for additions/removals
Example SVG Overlay
function createSVGOverlay(visualDiff) {
const svg = document.createElementNS('http://www.w3.org/2000/svg', 'svg');
visualDiff.forEach(overlay => {
const rect = document.createElementNS('http://www.w3.org/2000/svg', 'rect');
rect.setAttribute('x', overlay.bounds.x);
rect.setAttribute('y', overlay.bounds.y);
rect.setAttribute('width', overlay.bounds.width);
rect.setAttribute('height', overlay.bounds.height);
rect.setAttribute('fill', overlay.type === 'addition' ? 'rgba(0,255,0,0.3)' : 'rgba(255,0,0,0.3)');
rect.setAttribute('stroke', overlay.type === 'addition' ? '#00aa00' : '#aa0000');
svg.appendChild(rect);
});
return svg;
}Browser Usage
For browser environments, import the standalone build:
<script src="/path/to/pdfdiff.standalone.js"></script>
<script>
// PDFDiff is available globally
const result = await PDFDiff.comparePdfs(pdf1Data, pdf2Data, {
mode: 'word',
ignoreCase: true
});
</script>Development
Scripts
# Type checking
npm run typecheck
# Run tests (placeholder)
npm testRequirements
- Node.js 16+ (ESM support)
- Modern JavaScript environment
Dependencies
pdfjs-dist- PDF parsing and text extractiondiff- Text diffing algorithms (character, word, line)
Browser Compatibility
- Modern browsers supporting ES2020+
- PDF.js worker support for PDF processing
- ArrayBuffer and Uint8Array support
License
ISC
Contributing
Contributions are welcome! Please feel free to submit issues and pull requests.
