pdf-diff-core
v1.0.2
Published
A core library for semantic PDF comparison without UI dependencies | 高精度语义化 PDF 比对引擎
Readme
pdf-diff-core
Part 1: English Documentation
A High-Precision Semantic PDF Comparison Engine (Headless).
pdf-diff-core is a lightweight, pure-logic library for comparing two PDF files. Unlike traditional pixel-based comparison, it extracts text semantics to perform precise "content diffing".
It separates calculation from rendering, making it perfect for React, Vue, Angular, or Node.js applications.
✨ Features
- Headless & UI Agnostic: Pure logic. You control how to render the PDF and highlights.
- Pagination Reflow Support: Smartly detects text moving across pages (e.g., from Page 1 bottom to Page 2 top) without marking it as Delete/Add.
- Semantic Diff: Based on Google's
diff-match-patchalgorithm. - Precise Coordinates: Returns strict
(x, y, w, h)bounding boxes for easy highlighting on Canvas.
📦 Installation
npm install pdf-diff-core pdfjs-distNote: This library depends on
pdfjs-distfor parsing.
🚀 Usage
1.Basic Setup
import { PdfDiff } from "pdf-diff-core";
import * as pdfjsLib from "pdfjs-dist";
// 1. Configure PDF.js Worker (Essential!)
// You can use a local file or a CDN URL
const workerSrc = `https://unpkg.com/pdfjs-dist@${pdfjsLib.version}/build/pdf.worker.min.js`;
// 2. Initialize
const differ = new PdfDiff({
workerSrc: workerSrc,
});
// 3. Load files as ArrayBuffer
const oldFileBuffer = await fetch("old.pdf").then((res) => res.arrayBuffer());
const newFileBuffer = await fetch("new.pdf").then((res) => res.arrayBuffer());
// 4. Compare
const results = await differ.compare(oldFileBuffer, newFileBuffer);
console.log(results);2.Output Data Structure
The compare method returns an array of diff blocks:
[
{
pageIndex: 0, // Page number (0-based)
type: "delete", // 'delete' (Red) or 'add' (Green)
rects: [
// Array of bounding boxes
{ x: 50.5, y: 100.2, w: 30.0, h: 12.0 },
{ x: 80.5, y: 100.2, w: 10.0, h: 12.0 },
],
},
// ... more results
];3.Rendering Highlights (Frontend Example)
The library provides coordinates. You need to draw them on a Canvas overlaying the PDF.
⚠️ Important: Coordinates are returned at scale = 1.0 (Standard PDF points). If you render your PDF at scale = 1.5 for better quality, you must multiply the coordinates.
// Example: Drawing on a 2D Context
const renderScale = 1.5; // The scale you used to render the PDF page
results.forEach((diff) => {
if (diff.pageIndex === currentPageIndex) {
// Set color: Red for Delete, Green for Add
ctx.fillStyle =
diff.type === "delete"
? "rgba(255, 65, 65, 0.3)"
: "rgba(65, 255, 100, 0.3)";
diff.rects.forEach((rect) => {
// Multiply coordinates by your render scale
ctx.fillRect(
rect.x * renderScale,
rect.y * renderScale,
rect.w * renderScale,
rect.h * renderScale,
);
});
}
});🔧 API
new PdfDiff(options)
options.workerSrc(string): Path or URL topdf.worker.min.js. If not provided, you must setpdfjsLib.GlobalWorkerOptions.workerSrcmanually in your project.
compare(buf1, buf2)
buf1(ArrayBuffer): The "Old" (Base) PDF file.buf2(ArrayBuffer): The "New" (Target) PDF file.
- Returns:
Promise<Array<DiffResult>>
第二部分:中文说明 (Chinese Section)
高精度语义化 PDF 比对引擎 (核心库)
pdf-diff-core 是一个轻量级、纯逻辑的 PDF 比对库。与传统的像素比对不同,它提取 PDF 内部的文本语义进行比对。
该库将“计算”与“渲染”完全分离,因此非常适合集成到 Vue、React、Angular 或 Node.js 项目中。
✨ 特性
- UI 无关 (Headless): 纯逻辑库。你可以自由决定如何渲染 PDF 和高亮框。
- 支持分页重排 (Reflow): 智能识别跨页移动的文本(例如:一段话从第1页页尾移到了第2页页头),不会错误地标记为“删除+新增”,而是视为内容相等。
- 语义比对: 基于 Google
diff-match-patch算法。 - 精确坐标: 返回精确的
(x, y, w, h)坐标,方便在 Canvas 上绘制高亮。
📦 安装
npm install pdf-diff-core pdfjs-dist"注意: 本库依赖
pdfjs-dist进行 PDF 解析。"
🚀 使用方法
1. 基本配置
import { PdfDiff } from "pdf-diff-core";
import * as pdfjsLib from "pdfjs-dist";
// 1. 配置 PDF.js Worker (必须!)
// 建议使用 CDN,或者你本地 public 目录下的 worker 文件路径
const workerSrc = `https://unpkg.com/pdfjs-dist@${pdfjsLib.version}/build/pdf.worker.min.js`;
// 2. 初始化
const differ = new PdfDiff({
workerSrc: workerSrc,
});
// 3. 加载文件为 ArrayBuffer
const oldFileBuffer = await fetch("old.pdf").then((res) => res.arrayBuffer());
const newFileBuffer = await fetch("new.pdf").then((res) => res.arrayBuffer());
// 4. 开始比对
const results = await differ.compare(oldFileBuffer, newFileBuffer);
console.log(results);2. 输出数据结构
compare 方法返回一个包含差异块的数组:
[
{
pageIndex: 0, // 页码 (从 0 开始)
type: "delete", // 'delete' (删除/旧版-红) 或 'add' (新增/新版-绿)
rects: [
// 矩形坐标数组
{ x: 50.5, y: 100.2, w: 30.0, h: 12.0 },
{ x: 80.5, y: 100.2, w: 10.0, h: 12.0 },
],
},
// ... 更多结果
];3. 前端渲染高亮示例
本库只提供坐标数据,你需要自己创建一个覆盖在 PDF 上的 Canvas 来绘制高亮。
⚠️ 关键提示: 返回的坐标基于标准 PDF 点数 (scale = 1.0)。如果你为了清晰度将 PDF 放大渲染(例如 scale = 1.5),绘制高亮时必须将坐标乘以该缩放比例。
// 示例:在 Canvas 上绘图
const renderScale = 1.5; // 假设你的 PDF Canvas 渲染缩放比是 1.5
results.forEach((diff) => {
// 只绘制当前页的差异
if (diff.pageIndex === currentPageIndex) {
// 设置颜色:删除为红,新增为绿
ctx.fillStyle =
diff.type === "delete"
? "rgba(255, 65, 65, 0.3)"
: "rgba(65, 255, 100, 0.3)";
diff.rects.forEach((rect) => {
// 关键:坐标 * 渲染缩放比
ctx.fillRect(
rect.x * renderScale,
rect.y * renderScale,
rect.w * renderScale,
rect.h * renderScale,
);
});
}
});🔧 API 参考
new PdfDiff(options)
options.workerSrc (string): pdf.worker.min.js 的路径或 URL。如果不传,你需要在外部手动设置 pdfjsLib.GlobalWorkerOptions.workerSrc。
compare(buf1, buf2)
- buf1 (ArrayBuffer): 旧版 (基准) PDF 文件流。
- buf2 (ArrayBuffer): 新版 (当前) PDF 文件流。
- 返回值:
Promise<Array<DiffResult>>
