pdf-diff-core

v1.0.2

Published

4 days ago

A core library for semantic PDF comparison without UI dependencies | 高精度语义化 PDF 比对引擎

0High
0Medium
0Low

zlp99

pdf diff compare pdfjs

pdf-diff-core

English | 中文说明

Part 1: English Documentation

A High-Precision Semantic PDF Comparison Engine (Headless).

pdf-diff-core is a lightweight, pure-logic library for comparing two PDF files. Unlike traditional pixel-based comparison, it extracts text semantics to perform precise "content diffing".

It separates calculation from rendering, making it perfect for React, Vue, Angular, or Node.js applications.

✨ Features

Headless & UI Agnostic: Pure logic. You control how to render the PDF and highlights.
Pagination Reflow Support: Smartly detects text moving across pages (e.g., from Page 1 bottom to Page 2 top) without marking it as Delete/Add.
Semantic Diff: Based on Google's diff-match-patch algorithm.
Precise Coordinates: Returns strict (x, y, w, h) bounding boxes for easy highlighting on Canvas.

📦 Installation

npm install pdf-diff-core pdfjs-dist

Note: This library depends on pdfjs-dist for parsing.

🚀 Usage

1.Basic Setup

import { PdfDiff } from "pdf-diff-core";
import * as pdfjsLib from "pdfjs-dist";

// 1. Configure PDF.js Worker (Essential!)
// You can use a local file or a CDN URL
const workerSrc = `https://unpkg.com/pdfjs-dist@${pdfjsLib.version}/build/pdf.worker.min.js`;

// 2. Initialize
const differ = new PdfDiff({
  workerSrc: workerSrc,
});

// 3. Load files as ArrayBuffer
const oldFileBuffer = await fetch("old.pdf").then((res) => res.arrayBuffer());
const newFileBuffer = await fetch("new.pdf").then((res) => res.arrayBuffer());

// 4. Compare
const results = await differ.compare(oldFileBuffer, newFileBuffer);

console.log(results);

2.Output Data Structure

The compare method returns an array of diff blocks:

[
  {
    pageIndex: 0, // Page number (0-based)
    type: "delete", // 'delete' (Red) or 'add' (Green)
    rects: [
      // Array of bounding boxes
      { x: 50.5, y: 100.2, w: 30.0, h: 12.0 },
      { x: 80.5, y: 100.2, w: 10.0, h: 12.0 },
    ],
  },
  // ... more results
];

3.Rendering Highlights (Frontend Example)

The library provides coordinates. You need to draw them on a Canvas overlaying the PDF.

⚠️ Important: Coordinates are returned at scale = 1.0 (Standard PDF points). If you render your PDF at scale = 1.5 for better quality, you must multiply the coordinates.

// Example: Drawing on a 2D Context
const renderScale = 1.5; // The scale you used to render the PDF page

results.forEach((diff) => {
  if (diff.pageIndex === currentPageIndex) {
    // Set color: Red for Delete, Green for Add
    ctx.fillStyle =
      diff.type === "delete"
        ? "rgba(255, 65, 65, 0.3)"
        : "rgba(65, 255, 100, 0.3)";

    diff.rects.forEach((rect) => {
      // Multiply coordinates by your render scale
      ctx.fillRect(
        rect.x * renderScale,
        rect.y * renderScale,
        rect.w * renderScale,
        rect.h * renderScale,
      );
    });
  }
});

🔧 API

new PdfDiff(options)

options.workerSrc (string): Path or URL to pdf.worker.min.js. If not provided, you must set pdfjsLib.GlobalWorkerOptions.workerSrc manually in your project.

compare(buf1, buf2)

buf1 (ArrayBuffer): The "Old" (Base) PDF file.
buf2 (ArrayBuffer): The "New" (Target) PDF file.

Returns: Promise<Array<DiffResult>>

第二部分：中文说明 (Chinese Section)

高精度语义化 PDF 比对引擎 (核心库)

pdf-diff-core 是一个轻量级、纯逻辑的 PDF 比对库。与传统的像素比对不同，它提取 PDF 内部的文本语义进行比对。

该库将“计算”与“渲染”完全分离，因此非常适合集成到 Vue、React、Angular 或 Node.js 项目中。

✨ 特性

UI 无关 (Headless): 纯逻辑库。你可以自由决定如何渲染 PDF 和高亮框。
支持分页重排 (Reflow): 智能识别跨页移动的文本（例如：一段话从第1页页尾移到了第2页页头），不会错误地标记为“删除+新增”，而是视为内容相等。
语义比对: 基于 Google diff-match-patch 算法。
精确坐标: 返回精确的 (x, y, w, h) 坐标，方便在 Canvas 上绘制高亮。

📦 安装

npm install pdf-diff-core pdfjs-dist

"注意: 本库依赖pdfjs-dist 进行 PDF 解析。"

🚀 使用方法

1. 基本配置

import { PdfDiff } from "pdf-diff-core";
import * as pdfjsLib from "pdfjs-dist";

// 1. 配置 PDF.js Worker (必须!)
// 建议使用 CDN，或者你本地 public 目录下的 worker 文件路径
const workerSrc = `https://unpkg.com/pdfjs-dist@${pdfjsLib.version}/build/pdf.worker.min.js`;

// 2. 初始化
const differ = new PdfDiff({
  workerSrc: workerSrc,
});

// 3. 加载文件为 ArrayBuffer
const oldFileBuffer = await fetch("old.pdf").then((res) => res.arrayBuffer());
const newFileBuffer = await fetch("new.pdf").then((res) => res.arrayBuffer());

// 4. 开始比对
const results = await differ.compare(oldFileBuffer, newFileBuffer);

console.log(results);

2. 输出数据结构

compare 方法返回一个包含差异块的数组：

[
  {
    pageIndex: 0, // 页码 (从 0 开始)
    type: "delete", // 'delete' (删除/旧版-红) 或 'add' (新增/新版-绿)
    rects: [
      // 矩形坐标数组
      { x: 50.5, y: 100.2, w: 30.0, h: 12.0 },
      { x: 80.5, y: 100.2, w: 10.0, h: 12.0 },
    ],
  },
  // ... 更多结果
];

3. 前端渲染高亮示例

本库只提供坐标数据，你需要自己创建一个覆盖在 PDF 上的 Canvas 来绘制高亮。

⚠️ 关键提示: 返回的坐标基于标准 PDF 点数 (scale = 1.0)。如果你为了清晰度将 PDF 放大渲染（例如 scale = 1.5），绘制高亮时必须将坐标乘以该缩放比例。

// 示例：在 Canvas 上绘图
const renderScale = 1.5; // 假设你的 PDF Canvas 渲染缩放比是 1.5

results.forEach((diff) => {
  // 只绘制当前页的差异
  if (diff.pageIndex === currentPageIndex) {
    // 设置颜色：删除为红，新增为绿
    ctx.fillStyle =
      diff.type === "delete"
        ? "rgba(255, 65, 65, 0.3)"
        : "rgba(65, 255, 100, 0.3)";

    diff.rects.forEach((rect) => {
      // 关键：坐标 * 渲染缩放比
      ctx.fillRect(
        rect.x * renderScale,
        rect.y * renderScale,
        rect.w * renderScale,
        rect.h * renderScale,
      );
    });
  }
});

🔧 API 参考

new PdfDiff(options) options.workerSrc (string): pdf.worker.min.js 的路径或 URL。如果不传，你需要在外部手动设置 pdfjsLib.GlobalWorkerOptions.workerSrc。 compare(buf1, buf2)

buf1 (ArrayBuffer): 旧版 (基准) PDF 文件流。
buf2 (ArrayBuffer): 新版 (当前) PDF 文件流。
返回值: Promise<Array<DiffResult>>

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

pdf-diff-core

Part 1: English Documentation

✨ Features

📦 Installation

🚀 Usage

1.Basic Setup

2.Output Data Structure

3.Rendering Highlights (Frontend Example)

🔧 API

第二部分：中文说明 (Chinese Section)

✨ 特性

📦 安装

🚀 使用方法

1. 基本配置

2. 输出数据结构

3. 前端渲染高亮示例

🔧 API 参考