@procommerz/doc-anonymizer

v0.1.2

Published

a month ago

Node.js document anonymizer library and CLI for OCR-based PDF/image redaction.

Downloads

0High
0Medium
0Low

procommerz

document anonymizer ocr anonymizer redaction pdf redaction image redaction

Document Anonymizer Tool and Component

This is Node.js component that uses local OCR and image editing tools to produce anonymized versions of PDF documents. It can be used as a standlone CLI tool or a component within another Node.js application.

Inputs

Inputs can be a sequence of per-page image file locations or URLs or a single PDF.

The user must also provide one or more target name variations that must be redacted. The OCR will search for these exact matches first. If nothings was found, will search more aggressively using string similarity comparison (like the Levenshtein distance). It will automatically account for potential line breaks when searching for matches.

It will identify bounding rectangles for these texts so that it then can cover it with opaque black rectangles.

Output

The output file is always a single PDF file generated at target path or returned as a binary blob as part of internal library function call result. The output PDF will have all pages as 150dpi images, where the target anonymized name will be redacted with black rectangles.

CLI Script

The library also provides a cli script that can be run directly to generate anonymized version of a document.

The script will generate a file at target path and will return a JSON output with run details.

Implementation Notes

The current implementation uses pdfjs-dist to rasterize PDF pages at 150dpi, tesseract.js for local OCR, @napi-rs/canvas for image redaction, and pdf-lib to assemble the output PDF.
OCR matching is exact-first and falls back to conservative fuzzy matching when exact OCR text is not found.
Page image inputs may be local file paths or HTTP/HTTPS URLs. PDF input is local-file based.
The library returns a PDF Buffer and can also write the generated file to disk.
OCR language files are read from ~/.tessdata by default, or from a caller-provided directory.

Install

npm install

Library Usage

import { anonymizeDocument } from 'doc-anonymizer';

const result = await anonymizeDocument({
  inputPdfPath: './input.pdf',
  targetNames: ['John Smith', 'J. Smith'],
  outputPdfPath: './redacted.pdf',
  ocrDataDir: '~/.tessdata',
  ocrLanguage: 'eng',
  ocrLanguageHints: ['deu', 'fra'],
});

console.log(result.pageCount);
console.log(result.totalRedactions);

Image input mode:

import { anonymizeDocument } from 'doc-anonymizer';

const result = await anonymizeDocument({
  inputPages: [
    './page-1.png',
    'https://example.com/page-2.jpg',
  ],
  targetNames: ['John Smith'],
});

console.log(result.pdfBuffer);

CLI Usage

PDF input:

node ./bin/doc-anonymizer.js \
  --input-pdf ./test_document.pdf \
  --document-language eng \
  --name "MUSTERMANN, ELENA" \
  --name "MUSTERMANN ELENA" \
  --name "ELENA MUSTERMANN" \
  --name "MUSTERMANN" \
  --name "P12345678X" \
  --output ./test_document_redacted.pdf

Page image input:

node ./bin/doc-anonymizer.js \
  --input-page ./page-1.png \
  --input-page https://example.com/page-2.jpg \
  --ocr-data-dir ~/.tessdata \
  --name "John Smith" \
  --output ./redacted.pdf

If --ocr-data-dir is omitted, the CLI downloads the requested document language into ~/.tessdata automatically and uses it.

Downloading OCR Data

To pre-download datasets for end users:

npx doc-anonymizer download-tessdata --language eng --data-dir ~/.tessdata

You can repeat --language for multiple OCR languages. If you need the direct script entrypoint during local development, you can also run:

node ./bin/download-tessdata.js --language eng --data-dir ~/.tessdata

On success, both CLI scripts print JSON to stdout. On failure, they print a JSON error object to stderr and exit non-zero.

OCR Language Hinting

Library callers can bias Tesseract toward additional languages by passing ocrLanguageHints to anonymizeDocument(). The primary OCR language stays in ocrLanguage, while hints are appended when creating the Tesseract worker.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme