@procommerz/doc-anonymizer
v0.1.2
Published
Node.js document anonymizer library and CLI for OCR-based PDF/image redaction.
Downloads
16
Maintainers
Readme
Document Anonymizer Tool and Component
This is Node.js component that uses local OCR and image editing tools to produce anonymized versions of PDF documents. It can be used as a standlone CLI tool or a component within another Node.js application.
Inputs
Inputs can be a sequence of per-page image file locations or URLs or a single PDF.
The user must also provide one or more target name variations that must be redacted. The OCR will search for these exact matches first. If nothings was found, will search more aggressively using string similarity comparison (like the Levenshtein distance). It will automatically account for potential line breaks when searching for matches.
It will identify bounding rectangles for these texts so that it then can cover it with opaque black rectangles.
Output
The output file is always a single PDF file generated at target path or returned as a binary blob as part of internal library function call result. The output PDF will have all pages as 150dpi images, where the target anonymized name will be redacted with black rectangles.
CLI Script
The library also provides a cli script that can be run directly to generate anonymized version of a document.
The script will generate a file at target path and will return a JSON output with run details.
Implementation Notes
- The current implementation uses
pdfjs-distto rasterize PDF pages at150dpi,tesseract.jsfor local OCR,@napi-rs/canvasfor image redaction, andpdf-libto assemble the output PDF. - OCR matching is exact-first and falls back to conservative fuzzy matching when exact OCR text is not found.
- Page image inputs may be local file paths or HTTP/HTTPS URLs. PDF input is local-file based.
- The library returns a PDF
Bufferand can also write the generated file to disk. - OCR language files are read from
~/.tessdataby default, or from a caller-provided directory.
Install
npm installLibrary Usage
import { anonymizeDocument } from 'doc-anonymizer';
const result = await anonymizeDocument({
inputPdfPath: './input.pdf',
targetNames: ['John Smith', 'J. Smith'],
outputPdfPath: './redacted.pdf',
ocrDataDir: '~/.tessdata',
ocrLanguage: 'eng',
ocrLanguageHints: ['deu', 'fra'],
});
console.log(result.pageCount);
console.log(result.totalRedactions);Image input mode:
import { anonymizeDocument } from 'doc-anonymizer';
const result = await anonymizeDocument({
inputPages: [
'./page-1.png',
'https://example.com/page-2.jpg',
],
targetNames: ['John Smith'],
});
console.log(result.pdfBuffer);CLI Usage
PDF input:
node ./bin/doc-anonymizer.js \
--input-pdf ./test_document.pdf \
--document-language eng \
--name "MUSTERMANN, ELENA" \
--name "MUSTERMANN ELENA" \
--name "ELENA MUSTERMANN" \
--name "MUSTERMANN" \
--name "P12345678X" \
--output ./test_document_redacted.pdfPage image input:
node ./bin/doc-anonymizer.js \
--input-page ./page-1.png \
--input-page https://example.com/page-2.jpg \
--ocr-data-dir ~/.tessdata \
--name "John Smith" \
--output ./redacted.pdfIf --ocr-data-dir is omitted, the CLI downloads the requested document language into ~/.tessdata automatically and uses it.
Downloading OCR Data
To pre-download datasets for end users:
npx doc-anonymizer download-tessdata --language eng --data-dir ~/.tessdataYou can repeat --language for multiple OCR languages. If you need the direct script entrypoint during local development, you can also run:
node ./bin/download-tessdata.js --language eng --data-dir ~/.tessdataOn success, both CLI scripts print JSON to stdout. On failure, they print a JSON error object to stderr and exit non-zero.
OCR Language Hinting
Library callers can bias Tesseract toward additional languages by passing ocrLanguageHints to anonymizeDocument().
The primary OCR language stays in ocrLanguage, while hints are appended when creating the Tesseract worker.
