npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@procommerz/doc-anonymizer

v0.1.2

Published

Node.js document anonymizer library and CLI for OCR-based PDF/image redaction.

Downloads

16

Readme

Document Anonymizer Tool and Component

This is Node.js component that uses local OCR and image editing tools to produce anonymized versions of PDF documents. It can be used as a standlone CLI tool or a component within another Node.js application.

Inputs

Inputs can be a sequence of per-page image file locations or URLs or a single PDF.

The user must also provide one or more target name variations that must be redacted. The OCR will search for these exact matches first. If nothings was found, will search more aggressively using string similarity comparison (like the Levenshtein distance). It will automatically account for potential line breaks when searching for matches.

It will identify bounding rectangles for these texts so that it then can cover it with opaque black rectangles.

Output

The output file is always a single PDF file generated at target path or returned as a binary blob as part of internal library function call result. The output PDF will have all pages as 150dpi images, where the target anonymized name will be redacted with black rectangles.

CLI Script

The library also provides a cli script that can be run directly to generate anonymized version of a document.

The script will generate a file at target path and will return a JSON output with run details.

Implementation Notes

  • The current implementation uses pdfjs-dist to rasterize PDF pages at 150dpi, tesseract.js for local OCR, @napi-rs/canvas for image redaction, and pdf-lib to assemble the output PDF.
  • OCR matching is exact-first and falls back to conservative fuzzy matching when exact OCR text is not found.
  • Page image inputs may be local file paths or HTTP/HTTPS URLs. PDF input is local-file based.
  • The library returns a PDF Buffer and can also write the generated file to disk.
  • OCR language files are read from ~/.tessdata by default, or from a caller-provided directory.

Install

npm install

Library Usage

import { anonymizeDocument } from 'doc-anonymizer';

const result = await anonymizeDocument({
  inputPdfPath: './input.pdf',
  targetNames: ['John Smith', 'J. Smith'],
  outputPdfPath: './redacted.pdf',
  ocrDataDir: '~/.tessdata',
  ocrLanguage: 'eng',
  ocrLanguageHints: ['deu', 'fra'],
});

console.log(result.pageCount);
console.log(result.totalRedactions);

Image input mode:

import { anonymizeDocument } from 'doc-anonymizer';

const result = await anonymizeDocument({
  inputPages: [
    './page-1.png',
    'https://example.com/page-2.jpg',
  ],
  targetNames: ['John Smith'],
});

console.log(result.pdfBuffer);

CLI Usage

PDF input:

node ./bin/doc-anonymizer.js \
  --input-pdf ./test_document.pdf \
  --document-language eng \
  --name "MUSTERMANN, ELENA" \
  --name "MUSTERMANN ELENA" \
  --name "ELENA MUSTERMANN" \
  --name "MUSTERMANN" \
  --name "P12345678X" \
  --output ./test_document_redacted.pdf

Page image input:

node ./bin/doc-anonymizer.js \
  --input-page ./page-1.png \
  --input-page https://example.com/page-2.jpg \
  --ocr-data-dir ~/.tessdata \
  --name "John Smith" \
  --output ./redacted.pdf

If --ocr-data-dir is omitted, the CLI downloads the requested document language into ~/.tessdata automatically and uses it.

Downloading OCR Data

To pre-download datasets for end users:

npx doc-anonymizer download-tessdata --language eng --data-dir ~/.tessdata

You can repeat --language for multiple OCR languages. If you need the direct script entrypoint during local development, you can also run:

node ./bin/download-tessdata.js --language eng --data-dir ~/.tessdata

On success, both CLI scripts print JSON to stdout. On failure, they print a JSON error object to stderr and exit non-zero.

OCR Language Hinting

Library callers can bias Tesseract toward additional languages by passing ocrLanguageHints to anonymizeDocument(). The primary OCR language stays in ocrLanguage, while hints are appended when creating the Tesseract worker.