@nabs23/pdf-ocr-cli
v1.0.4
Published
A high-performance, parallelized PDF OCR tool using Tesseract.js WASM
Maintainers
Readme
pdf-ocr-cli
pdf-ocr-cli is a lightweight, zero-system-dependency CLI utility designed to bridge the gap between static scanned documents and searchable digital assets.
Unlike traditional OCR tools that require complex system-level installations of Tesseract and its language data, this tool leverages Tesseract.js (WebAssembly) to run the OCR engine directly within the Node.js runtime. It features a parallelized processing pipeline that scales across multiple CPU cores, making it efficient for large documents (100+ pages).
Key Features
- Searchable PDF Generation: Automatically merges OCR text layers back into a high-quality PDF.
- Parallel Execution: Configurable worker pools to maximize hardware utilization.
- Zero-Config OCR: No need for
tesseract-ocrsystem binaries or manual language data management (WASM-based). - Dual Output: Generates both a
.txttranscription and a searchable_OCRed.pdfsimultaneously. - Developer Friendly: Built with Node.js, providing a clean CLI interface with real-time progress tracking.
Prerequisites
This tool requires the following system utilities to be present:
poppler-utils(forpdftoppm)ghostscript(forgs)
Installation
npm install -g pdf-ocr-cliUsage
pdf-ocr input.pdfOptions
-o, --output <prefix>: Custom prefix for output files (defaults to input filename).-w, --workers <number>: Number of parallel workers (default: 4).-k, --keep: Keep temporary image files after completion.-h, --help: Show help.
License
This project is licensed under the ISC License - see the LICENSE file for details.
