docparser-ocr
v1.5.0
Published
Optional OCR runtime package for docparser-core — adaptive OCR pipelines, local Tesseract engines, and image preprocessing
Maintainers
Readme
docparser-ocr
Optional OCR runtime package for docparser-core.
Published package name on npm: docparser-ocr
npm install docparser-core docparser-ocrTable of contents
- Why this package exists
- Installation
- Quick Start
- Usage
- Native Tesseract Provider
- Benchmarking
- Preprocessing Presets
- Retry Profiles
- Page Concurrency
- Orientation And Script Detection
- Windows Smoke Check
- Publishing
- License
This package contains OCR and image-native processing components that are intentionally split out from core:
ImagePreprocessor(usessharp)TesseractProvider(usestesseract.js)NativeTesseractProvider(uses a localtesseractCLI install)OCRPipeline
Why this package exists
docparser-core keeps OCR runtime dependencies optional so the default install stays smaller and avoids image-processing and Tesseract dependencies unless you actually need OCR.
Use docparser-ocr when you want:
- direct OCR/image preprocessing APIs
- the built-in
createOCRPlugin()runtime fromdocparser-core - Tesseract-based orientation, script, and geometry extraction
In docparser-core 1.3.0 and later, that plugin can now consume:
- raw
.png,.jpg/.jpeg,.tiff, and.webpdocuments parsed directly by core - image-only PDF pages that core rasterizes into PNG data URLs during parsing
- embedded OOXML images emitted by the DOCX and PPTX parsers
Installation
Package names on npm:
docparser-coredocparser-ocr
npm
npm install docparser-core docparser-ocryarn
yarn add docparser-core docparser-ocrpnpm
pnpm add docparser-core docparser-ocrRuntime requirements
- Node.js:
>=20.19.0 - This package is ESM (
"type": "module")
Quick Start
Beginner checklist:
- Install both
docparser-coreanddocparser-ocr. - Create a logger from
docparser-core. - Create a
TesseractProviderand pass it intoOCRPipeline. - Call
processImage(buffer, pageNumber)with an image buffer. - Read OCR text from
result.textand OCR diagnostics fromresult.detection.
If you already manage a local Tesseract installation and want a native/offline engine path without the browser-style worker runtime, use NativeTesseractProvider instead of TesseractProvider.
Smallest working example:
import { readFile } from 'node:fs/promises';
import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';
const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
provider,
minConfidence: 0.4,
});
const image = await readFile('./examples/receipt.png');
const result = await pipeline.processImage(image, 1);
console.log(result.text);
console.log(result.confidence);
console.log(result.detection?.orientationDegrees);If you want OCR inside DocParser, install both packages and use createOCRPlugin() from docparser-core. That path now covers direct image inputs and scanned PDFs in addition to embedded document images.
Usage
import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';
const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
provider,
qualityLevel: 'thorough',
minConfidence: 0.5,
concurrency: 2,
});
const result = await pipeline.processImage(imageBuffer, 1);This package currently exports:
ImagePreprocessorTesseractProviderNativeTesseractProviderOCRPipeline- OCR pipeline/result/config types from the package root
TesseractProvider now returns richer OCR geometry through blocks, lines, paragraphs, words, and symbols, plus optional orientation and script metadata in detection.
console.log(result.paragraphs?.[0]?.bbox, result.symbols?.[0]?.bbox);
console.log(result.detection?.orientationDegrees, result.detection?.appliedRotationDegrees);Native Tesseract Provider
NativeTesseractProvider wraps a locally installed tesseract binary and keeps OCR fully local without adding extra npm runtime dependencies beyond this package.
Use it when you want:
- local native Tesseract instead of
tesseract.js - easier control over local
tessdatainstalls and language packs - a path toward enterprise/offline OCR deployments that rely on managed system binaries
Example:
import { createLogger } from 'docparser-core';
import { NativeTesseractProvider, OCRPipeline } from 'docparser-ocr';
const logger = createLogger({ level: 'warn' });
const provider = new NativeTesseractProvider(logger, {
executablePath: 'tesseract',
tessdataDir: '/usr/share/tessdata',
languages: ['eng'],
});
const pipeline = new OCRPipeline(logger, {
provider,
qualityLevel: 'thorough',
});The native provider reads TSV output for geometry and attempts OSD detection through the same local binary when enabled.
When createOCRPlugin() is configured with engine: 'native-tesseract', the runtime now probes for a local native binary first and automatically falls back to TesseractProvider when the native executable cannot be discovered.
Benchmarking
The package now includes a labeled corpus runner so OCR changes can be measured per document type instead of by confidence alone.
- Example manifest:
benchmark/corpus.example.json - Supported document types:
printed-scan,form,screenshot,camera-photo,receipt,handwritten - Reported metrics: exact match rate, average confidence, character error rate (CER), and word error rate (WER)
Run the benchmark after building the package:
npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --output ./benchmark/results.json --engine native-tesseract --quality thoroughYou can also pass native-binary settings through the CLI:
npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --engine native-tesseract --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages engPreprocessing Presets
ImagePreprocessor and OCR preprocessing configs now accept named presets so common OCR tuning does not require a long option bag.
fast: minimal cleanup for speed-sensitive OCRscanned-document: binarized, normalized cleanup for typical document scansphoto-receipt: stronger cleanup plus deskew for camera captures and receiptslow-contrast: contrast-heavy cleanup for faded scans
Preset values are applied first, and any explicit preprocessing options override the preset.
Retry Profiles
OCRPipeline now supports retryProfiles for low-confidence OCR recovery. The pipeline runs the default preprocessing first, then retries low-confidence pages with profile-specific preprocessing overrides until one clears the confidence threshold.
Normal OCR runs now use adaptive retry selection by default. Instead of sweeping every built-in retry profile for every page, the pipeline narrows built-in retries to the most relevant profiles for the document type or image hint. Full profile sweeps are reserved for explicit evaluateAllProfiles: true or forced-attempt evaluation paths.
Use qualityLevel when you want the pipeline to control how much OCR work it performs before giving up:
fast: one default pass, lowest computebalanced: adaptive retries for common printed and scanned textthorough: stronger preprocessing plus adaptive retries, with full profile evaluation reserved for forced attempts or explicit opt-inextreme: highest-compute adaptive mode, with exhaustive profile sweeps reserved for forced attempts or explicit opt-in
If you pass retryProfiles, they replace the built-in retry set by default. Set useBuiltInRetryProfiles: true when you want to keep the built-in quality-level retries and append your own profiles after them.
When you do not provide custom retry profiles, the pipeline falls back to built-in recovery passes aimed at common OCR failure modes:
scanned-document-retry: binarized document cleanup plussingle_blocksegmentationlow-contrast-retry: contrast-heavy cleanup plussparse_textsegmentationhandwritten-retry: stronger photo-style cleanup plusbothengine mode andsparse_textsegmentation
In extreme mode the pipeline also adds extra line-, word-, legacy-, and high-DPI sparse-text retries for especially difficult handwriting and degraded printed scans.
You can also pass documentProfileHint into processImage() to bias adaptive selection toward a narrower strategy family such as handwritten, form, screenshot, camera-photo, or printed-scan.
For receipt and form hints, the pipeline now applies conservative field-aware postprocessing to normalize label/value formatting without changing provider confidence or geometry.
When OSD metadata reports a rotated page on the first pass, the pipeline now rotates the original image before retry profiles run, so the recovery profiles operate on orientation-corrected input instead of repeatedly retrying the same rotated buffer.
name: optional label for logs and diagnosticspreprocessing: preprocessing overrides merged on top of the base pipeline preprocessing configrecognition: optional per-profile OCR engine overrides such aspageSegMode,engineMode,preserveInterwordSpaces, anduserDefinedDpiminConfidence: optional per-profile threshold override
This is useful for harder scans where a binarized or deskewed pass can recover text that the default preprocessing misses.
Each OCRPageResult now includes retry diagnostics through attemptCount, attemptedProfiles, and selectedProfile, plus geometry and OSD diagnostics through paragraphs, symbols, and detection.appliedRotationDegrees.
Example retry profile using a preset:
const pipeline = new OCRPipeline(logger, {
provider,
qualityLevel: 'thorough',
useBuiltInRetryProfiles: true,
retryProfiles: [
{
name: 'receipt-recovery',
preprocessing: { preset: 'photo-receipt' },
recognition: { pageSegMode: 'sparse_text', engineMode: 'both' },
},
],
});Page Concurrency
OCRPipeline.processPages() now supports a concurrency option so multi-page OCR can use multiple workers in parallel while preserving input page order in the returned results.
concurrency: max number of pages processed at once; values below1are clamped to1
Orientation And Script Detection
TesseractProvider now attempts orientation and script detection by default and adds the result to OCRResult.detection when available.
detectOrientationScript: enable or disable OSD detection inTesseractConfig(default:true)
If OSD is unavailable or fails, text recognition still completes and returns geometry output without detection metadata.
The provider also enables the legacy Tesseract worker for OSD so detect() can return reliable script and orientation metadata. When the OCR pipeline uses that metadata to auto-rotate an image before retries, the final page result carries the applied correction in detection.appliedRotationDegrees.
Windows Smoke Check
The package now includes a Windows native deployment smoke check for local Tesseract installs. It verifies binary discovery and confirms that required language packs such as eng and osd are available.
npm --workspace packages/ocr run smoke:native:windows -- --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages eng,osdThe command prints a JSON diagnostic payload and exits non-zero when required languages or the binary itself are missing.
Publishing
This package is configured to publish as a public unscoped npm package.
Consumer install command:
npm install docparser-core docparser-ocrTypical release flow:
npm --workspace packages/ocr run build
npm --workspace packages/ocr run test
npm --workspace packages/ocr run typecheck
npm publish --workspace packages/ocrThe package metadata publishes:
- ESM entrypoint:
dist/index.js - Type declarations:
dist/index.d.ts - package name:
docparser-ocr
License
Apache-2.0
