docparser-ocr
v2.2.1
Published
Optional OCR runtime package for docparser-core — adaptive OCR pipelines, local Tesseract engines, and image preprocessing
Maintainers
Readme
docparser-ocr
Optional OCR runtime package for docparser-core.
Published package name on npm: docparser-ocr
npm install docparser-core docparser-ocrTable of contents
- Why this package exists
- Installation
- Quick Start
- Developer Playbook
- Usage
- Native Tesseract Provider
- Benchmarking
- Preprocessing Presets
- Retry Profiles
- Page Concurrency
- Orientation And Script Detection
- Windows Smoke Check
- Publishing
- License
This package contains OCR and image-native processing components that are intentionally split out from core:
ImagePreprocessor(usessharp)TesseractProvider(usestesseract.js)NativeTesseractProvider(uses a localtesseractCLI install)OCRPipeline
Why this package exists
docparser-core keeps OCR runtime dependencies optional so the default install stays smaller and avoids image-processing and Tesseract dependencies unless you actually need OCR.
Use docparser-ocr when you want:
- direct OCR/image preprocessing APIs
- the built-in
createOCRPlugin()runtime fromdocparser-core - Tesseract-based orientation, script, and geometry extraction
In current docparser-core releases, that plugin can now consume:
- raw
.png,.jpg/.jpeg,.tiff, and.webpdocuments parsed directly by core - image-only PDF pages that core rasterizes into PNG data URLs during parsing
- embedded OOXML images emitted by the DOCX and PPTX parsers
Installation
Package names on npm:
docparser-coredocparser-ocr
npm
npm install docparser-core docparser-ocryarn
yarn add docparser-core docparser-ocrpnpm
pnpm add docparser-core docparser-ocrRuntime requirements
- Node.js:
>=20.19.0 - This package is ESM (
"type": "module")
Quick Start
Beginner checklist:
- Install both
docparser-coreanddocparser-ocr. - Create a logger from
docparser-core. - Create a
TesseractProviderand pass it intoOCRPipeline. - Call
processImage(buffer, pageNumber)with an image buffer. - Read OCR text from
result.textand OCR diagnostics fromresult.detection.
If you already manage a local Tesseract installation and want a native/offline engine path without the browser-style worker runtime, use NativeTesseractProvider instead of TesseractProvider.
Smallest working example:
import { readFile } from 'node:fs/promises';
import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';
const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
provider,
minConfidence: 0.4,
});
const image = await readFile('./examples/receipt.png');
const result = await pipeline.processImage(image, 1);
console.log(result.text);
console.log(result.confidence);
console.log(result.detection?.orientationDegrees);If you want OCR inside DocParser, install both packages and use createOCRPlugin() from docparser-core. That path now covers direct image inputs and scanned PDFs in addition to embedded document images.
If you prefer config-driven setup instead of manual plugin registration, docparser-core now supports a top-level providers.ocr section. Supported OCR provider kinds are tesseract, native_tesseract, ollama, claude, openrouter, and organization_gateway. The Tesseract-based options still depend on this package because the OCR runtime stays optional.
That lets a generated config file enable OCR like this:
export default {
parsing: {
ocr: {
languages: ['eng'],
confidenceThreshold: 60,
dpi: 300,
},
},
providers: {
ocr: {
provider: {
kind: 'native_tesseract',
executablePath: process.env.DOCPARSER_NATIVE_TESSERACT_PATH,
languages: ['eng'],
},
pipeline: {
qualityLevel: 'balanced',
},
},
},
};For HTTP-backed OCR providers, add startup health validation and bounded retry/backoff so the parser can fail fast before processing documents:
export default {
providers: {
ocr: {
provider: {
kind: 'organization_gateway',
protocol: 'openai',
baseUrl: process.env.DOCPARSER_GATEWAY_BASE_URL,
apiKey: process.env.DOCPARSER_GATEWAY_API_KEY,
model: 'gpt-4.1-mini',
healthPath: '/readyz',
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
processEmbeddedImages: true,
processImagePDFs: true,
pipeline: {
qualityLevel: 'thorough',
useBuiltInRetryProfiles: true,
},
},
},
};Built-in provider errors also redact common bearer-token and API-key patterns before they surface to callers or logs.
Developer Playbook
Use docparser-ocr for these developer workflows:
| Goal | Use this surface | Start here |
| --- | --- | --- |
| OCR a single image buffer | OCRPipeline.processImage(...) | Quick Start |
| OCR multiple rasterized pages with concurrency | OCRPipeline.processPages(...) | Page Concurrency |
| Use browser-style worker OCR with tesseract.js | TesseractProvider | Usage |
| Use a local system Tesseract binary | NativeTesseractProvider | Native Tesseract Provider |
| Preprocess scans before OCR | ImagePreprocessor | Preprocessing Presets |
| Retry harder OCR profiles for poor scans or handwriting | OCRPipeline retry profiles and quality levels | Retry Profiles |
| Capture geometry, script, and orientation metadata | OCR provider + pipeline results | Usage and Orientation And Script Detection |
| Plug OCR into DocParser for images and scanned PDFs | createOCRPlugin() or providers.ocr in docparser-core | Quick Start and the core package docs |
| Benchmark OCR quality against a corpus | runOCRBenchmark(...) or the benchmark CLI | Benchmarking |
| Verify Windows native Tesseract installs | runWindowsNativeTesseractSmoke(...) | Windows Smoke Check |
Practical path selection:
- Choose
TesseractProviderwhen you want pure npm-based local OCR with no system binary management. - Choose
NativeTesseractProviderwhen your enterprise environment standardizes on a localtesseractinstall and you want predictable binary/runtime ownership. - Choose
OCRPipelinewhen you want preprocessing, retry profiles, text detection heuristics, and page-level result objects. - Choose
createOCRPlugin()indocparser-corewhen you want OCR to happen automatically after parsing standalone images, embedded document images, or rasterized scanned-PDF pages.
Usage
import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';
const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
provider,
qualityLevel: 'thorough',
minConfidence: 0.5,
concurrency: 2,
});
const result = await pipeline.processImage(imageBuffer, 1);This package currently exports:
ImagePreprocessorTesseractProviderNativeTesseractProviderOCRPipeline- OCR pipeline/result/config types from the package root
TesseractProvider now returns richer OCR geometry through blocks, lines, paragraphs, words, and symbols, plus optional orientation and script metadata in detection.
console.log(result.paragraphs?.[0]?.bbox, result.symbols?.[0]?.bbox);
console.log(result.detection?.orientationDegrees, result.detection?.appliedRotationDegrees);Native Tesseract Provider
NativeTesseractProvider wraps a locally installed tesseract binary and keeps OCR fully local without adding extra npm runtime dependencies beyond this package.
Use it when you want:
- local native Tesseract instead of
tesseract.js - easier control over local
tessdatainstalls and language packs - a path toward enterprise/offline OCR deployments that rely on managed system binaries
Example:
import { createLogger } from 'docparser-core';
import { NativeTesseractProvider, OCRPipeline } from 'docparser-ocr';
const logger = createLogger({ level: 'warn' });
const provider = new NativeTesseractProvider(logger, {
executablePath: 'tesseract',
tessdataDir: '/usr/share/tessdata',
languages: ['eng'],
});
const pipeline = new OCRPipeline(logger, {
provider,
qualityLevel: 'thorough',
});The native provider reads TSV output for geometry and attempts OSD detection through the same local binary when enabled.
When createOCRPlugin() is configured with engine: 'native-tesseract', the runtime now probes for a local native binary first and automatically falls back to TesseractProvider when the native executable cannot be discovered.
Benchmarking
The package now includes a labeled corpus runner so OCR changes can be measured per document type instead of by confidence alone.
- Example manifest:
benchmark/corpus.example.json - Supported document types:
printed-scan,form,prescription,screenshot,camera-photo,receipt,handwritten - Reported metrics: exact match rate, average confidence, character error rate (CER), word error rate (WER), reviewed-entry counts, average review-finding counts, average OCR attempt counts, selected-profile counts, and aggregated review field types
- Exported benchmark JSON now includes both the raw aggregate metrics and a derived
reportsection with sorted profile and review-field breakdowns for operator dashboards or quick inspection - Existing benchmark result JSON can be used as a comparison baseline so operators can inspect deltas between two OCR runs without manually diffing raw metrics
- Saved benchmark result JSON can also be loaded directly with
--currentso operators can inspect or compare existing benchmark exports without running OCR again
Run the benchmark after building the package:
npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --output ./benchmark/results.json --engine native-tesseract --quality thoroughFor a concise operator-facing console summary instead of JSON, add --format text:
npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --engine native-tesseract --format textTo compare a fresh run against a previous benchmark export, pass --baseline with the earlier result JSON. The CLI will include a structured comparison block in JSON output, or a delta section in text mode:
npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --baseline ./benchmark/previous-results.json --format textTo compare two existing benchmark exports without rerunning OCR, pass both --current and --baseline:
npm --workspace packages/ocr run benchmark -- --current ./benchmark/current-results.json --baseline ./benchmark/previous-results.jsonYou can also pass native-binary settings through the CLI:
npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --engine native-tesseract --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages engPreprocessing Presets
ImagePreprocessor and OCR preprocessing configs now accept named presets so common OCR tuning does not require a long option bag.
fast: minimal cleanup for speed-sensitive OCRscanned-document: binarized, normalized cleanup for typical document scansphoto-receipt: stronger cleanup plus deskew for camera captures and receiptslow-contrast: contrast-heavy cleanup for faded scans
Preset values are applied first, and any explicit preprocessing options override the preset.
Retry Profiles
OCRPipeline now supports retryProfiles for low-confidence OCR recovery. The pipeline runs the default preprocessing first, then retries low-confidence pages with profile-specific preprocessing overrides until one clears the confidence threshold.
Normal OCR runs now use adaptive retry selection by default. Instead of sweeping every built-in retry profile for every page, the pipeline narrows built-in retries to the most relevant profiles for the document type or image hint. Full profile sweeps are reserved for explicit evaluateAllProfiles: true or forced-attempt evaluation paths.
Use qualityLevel when you want the pipeline to control how much OCR work it performs before giving up:
fast: one default pass, lowest computebalanced: adaptive retries for common printed and scanned textthorough: stronger preprocessing plus adaptive retries, with full profile evaluation reserved for forced attempts or explicit opt-inextreme: highest-compute adaptive mode, with exhaustive profile sweeps reserved for forced attempts or explicit opt-in
If you pass retryProfiles, they replace the built-in retry set by default. Set useBuiltInRetryProfiles: true when you want to keep the built-in quality-level retries and append your own profiles after them.
When you do not provide custom retry profiles, the pipeline falls back to built-in recovery passes aimed at common OCR failure modes:
scanned-document-retry: binarized document cleanup plussingle_blocksegmentationlow-contrast-retry: contrast-heavy cleanup plussparse_textsegmentationhandwritten-retry: stronger photo-style cleanup plusbothengine mode andsparse_textsegmentation
In extreme mode the pipeline also adds extra line-, word-, legacy-, and high-DPI sparse-text retries for especially difficult handwriting and degraded printed scans.
You can also pass documentProfileHint into processImage() to bias adaptive selection toward a narrower strategy family such as handwritten, form, screenshot, camera-photo, or printed-scan.
For receipt and form hints, the pipeline now applies conservative field-aware postprocessing to normalize label/value formatting without changing provider confidence or geometry.
When OSD metadata reports a rotated page on the first pass, the pipeline now rotates the original image before retry profiles run, so the recovery profiles operate on orientation-corrected input instead of repeatedly retrying the same rotated buffer.
name: optional label for logs and diagnosticspreprocessing: preprocessing overrides merged on top of the base pipeline preprocessing configrecognition: optional per-profile OCR engine overrides such aspageSegMode,engineMode,preserveInterwordSpaces, anduserDefinedDpiminConfidence: optional per-profile threshold override
This is useful for harder scans where a binarized or deskewed pass can recover text that the default preprocessing misses.
Each OCRPageResult now includes retry diagnostics through attemptCount, attemptedProfiles, and selectedProfile, plus geometry and OSD diagnostics through paragraphs, symbols, and detection.appliedRotationDegrees.
Example retry profile using a preset:
const pipeline = new OCRPipeline(logger, {
provider,
qualityLevel: 'thorough',
useBuiltInRetryProfiles: true,
retryProfiles: [
{
name: 'receipt-recovery',
preprocessing: { preset: 'photo-receipt' },
recognition: { pageSegMode: 'sparse_text', engineMode: 'both' },
},
],
});Page Concurrency
OCRPipeline.processPages() now supports a concurrency option so multi-page OCR can use multiple workers in parallel while preserving input page order in the returned results.
concurrency: max number of pages processed at once; values below1are clamped to1
Orientation And Script Detection
TesseractProvider now attempts orientation and script detection by default and adds the result to OCRResult.detection when available.
detectOrientationScript: enable or disable OSD detection inTesseractConfig(default:true)
If OSD is unavailable or fails, text recognition still completes and returns geometry output without detection metadata.
The provider also enables the legacy Tesseract worker for OSD so detect() can return reliable script and orientation metadata. When the OCR pipeline uses that metadata to auto-rotate an image before retries, the final page result carries the applied correction in detection.appliedRotationDegrees.
Windows Smoke Check
The package now includes a Windows native deployment smoke check for local Tesseract installs. It verifies binary discovery and confirms that required language packs such as eng and osd are available.
npm --workspace packages/ocr run smoke:native:windows -- --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages eng,osdThe command prints a JSON diagnostic payload and exits non-zero when required languages or the binary itself are missing.
Publishing
This package is configured to publish as a public unscoped npm package.
Consumer install command:
npm install docparser-core docparser-ocrRelease cookbook:
- Publish
docparser-corefirst if both packages are moving together. - Reconfirm the peer dependency range in
packages/ocr/package.jsonbefore publish. - Run OCR verification after core is built and available locally.
- Inspect the tarball contents with
npm pack --dry-run. - Publish and then verify the registry version.
OCR release commands:
npm --workspace packages/core run verify:release
npm --workspace packages/ocr run verify:release
npm pack --workspace packages/core --dry-run
npm pack --workspace packages/ocr --dry-run
npm publish --workspace packages/core
npm --workspace packages/ocr run typecheck
npm publish --workspace packages/ocr
npm view docparser-ocr versionThe package metadata publishes:
- ESM entrypoint:
dist/index.js - Type declarations:
dist/index.d.ts - package name:
docparser-ocr docparser-ocrshould be published afterdocparser-corewhen both packages move together, because the OCR package peers on the matching core release train.- If OCR release verification fails after core has already published, stop and fix OCR locally before retrying instead of widening the peer range or publishing a mismatched OCR package.
License
Apache-2.0
