docparser-ocr

v1.5.0

Published

5 days ago

Optional OCR runtime package for docparser-core — adaptive OCR pipelines, local Tesseract engines, and image preprocessing

0High
0Medium
0Low

rohansahana3000

docparser ocr tesseract native-tesseract benchmark sharp image preprocessing plugin

docparser-ocr

Optional OCR runtime package for docparser-core.

Published package name on npm: docparser-ocr

npm install docparser-core docparser-ocr

This package contains OCR and image-native processing components that are intentionally split out from core:

ImagePreprocessor (uses sharp)
TesseractProvider (uses tesseract.js)
NativeTesseractProvider (uses a local tesseract CLI install)
OCRPipeline

Why this package exists

docparser-core keeps OCR runtime dependencies optional so the default install stays smaller and avoids image-processing and Tesseract dependencies unless you actually need OCR.

Use docparser-ocr when you want:

direct OCR/image preprocessing APIs
the built-in createOCRPlugin() runtime from docparser-core
Tesseract-based orientation, script, and geometry extraction

In docparser-core 1.3.0 and later, that plugin can now consume:

raw .png, .jpg/.jpeg, .tiff, and .webp documents parsed directly by core
image-only PDF pages that core rasterizes into PNG data URLs during parsing
embedded OOXML images emitted by the DOCX and PPTX parsers

Installation

Package names on npm:

docparser-core
docparser-ocr

npm

npm install docparser-core docparser-ocr

yarn

yarn add docparser-core docparser-ocr

pnpm

pnpm add docparser-core docparser-ocr

Runtime requirements

Node.js: >=20.19.0
This package is ESM ("type": "module")

Quick Start

Beginner checklist:

Install both docparser-core and docparser-ocr.
Create a logger from docparser-core.
Create a TesseractProvider and pass it into OCRPipeline.
Call processImage(buffer, pageNumber) with an image buffer.
Read OCR text from result.text and OCR diagnostics from result.detection.

If you already manage a local Tesseract installation and want a native/offline engine path without the browser-style worker runtime, use NativeTesseractProvider instead of TesseractProvider.

Smallest working example:

import { readFile } from 'node:fs/promises';
import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';

const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
  provider,
  minConfidence: 0.4,
});

const image = await readFile('./examples/receipt.png');
const result = await pipeline.processImage(image, 1);

console.log(result.text);
console.log(result.confidence);
console.log(result.detection?.orientationDegrees);

If you want OCR inside DocParser, install both packages and use createOCRPlugin() from docparser-core. That path now covers direct image inputs and scanned PDFs in addition to embedded document images.

Usage

import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';

const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
  provider,
  qualityLevel: 'thorough',
  minConfidence: 0.5,
  concurrency: 2,
});

const result = await pipeline.processImage(imageBuffer, 1);

This package currently exports:

ImagePreprocessor
TesseractProvider
NativeTesseractProvider
OCRPipeline
OCR pipeline/result/config types from the package root

TesseractProvider now returns richer OCR geometry through blocks, lines, paragraphs, words, and symbols, plus optional orientation and script metadata in detection.

console.log(result.paragraphs?.[0]?.bbox, result.symbols?.[0]?.bbox);
console.log(result.detection?.orientationDegrees, result.detection?.appliedRotationDegrees);

Native Tesseract Provider

NativeTesseractProvider wraps a locally installed tesseract binary and keeps OCR fully local without adding extra npm runtime dependencies beyond this package.

Use it when you want:

local native Tesseract instead of tesseract.js
easier control over local tessdata installs and language packs
a path toward enterprise/offline OCR deployments that rely on managed system binaries

Example:

import { createLogger } from 'docparser-core';
import { NativeTesseractProvider, OCRPipeline } from 'docparser-ocr';

const logger = createLogger({ level: 'warn' });
const provider = new NativeTesseractProvider(logger, {
  executablePath: 'tesseract',
  tessdataDir: '/usr/share/tessdata',
  languages: ['eng'],
});

const pipeline = new OCRPipeline(logger, {
  provider,
  qualityLevel: 'thorough',
});

The native provider reads TSV output for geometry and attempts OSD detection through the same local binary when enabled.

When createOCRPlugin() is configured with engine: 'native-tesseract', the runtime now probes for a local native binary first and automatically falls back to TesseractProvider when the native executable cannot be discovered.

Benchmarking

The package now includes a labeled corpus runner so OCR changes can be measured per document type instead of by confidence alone.

Example manifest: benchmark/corpus.example.json
Supported document types: printed-scan, form, screenshot, camera-photo, receipt, handwritten
Reported metrics: exact match rate, average confidence, character error rate (CER), and word error rate (WER)

Run the benchmark after building the package:

npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --output ./benchmark/results.json --engine native-tesseract --quality thorough

You can also pass native-binary settings through the CLI:

npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --engine native-tesseract --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages eng

Preprocessing Presets

ImagePreprocessor and OCR preprocessing configs now accept named presets so common OCR tuning does not require a long option bag.

fast: minimal cleanup for speed-sensitive OCR
scanned-document: binarized, normalized cleanup for typical document scans
photo-receipt: stronger cleanup plus deskew for camera captures and receipts
low-contrast: contrast-heavy cleanup for faded scans

Preset values are applied first, and any explicit preprocessing options override the preset.

Retry Profiles

OCRPipeline now supports retryProfiles for low-confidence OCR recovery. The pipeline runs the default preprocessing first, then retries low-confidence pages with profile-specific preprocessing overrides until one clears the confidence threshold.

Normal OCR runs now use adaptive retry selection by default. Instead of sweeping every built-in retry profile for every page, the pipeline narrows built-in retries to the most relevant profiles for the document type or image hint. Full profile sweeps are reserved for explicit evaluateAllProfiles: true or forced-attempt evaluation paths.

Use qualityLevel when you want the pipeline to control how much OCR work it performs before giving up:

fast: one default pass, lowest compute
balanced: adaptive retries for common printed and scanned text
thorough: stronger preprocessing plus adaptive retries, with full profile evaluation reserved for forced attempts or explicit opt-in
extreme: highest-compute adaptive mode, with exhaustive profile sweeps reserved for forced attempts or explicit opt-in

If you pass retryProfiles, they replace the built-in retry set by default. Set useBuiltInRetryProfiles: true when you want to keep the built-in quality-level retries and append your own profiles after them.

When you do not provide custom retry profiles, the pipeline falls back to built-in recovery passes aimed at common OCR failure modes:

scanned-document-retry: binarized document cleanup plus single_block segmentation
low-contrast-retry: contrast-heavy cleanup plus sparse_text segmentation
handwritten-retry: stronger photo-style cleanup plus both engine mode and sparse_text segmentation

In extreme mode the pipeline also adds extra line-, word-, legacy-, and high-DPI sparse-text retries for especially difficult handwriting and degraded printed scans.

You can also pass documentProfileHint into processImage() to bias adaptive selection toward a narrower strategy family such as handwritten, form, screenshot, camera-photo, or printed-scan.

For receipt and form hints, the pipeline now applies conservative field-aware postprocessing to normalize label/value formatting without changing provider confidence or geometry.

When OSD metadata reports a rotated page on the first pass, the pipeline now rotates the original image before retry profiles run, so the recovery profiles operate on orientation-corrected input instead of repeatedly retrying the same rotated buffer.

name: optional label for logs and diagnostics
preprocessing: preprocessing overrides merged on top of the base pipeline preprocessing config
recognition: optional per-profile OCR engine overrides such as pageSegMode, engineMode, preserveInterwordSpaces, and userDefinedDpi
minConfidence: optional per-profile threshold override

This is useful for harder scans where a binarized or deskewed pass can recover text that the default preprocessing misses.

Each OCRPageResult now includes retry diagnostics through attemptCount, attemptedProfiles, and selectedProfile, plus geometry and OSD diagnostics through paragraphs, symbols, and detection.appliedRotationDegrees.

Example retry profile using a preset:

const pipeline = new OCRPipeline(logger, {
  provider,
  qualityLevel: 'thorough',
  useBuiltInRetryProfiles: true,
  retryProfiles: [
    {
      name: 'receipt-recovery',
      preprocessing: { preset: 'photo-receipt' },
      recognition: { pageSegMode: 'sparse_text', engineMode: 'both' },
    },
  ],
});

Page Concurrency

OCRPipeline.processPages() now supports a concurrency option so multi-page OCR can use multiple workers in parallel while preserving input page order in the returned results.

concurrency: max number of pages processed at once; values below 1 are clamped to 1

Orientation And Script Detection

TesseractProvider now attempts orientation and script detection by default and adds the result to OCRResult.detection when available.

detectOrientationScript: enable or disable OSD detection in TesseractConfig (default: true)

If OSD is unavailable or fails, text recognition still completes and returns geometry output without detection metadata.

The provider also enables the legacy Tesseract worker for OSD so detect() can return reliable script and orientation metadata. When the OCR pipeline uses that metadata to auto-rotate an image before retries, the final page result carries the applied correction in detection.appliedRotationDegrees.

Windows Smoke Check

The package now includes a Windows native deployment smoke check for local Tesseract installs. It verifies binary discovery and confirms that required language packs such as eng and osd are available.

npm --workspace packages/ocr run smoke:native:windows -- --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages eng,osd

The command prints a JSON diagnostic payload and exits non-zero when required languages or the binary itself are missing.

Publishing

This package is configured to publish as a public unscoped npm package.

Consumer install command:

npm install docparser-core docparser-ocr

Typical release flow:

npm --workspace packages/ocr run build
npm --workspace packages/ocr run test
npm --workspace packages/ocr run typecheck
npm publish --workspace packages/ocr

The package metadata publishes:

ESM entrypoint: dist/index.js
Type declarations: dist/index.d.ts
package name: docparser-ocr

License

Apache-2.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

docparser-ocr

Table of contents

Why this package exists

Installation

npm

yarn

pnpm

Runtime requirements

Quick Start

Usage

Native Tesseract Provider

Benchmarking

Preprocessing Presets

Retry Profiles

Page Concurrency

Orientation And Script Detection

Windows Smoke Check

Publishing

License