npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

docparser-ocr

v1.5.0

Published

Optional OCR runtime package for docparser-core — adaptive OCR pipelines, local Tesseract engines, and image preprocessing

Readme

docparser-ocr

Optional OCR runtime package for docparser-core.

Published package name on npm: docparser-ocr

npm version license tests coverage

npm install docparser-core docparser-ocr

Table of contents


This package contains OCR and image-native processing components that are intentionally split out from core:

  • ImagePreprocessor (uses sharp)
  • TesseractProvider (uses tesseract.js)
  • NativeTesseractProvider (uses a local tesseract CLI install)
  • OCRPipeline

Why this package exists

docparser-core keeps OCR runtime dependencies optional so the default install stays smaller and avoids image-processing and Tesseract dependencies unless you actually need OCR.

Use docparser-ocr when you want:

  • direct OCR/image preprocessing APIs
  • the built-in createOCRPlugin() runtime from docparser-core
  • Tesseract-based orientation, script, and geometry extraction

In docparser-core 1.3.0 and later, that plugin can now consume:

  • raw .png, .jpg/.jpeg, .tiff, and .webp documents parsed directly by core
  • image-only PDF pages that core rasterizes into PNG data URLs during parsing
  • embedded OOXML images emitted by the DOCX and PPTX parsers

Installation

Package names on npm:

  • docparser-core
  • docparser-ocr

npm

npm install docparser-core docparser-ocr

yarn

yarn add docparser-core docparser-ocr

pnpm

pnpm add docparser-core docparser-ocr

Runtime requirements

  • Node.js: >=20.19.0
  • This package is ESM ("type": "module")

Quick Start

Beginner checklist:

  1. Install both docparser-core and docparser-ocr.
  2. Create a logger from docparser-core.
  3. Create a TesseractProvider and pass it into OCRPipeline.
  4. Call processImage(buffer, pageNumber) with an image buffer.
  5. Read OCR text from result.text and OCR diagnostics from result.detection.

If you already manage a local Tesseract installation and want a native/offline engine path without the browser-style worker runtime, use NativeTesseractProvider instead of TesseractProvider.

Smallest working example:

import { readFile } from 'node:fs/promises';
import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';

const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
  provider,
  minConfidence: 0.4,
});

const image = await readFile('./examples/receipt.png');
const result = await pipeline.processImage(image, 1);

console.log(result.text);
console.log(result.confidence);
console.log(result.detection?.orientationDegrees);

If you want OCR inside DocParser, install both packages and use createOCRPlugin() from docparser-core. That path now covers direct image inputs and scanned PDFs in addition to embedded document images.


Usage

import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';

const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
  provider,
  qualityLevel: 'thorough',
  minConfidence: 0.5,
  concurrency: 2,
});

const result = await pipeline.processImage(imageBuffer, 1);

This package currently exports:

  • ImagePreprocessor
  • TesseractProvider
  • NativeTesseractProvider
  • OCRPipeline
  • OCR pipeline/result/config types from the package root

TesseractProvider now returns richer OCR geometry through blocks, lines, paragraphs, words, and symbols, plus optional orientation and script metadata in detection.

console.log(result.paragraphs?.[0]?.bbox, result.symbols?.[0]?.bbox);
console.log(result.detection?.orientationDegrees, result.detection?.appliedRotationDegrees);

Native Tesseract Provider

NativeTesseractProvider wraps a locally installed tesseract binary and keeps OCR fully local without adding extra npm runtime dependencies beyond this package.

Use it when you want:

  • local native Tesseract instead of tesseract.js
  • easier control over local tessdata installs and language packs
  • a path toward enterprise/offline OCR deployments that rely on managed system binaries

Example:

import { createLogger } from 'docparser-core';
import { NativeTesseractProvider, OCRPipeline } from 'docparser-ocr';

const logger = createLogger({ level: 'warn' });
const provider = new NativeTesseractProvider(logger, {
  executablePath: 'tesseract',
  tessdataDir: '/usr/share/tessdata',
  languages: ['eng'],
});

const pipeline = new OCRPipeline(logger, {
  provider,
  qualityLevel: 'thorough',
});

The native provider reads TSV output for geometry and attempts OSD detection through the same local binary when enabled.

When createOCRPlugin() is configured with engine: 'native-tesseract', the runtime now probes for a local native binary first and automatically falls back to TesseractProvider when the native executable cannot be discovered.


Benchmarking

The package now includes a labeled corpus runner so OCR changes can be measured per document type instead of by confidence alone.

  • Example manifest: benchmark/corpus.example.json
  • Supported document types: printed-scan, form, screenshot, camera-photo, receipt, handwritten
  • Reported metrics: exact match rate, average confidence, character error rate (CER), and word error rate (WER)

Run the benchmark after building the package:

npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --output ./benchmark/results.json --engine native-tesseract --quality thorough

You can also pass native-binary settings through the CLI:

npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --engine native-tesseract --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages eng

Preprocessing Presets

ImagePreprocessor and OCR preprocessing configs now accept named presets so common OCR tuning does not require a long option bag.

  • fast: minimal cleanup for speed-sensitive OCR
  • scanned-document: binarized, normalized cleanup for typical document scans
  • photo-receipt: stronger cleanup plus deskew for camera captures and receipts
  • low-contrast: contrast-heavy cleanup for faded scans

Preset values are applied first, and any explicit preprocessing options override the preset.


Retry Profiles

OCRPipeline now supports retryProfiles for low-confidence OCR recovery. The pipeline runs the default preprocessing first, then retries low-confidence pages with profile-specific preprocessing overrides until one clears the confidence threshold.

Normal OCR runs now use adaptive retry selection by default. Instead of sweeping every built-in retry profile for every page, the pipeline narrows built-in retries to the most relevant profiles for the document type or image hint. Full profile sweeps are reserved for explicit evaluateAllProfiles: true or forced-attempt evaluation paths.

Use qualityLevel when you want the pipeline to control how much OCR work it performs before giving up:

  • fast: one default pass, lowest compute
  • balanced: adaptive retries for common printed and scanned text
  • thorough: stronger preprocessing plus adaptive retries, with full profile evaluation reserved for forced attempts or explicit opt-in
  • extreme: highest-compute adaptive mode, with exhaustive profile sweeps reserved for forced attempts or explicit opt-in

If you pass retryProfiles, they replace the built-in retry set by default. Set useBuiltInRetryProfiles: true when you want to keep the built-in quality-level retries and append your own profiles after them.

When you do not provide custom retry profiles, the pipeline falls back to built-in recovery passes aimed at common OCR failure modes:

  • scanned-document-retry: binarized document cleanup plus single_block segmentation
  • low-contrast-retry: contrast-heavy cleanup plus sparse_text segmentation
  • handwritten-retry: stronger photo-style cleanup plus both engine mode and sparse_text segmentation

In extreme mode the pipeline also adds extra line-, word-, legacy-, and high-DPI sparse-text retries for especially difficult handwriting and degraded printed scans.

You can also pass documentProfileHint into processImage() to bias adaptive selection toward a narrower strategy family such as handwritten, form, screenshot, camera-photo, or printed-scan.

For receipt and form hints, the pipeline now applies conservative field-aware postprocessing to normalize label/value formatting without changing provider confidence or geometry.

When OSD metadata reports a rotated page on the first pass, the pipeline now rotates the original image before retry profiles run, so the recovery profiles operate on orientation-corrected input instead of repeatedly retrying the same rotated buffer.

  • name: optional label for logs and diagnostics
  • preprocessing: preprocessing overrides merged on top of the base pipeline preprocessing config
  • recognition: optional per-profile OCR engine overrides such as pageSegMode, engineMode, preserveInterwordSpaces, and userDefinedDpi
  • minConfidence: optional per-profile threshold override

This is useful for harder scans where a binarized or deskewed pass can recover text that the default preprocessing misses.

Each OCRPageResult now includes retry diagnostics through attemptCount, attemptedProfiles, and selectedProfile, plus geometry and OSD diagnostics through paragraphs, symbols, and detection.appliedRotationDegrees.

Example retry profile using a preset:

const pipeline = new OCRPipeline(logger, {
  provider,
  qualityLevel: 'thorough',
  useBuiltInRetryProfiles: true,
  retryProfiles: [
    {
      name: 'receipt-recovery',
      preprocessing: { preset: 'photo-receipt' },
      recognition: { pageSegMode: 'sparse_text', engineMode: 'both' },
    },
  ],
});

Page Concurrency

OCRPipeline.processPages() now supports a concurrency option so multi-page OCR can use multiple workers in parallel while preserving input page order in the returned results.

  • concurrency: max number of pages processed at once; values below 1 are clamped to 1

Orientation And Script Detection

TesseractProvider now attempts orientation and script detection by default and adds the result to OCRResult.detection when available.

  • detectOrientationScript: enable or disable OSD detection in TesseractConfig (default: true)

If OSD is unavailable or fails, text recognition still completes and returns geometry output without detection metadata.

The provider also enables the legacy Tesseract worker for OSD so detect() can return reliable script and orientation metadata. When the OCR pipeline uses that metadata to auto-rotate an image before retries, the final page result carries the applied correction in detection.appliedRotationDegrees.


Windows Smoke Check

The package now includes a Windows native deployment smoke check for local Tesseract installs. It verifies binary discovery and confirms that required language packs such as eng and osd are available.

npm --workspace packages/ocr run smoke:native:windows -- --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages eng,osd

The command prints a JSON diagnostic payload and exits non-zero when required languages or the binary itself are missing.


Publishing

This package is configured to publish as a public unscoped npm package.

Consumer install command:

npm install docparser-core docparser-ocr

Typical release flow:

npm --workspace packages/ocr run build
npm --workspace packages/ocr run test
npm --workspace packages/ocr run typecheck
npm publish --workspace packages/ocr

The package metadata publishes:

  • ESM entrypoint: dist/index.js
  • Type declarations: dist/index.d.ts
  • package name: docparser-ocr

License

Apache-2.0