npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

docparser-ocr

v2.2.1

Published

Optional OCR runtime package for docparser-core — adaptive OCR pipelines, local Tesseract engines, and image preprocessing

Readme

docparser-ocr

Optional OCR runtime package for docparser-core.

Published package name on npm: docparser-ocr

npm version license tests coverage

npm install docparser-core docparser-ocr

Table of contents


This package contains OCR and image-native processing components that are intentionally split out from core:

  • ImagePreprocessor (uses sharp)
  • TesseractProvider (uses tesseract.js)
  • NativeTesseractProvider (uses a local tesseract CLI install)
  • OCRPipeline

Why this package exists

docparser-core keeps OCR runtime dependencies optional so the default install stays smaller and avoids image-processing and Tesseract dependencies unless you actually need OCR.

Use docparser-ocr when you want:

  • direct OCR/image preprocessing APIs
  • the built-in createOCRPlugin() runtime from docparser-core
  • Tesseract-based orientation, script, and geometry extraction

In current docparser-core releases, that plugin can now consume:

  • raw .png, .jpg/.jpeg, .tiff, and .webp documents parsed directly by core
  • image-only PDF pages that core rasterizes into PNG data URLs during parsing
  • embedded OOXML images emitted by the DOCX and PPTX parsers

Installation

Package names on npm:

  • docparser-core
  • docparser-ocr

npm

npm install docparser-core docparser-ocr

yarn

yarn add docparser-core docparser-ocr

pnpm

pnpm add docparser-core docparser-ocr

Runtime requirements

  • Node.js: >=20.19.0
  • This package is ESM ("type": "module")

Quick Start

Beginner checklist:

  1. Install both docparser-core and docparser-ocr.
  2. Create a logger from docparser-core.
  3. Create a TesseractProvider and pass it into OCRPipeline.
  4. Call processImage(buffer, pageNumber) with an image buffer.
  5. Read OCR text from result.text and OCR diagnostics from result.detection.

If you already manage a local Tesseract installation and want a native/offline engine path without the browser-style worker runtime, use NativeTesseractProvider instead of TesseractProvider.

Smallest working example:

import { readFile } from 'node:fs/promises';
import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';

const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
  provider,
  minConfidence: 0.4,
});

const image = await readFile('./examples/receipt.png');
const result = await pipeline.processImage(image, 1);

console.log(result.text);
console.log(result.confidence);
console.log(result.detection?.orientationDegrees);

If you want OCR inside DocParser, install both packages and use createOCRPlugin() from docparser-core. That path now covers direct image inputs and scanned PDFs in addition to embedded document images.

If you prefer config-driven setup instead of manual plugin registration, docparser-core now supports a top-level providers.ocr section. Supported OCR provider kinds are tesseract, native_tesseract, ollama, claude, openrouter, and organization_gateway. The Tesseract-based options still depend on this package because the OCR runtime stays optional.

That lets a generated config file enable OCR like this:

export default {
  parsing: {
    ocr: {
      languages: ['eng'],
      confidenceThreshold: 60,
      dpi: 300,
    },
  },
  providers: {
    ocr: {
      provider: {
        kind: 'native_tesseract',
        executablePath: process.env.DOCPARSER_NATIVE_TESSERACT_PATH,
        languages: ['eng'],
      },
      pipeline: {
        qualityLevel: 'balanced',
      },
    },
  },
};

For HTTP-backed OCR providers, add startup health validation and bounded retry/backoff so the parser can fail fast before processing documents:

export default {
  providers: {
    ocr: {
      provider: {
        kind: 'organization_gateway',
        protocol: 'openai',
        baseUrl: process.env.DOCPARSER_GATEWAY_BASE_URL,
        apiKey: process.env.DOCPARSER_GATEWAY_API_KEY,
        model: 'gpt-4.1-mini',
        healthPath: '/readyz',
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      processEmbeddedImages: true,
      processImagePDFs: true,
      pipeline: {
        qualityLevel: 'thorough',
        useBuiltInRetryProfiles: true,
      },
    },
  },
};

Built-in provider errors also redact common bearer-token and API-key patterns before they surface to callers or logs.


Developer Playbook

Use docparser-ocr for these developer workflows:

| Goal | Use this surface | Start here | | --- | --- | --- | | OCR a single image buffer | OCRPipeline.processImage(...) | Quick Start | | OCR multiple rasterized pages with concurrency | OCRPipeline.processPages(...) | Page Concurrency | | Use browser-style worker OCR with tesseract.js | TesseractProvider | Usage | | Use a local system Tesseract binary | NativeTesseractProvider | Native Tesseract Provider | | Preprocess scans before OCR | ImagePreprocessor | Preprocessing Presets | | Retry harder OCR profiles for poor scans or handwriting | OCRPipeline retry profiles and quality levels | Retry Profiles | | Capture geometry, script, and orientation metadata | OCR provider + pipeline results | Usage and Orientation And Script Detection | | Plug OCR into DocParser for images and scanned PDFs | createOCRPlugin() or providers.ocr in docparser-core | Quick Start and the core package docs | | Benchmark OCR quality against a corpus | runOCRBenchmark(...) or the benchmark CLI | Benchmarking | | Verify Windows native Tesseract installs | runWindowsNativeTesseractSmoke(...) | Windows Smoke Check |

Practical path selection:

  • Choose TesseractProvider when you want pure npm-based local OCR with no system binary management.
  • Choose NativeTesseractProvider when your enterprise environment standardizes on a local tesseract install and you want predictable binary/runtime ownership.
  • Choose OCRPipeline when you want preprocessing, retry profiles, text detection heuristics, and page-level result objects.
  • Choose createOCRPlugin() in docparser-core when you want OCR to happen automatically after parsing standalone images, embedded document images, or rasterized scanned-PDF pages.

Usage

import { createLogger } from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';

const logger = createLogger({ level: 'warn' });
const provider = new TesseractProvider(logger, { languages: ['eng'] });
const pipeline = new OCRPipeline(logger, {
  provider,
  qualityLevel: 'thorough',
  minConfidence: 0.5,
  concurrency: 2,
});

const result = await pipeline.processImage(imageBuffer, 1);

This package currently exports:

  • ImagePreprocessor
  • TesseractProvider
  • NativeTesseractProvider
  • OCRPipeline
  • OCR pipeline/result/config types from the package root

TesseractProvider now returns richer OCR geometry through blocks, lines, paragraphs, words, and symbols, plus optional orientation and script metadata in detection.

console.log(result.paragraphs?.[0]?.bbox, result.symbols?.[0]?.bbox);
console.log(result.detection?.orientationDegrees, result.detection?.appliedRotationDegrees);

Native Tesseract Provider

NativeTesseractProvider wraps a locally installed tesseract binary and keeps OCR fully local without adding extra npm runtime dependencies beyond this package.

Use it when you want:

  • local native Tesseract instead of tesseract.js
  • easier control over local tessdata installs and language packs
  • a path toward enterprise/offline OCR deployments that rely on managed system binaries

Example:

import { createLogger } from 'docparser-core';
import { NativeTesseractProvider, OCRPipeline } from 'docparser-ocr';

const logger = createLogger({ level: 'warn' });
const provider = new NativeTesseractProvider(logger, {
  executablePath: 'tesseract',
  tessdataDir: '/usr/share/tessdata',
  languages: ['eng'],
});

const pipeline = new OCRPipeline(logger, {
  provider,
  qualityLevel: 'thorough',
});

The native provider reads TSV output for geometry and attempts OSD detection through the same local binary when enabled.

When createOCRPlugin() is configured with engine: 'native-tesseract', the runtime now probes for a local native binary first and automatically falls back to TesseractProvider when the native executable cannot be discovered.


Benchmarking

The package now includes a labeled corpus runner so OCR changes can be measured per document type instead of by confidence alone.

  • Example manifest: benchmark/corpus.example.json
  • Supported document types: printed-scan, form, prescription, screenshot, camera-photo, receipt, handwritten
  • Reported metrics: exact match rate, average confidence, character error rate (CER), word error rate (WER), reviewed-entry counts, average review-finding counts, average OCR attempt counts, selected-profile counts, and aggregated review field types
  • Exported benchmark JSON now includes both the raw aggregate metrics and a derived report section with sorted profile and review-field breakdowns for operator dashboards or quick inspection
  • Existing benchmark result JSON can be used as a comparison baseline so operators can inspect deltas between two OCR runs without manually diffing raw metrics
  • Saved benchmark result JSON can also be loaded directly with --current so operators can inspect or compare existing benchmark exports without running OCR again

Run the benchmark after building the package:

npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --output ./benchmark/results.json --engine native-tesseract --quality thorough

For a concise operator-facing console summary instead of JSON, add --format text:

npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --engine native-tesseract --format text

To compare a fresh run against a previous benchmark export, pass --baseline with the earlier result JSON. The CLI will include a structured comparison block in JSON output, or a delta section in text mode:

npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --baseline ./benchmark/previous-results.json --format text

To compare two existing benchmark exports without rerunning OCR, pass both --current and --baseline:

npm --workspace packages/ocr run benchmark -- --current ./benchmark/current-results.json --baseline ./benchmark/previous-results.json

You can also pass native-binary settings through the CLI:

npm --workspace packages/ocr run benchmark -- --manifest ./benchmark/corpus.example.json --engine native-tesseract --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages eng

Preprocessing Presets

ImagePreprocessor and OCR preprocessing configs now accept named presets so common OCR tuning does not require a long option bag.

  • fast: minimal cleanup for speed-sensitive OCR
  • scanned-document: binarized, normalized cleanup for typical document scans
  • photo-receipt: stronger cleanup plus deskew for camera captures and receipts
  • low-contrast: contrast-heavy cleanup for faded scans

Preset values are applied first, and any explicit preprocessing options override the preset.


Retry Profiles

OCRPipeline now supports retryProfiles for low-confidence OCR recovery. The pipeline runs the default preprocessing first, then retries low-confidence pages with profile-specific preprocessing overrides until one clears the confidence threshold.

Normal OCR runs now use adaptive retry selection by default. Instead of sweeping every built-in retry profile for every page, the pipeline narrows built-in retries to the most relevant profiles for the document type or image hint. Full profile sweeps are reserved for explicit evaluateAllProfiles: true or forced-attempt evaluation paths.

Use qualityLevel when you want the pipeline to control how much OCR work it performs before giving up:

  • fast: one default pass, lowest compute
  • balanced: adaptive retries for common printed and scanned text
  • thorough: stronger preprocessing plus adaptive retries, with full profile evaluation reserved for forced attempts or explicit opt-in
  • extreme: highest-compute adaptive mode, with exhaustive profile sweeps reserved for forced attempts or explicit opt-in

If you pass retryProfiles, they replace the built-in retry set by default. Set useBuiltInRetryProfiles: true when you want to keep the built-in quality-level retries and append your own profiles after them.

When you do not provide custom retry profiles, the pipeline falls back to built-in recovery passes aimed at common OCR failure modes:

  • scanned-document-retry: binarized document cleanup plus single_block segmentation
  • low-contrast-retry: contrast-heavy cleanup plus sparse_text segmentation
  • handwritten-retry: stronger photo-style cleanup plus both engine mode and sparse_text segmentation

In extreme mode the pipeline also adds extra line-, word-, legacy-, and high-DPI sparse-text retries for especially difficult handwriting and degraded printed scans.

You can also pass documentProfileHint into processImage() to bias adaptive selection toward a narrower strategy family such as handwritten, form, screenshot, camera-photo, or printed-scan.

For receipt and form hints, the pipeline now applies conservative field-aware postprocessing to normalize label/value formatting without changing provider confidence or geometry.

When OSD metadata reports a rotated page on the first pass, the pipeline now rotates the original image before retry profiles run, so the recovery profiles operate on orientation-corrected input instead of repeatedly retrying the same rotated buffer.

  • name: optional label for logs and diagnostics
  • preprocessing: preprocessing overrides merged on top of the base pipeline preprocessing config
  • recognition: optional per-profile OCR engine overrides such as pageSegMode, engineMode, preserveInterwordSpaces, and userDefinedDpi
  • minConfidence: optional per-profile threshold override

This is useful for harder scans where a binarized or deskewed pass can recover text that the default preprocessing misses.

Each OCRPageResult now includes retry diagnostics through attemptCount, attemptedProfiles, and selectedProfile, plus geometry and OSD diagnostics through paragraphs, symbols, and detection.appliedRotationDegrees.

Example retry profile using a preset:

const pipeline = new OCRPipeline(logger, {
  provider,
  qualityLevel: 'thorough',
  useBuiltInRetryProfiles: true,
  retryProfiles: [
    {
      name: 'receipt-recovery',
      preprocessing: { preset: 'photo-receipt' },
      recognition: { pageSegMode: 'sparse_text', engineMode: 'both' },
    },
  ],
});

Page Concurrency

OCRPipeline.processPages() now supports a concurrency option so multi-page OCR can use multiple workers in parallel while preserving input page order in the returned results.

  • concurrency: max number of pages processed at once; values below 1 are clamped to 1

Orientation And Script Detection

TesseractProvider now attempts orientation and script detection by default and adds the result to OCRResult.detection when available.

  • detectOrientationScript: enable or disable OSD detection in TesseractConfig (default: true)

If OSD is unavailable or fails, text recognition still completes and returns geometry output without detection metadata.

The provider also enables the legacy Tesseract worker for OSD so detect() can return reliable script and orientation metadata. When the OCR pipeline uses that metadata to auto-rotate an image before retries, the final page result carries the applied correction in detection.appliedRotationDegrees.


Windows Smoke Check

The package now includes a Windows native deployment smoke check for local Tesseract installs. It verifies binary discovery and confirms that required language packs such as eng and osd are available.

npm --workspace packages/ocr run smoke:native:windows -- --native-path "C:\Program Files\Tesseract-OCR\tesseract.exe" --tessdata "C:\Program Files\Tesseract-OCR\tessdata" --languages eng,osd

The command prints a JSON diagnostic payload and exits non-zero when required languages or the binary itself are missing.


Publishing

This package is configured to publish as a public unscoped npm package.

Consumer install command:

npm install docparser-core docparser-ocr

Release cookbook:

  1. Publish docparser-core first if both packages are moving together.
  2. Reconfirm the peer dependency range in packages/ocr/package.json before publish.
  3. Run OCR verification after core is built and available locally.
  4. Inspect the tarball contents with npm pack --dry-run.
  5. Publish and then verify the registry version.

OCR release commands:

npm --workspace packages/core run verify:release
npm --workspace packages/ocr run verify:release
npm pack --workspace packages/core --dry-run
npm pack --workspace packages/ocr --dry-run
npm publish --workspace packages/core
npm --workspace packages/ocr run typecheck
npm publish --workspace packages/ocr
npm view docparser-ocr version

The package metadata publishes:

  • ESM entrypoint: dist/index.js
  • Type declarations: dist/index.d.ts
  • package name: docparser-ocr
  • docparser-ocr should be published after docparser-core when both packages move together, because the OCR package peers on the matching core release train.
  • If OCR release verification fails after core has already published, stop and fix OCR locally before retrying instead of widening the peer range or publishing a mismatched OCR package.

License

Apache-2.0