smart-ocr

v1.3.1

Published

2 months ago

OCR library for both scanned and text-based PDFs in .pdf or image format using tesseract.js with AI-powered structured output support.

0High
0Medium
0Low

thadeveloper

ocr pdf png image tesseract pdfjs scanned-pdf text-extraction pdf ocr optical character recognition pdf text extraction

Smart OCR

smart-ocr is a Node.js OCR library for:

text-based PDFs
scanned PDFs
mixed PDFs with both text-native and scanned pages
PNG and other common raster image formats
optional AI-assisted structured output from extracted OCR text

For PDFs, each page is handled independently. If a page already contains selectable text, Smart OCR extracts it directly. If a page is image-only, it renders the page and falls back to OCR.

Requirements

Node.js >=20.6.0

This package is designed for Node.js. It is not set up for browser use.

Installation

npm install smart-ocr

Quick Start

import { SmartOCR } from "smart-ocr";
import path from "node:path";
import { fileURLToPath } from "node:url";

const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

const ocr = new SmartOCR({ language: "eng", workerCount: 2 });

try {
  const pdfText = await ocr.processPDF(path.join(__dirname, "sample-scanned.pdf"));
  console.log(pdfText);
} finally {
  await ocr.terminate();
}

Structured Output

Smart OCR can optionally turn extracted text into structured JSON.

OCR still runs first
the extracted text is then sent to an AI model to produce structured output

When structuredOutputOptions.ai is configured, processFile(), processPDF(), and processImage() return a JSON object instead of a plain text string.

Supported providers:

openai - uses structured outputs (response_format: json_schema)
anthropic - uses tool use to enforce schema-shaped output
gemini - uses responseMimeType: "application/json" with responseSchema

Example (OpenAI):

import { SmartOCR } from "smart-ocr";

const ocr = new SmartOCR({
  language: "eng",
  structuredOutputOptions: {
    ai: {
      provider: "openai",
      model: "gpt-4.1-mini",
      apiKey: process.env.OPENAI_API_KEY,
      prompt: "Extract the document fields. Use null when a value is missing or unclear.",
    },
    schema: {
      type: "object",
      properties: {
        fullName: { type: ["string", "null"] },
        idNumber: { type: ["string", "null"] },
        dateOfBirth: { type: ["string", "null"] },
        sex: { type: ["string", "null"] },
      },
      required: ["fullName", "idNumber", "dateOfBirth", "sex"],
      additionalProperties: false,
    },
  },
});

try {
  const result = await ocr.processFile("./id.pdf");
  console.log(result);
} finally {
  await ocr.terminate();
}

Example (Anthropic):

const ocr = new SmartOCR({
  structuredOutputOptions: {
    ai: {
      provider: "anthropic",
      model: "claude-opus-4-5",
      apiKey: process.env.ANTHROPIC_API_KEY,
    },
    schema: {
      type: "object",
      properties: {
        fullName: { type: ["string", "null"] },
        idNumber: { type: ["string", "null"] },
      },
      required: ["fullName", "idNumber"],
    },
  },
});

Example (Gemini):

const ocr = new SmartOCR({
  structuredOutputOptions: {
    ai: {
      provider: "gemini",
      model: "gemini-2.5-flash-lite",
      apiKey: process.env.GOOGLE_API_KEY,
    },
    schema: {
      type: "object",
      properties: {
        fullName: { type: ["string", "null"] },
        idNumber: { type: ["string", "null"] },
      },
      required: ["fullName", "idNumber"],
    },
  },
});

Notes for AI mode:

apiKey is required for all providers
prompt overrides the default extraction instruction
schema should be a JSON schema describing the object you want back
for OpenAI strict mode, required must list every key in properties
Gemini schemas are automatically normalized: array type values (e.g. ["string", "null"]) are converted to nullable: true, and unsupported fields like additionalProperties are stripped
when AI mode is enabled, the raw OCR text is not returned by these methods

Reference

`new SmartOCR(options?)`

Creates an OCR processor.

Options:

language: Tesseract language or language list. Default: "eng"
pdfRenderScale: render scale used before OCR on scanned PDF pages. Default: 2
workerOptions: options passed to the Tesseract worker, such as langPath, cachePath, or logger
workerCount: Number of OCR workers to run in parallel.
structuredOutputOptions: optional AI configuration for returning structured JSON instead of plain text

Language codes use Tesseract traineddata identifiers, not 2-letter locale codes. For example:

"eng" for English
"spa" for Spanish
"fra" for French
["eng", "spa"] for multilingual OCR

Use "eng", not "en".

structuredOutputOptions shape:

ai.provider: AI provider name. One of "openai", "anthropic", or "gemini"
ai.model: model name to call for structured extraction
ai.apiKey: API key for the chosen provider
ai.prompt: optional custom extraction prompt
schema: JSON schema describing the expected response object. Gemini schemas are automatically normalized from JSON Schema to Gemini's OpenAPI 3.0 subset.

`processFile(filePath)`

Routes a supported file to the correct handler based on file extension.

Returns:

extracted text by default
structured JSON when structuredOutputOptions.ai is configured

Supported extensions:

.pdf
.png
.jpg
.jpeg
.tif
.tiff
.bmp
.webp
.gif

`processPDF(pdfPath)`

Extracts text from a PDF. Text-native pages are read directly. Scanned pages are rendered to images and OCRed.

The OCR language only affects scanned/image-only pages. If a PDF page already contains selectable text, Smart OCR returns that embedded text directly instead of re-OCRing it.

Returns:

extracted text by default
structured JSON when structuredOutputOptions.ai is configured

`processImage(imagePath)`

Runs OCR on an image file.

Returns:

extracted text by default
structured JSON when structuredOutputOptions.ai is configured

`init(language?)`

Eagerly initializes the Tesseract worker. This is optional because processing methods initialize on demand.

If you pass a language to init(language), Smart OCR keeps using that language for later OCR calls until you switch it again or create a new instance.

`terminate()`

Terminates the Tesseract worker and frees resources.

Notes

Smart OCR is optimized for Node.js workloads, not browser runtimes.
Rendering uses @napi-rs/canvas, which avoids the extra Cairo system setup required by canvas.
Scanned PDFs are preprocessed before OCR so sparse content, such as ID cards on large blank pages, is easier to detect.
Structured output is an optional post-processing step on top of OCR, not a replacement for OCR itself.
AI mode supports OpenAI, Anthropic, and Gemini.
OCR quality still depends on the source document quality, scan resolution, and language data.

Development

npm run typecheck
npm run lint
npm test
npm run build
npm run sample

npm run sample builds the library and runs it against the bundled sample files in src/.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Smart OCR

Requirements

Installation

Quick Start

Structured Output

Reference

new SmartOCR(options?)

processFile(filePath)

processPDF(pdfPath)

processImage(imagePath)

init(language?)

terminate()

Notes

Development

License

Support

Buy Me a Coffee

`new SmartOCR(options?)`

`processFile(filePath)`

`processPDF(pdfPath)`

`processImage(imagePath)`

`init(language?)`

`terminate()`