ocr-ai

v1.0.4

Published

3 days ago

Multi-provider AI document extraction library - Extract text or structured JSON from documents using Gemini, OpenAI, Grok, or Claude

Downloads

604

0High
0Medium
0Low

remdph

ai extraction document pdf ocr gemini openai claude grok text-extraction json-extraction

ocr-ai

Multi-provider AI document extraction for Node.js. Extract text or structured JSON from documents using Gemini, OpenAI, Claude, Grok, or Vertex AI.

Installation

npm install ocr-ai

Quick Start

Using Gemini

import { OcrAI } from 'ocr-ai';

const ocr = new OcrAI({
  provider: 'gemini',
  apiKey: 'YOUR_GEMINI_API_KEY',
});

const result = await ocr.extract('./invoice.png');

if (result.success) {
  const text = result.content;
  console.log(text);
}

Using OpenAI

import { OcrAI } from 'ocr-ai';

const ocr = new OcrAI({
  provider: 'openai',
  apiKey: 'YOUR_OPENAI_API_KEY',
});

const result = await ocr.extract('./document.pdf');

if (result.success) {
  const text = result.content;
  console.log(text);
}

Custom Model

You can specify a custom model for any provider:

const ocr = new OcrAI({
  provider: 'gemini',
  apiKey: 'YOUR_GEMINI_API_KEY',
  model: 'gemini-2.0-flash', // Use a specific model
});

// Or with OpenAI
const ocrOpenAI = new OcrAI({
  provider: 'openai',
  apiKey: 'YOUR_OPENAI_API_KEY',
  model: 'gpt-4o-mini', // Use a different model
});

From URL

Extract directly from a URL:

const result = await ocr.extract('https://example.com/invoice.png');

if (result.success) {
  console.log(result.content);
}

Custom Instructions

You can provide custom instructions to guide the extraction:

const result = await ocr.extract('./receipt.png', {
  prompt: 'Extract only the total amount and date from this receipt',
});

if (result.success) {
  console.log(result.content);
  // Output: "Total: $154.06, Date: 11/02/2019"
}

Output Format

By default, extraction returns text. You can also extract structured JSON:

// Text output (default)
const textResult = await ocr.extract('./invoice.png', {
  format: 'text',
});

if (textResult.success) {
  console.log(textResult.content); // string
}

// JSON output with schema
const jsonResult = await ocr.extract('./invoice.png', {
  format: 'json',
  schema: {
    invoice_number: 'string',
    date: 'string',
    total: 'number',
    items: [{ name: 'string', quantity: 'number', price: 'number' }],
  },
});

if (jsonResult.success) {
  console.log(jsonResult.data); // { invoice_number: "US-001", date: "11/02/2019", total: 154.06, items: [...] }
}

JSON Schema

The schema defines the structure of the data you want to extract. Use a simple object where keys are field names and values are types:

Basic types:

'string' - Text values
'number' - Numeric values
'boolean' - True/false values

Nested objects:

const schema = {
  company: {
    name: 'string',
    address: 'string',
    phone: 'string',
  },
  customer: {
    name: 'string',
    email: 'string',
  },
};

Arrays:

const schema = {
  // Array of objects
  items: [
    {
      description: 'string',
      quantity: 'number',
      unit_price: 'number',
      total: 'number',
    },
  ],
  // Simple array
  tags: ['string'],
};

Complete example (invoice):

const invoiceSchema = {
  invoice_number: 'string',
  date: 'string',
  due_date: 'string',
  company: {
    name: 'string',
    address: 'string',
    phone: 'string',
    email: 'string',
  },
  bill_to: {
    name: 'string',
    address: 'string',
  },
  items: [
    {
      description: 'string',
      quantity: 'number',
      unit_price: 'number',
      total: 'number',
    },
  ],
  subtotal: 'number',
  tax: 'number',
  total: 'number',
};

const result = await ocr.extract('./invoice.png', {
  format: 'json',
  schema: invoiceSchema,
  prompt: 'Extract all invoice data from this document.',
});

Model Configuration

You can pass model-specific parameters like temperature, max tokens, and more:

// Gemini with model config
const result = await ocr.extract('./invoice.png', {
  modelConfig: {
    temperature: 0.2,
    maxTokens: 4096,
    topP: 0.8,
    topK: 40,
  },
});

// OpenAI with model config
const result = await ocr.extract('./invoice.png', {
  modelConfig: {
    temperature: 0,
    maxTokens: 2048,
    topP: 1,
  },
});

Available options:

| Option | Description | Supported Providers | |--------|-------------|---------------------| | temperature | Controls randomness (0.0-1.0+) | All | | maxTokens | Maximum tokens to generate | All | | topP | Nucleus sampling | All | | topK | Top-k sampling | Gemini, Claude, Vertex | | stopSequences | Stop generation at these strings | All |

Token Usage

Access token usage information from the metadata:

const result = await ocr.extract('./invoice.png');

if (result.success) {
  console.log(result.content);

  // Access metadata
  console.log(result.metadata.processingTimeMs); // 2351
  console.log(result.metadata.tokens?.inputTokens); // 1855
  console.log(result.metadata.tokens?.outputTokens); // 260
  console.log(result.metadata.tokens?.totalTokens); // 2115
}

Supported Providers

| Provider | Default Model | Auth | |----------|---------------|------| | gemini | gemini-1.5-flash | API Key | | openai | gpt-4o | API Key | | claude | claude-sonnet-4-20250514 | API Key | | grok | grok-2-vision-1212 | API Key | | vertex | gemini-2.0-flash | Google Cloud |

Note: For enterprise OCR needs, see Advanced: Vertex AI section below.

Supported Inputs

Local files: ./invoice.png, ./document.pdf
URLs: https://example.com/invoice.png

Supported Files

Images: jpg, png, gif, webp
Documents: pdf
Text: txt, md, csv, json, xml, html

Advanced: Vertex AI (Google Cloud)

The vertex provider enables access to Google Cloud's AI infrastructure, which is useful for enterprise scenarios requiring:

Compliance: Data residency and regulatory requirements
Integration: Native integration with Google Cloud services (BigQuery, Cloud Storage, etc.)
Specialized OCR: Access to Google's Document AI and Vision AI processors

Basic Setup

Vertex AI uses Google Cloud authentication instead of API keys:

import { OcrAI } from 'ocr-ai';

const ocr = new OcrAI({
  provider: 'vertex',
  vertexConfig: {
    project: 'your-gcp-project-id',
    location: 'us-central1',
  },
});

const result = await ocr.extract('./invoice.png');

Requirements:

Install the gcloud CLI
Run gcloud auth application-default login
Enable the Vertex AI API in your GCP project

When to Use Vertex AI vs Gemini API

| Scenario | Recommended | |----------|-------------| | Quick prototyping | Gemini (API Key) | | Personal projects | Gemini (API Key) | | Enterprise/production | Vertex AI | | Data residency requirements | Vertex AI | | High-volume processing | Vertex AI |

Related Google Cloud OCR Services

For specialized document processing beyond what Gemini models offer, Google Cloud provides dedicated OCR services:

Document AI - Optimized for structured documents:

Invoice Parser, Receipt Parser, Form Parser
W2, 1040, Bank Statement processors
Custom extractors for domain-specific documents
Higher accuracy for tables, forms, and handwritten text

Vision API - Optimized for images:

Real-time OCR with low latency
80+ language support
Handwriting detection
Simple integration, ~98% accuracy on clean documents

These services are separate from ocr-ai but can complement it for enterprise document pipelines.

Gemini Model Benchmarks

Performance benchmarks for Gemini models extracting data from an invoice image:

| Model | Text Extraction | JSON Extraction | Best For | |-------|-----------------|-----------------|----------| | gemini-2.0-flash-lite | 2.8s | 2.1s | High-volume processing, cost optimization | | gemini-2.5-flash-lite | 2.2s | 1.9s | Fastest option, simple documents | | gemini-2.0-flash | 3.9s | 2.9s | General purpose, good balance | | gemini-2.5-flash | 5.0s | 5.0s | Standard documents, reliable | | gemini-3-flash-preview | 12.3s | 10.6s | Complex layouts, newer capabilities | | gemini-3-pro-image-preview | 8.0s | 11.9s | Image-heavy documents | | gemini-2.5-pro | 12.6s | 5.5s | High accuracy, complex documents | | gemini-3-pro-preview | 24.8s | 13.1s | Maximum accuracy, handwritten text |

Model Recommendations

For digital documents (invoices, receipts, forms):

// Fast and cost-effective
const ocr = new OcrAI({
  provider: 'gemini',
  apiKey: 'YOUR_API_KEY',
  model: 'gemini-2.5-flash-lite', // ~2s response time
});

For complex documents or when accuracy is critical:

// Higher accuracy, slower processing
const ocr = new OcrAI({
  provider: 'gemini',
  apiKey: 'YOUR_API_KEY',
  model: 'gemini-2.5-pro', // Best accuracy/speed ratio
});

For handwritten documents or poor quality scans:

// Maximum accuracy for difficult documents
const ocr = new OcrAI({
  provider: 'gemini',
  apiKey: 'YOUR_API_KEY',
  model: 'gemini-3-pro-preview', // Best for handwriting
});

Quick Reference

| Use Case | Recommended Model | |----------|-------------------| | High-volume batch processing | gemini-2.5-flash-lite | | Standard invoices/receipts | gemini-2.0-flash | | Complex tables and layouts | gemini-2.5-pro | | Handwritten documents | gemini-3-pro-preview | | Poor quality scans | gemini-3-pro-preview | | Real-time applications | gemini-2.5-flash-lite |

OpenAI Model Benchmarks

Performance benchmarks for OpenAI models extracting data from an invoice image:

| Model | Text Extraction | JSON Extraction | Best For | |-------|-----------------|-----------------|----------| | gpt-4.1-nano | 4.4s | 2.4s | Fastest, cost-effective | | gpt-4.1-mini | 4.8s | 3.2s | Good balance speed/accuracy | | gpt-4.1 | 8.2s | 5.4s | High accuracy, reliable | | gpt-4o-mini | 7.2s | 5.7s | Budget-friendly | | gpt-4o | 12.3s | 10.7s | Standard high accuracy | | gpt-5.2 | 6.4s | 5.0s | Latest generation | | gpt-5-mini | 12.2s | 7.9s | GPT-5 balanced option | | gpt-5-nano | 19.9s | 16.1s | GPT-5 economy tier |

Note: gpt-5.2-pro and gpt-image-1 use different API endpoints and are not currently supported.

Model Recommendations

For digital documents (invoices, receipts, forms):

// Fast and cost-effective
const ocr = new OcrAI({
  provider: 'openai',
  apiKey: 'YOUR_API_KEY',
  model: 'gpt-4.1-nano', // ~2-4s response time
});

For complex documents or when accuracy is critical:

// Higher accuracy, reliable extraction
const ocr = new OcrAI({
  provider: 'openai',
  apiKey: 'YOUR_API_KEY',
  model: 'gpt-4.1', // Best accuracy/speed ratio
});

For handwritten documents or poor quality scans:

// Maximum accuracy for difficult documents
const ocr = new OcrAI({
  provider: 'openai',
  apiKey: 'YOUR_API_KEY',
  model: 'gpt-5.2', // Latest generation, best accuracy
});

Quick Reference

| Use Case | Recommended Model | |----------|-------------------| | High-volume batch processing | gpt-4.1-nano | | Standard invoices/receipts | gpt-4.1-mini | | Complex tables and layouts | gpt-4.1 | | Handwritten documents | gpt-5.2 | | Poor quality scans | gpt-5.2 | | Real-time applications | gpt-4.1-nano | | Budget-conscious projects | gpt-4o-mini |

Promise API

For users who prefer callbacks or need more control over async operations, ocr-ai provides an alternative OcrAIPromise class with additional features.

Basic Usage with Callbacks

import { OcrAIPromise } from 'ocr-ai';

const ocr = new OcrAIPromise({
  provider: 'gemini',
  apiKey: 'YOUR_API_KEY',
});

// Using callback style
ocr.extract('./invoice.png', {}, (error, result) => {
  if (error) {
    console.error('Extraction failed:', error.message);
    return;
  }
  console.log('Extracted:', result.content);
});

Using .then()/.catch()

import { OcrAIPromise } from 'ocr-ai';

const ocr = new OcrAIPromise({
  provider: 'gemini',
  apiKey: 'YOUR_API_KEY',
});

// Promise chain style
ocr.extract('./invoice.png')
  .then((result) => {
    if (result.success) {
      console.log('Content:', result.content);
    }
  })
  .catch((error) => {
    console.error('Error:', error);
  });

Extract Multiple Files in Parallel

const ocr = new OcrAIPromise({
  provider: 'gemini',
  apiKey: 'YOUR_API_KEY',
});

// Extract many files at once
const results = await ocr.extractMany([
  './invoice1.png',
  './invoice2.png',
  './invoice3.png',
]);

results.forEach((result, index) => {
  if (result.success) {
    console.log(`File ${index + 1}:`, result.content);
  }
});

Batch Extraction with Individual Options

const ocr = new OcrAIPromise({
  provider: 'gemini',
  apiKey: 'YOUR_API_KEY',
});

// Each file with its own options
const results = await ocr.extractBatch([
  { source: './invoice.png', options: { format: 'json', schema: invoiceSchema } },
  { source: './receipt.png', options: { format: 'text' } },
  { source: './contract.pdf', options: { prompt: 'Extract key dates and amounts' } },
]);

Automatic Retry on Failure

const ocr = new OcrAIPromise({
  provider: 'gemini',
  apiKey: 'YOUR_API_KEY',
});

// Retry up to 3 times with 1 second delay between attempts
const result = await ocr.extractWithRetry(
  './invoice.png',
  { format: 'json', schema: invoiceSchema },
  3,    // retries
  1000  // delay in ms
);

if (result.success) {
  console.log('Extracted after retries:', result.data);
}

Access Underlying OcrAI Instance

const ocrPromise = new OcrAIPromise({
  provider: 'gemini',
  apiKey: 'YOUR_API_KEY',
});

// Get the underlying OcrAI instance for direct access
const ocr = ocrPromise.getOcrAI();

// Use standard async/await if needed
const result = await ocr.extract('./invoice.png');

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ocr-ai

Installation

Quick Start

Using Gemini

Using OpenAI

Custom Model

From URL

Custom Instructions

Output Format

JSON Schema

Model Configuration

Token Usage

Supported Providers

Supported Inputs

Supported Files

Advanced: Vertex AI (Google Cloud)

Basic Setup

When to Use Vertex AI vs Gemini API

Related Google Cloud OCR Services

Gemini Model Benchmarks

Model Recommendations

Quick Reference

OpenAI Model Benchmarks

Model Recommendations

Quick Reference

Promise API

Basic Usage with Callbacks

Using .then()/.catch()

Extract Multiple Files in Parallel

Batch Extraction with Individual Options

Automatic Retry on Failure

Access Underlying OcrAI Instance

License