ai-vision-parser

v0.1.12

Published

8 months ago

A powerful TypeScript library for extracting and analyzing content from PDF, Image, and Video files using Vision Language Models

0High
0Medium
0Low

vophihungvn

vision llm pdf image video parser ai vision-language-model

AI Vision Parser

TypeScript library for extracting content from PDFs, images, and videos using Vision Language Models.

Features

🤖 Agent Parser - Multi-step workflows with strategies (parallel, iterative, hierarchical)
📋 Structured Parsing - Type-safe extraction with Zod schema validation
🔍 OCR Fallback - Optional OCR providers (Tesseract, Google Vision, Azure)
🎯 Vision Models - OpenAI, Claude, Gemini, Azure via aisuite
📄 Document Types - PDFs, images (JPG, PNG, TIFF, WebP, BMP), videos
💾 Smart Caching - Multi-layer caching (local, S3, Redis-ready)
⚡ Async Processing - Parallel processing with configurable concurrency

Installation

npm install ai-vision-parser zod
# or
pnpm add ai-vision-parser zod

System dependencies (for canvas - required for PDF processing):

# macOS
brew install pkg-config cairo pango libpng jpeg giflib librsvg

# Ubuntu/Debian
sudo apt-get install build-essential libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev librsvg2-dev

Having canvas installation issues? See the Troubleshooting Guide for detailed solutions.

Quick Start

Basic Document Parsing

import { OpenAIProvider, VisionParser } from 'ai-vision-parser';

// With API key
const provider = new OpenAIProvider({ apiKey: 'your-key' });
const parser = new VisionParser({ visionModel: provider.getModel() });

// Or with environment variable (OPENAI_API_KEY)
const provider = new OpenAIProvider();
const parser = new VisionParser({ visionModel: provider.getModel() });

// Process PDF
const result = await parser.processPDF('document.pdf');
console.log(result.file_object.pages[0].page_content);

Agent Parser (Multi-Step Workflows)

import { AgentParser, AgentStrategy, AgentTask } from 'ai-vision-parser';

const agent = new AgentParser({
  provider: 'openai',
  strategy: AgentStrategy.ADAPTIVE,  // Auto-selects best strategy
});

const result = await agent.parseDocument('document.pdf', {
  tasks: [
    AgentTask.EXTRACT_TABLES,
    AgentTask.IDENTIFY_ENTITIES,
    AgentTask.EXTRACT_METADATA,
  ],
});

console.log(result.data);
console.log(`Took ${result.executionTime}ms`);

Structured Parsing (Type-Safe with Zod)

import { VisionParser, StructuredParser, CommonSchemas } from 'ai-vision-parser';

const parser = new VisionParser({ provider: 'openai' });
const structured = new StructuredParser(parser);

const result = await structured.parsePDFWithSchema('invoice.pdf', {
  schema: CommonSchemas.Invoice,
  structured: true,
  maxRetries: 2,  // Retry on validation errors
});

// Fully typed and validated
console.log(result.data.invoiceNumber);
console.log(result.data.total);
console.log('Valid:', result.isValid);

OCR Fallback (Optional)

import { VisionParser, OCRProvider } from 'ai-vision-parser';

// Install first: npm install tesseract.js
const parser = new VisionParser({
  provider: 'openai',
  ocrFallback: true,  // Use OCR if vision model fails
  ocrProvider: OCRProvider.TESSERACT,
});

const result = await parser.processPDF('document.pdf');

Custom Agent Tool

import { AgentParser, AgentTool } from 'ai-vision-parser';
import { z } from 'zod';

const customTool: AgentTool = {
  name: 'extract_prices',
  description: 'Extract all prices',
  outputSchema: z.object({
    prices: z.array(z.number()),
    total: z.number(),
  }),
  execute: async (input, context) => {
    const text = context.rawResult.file_object.pages
      .map(p => p.page_content).join('\n');
    
    const prices = text.match(/\$\d+/g)
      ?.map(p => parseFloat(p.replace('$', ''))) || [];
    
    return {
      prices,
      total: prices.reduce((a, b) => a + b, 0),
    };
  },
};

const agent = new AgentParser({ provider: 'openai' });
agent.addTool(customTool);

Environment Setup

# Set API key
export OPENAI_API_KEY=your_key
# or
export ANTHROPIC_API_KEY=your_key
# or
export GEMINI_API_KEY=your_key

Or pass directly in code:

import { OpenAIProvider, ClaudeProvider, GeminiProvider } from 'ai-vision-parser';

const openai = new OpenAIProvider({ apiKey: 'your-key' });
const claude = new ClaudeProvider({ apiKey: 'your-key' });
const gemini = new GeminiProvider({ apiKey: 'your-key' });

Core Components

Vision Parser

Basic document processing for PDFs and images.

const parser = new VisionParser({
  provider: 'openai',
  dpi: 333,
  prompt: 'Custom extraction prompt...',
});

Agent Parser

Multi-step workflows with different strategies:

Parallel - Execute tasks simultaneously (fastest)
Iterative - Multiple passes for accuracy
Hierarchical - High-level → detailed extraction
Adaptive - Auto-select based on complexity

Agent Parser vs Normal Parser

Normal Vision Parser is ideal for:

Simple text extraction from documents
Single-page images or basic PDFs
When you need raw markdown text
Speed-critical scenarios where structure isn't needed

Agent Parser provides additional advantages:

Structured Data Extraction
- Extracts tables, entities, metadata, forms, and key-value pairs
- Returns structured objects instead of raw text
- Type-safe with Zod schema validation
Multi-Step Processing Strategies
- Parallel: Run multiple tasks simultaneously (faster)
- Iterative: Refine results over multiple passes (more accurate)
- Hierarchical: Process from high-level to detailed (better for structured docs)
- Adaptive: Automatically selects the best strategy
Context & Memory Management
- Maintains context across processing steps
- Tracks intermediate results
- Builds metadata progressively
Custom Tools & Extensibility
- Add custom extraction tools
- Compose multiple tools together
- Domain-specific extraction logic
Schema Validation
- Validate output against Zod schemas
- Type-safe results
- Automatic validation feedback
Task Decomposition
- Automatically breaks complex tasks into subtasks
- Execute tasks selectively
- Run individual tasks on existing results
Execution Tracking
- Step-by-step execution details
- Success/failure status per step
- Execution time metrics
- Token usage tracking

When to use Agent Parser:

Complex multi-page documents
Need structured data (tables, entities, metadata)
Require validation and type safety
Production systems needing reliable structured output
Documents requiring multiple extraction passes

Structured Parser

Type-safe parsing with Zod schemas:

Predefined schemas (Invoice, Receipt, Contract, etc.)
Custom schemas with validation
Automatic retry on validation errors
Partial result support

OCR Providers

Optional OCR when needed:

Tesseract - Free, offline (npm install tesseract.js)
Google Vision - Best accuracy (npm install @google-cloud/vision)
Azure - Enterprise (npm install @azure/cognitiveservices-computervision @azure/ms-rest-js)

Troubleshooting

Canvas Installation Issues

The canvas package is required for PDF processing and needs native system libraries. If you encounter canvas errors:

Quick Fix

For pnpm users (pnpm rebuild often doesn't work):

Manual rebuild (Recommended):

# Find and rebuild canvas
cd node_modules/.pnpm/canvas@*/node_modules/canvas
npx node-gyp rebuild
cd ../../../../..

Or use this one-liner:

CANVAS_DIR=$(find node_modules/.pnpm -name "canvas" -type d -path "*/node_modules/canvas" | head -1) && cd "$CANVAS_DIR" && npx node-gyp rebuild && cd - > /dev/null

Alternative: Use npm for canvas

npm install canvas --prefix ./temp && cp -r ./temp/node_modules/canvas node_modules/.pnpm/canvas@*/node_modules/ && rm -rf ./temp

Install System Dependencies

macOS:

brew install pkg-config cairo pango libpng jpeg giflib librsvg

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install build-essential libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev librsvg2-dev

Fedora/CentOS/RHEL:

sudo dnf install cairo-devel pango-devel libjpeg-turbo-devel giflib-devel

Common Solutions

Rebuild canvas (Recommended for pnpm):

pnpm rebuild canvas

If that doesn't work, manually rebuild:

cd node_modules/.pnpm/canvas@*/node_modules/canvas
npx node-gyp rebuild
cd ../../../../..

Clean install:

rm -rf node_modules pnpm-lock.yaml
pnpm install
pnpm rebuild canvas

Use npm instead (if pnpm issues persist):

npm install ai-vision-parser
npm rebuild canvas

Verify canvas works:

node -e "const { createCanvas } = require('canvas'); console.log('✅ Canvas works!');"

Apple Silicon (M1/M2)

arch -x86_64 brew install pkg-config cairo pango libpng jpeg giflib librsvg
pnpm rebuild canvas

PDF Rendering Error: "TypeError: Image or Canvas expected"

If you see this error when processing PDFs, it's a critical compatibility issue between pdfjs-dist and canvas 3.x.

REQUIRED: Downgrade canvas to 2.x in your project

Add to your project's package.json:

For pnpm:

{
  "pnpm": {
    "overrides": {
      "ai-vision-parser>canvas": "2.11.2"
    }
  }
}

For npm:

{
  "overrides": {
    "ai-vision-parser": {
      "canvas": "2.11.2"
    }
  }
}

Then reinstall:

rm -rf node_modules pnpm-lock.yaml
pnpm install
# Rebuild canvas 2.x
cd node_modules/.pnpm/[email protected]*/node_modules/canvas
npx node-gyp rebuild
cd ../../../../..

Verify: pnpm list canvas should show [email protected]

If still not working, try npm instead of pnpm:

rm -rf node_modules pnpm-lock.yaml
npm install
npm rebuild canvas

pnpm's dependency hoisting can cause issues with native modules like canvas.

Sharp and Canvas Conflict (macOS)

If you see Class GNotificationCenterDelegate is implemented in both sharp and canvas:

Quick fix:

export DYLD_INSERT_LIBRARIES=""
node your-script.js

Or pin compatible versions:

{
  "dependencies": {
    "sharp": "^0.33.0",
    "canvas": "^2.11.2"
  }
}

See Troubleshooting Guide for more solutions.

Image-Only Processing

If you only need image processing (not PDF), canvas is not required. Only use processImage() method.

For more detailed troubleshooting, see the complete Troubleshooting Guide

Documentation

Quick Start - Get started in 5 minutes
Providers - Provider configuration (NEW)
Agent Parser - Multi-step workflows
Structured Parsing - Zod integration
OCR Providers - Optional OCR
Cache Interface - Caching system
Troubleshooting - Common issues
Roadmap - Future plans

Examples

See examples/ directory:

# Basic examples
npm run test:pdf
npm run test:image

# Advanced examples
ts-node examples/example-agent-parser.ts
ts-node examples/example-structured-parser.ts
ts-node examples/example-agent-with-zod.ts
ts-node examples/example-ocr.ts

Development

pnpm install
pnpm run build
pnpm test

License

Apache License 2.0