ai-vision-parser
v0.1.12
Published
A powerful TypeScript library for extracting and analyzing content from PDF, Image, and Video files using Vision Language Models
Maintainers
Readme
AI Vision Parser
TypeScript library for extracting content from PDFs, images, and videos using Vision Language Models.
Features
- 🤖 Agent Parser - Multi-step workflows with strategies (parallel, iterative, hierarchical)
- 📋 Structured Parsing - Type-safe extraction with Zod schema validation
- 🔍 OCR Fallback - Optional OCR providers (Tesseract, Google Vision, Azure)
- 🎯 Vision Models - OpenAI, Claude, Gemini, Azure via
aisuite - 📄 Document Types - PDFs, images (JPG, PNG, TIFF, WebP, BMP), videos
- 💾 Smart Caching - Multi-layer caching (local, S3, Redis-ready)
- ⚡ Async Processing - Parallel processing with configurable concurrency
Installation
npm install ai-vision-parser zod
# or
pnpm add ai-vision-parser zodSystem dependencies (for canvas - required for PDF processing):
# macOS
brew install pkg-config cairo pango libpng jpeg giflib librsvg
# Ubuntu/Debian
sudo apt-get install build-essential libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev librsvg2-devHaving canvas installation issues? See the Troubleshooting Guide for detailed solutions.
Quick Start
Basic Document Parsing
import { OpenAIProvider, VisionParser } from 'ai-vision-parser';
// With API key
const provider = new OpenAIProvider({ apiKey: 'your-key' });
const parser = new VisionParser({ visionModel: provider.getModel() });
// Or with environment variable (OPENAI_API_KEY)
const provider = new OpenAIProvider();
const parser = new VisionParser({ visionModel: provider.getModel() });
// Process PDF
const result = await parser.processPDF('document.pdf');
console.log(result.file_object.pages[0].page_content);Agent Parser (Multi-Step Workflows)
import { AgentParser, AgentStrategy, AgentTask } from 'ai-vision-parser';
const agent = new AgentParser({
provider: 'openai',
strategy: AgentStrategy.ADAPTIVE, // Auto-selects best strategy
});
const result = await agent.parseDocument('document.pdf', {
tasks: [
AgentTask.EXTRACT_TABLES,
AgentTask.IDENTIFY_ENTITIES,
AgentTask.EXTRACT_METADATA,
],
});
console.log(result.data);
console.log(`Took ${result.executionTime}ms`);Structured Parsing (Type-Safe with Zod)
import { VisionParser, StructuredParser, CommonSchemas } from 'ai-vision-parser';
const parser = new VisionParser({ provider: 'openai' });
const structured = new StructuredParser(parser);
const result = await structured.parsePDFWithSchema('invoice.pdf', {
schema: CommonSchemas.Invoice,
structured: true,
maxRetries: 2, // Retry on validation errors
});
// Fully typed and validated
console.log(result.data.invoiceNumber);
console.log(result.data.total);
console.log('Valid:', result.isValid);OCR Fallback (Optional)
import { VisionParser, OCRProvider } from 'ai-vision-parser';
// Install first: npm install tesseract.js
const parser = new VisionParser({
provider: 'openai',
ocrFallback: true, // Use OCR if vision model fails
ocrProvider: OCRProvider.TESSERACT,
});
const result = await parser.processPDF('document.pdf');Custom Agent Tool
import { AgentParser, AgentTool } from 'ai-vision-parser';
import { z } from 'zod';
const customTool: AgentTool = {
name: 'extract_prices',
description: 'Extract all prices',
outputSchema: z.object({
prices: z.array(z.number()),
total: z.number(),
}),
execute: async (input, context) => {
const text = context.rawResult.file_object.pages
.map(p => p.page_content).join('\n');
const prices = text.match(/\$\d+/g)
?.map(p => parseFloat(p.replace('$', ''))) || [];
return {
prices,
total: prices.reduce((a, b) => a + b, 0),
};
},
};
const agent = new AgentParser({ provider: 'openai' });
agent.addTool(customTool);Environment Setup
# Set API key
export OPENAI_API_KEY=your_key
# or
export ANTHROPIC_API_KEY=your_key
# or
export GEMINI_API_KEY=your_keyOr pass directly in code:
import { OpenAIProvider, ClaudeProvider, GeminiProvider } from 'ai-vision-parser';
const openai = new OpenAIProvider({ apiKey: 'your-key' });
const claude = new ClaudeProvider({ apiKey: 'your-key' });
const gemini = new GeminiProvider({ apiKey: 'your-key' });Core Components
Vision Parser
Basic document processing for PDFs and images.
const parser = new VisionParser({
provider: 'openai',
dpi: 333,
prompt: 'Custom extraction prompt...',
});Agent Parser
Multi-step workflows with different strategies:
- Parallel - Execute tasks simultaneously (fastest)
- Iterative - Multiple passes for accuracy
- Hierarchical - High-level → detailed extraction
- Adaptive - Auto-select based on complexity
Agent Parser vs Normal Parser
Normal Vision Parser is ideal for:
- Simple text extraction from documents
- Single-page images or basic PDFs
- When you need raw markdown text
- Speed-critical scenarios where structure isn't needed
Agent Parser provides additional advantages:
Structured Data Extraction
- Extracts tables, entities, metadata, forms, and key-value pairs
- Returns structured objects instead of raw text
- Type-safe with Zod schema validation
Multi-Step Processing Strategies
- Parallel: Run multiple tasks simultaneously (faster)
- Iterative: Refine results over multiple passes (more accurate)
- Hierarchical: Process from high-level to detailed (better for structured docs)
- Adaptive: Automatically selects the best strategy
Context & Memory Management
- Maintains context across processing steps
- Tracks intermediate results
- Builds metadata progressively
Custom Tools & Extensibility
- Add custom extraction tools
- Compose multiple tools together
- Domain-specific extraction logic
Schema Validation
- Validate output against Zod schemas
- Type-safe results
- Automatic validation feedback
Task Decomposition
- Automatically breaks complex tasks into subtasks
- Execute tasks selectively
- Run individual tasks on existing results
Execution Tracking
- Step-by-step execution details
- Success/failure status per step
- Execution time metrics
- Token usage tracking
When to use Agent Parser:
- Complex multi-page documents
- Need structured data (tables, entities, metadata)
- Require validation and type safety
- Production systems needing reliable structured output
- Documents requiring multiple extraction passes
Structured Parser
Type-safe parsing with Zod schemas:
- Predefined schemas (Invoice, Receipt, Contract, etc.)
- Custom schemas with validation
- Automatic retry on validation errors
- Partial result support
OCR Providers
Optional OCR when needed:
- Tesseract - Free, offline (
npm install tesseract.js) - Google Vision - Best accuracy (
npm install @google-cloud/vision) - Azure - Enterprise (
npm install @azure/cognitiveservices-computervision @azure/ms-rest-js)
Troubleshooting
Canvas Installation Issues
The canvas package is required for PDF processing and needs native system libraries. If you encounter canvas errors:
Quick Fix
For pnpm users (pnpm rebuild often doesn't work):
Manual rebuild (Recommended):
# Find and rebuild canvas
cd node_modules/.pnpm/canvas@*/node_modules/canvas
npx node-gyp rebuild
cd ../../../../..Or use this one-liner:
CANVAS_DIR=$(find node_modules/.pnpm -name "canvas" -type d -path "*/node_modules/canvas" | head -1) && cd "$CANVAS_DIR" && npx node-gyp rebuild && cd - > /dev/nullAlternative: Use npm for canvas
npm install canvas --prefix ./temp && cp -r ./temp/node_modules/canvas node_modules/.pnpm/canvas@*/node_modules/ && rm -rf ./tempInstall System Dependencies
macOS:
brew install pkg-config cairo pango libpng jpeg giflib librsvgUbuntu/Debian:
sudo apt-get update
sudo apt-get install build-essential libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev librsvg2-devFedora/CentOS/RHEL:
sudo dnf install cairo-devel pango-devel libjpeg-turbo-devel giflib-develCommon Solutions
Rebuild canvas (Recommended for pnpm):
pnpm rebuild canvasIf that doesn't work, manually rebuild:
cd node_modules/.pnpm/canvas@*/node_modules/canvas npx node-gyp rebuild cd ../../../../..Clean install:
rm -rf node_modules pnpm-lock.yaml pnpm install pnpm rebuild canvasUse npm instead (if pnpm issues persist):
npm install ai-vision-parser npm rebuild canvasVerify canvas works:
node -e "const { createCanvas } = require('canvas'); console.log('✅ Canvas works!');"
Apple Silicon (M1/M2)
arch -x86_64 brew install pkg-config cairo pango libpng jpeg giflib librsvg
pnpm rebuild canvasPDF Rendering Error: "TypeError: Image or Canvas expected"
If you see this error when processing PDFs, it's a critical compatibility issue between pdfjs-dist and canvas 3.x.
REQUIRED: Downgrade canvas to 2.x in your project
Add to your project's package.json:
For pnpm:
{
"pnpm": {
"overrides": {
"ai-vision-parser>canvas": "2.11.2"
}
}
}For npm:
{
"overrides": {
"ai-vision-parser": {
"canvas": "2.11.2"
}
}
}Then reinstall:
rm -rf node_modules pnpm-lock.yaml
pnpm install
# Rebuild canvas 2.x
cd node_modules/.pnpm/[email protected]*/node_modules/canvas
npx node-gyp rebuild
cd ../../../../..Verify: pnpm list canvas should show [email protected]
If still not working, try npm instead of pnpm:
rm -rf node_modules pnpm-lock.yaml
npm install
npm rebuild canvaspnpm's dependency hoisting can cause issues with native modules like canvas.
Sharp and Canvas Conflict (macOS)
If you see Class GNotificationCenterDelegate is implemented in both sharp and canvas:
Quick fix:
export DYLD_INSERT_LIBRARIES=""
node your-script.jsOr pin compatible versions:
{
"dependencies": {
"sharp": "^0.33.0",
"canvas": "^2.11.2"
}
}See Troubleshooting Guide for more solutions.
Image-Only Processing
If you only need image processing (not PDF), canvas is not required. Only use processImage() method.
For more detailed troubleshooting, see the complete Troubleshooting Guide
Documentation
- Quick Start - Get started in 5 minutes
- Providers - Provider configuration (NEW)
- Agent Parser - Multi-step workflows
- Structured Parsing - Zod integration
- OCR Providers - Optional OCR
- Cache Interface - Caching system
- Troubleshooting - Common issues
- Roadmap - Future plans
Examples
See examples/ directory:
# Basic examples
npm run test:pdf
npm run test:image
# Advanced examples
ts-node examples/example-agent-parser.ts
ts-node examples/example-structured-parser.ts
ts-node examples/example-agent-with-zod.ts
ts-node examples/example-ocr.tsDevelopment
pnpm install
pnpm run build
pnpm testLicense
Apache License 2.0
