npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

ai-vision-parser

v0.1.12

Published

A powerful TypeScript library for extracting and analyzing content from PDF, Image, and Video files using Vision Language Models

Readme

AI Vision Parser

TypeScript library for extracting content from PDFs, images, and videos using Vision Language Models.

Features

  • 🤖 Agent Parser - Multi-step workflows with strategies (parallel, iterative, hierarchical)
  • 📋 Structured Parsing - Type-safe extraction with Zod schema validation
  • 🔍 OCR Fallback - Optional OCR providers (Tesseract, Google Vision, Azure)
  • 🎯 Vision Models - OpenAI, Claude, Gemini, Azure via aisuite
  • 📄 Document Types - PDFs, images (JPG, PNG, TIFF, WebP, BMP), videos
  • 💾 Smart Caching - Multi-layer caching (local, S3, Redis-ready)
  • Async Processing - Parallel processing with configurable concurrency

Installation

npm install ai-vision-parser zod
# or
pnpm add ai-vision-parser zod

System dependencies (for canvas - required for PDF processing):

# macOS
brew install pkg-config cairo pango libpng jpeg giflib librsvg

# Ubuntu/Debian
sudo apt-get install build-essential libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev librsvg2-dev

Having canvas installation issues? See the Troubleshooting Guide for detailed solutions.

Quick Start

Basic Document Parsing

import { OpenAIProvider, VisionParser } from 'ai-vision-parser';

// With API key
const provider = new OpenAIProvider({ apiKey: 'your-key' });
const parser = new VisionParser({ visionModel: provider.getModel() });

// Or with environment variable (OPENAI_API_KEY)
const provider = new OpenAIProvider();
const parser = new VisionParser({ visionModel: provider.getModel() });

// Process PDF
const result = await parser.processPDF('document.pdf');
console.log(result.file_object.pages[0].page_content);

Agent Parser (Multi-Step Workflows)

import { AgentParser, AgentStrategy, AgentTask } from 'ai-vision-parser';

const agent = new AgentParser({
  provider: 'openai',
  strategy: AgentStrategy.ADAPTIVE,  // Auto-selects best strategy
});

const result = await agent.parseDocument('document.pdf', {
  tasks: [
    AgentTask.EXTRACT_TABLES,
    AgentTask.IDENTIFY_ENTITIES,
    AgentTask.EXTRACT_METADATA,
  ],
});

console.log(result.data);
console.log(`Took ${result.executionTime}ms`);

Structured Parsing (Type-Safe with Zod)

import { VisionParser, StructuredParser, CommonSchemas } from 'ai-vision-parser';

const parser = new VisionParser({ provider: 'openai' });
const structured = new StructuredParser(parser);

const result = await structured.parsePDFWithSchema('invoice.pdf', {
  schema: CommonSchemas.Invoice,
  structured: true,
  maxRetries: 2,  // Retry on validation errors
});

// Fully typed and validated
console.log(result.data.invoiceNumber);
console.log(result.data.total);
console.log('Valid:', result.isValid);

OCR Fallback (Optional)

import { VisionParser, OCRProvider } from 'ai-vision-parser';

// Install first: npm install tesseract.js
const parser = new VisionParser({
  provider: 'openai',
  ocrFallback: true,  // Use OCR if vision model fails
  ocrProvider: OCRProvider.TESSERACT,
});

const result = await parser.processPDF('document.pdf');

Custom Agent Tool

import { AgentParser, AgentTool } from 'ai-vision-parser';
import { z } from 'zod';

const customTool: AgentTool = {
  name: 'extract_prices',
  description: 'Extract all prices',
  outputSchema: z.object({
    prices: z.array(z.number()),
    total: z.number(),
  }),
  execute: async (input, context) => {
    const text = context.rawResult.file_object.pages
      .map(p => p.page_content).join('\n');
    
    const prices = text.match(/\$\d+/g)
      ?.map(p => parseFloat(p.replace('$', ''))) || [];
    
    return {
      prices,
      total: prices.reduce((a, b) => a + b, 0),
    };
  },
};

const agent = new AgentParser({ provider: 'openai' });
agent.addTool(customTool);

Environment Setup

# Set API key
export OPENAI_API_KEY=your_key
# or
export ANTHROPIC_API_KEY=your_key
# or
export GEMINI_API_KEY=your_key

Or pass directly in code:

import { OpenAIProvider, ClaudeProvider, GeminiProvider } from 'ai-vision-parser';

const openai = new OpenAIProvider({ apiKey: 'your-key' });
const claude = new ClaudeProvider({ apiKey: 'your-key' });
const gemini = new GeminiProvider({ apiKey: 'your-key' });

Core Components

Vision Parser

Basic document processing for PDFs and images.

const parser = new VisionParser({
  provider: 'openai',
  dpi: 333,
  prompt: 'Custom extraction prompt...',
});

Agent Parser

Multi-step workflows with different strategies:

  • Parallel - Execute tasks simultaneously (fastest)
  • Iterative - Multiple passes for accuracy
  • Hierarchical - High-level → detailed extraction
  • Adaptive - Auto-select based on complexity

Agent Parser vs Normal Parser

Normal Vision Parser is ideal for:

  • Simple text extraction from documents
  • Single-page images or basic PDFs
  • When you need raw markdown text
  • Speed-critical scenarios where structure isn't needed

Agent Parser provides additional advantages:

  1. Structured Data Extraction

    • Extracts tables, entities, metadata, forms, and key-value pairs
    • Returns structured objects instead of raw text
    • Type-safe with Zod schema validation
  2. Multi-Step Processing Strategies

    • Parallel: Run multiple tasks simultaneously (faster)
    • Iterative: Refine results over multiple passes (more accurate)
    • Hierarchical: Process from high-level to detailed (better for structured docs)
    • Adaptive: Automatically selects the best strategy
  3. Context & Memory Management

    • Maintains context across processing steps
    • Tracks intermediate results
    • Builds metadata progressively
  4. Custom Tools & Extensibility

    • Add custom extraction tools
    • Compose multiple tools together
    • Domain-specific extraction logic
  5. Schema Validation

    • Validate output against Zod schemas
    • Type-safe results
    • Automatic validation feedback
  6. Task Decomposition

    • Automatically breaks complex tasks into subtasks
    • Execute tasks selectively
    • Run individual tasks on existing results
  7. Execution Tracking

    • Step-by-step execution details
    • Success/failure status per step
    • Execution time metrics
    • Token usage tracking

When to use Agent Parser:

  • Complex multi-page documents
  • Need structured data (tables, entities, metadata)
  • Require validation and type safety
  • Production systems needing reliable structured output
  • Documents requiring multiple extraction passes

Structured Parser

Type-safe parsing with Zod schemas:

  • Predefined schemas (Invoice, Receipt, Contract, etc.)
  • Custom schemas with validation
  • Automatic retry on validation errors
  • Partial result support

OCR Providers

Optional OCR when needed:

  • Tesseract - Free, offline (npm install tesseract.js)
  • Google Vision - Best accuracy (npm install @google-cloud/vision)
  • Azure - Enterprise (npm install @azure/cognitiveservices-computervision @azure/ms-rest-js)

Troubleshooting

Canvas Installation Issues

The canvas package is required for PDF processing and needs native system libraries. If you encounter canvas errors:

Quick Fix

For pnpm users (pnpm rebuild often doesn't work):

Manual rebuild (Recommended):

# Find and rebuild canvas
cd node_modules/.pnpm/canvas@*/node_modules/canvas
npx node-gyp rebuild
cd ../../../../..

Or use this one-liner:

CANVAS_DIR=$(find node_modules/.pnpm -name "canvas" -type d -path "*/node_modules/canvas" | head -1) && cd "$CANVAS_DIR" && npx node-gyp rebuild && cd - > /dev/null

Alternative: Use npm for canvas

npm install canvas --prefix ./temp && cp -r ./temp/node_modules/canvas node_modules/.pnpm/canvas@*/node_modules/ && rm -rf ./temp

Install System Dependencies

macOS:

brew install pkg-config cairo pango libpng jpeg giflib librsvg

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install build-essential libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev librsvg2-dev

Fedora/CentOS/RHEL:

sudo dnf install cairo-devel pango-devel libjpeg-turbo-devel giflib-devel

Common Solutions

  1. Rebuild canvas (Recommended for pnpm):

    pnpm rebuild canvas

    If that doesn't work, manually rebuild:

    cd node_modules/.pnpm/canvas@*/node_modules/canvas
    npx node-gyp rebuild
    cd ../../../../..
  2. Clean install:

    rm -rf node_modules pnpm-lock.yaml
    pnpm install
    pnpm rebuild canvas
  3. Use npm instead (if pnpm issues persist):

    npm install ai-vision-parser
    npm rebuild canvas
  4. Verify canvas works:

    node -e "const { createCanvas } = require('canvas'); console.log('✅ Canvas works!');"

Apple Silicon (M1/M2)

arch -x86_64 brew install pkg-config cairo pango libpng jpeg giflib librsvg
pnpm rebuild canvas

PDF Rendering Error: "TypeError: Image or Canvas expected"

If you see this error when processing PDFs, it's a critical compatibility issue between pdfjs-dist and canvas 3.x.

REQUIRED: Downgrade canvas to 2.x in your project

Add to your project's package.json:

For pnpm:

{
  "pnpm": {
    "overrides": {
      "ai-vision-parser>canvas": "2.11.2"
    }
  }
}

For npm:

{
  "overrides": {
    "ai-vision-parser": {
      "canvas": "2.11.2"
    }
  }
}

Then reinstall:

rm -rf node_modules pnpm-lock.yaml
pnpm install
# Rebuild canvas 2.x
cd node_modules/.pnpm/[email protected]*/node_modules/canvas
npx node-gyp rebuild
cd ../../../../..

Verify: pnpm list canvas should show [email protected]

If still not working, try npm instead of pnpm:

rm -rf node_modules pnpm-lock.yaml
npm install
npm rebuild canvas

pnpm's dependency hoisting can cause issues with native modules like canvas.

Sharp and Canvas Conflict (macOS)

If you see Class GNotificationCenterDelegate is implemented in both sharp and canvas:

Quick fix:

export DYLD_INSERT_LIBRARIES=""
node your-script.js

Or pin compatible versions:

{
  "dependencies": {
    "sharp": "^0.33.0",
    "canvas": "^2.11.2"
  }
}

See Troubleshooting Guide for more solutions.

Image-Only Processing

If you only need image processing (not PDF), canvas is not required. Only use processImage() method.

For more detailed troubleshooting, see the complete Troubleshooting Guide

Documentation

Examples

See examples/ directory:

# Basic examples
npm run test:pdf
npm run test:image

# Advanced examples
ts-node examples/example-agent-parser.ts
ts-node examples/example-structured-parser.ts
ts-node examples/example-agent-with-zod.ts
ts-node examples/example-ocr.ts

Development

pnpm install
pnpm run build
pnpm test

License

Apache License 2.0