npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

ocr-provenance-mcp

v1.0.16

Published

MCP server for document OCR processing, semantic search, document comparison, clustering, and provenance tracking

Readme

OCR Provenance MCP Server

Give your AI the power to read, search, and reason over any document.

Watch the video

Watch: System Overview & Installation Guide | How the Storage System Works | Database System Deep Dive

npm License: Dual MCP Tools Tests Docker

"This repo is life changing." -- Kunal Dalal, Agentic Leadership Strategist


Install

Two commands. That's it.

npm install -g ocr-provenance-mcp
ocr-provenance-mcp-setup

The setup wizard walks you through everything: API keys, Docker image, and connecting to your AI client. Takes about 5 minutes.

Prerequisites: Docker Desktop and Node.js >= 20. No Python install, no GPU drivers, no model downloads -- the Docker image handles all of that.


Getting Your API Keys

You need two free API keys. The setup wizard will ask for both and validate them automatically.

1. Datalab API Key (for OCR)

Datalab handles the document-to-text conversion -- 18 file types including PDFs, Word docs, spreadsheets, presentations, and images.

  1. Go to datalab.to
  2. Click Sign Up (or log in if you have an account)
  3. Go to your Account page
  4. Copy your API key

Datalab offers free credits to start. Their OCR supports three accuracy modes (fast, balanced, accurate) and handles scanned documents, handwriting, tables, charts, and multi-column layouts.

2. Gemini API Key (for vision AI)

Gemini Flash 3 analyzes images extracted from your documents -- describing charts, diagrams, photos, and figures so they become searchable too.

  1. Go to Google AI Studio
  2. Sign in with your Google account
  3. Click Get API Key in the left sidebar
  4. Click Create API Key and select a project (or create one)
  5. Copy your API key

The free tier is generous and more than enough to get started. The server uses the gemini-3-flash-preview model for fast, low-cost image analysis.


The Problem This Solves

AI is powerful, but it can't work with messy data. And real-world data is always messy.

Your documents are trapped in PDFs, scanned images, Word files, spreadsheets, and presentations. Claude can't natively read a 500-page contract PDF. It can't search across 3,000 discovery documents. It can't compare two versions of a lease agreement or find every mention of a specific clause across a corpus of legal filings.

This server is a data cleaning pipeline that makes your documents AI-ready. It takes your messy files -- all 18 supported formats -- and converts them into clean, chunked, embedded, searchable text with full provenance tracking. Then it exposes 149 tools that let the AI search, navigate, compare, and reason over that data.

You don't need clean data. You need this system. It gives the AI the tools to clean the data itself.

What Happens When You Point It at a Folder

Your messy files (PDF, DOCX, XLSX, scanned images, presentations...)
    |
    v
  OCR extracts text from every page (Datalab API, 3 accuracy modes)
    |
    v
  Smart chunking splits text into searchable sections
  (respects headings, tables, page boundaries -- not arbitrary splits)
    |
    v
  Each chunk gets a vector embedding (nomic-embed-text-v1.5, 768-dim)
    |
    v
  Images extracted and described by AI (Gemini Flash 3)
    |
    v
  Full-text search index built (BM25 + FTS5)
    |
    v
  SHA-256 provenance chain on every artifact
  (you can prove exactly where every answer came from)
    |
    v
  149 tools available to your AI -- search, compare, cluster, tag,
  extract tables, fill forms, track obligations, verify integrity

The AI can now search your entire document collection with keyword search (exact terms, case numbers, names), semantic search (conceptual queries, paraphrases), or hybrid search (both combined -- recommended). Every result links back to its exact source page and document through the provenance chain.


Who Is This For

Lawyers and Legal Teams

This is where the system has the highest impact. Legal work is document work, and legal documents are notoriously messy -- scanned contracts, discovery dumps with thousands of files, handwritten notes, exhibits in every format imaginable.

With this server, you can tell Claude:

  • "Search all 3,000 discovery documents for references to the March 2024 amendment"
  • "Compare the original contract with the signed version -- what changed?"
  • "Find every document mentioning Dr. Rivera and cluster them by topic"
  • "Which invoices were submitted after the termination date?"
  • "Extract all obligation deadlines from these contracts and show me what's due this month"

The provenance chain is critical for legal work. Every search result traces back to its exact source page and document -- supporting admissibility, audit, and compliance requirements. Export provenance in W3C PROV-JSON format for regulatory filings.

Insurance and Claims

Adjusters reviewing hundreds of pages across multiple providers: medical records, engineering reports, contractor estimates, policy documents. Search semantically across all of them, compare assessments, and cluster findings by category.

Finance and Compliance

Forensic accountants and compliance teams reviewing years of financial records. Cross-reference bank statements, tax returns, and invoices. Flag duplicates, compare year-over-year, and generate compliance reports (SOC2, SOX, HIPAA).

Research and Due Diligence

Literature reviews, real estate due diligence, HR investigations -- any work that requires reading, searching, and comparing large volumes of documents. The AI does the reading; you do the thinking.


Supported File Types (18)

| Category | Formats | |----------|---------| | Documents | PDF, DOCX, DOC, TXT, CSV, MD | | Spreadsheets | XLSX, XLS | | Presentations | PPTX, PPT | | Images | PNG, JPG, JPEG, TIFF, TIF, BMP, GIF, WEBP |

Scanned documents, handwriting, multi-column layouts, tables, and charts are all handled by the Datalab OCR engine.


Quick Start

Once installed, open your AI client and tell it:

"Create a database called my-case, ingest all the PDFs in ~/Documents/case-files, and process everything"

Or step by step with tool calls:

1. ocr_db_create { name: "my-case" }
2. ocr_db_select { database_name: "my-case" }
3. ocr_ingest_directory { directory_path: "/path/to/docs" }
4. ocr_process_pending {}
5. ocr_search { query: "breach of contract" }
6. ocr_rag_context { question: "What were the settlement terms?" }

Each database is fully isolated. Create one per case, project, or client.


Managing Databases and the AI's Context Window

Storage System Explained Database System Deep Dive

Watch: How the Storage System Works -- a walkthrough of how databases isolate your data and keep AI searches focused. Database System Deep Dive -- covers the registry layer, swappable AI brains, context contamination prevention, and the full data pipeline from OCR to embeddings.

The key to getting good results from this system is understanding how databases control what the AI can see. Each database is an isolated SQLite file. When you select a database, all search tools only operate on the documents stored inside it. This is how you manage the AI's context window -- by choosing exactly what data it has access to.

One database per case (recommended starting point)

If you have three legal cases on your computer, create three databases:

Case 1 (1,000 documents)  -->  database: "case-one"
Case 2 (300 documents)    -->  database: "case-two"
Case 3 (500 documents)    -->  database: "case-three"

When you select case-one, the AI's search tools only see those 1,000 documents. It has no knowledge of case two or three. This keeps searches fast and results precise -- no cross-contamination between unrelated matters.

Combining cases for cross-case analysis

You can also create databases that span multiple cases when you need to find connections:

"cases-two-and-three"  -->  contains Case 2 + Case 3 documents
"all-cases"            -->  contains Case 1 + Case 2 + Case 3 documents

When working in all-cases, the AI searches across everything. When working in cases-two-and-three, it only sees those two. You can create as many databases as you need -- you're only limited by disk space.

Why this matters for AI quality

When the AI searches 1,200 documents and narrows down to 4 relevant ones, it can then go read those 4 documents in full to get complete context before answering your question. The database isolation ensures the AI isn't distracted by irrelevant material, which means better answers and more accurate provenance citations.

Capacity

Each database supports up to ~4 million documents before semantic search quality degrades. For most use cases, you'll never hit this limit. If you do, split into multiple databases by topic, date range, or case.

Advanced database management

For power users who need more control, there are 8 additional database management tools beyond the basics:

| Tool | What it does | |------|-------------| | ocr_db_search | Full-text search across all database names, descriptions, and tags | | ocr_db_recent | Quick access to recently used databases | | ocr_db_tag | Add tags and metadata to databases for organization | | ocr_db_archive / ocr_db_unarchive | Hide databases from listings without deleting them | | ocr_db_rename | Rename a database (updates filesystem, registry, and internal metadata) | | ocr_db_summary | Get an AI-readable profile of a database (coverage, quality, file types) | | ocr_db_workspace | Group databases into named workspaces for scoped cross-database search |

Workspaces let you group related databases and search across just that group:

"Create a workspace called 'legal' and add the contracts and compliance databases to it"
ocr_search_cross_db { query: "liability", workspace: "legal" }

Archiving hides a database from default listings and cross-database search without deleting any data. Unarchive to bring it back.

For the complete database system documentation -- registry internals, schema details, safety mechanisms, and usage patterns -- see Database System Guide.


How Search Works

Three modes, all built-in:

| Mode | Best For | How It Works | |------|----------|--------------| | Keyword (BM25) | Exact terms, case numbers, names, dates | FTS5 full-text with porter stemming | | Semantic | Conceptual queries, paraphrases, "find documents about X" | Vector similarity via nomic-embed-text-v1.5 | | Hybrid (default) | Everything else | Keyword + semantic combined via Reciprocal Rank Fusion |

All modes support: local cross-encoder reranking, query expansion (legal/medical synonyms), auto-routing, quality-weighted ranking, page/section/heading filters, document grouping, header/footer exclusion, and provenance inclusion.


Platform Setup

Docker Desktop

| Platform | Install | |----------|---------| | Windows | Docker Desktop for Windows (requires WSL2) | | macOS | Docker Desktop for Mac (Intel and Apple Silicon) | | Linux | Docker Engine -- add your user to the docker group: sudo usermod -aG docker $USER |

Supported AI Clients

The setup wizard auto-detects your platform and registers the server. Supported clients:

| Client | How It's Configured | |--------|---------------------| | Claude Code | Automatic via claude mcp add | | Claude Desktop | JSON config (macOS: ~/Library/Application Support/Claude/claude_desktop_config.json, Windows: %APPDATA%\Claude\claude_desktop_config.json) | | Cursor | ~/.cursor/mcp.json | | VS Code | .vscode/mcp.json (per-project) | | Windsurf | ~/.codeium/windsurf/mcp_config.json |

Docker Installation

The setup wizard handles this automatically, but if you prefer manual configuration, add the following to your client's config file.

Claude Code

claude mcp add ocr-provenance \
  -s user \
  -e DATALAB_API_KEY=your-datalab-key \
  -e GEMINI_API_KEY=your-gemini-key \
  -- docker run -i --rm \
  -v $HOME:/host:ro \
  -v ocr-data:/data \
  ghcr.io/chrisroyse/ocr-provenance:latest

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "ocr-provenance": {
      "command": "docker",
      "args": [
        "run", "-i", "--rm",
        "-e", "DATALAB_API_KEY",
        "-e", "GEMINI_API_KEY",
        "-v", "/Users/you:/host:ro",
        "-v", "ocr-data:/data",
        "ghcr.io/chrisroyse/ocr-provenance:latest"
      ],
      "env": {
        "DATALAB_API_KEY": "your-datalab-key",
        "GEMINI_API_KEY": "your-gemini-key"
      }
    }
  }
}

Note: The -e DATALAB_API_KEY and -e GEMINI_API_KEY flags in args are required to forward these environment variables into the Docker container. The env field sets them on the host side, but Docker needs explicit -e flags to pass them through.

Cursor

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "ocr-provenance": {
      "command": "docker",
      "args": [
        "run", "-i", "--rm",
        "-e", "DATALAB_API_KEY",
        "-e", "GEMINI_API_KEY",
        "-v", "/Users/you:/host:ro",
        "-v", "ocr-data:/data",
        "ghcr.io/chrisroyse/ocr-provenance:latest"
      ],
      "env": {
        "DATALAB_API_KEY": "your-datalab-key",
        "GEMINI_API_KEY": "your-gemini-key"
      }
    }
  }
}

VS Code

Add to .vscode/mcp.json in your workspace:

{
  "inputs": [
    { "id": "datalab-key", "type": "promptString", "description": "Datalab API key", "password": true },
    { "id": "gemini-key", "type": "promptString", "description": "Gemini API key", "password": true }
  ],
  "servers": {
    "ocr-provenance": {
      "type": "stdio",
      "command": "docker",
      "args": [
        "run", "-i", "--rm",
        "-e", "DATALAB_API_KEY",
        "-e", "GEMINI_API_KEY",
        "-v", "/Users/you:/host:ro",
        "-v", "ocr-data:/data",
        "ghcr.io/chrisroyse/ocr-provenance:latest"
      ],
      "env": {
        "DATALAB_API_KEY": "${input:datalab-key}",
        "GEMINI_API_KEY": "${input:gemini-key}"
      }
    }
  }
}

Windsurf

Add to ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "ocr-provenance": {
      "command": "docker",
      "args": [
        "run", "-i", "--rm",
        "-e", "DATALAB_API_KEY",
        "-e", "GEMINI_API_KEY",
        "-v", "/Users/you:/host:ro",
        "-v", "ocr-data:/data",
        "ghcr.io/chrisroyse/ocr-provenance:latest"
      ],
      "env": {
        "DATALAB_API_KEY": "your-datalab-key",
        "GEMINI_API_KEY": "your-gemini-key"
      }
    }
  }
}

HTTP Mode

For remote or shared deployments, run as an HTTP server:

docker run -d --name ocr-provenance \
  -e DATALAB_API_KEY=your-datalab-key \
  -e GEMINI_API_KEY=your-gemini-key \
  -e MCP_TRANSPORT=http \
  -p 3100:3100 \
  -v $HOME:/host:ro \
  -v ocr-data:/data \
  ghcr.io/chrisroyse/ocr-provenance:latest

Backup and Restore

Your databases live in the ocr-data Docker volume and persist across container restarts.

# Backup all databases to ./backup/
docker run --rm -v ocr-data:/data:ro -v $(pwd)/backup:/backup alpine cp -a /data/. /backup/

# Restore from ./backup/
docker run --rm -v ocr-data:/data -v $(pwd)/backup:/backup:ro alpine cp -a /backup/. /data/

Configuration

API keys are stored at ~/.ocr-provenance/.env (created by the setup wizard, permissions 0600). All other settings can be changed at runtime via the ocr_config_set tool.

| Setting | Default | Description | |---------|---------|-------------| | DATALAB_DEFAULT_MODE | accurate | OCR mode: fast, balanced, or accurate | | DATALAB_MAX_CONCURRENT | 3 | Max concurrent OCR API requests | | EMBEDDING_DEVICE | cpu | cpu, cuda, or mps (auto-detected in Docker) | | EMBEDDING_BATCH_SIZE | 512 | Batch size for embedding generation | | CHUNKING_SIZE | 2000 | Target chunk size in characters | | CHUNKING_OVERLAP_PERCENT | 10 | Overlap between adjacent chunks | | AUTO_CLUSTER_ENABLED | false | Auto-cluster documents after processing | | AUTO_CLUSTER_THRESHOLD | 5 | Minimum documents to trigger auto-clustering |

Environment Variables (Docker)

Override via -e flags when needed:

| Variable | Default | Description | |----------|---------|-------------| | DATALAB_API_KEY | (required) | Datalab OCR API key | | GEMINI_API_KEY | (required) | Google Gemini API key | | MCP_TRANSPORT | stdio | Transport mode: stdio or http | | MCP_HTTP_PORT | 3100 | HTTP server port (when using http transport) | | OCR_PROVENANCE_DATABASES_PATH | /data | Database storage path inside container | | OCR_PROVENANCE_ALLOWED_DIRS | /host,/data | Allowed directories for file access |

HTTP Mode (Remote/Shared Deployment)

docker compose up -d          # CPU mode
docker compose -f docker-compose.gpu.yml up -d   # GPU mode (NVIDIA CUDA)

Health endpoint: GET /health -- MCP endpoint: POST /mcp -- Port: 3100


Tool Reference (149 Tools)

| Tool | Description | |------|-------------| | ocr_db_create | Create a new isolated database with optional tags and metadata | | ocr_db_list | List databases with filtering, sorting, pagination | | ocr_db_select | Select the active database (RAG swap) | | ocr_db_stats | Detailed statistics synced to central registry | | ocr_db_delete | Permanently delete a database with cascade | | ocr_db_search | Full-text search across database names, descriptions, and tags | | ocr_db_recent | Recently accessed databases ordered by last access time | | ocr_db_tag | Add, remove, or set tags and key-value metadata on databases | | ocr_db_archive | Archive a database (hidden from listings, data preserved) | | ocr_db_unarchive | Restore an archived database to active status | | ocr_db_rename | Rename a database (filesystem + registry + internal metadata) | | ocr_db_summary | AI-readable database profile with coverage stats | | ocr_db_workspace | Create and manage workspace groupings |

| Tool | Description | |------|-------------| | ocr_ingest_directory | Scan directory and register documents (18 file types, recursive) | | ocr_ingest_files | Ingest specific files by path | | ocr_process_pending | Full pipeline: OCR -> Chunk -> Embed -> Vector -> VLM (with auto-clustering) | | ocr_status | Check processing status | | ocr_retry_failed | Reset failed documents for reprocessing | | ocr_reprocess | Reprocess with different OCR settings | | ocr_convert_raw | One-off OCR conversion without storing |

| Tool | Description | |------|-------------| | ocr_search | Unified search -- mode: keyword (BM25), semantic (vector), or hybrid (default) | | ocr_rag_context | Assemble hybrid search results into a markdown context block for LLMs | | ocr_search_export | Export results to CSV or JSON | | ocr_fts_manage | Rebuild or check FTS5 index status | | ocr_search_saved | Save, list, get, or execute named searches | | ocr_search_cross_db | BM25 search across all databases simultaneously | | ocr_benchmark_compare | Compare search results across databases |

| Tool | Description | |------|-------------| | ocr_document_list | List documents with status filtering and cursor pagination | | ocr_document_get | Full document details (text, chunks, blocks, provenance) | | ocr_document_delete | Delete document and all derived data (cascade) | | ocr_document_find_similar | Find similar documents via embedding centroid similarity | | ocr_document_structure | Analyze structure: headings/tables/figures, section tree, or outline | | ocr_document_update_metadata | Batch update document metadata fields | | ocr_document_duplicates | Detect exact (hash) and near (similarity) duplicates | | ocr_export | Export document (json/markdown) or corpus (json/csv) | | ocr_document_versions | List all versions of a document by file path | | ocr_document_workflow | Manage workflow states (draft/review/approved/published/archived) |

| Tool | Description | |------|-------------| | ocr_provenance_get | Get the complete provenance chain for any item | | ocr_provenance_verify | Verify integrity through SHA-256 hash chain | | ocr_provenance_export | Export provenance (JSON, W3C PROV-JSON, CSV) | | ocr_provenance_query | Query provenance records with 12+ filters | | ocr_provenance_timeline | View document processing timeline | | ocr_provenance_processor_stats | Aggregate statistics per processor type |

| Tool | Description | |------|-------------| | ocr_document_compare | Text diff + structural metadata diff + similarity ratio | | ocr_comparison_list | List comparisons with optional filtering | | ocr_comparison_get | Full comparison details with diff operations | | ocr_comparison_discover | Auto-discover similar document pairs for comparison | | ocr_comparison_batch | Batch compare multiple document pairs | | ocr_comparison_matrix | NxN pairwise cosine similarity matrix across documents |

| Tool | Description | |------|-------------| | ocr_cluster_documents | Cluster by semantic similarity (HDBSCAN / agglomerative / k-means) | | ocr_cluster_list | List clusters with filtering by run ID or tag | | ocr_cluster_get | Cluster details with member documents | | ocr_cluster_assign | Auto-assign a document to the nearest cluster | | ocr_cluster_reassign | Move a document to a different cluster | | ocr_cluster_merge | Merge two clusters into one | | ocr_cluster_delete | Delete a clustering run |

| Tool | Description | |------|-------------| | ocr_vlm_describe | Describe an image using Gemini 3 Flash | | ocr_vlm_process | VLM-process images (pass document_id for one doc, omit for all pending) | | ocr_vlm_analyze_pdf | Analyze a PDF directly with Gemini 3 Flash (max 20MB) | | ocr_vlm_status | Service status (API config, rate limits, circuit breaker) |

| Tool | Description | |------|-------------| | ocr_image_list | List images extracted from a document | | ocr_image_get | Get image details | | ocr_image_stats | Processing statistics | | ocr_image_delete | Delete images (by image_id or document_id) | | ocr_image_reset_failed | Reset failed images for reprocessing | | ocr_image_pending | List images pending VLM processing | | ocr_image_search | Search images by keyword filters or semantic similarity (mode=keyword/semantic) | | ocr_image_reanalyze | Re-run VLM analysis with a custom prompt |

| Tool | Description | |------|-------------| | ocr_extract_images | Extract images locally (pass document_id for one doc, omit for batch) |

| Tool | Description | |------|-------------| | ocr_chunk_get | Get a chunk by ID with full metadata | | ocr_chunk_list | List chunks with filtering (content type, section path, page, heading) | | ocr_chunk_context | Get a chunk with N neighboring chunks for context | | ocr_document_page | Get all chunks for a specific page number |

| Tool | Description | |------|-------------| | ocr_embedding_list | List embeddings with filtering | | ocr_embedding_stats | Embedding statistics (counts, models, coverage) | | ocr_embedding_get | Get embedding details by ID | | ocr_embedding_rebuild | Re-generate embeddings for specific targets |

| Tool | Description | |------|-------------| | ocr_extract_structured | Extract structured data using a JSON schema | | ocr_extraction_list | List or search structured extractions (filter by document_id or query) | | ocr_extraction_get | Get a structured extraction by ID |

| Tool | Description | |------|-------------| | ocr_form_fill | Fill PDF/image forms via Datalab | | ocr_form_fill_status | Form fill operation status and results |

| Tool | Description | |------|-------------| | ocr_file_upload | Upload to Datalab cloud (deduplicates by SHA-256) | | ocr_file_list | List uploaded files with duplicate detection | | ocr_file_get | File metadata | | ocr_file_download | Get download URL | | ocr_file_delete | Delete file record | | ocr_file_ingest_uploaded | Bridge uploaded files into the document pipeline |

| Tool | Description | |------|-------------| | ocr_tag_create | Create a tag with optional color and description | | ocr_tag_list | List tags with usage counts | | ocr_tag_apply | Apply a tag to any entity (document, chunk, image, cluster, etc.) | | ocr_tag_remove | Remove a tag from an entity | | ocr_tag_search | Find entities by tag name | | ocr_tag_delete | Delete a tag and all associations |

| Tool | Description | |------|-------------| | ocr_guide | AI agent navigation -- inspects system state and recommends next actions | | ocr_document_tables | Extract and parse tables from OCR JSON blocks | | ocr_document_recommend | Get related document recommendations | | ocr_document_extras | Access OCR extras data (charts, links, tracked changes) | | ocr_table_export | Export tables to CSV, JSON, or markdown |

| Tool | Description | |------|-------------| | ocr_evaluate | Evaluate VLM quality (pass image_id, document_id, or neither for all pending) |

| Tool | Description | |------|-------------| | ocr_evaluation_report | Comprehensive OCR + VLM metrics report | | ocr_document_report | Single document report | | ocr_report_overview | Quality and corpus overview (section=quality/corpus/all) | | ocr_cost_summary | Cost analytics by document, mode, month, or total | | ocr_report_performance | Pipeline, throughput, and bottleneck analytics (section=pipeline/throughput/bottlenecks/all) | | ocr_error_analytics | Error/recovery analytics and failure rates | | ocr_trends | Time-series trends (metric=quality/volume, bucketed by time period) |

| Tool | Description | |------|-------------| | ocr_user_info | Get, create, or list users (viewer/reviewer/editor/admin roles) | | ocr_audit_query | Query the user action audit log with filters |

| Tool | Description | |------|-------------| | ocr_annotation_create | Add comments, suggestions, or highlights on documents | | ocr_annotation_list | List annotations with filtering | | ocr_annotation_get | Get annotation details with thread replies | | ocr_annotation_update | Edit an annotation | | ocr_annotation_delete | Delete an annotation | | ocr_annotation_summary | Summary stats for annotations on a document | | ocr_document_lock | Lock a document (exclusive or shared) | | ocr_document_unlock | Release a document lock | | ocr_document_lock_status | Check lock status and holder | | ocr_search_alert_enable | Set up alerts for new content matching a query | | ocr_search_alert_check | Check for new matches since last alert |

| Tool | Description | |------|-------------| | ocr_workflow_submit | Submit a document for review | | ocr_workflow_review | Approve, reject, or request changes | | ocr_workflow_assign | Assign a reviewer to a document | | ocr_workflow_status | Get current workflow state and history | | ocr_workflow_queue | List documents pending review | | ocr_approval_chain_create | Create multi-step approval chains | | ocr_approval_chain_apply | Apply an approval chain to a document | | ocr_approval_step_decide | Record an approval/rejection decision on a step |

| Tool | Description | |------|-------------| | ocr_webhook_create | Register a webhook endpoint (HMAC-SHA256 signed) | | ocr_webhook_list | List registered webhooks | | ocr_webhook_delete | Remove a webhook | | ocr_export_obligations_csv | Export obligations to CSV | | ocr_export_audit_log | Export audit log entries | | ocr_export_annotations | Export document annotations |

| Tool | Description | |------|-------------| | ocr_contract_extract | Extract contract clauses and terms from OCR text | | ocr_obligation_list | List obligations with status filtering | | ocr_obligation_update | Update obligation status or details | | ocr_obligation_calendar | View obligations by due date range | | ocr_playbook_create | Create a clause comparison playbook | | ocr_playbook_compare | Compare document clauses against a playbook | | ocr_playbook_list | List available playbooks | | ocr_document_summarize | Algorithmic document summarization | | ocr_corpus_summarize | Summarize across multiple documents |

| Tool | Description | |------|-------------| | ocr_compliance_report | Generate compliance report (SOC2/SOX) | | ocr_compliance_hipaa | HIPAA-specific compliance checks and export | | ocr_compliance_export | Export provenance with PROV-AGENT metadata and chain-hash verification |

| Tool | Description | |------|-------------| | ocr_health_check | Detect data integrity gaps with optional auto-fix | | ocr_config_get | Get current system configuration | | ocr_config_set | Update configuration at runtime |


Troubleshooting

Make sure Docker Desktop is running. On Linux, check with docker info. On Windows/macOS, open Docker Desktop from the system tray.

Ensure docker is on your PATH: docker --version. On Windows, you may need to restart your terminal after installing Docker Desktop.

Restart your AI client after running ocr-provenance-mcp-setup. MCP clients only load server configs at startup.

  • Datalab: Make sure your key is from datalab.to (not the docs site). Run the setup wizard again to re-enter.
  • Gemini: Make sure your key is from Google AI Studio. Free tier keys work fine.

The default image uses CPU (works great, just slower for embeddings). For GPU:

docker compose -f docker-compose.gpu.yml up -d

Or build a GPU image yourself:

docker build --build-arg COMPUTE=cu124 \
  --build-arg RUNTIME_BASE=nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
  -t ocr-provenance-mcp:gpu .

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    MCP Server (stdio/http)                    │
│  TypeScript + @modelcontextprotocol/sdk                     │
│  149 tools across 29 tool modules                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ Ingestion│  │  Search  │  │ Analysis │  │  Reports │   │
│  │ 7 tools  │  │  7 tools │  │ 33 tools │  │  7 tools │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │              │          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   VLM    │  │  Images  │  │  Tags    │  │  Intel   │   │
│  │ 4 tools  │  │  9 tools │  │ 6 tools  │  │  5 tools │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │              │          │
│  ┌────┴──────────────┴──────────────┴──────────────┴────┐   │
│  │             Service Layer (11 domains)                │   │
│  │  OCR · Chunking · Embedding · Search · VLM          │   │
│  │  Provenance · Comparison · Clustering · Gemini      │   │
│  │  Images · Storage                                    │   │
│  └────┬──────────────┬──────────────┬───────────────────┘   │
│       │              │              │                         │
│  ┌────┴────┐   ┌────┴────┐   ┌────┴─────┐                  │
│  │ SQLite  │   │sqlite-vec│   │ FTS5     │                  │
│  │ 28 tbls │   │ vectors  │   │ indexes  │                  │
│  └─────────┘   └─────────┘   └──────────┘                  │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │            Python Workers (9 processes)               │   │
│  │  OCR · Embedding · Clustering · Image Extraction    │   │
│  │  DOCX Extraction · Image Optimizer · Form Fill      │   │
│  │  File Manager · Local Reranker                      │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │            External APIs                              │   │
│  │  Datalab (OCR/Forms) · Gemini 3 Flash (VLM/AI)     │   │
│  │  Nomic embed v1.5 (local, 768-dim)                  │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
File on disk
  │
  ├─ 1. REGISTER ──► documents table (status: pending)
  │                  ├─ file_hash computed (SHA-256)
  │                  ├─ version detection (new vs re-ingested)
  │                  └─ provenance record (type: DOCUMENT, depth: 0)
  │
  ├─ 2. OCR ──────► ocr_results table
  │                  ├─ Datalab API call (fast/balanced/accurate)
  │                  ├─ extracted_text (markdown)
  │                  ├─ json_blocks (structural hierarchy)
  │                  └─ provenance record (type: OCR_RESULT, depth: 1)
  │
  ├─ 3. CHUNK ────► chunks table
  │                  ├─ Hybrid section-aware chunking
  │                  ├─ 2000 chars with 10% overlap
  │                  ├─ section_path, heading_context, content_types
  │                  └─ provenance records (type: CHUNK, depth: 2)
  │
  ├─ 4. EMBED ────► embeddings + vec_embeddings tables
  │                  ├─ Nomic embed v1.5 (768-dim)
  │                  └─ provenance records (type: EMBEDDING, depth: 3)
  │
  ├─ 5. FTS ──────► fts_index (FTS5 virtual table)
  │
  ├─ 6. IMAGES ───► images table
  │   └─ 7. VLM ──► VLM descriptions + embeddings
  │
  ├─ 8. AUTO-CLUSTER (when configured)
  │
  └─ documents.status = 'complete'

| Table | Purpose | |-------|---------| | documents | Source files, file hash, status, page count | | ocr_results | Extracted text, JSON blocks, quality score | | chunks | Text segments with section path, heading, content types | | embeddings | 768-dim vectors with source metadata | | images | Extracted images with VLM descriptions | | extractions | Structured data extractions | | form_fills | Form filling results | | comparisons | Document pair diffs | | clusters | Document groupings by similarity | | document_clusters | Cluster membership | | provenance | Full audit trail with chain hash | | tags | Cross-entity labels | | entity_tags | Tag associations | | saved_searches | Saved search configurations | | uploaded_files | Cloud file tracking | | database_metadata | DB-level settings | | schema_version | Migration tracking | | fts_index_metadata | FTS index state | | users | Multi-user accounts with RBAC | | audit_log | User action audit trail | | annotations | Document annotations (comments, suggestions, highlights) | | document_locks | Collaborative locking | | workflow_states | Document lifecycle state machine | | approval_chains | Multi-step approval workflows | | approval_steps | Individual approval decisions | | obligations | Contract obligations with due dates | | playbooks | Clause comparison playbooks | | webhooks | Event notification endpoints |


AI Agent Hint System

Every tool response includes guidance that tells your AI what to do next. This eliminates dead ends -- the AI always knows how to continue a workflow, recover from errors, and discover related tools.

How It Works

On success, each tool returns a next_steps array with contextual suggestions based on the actual results:

{
  "success": true,
  "data": {
    "documents": [...],
    "total": 42,
    "next_steps": [
      { "tool": "ocr_document_get", "description": "Get details for a specific document by ID" },
      { "tool": "ocr_search", "description": "Search within the corpus" },
      { "tool": "ocr_document_structure", "description": "View a document outline (headings, tables)" }
    ]
  }
}

On error, every response includes a recovery hint pointing to the right tool to fix the problem:

{
  "success": false,
  "error": {
    "category": "DATABASE_NOT_SELECTED",
    "message": "No database selected",
    "recovery": {
      "tool": "ocr_db_select",
      "hint": "Use ocr_db_list to find database names, then ocr_db_select"
    }
  }
}

Context-Aware Suggestions

Next steps adapt to what actually happened:

| Scenario | Suggested next steps | |----------|---------------------| | Search returns 0 results | Try different keywords, ingest more documents | | Search returns 1 result | Get chunk context, view full document, find similar | | Search returns many results | Get chunk context, view document, browse by page | | Health check finds gaps | Process pending, retry failed, run VLM on images | | Health check all clear | Search, browse documents | | Document list empty | Ingest files, scan a directory | | Post-ingestion | Process pending, compare documents |

Error Recovery Categories (25)

Every possible error maps to a specific recovery action:

| Category | Recovery tool | What to do | |----------|--------------|------------| | DATABASE_NOT_FOUND | ocr_db_list | List available databases | | DATABASE_NOT_SELECTED | ocr_db_select | Select a database first | | DOCUMENT_NOT_FOUND | ocr_document_list | Browse available documents | | OCR_API_ERROR | ocr_health_check | Check API key and service status | | OCR_RATE_LIMIT | ocr_process_pending | Wait and retry with lower concurrency | | EMBEDDING_FAILED | ocr_health_check | Check Python worker and GPU memory | | VLM_API_ERROR | ocr_vlm_status | Check Gemini API key and circuit breaker | | VLM_RATE_LIMIT | ocr_vlm_status | Wait for rate limit reset | | GPU_OUT_OF_MEMORY | ocr_config_set | Reduce embedding batch size | | PATH_NOT_FOUND | ocr_guide | Verify file path exists | | VALIDATION_ERROR | ocr_guide | Check parameter types and required fields |

The Guide Tool

ocr_guide is a meta-tool that analyzes your current database state and recommends what to do. Pass an optional intent parameter to get targeted advice:

| Intent | What it suggests | |--------|-----------------| | ingest | File ingestion and processing workflow | | search | Search strategies based on what's indexed | | explore | Browse and discover what's in the corpus | | analyze | Clustering, comparison, duplicate detection | | status | Health check, gap analysis, retry failed | | (none) | Context-aware suggestions based on current state |

The guide also returns workflow_chains -- named multi-step sequences for common tasks:

find_and_read:    ocr_search -> ocr_chunk_context -> ocr_document_page
compare_documents: ocr_comparison_discover -> ocr_document_compare -> ocr_comparison_get
process_new:      ocr_ingest_files -> ocr_process_pending -> ocr_health_check

Development

npm run build             # Build TypeScript
npm test                  # All tests (2,805 across 118 test suites)
npm run lint:all          # TypeScript + Python linting
npm run check             # typecheck + lint + test
src/
  index.ts              # MCP server entry point
  bin.ts                # CLI entry point (stdio)
  bin-http.ts           # HTTP entry point
  bin-setup.ts          # Setup wizard
  tools/                # 27 tool modules + shared.ts
  services/             # Core services (11 domains, 80 files)
  models/               # Zod schemas and TypeScript types
  utils/                # Hash, validation, path sanitization
  server/               # Server state, types, errors
python/                 # 9 Python workers + GPU utils
tests/                  # Unit, integration, e2e, benchmark tests
docs/                   # System documentation and reports

License

This project uses a dual-license model:

  • Free for non-commercial use -- personal projects, academic research, education, non-profits, evaluation, and contributions to this project are all permitted at no cost.
  • Commercial license required for revenue-generating use -- if you use this software to make money (paid services, SaaS, internal tools at for-profit companies, etc.), you must obtain a commercial license from the copyright holder. Terms are negotiated case-by-case and may include revenue sharing or flat-rate arrangements.

See LICENSE for full details. For commercial licensing inquiries, contact Chris Royse at [email protected] or via GitHub.