npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@reaatech/media-pipeline-mcp-doc-extraction

v0.3.0

Published

Document extraction operations — OCR, table extraction, field extraction, summarization via vision-capable LLMs

Readme

@reaatech/media-pipeline-mcp-doc-extraction

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Document extraction operations — OCR, table extraction, structured field extraction, and content summarization — via provider delegation to vision-capable LLMs with automatic fallback chains.

Installation

npm install @reaatech/media-pipeline-mcp-doc-extraction
# or
pnpm add @reaatech/media-pipeline-mcp-doc-extraction

Feature Overview

  • OCR (Optical Character Recognition) — extract text from document images and PDFs in plain text, markdown, or structured JSON formats
  • Table extraction — extract tables from documents as markdown tables or structured JSON with headers and rows
  • Field extraction — schema-driven extraction of typed fields (string, number, date, boolean, array) from documents
  • Content summarization — summarize document content in multiple lengths (short, medium, long) and styles (bullet-points, paragraph, executive)
  • Multi-provider routing — operation-based lookup with preferred provider selection
  • Automatic fallback — falls back to image.describe capable providers when document-specific providers are unavailable (Google → Anthropic → OpenAI vision)
  • Provider-agnostic — works with Anthropic, Google Document AI, OpenAI, and any conformant provider

Quick Start

import { createDocumentExtractionOperations } from "@reaatech/media-pipeline-mcp-doc-extraction";
import { GoogleProvider } from "@reaatech/media-pipeline-mcp-google";
import { AnthropicProvider } from "@reaatech/media-pipeline-mcp-anthropic";

const ops = createDocumentExtractionOperations(artifactRegistry, storage);

// Register providers
ops.registerProvider("google", new GoogleProvider({
  projectId: "my-gcp-project",
  documentAiProcessorId: "processor-id",
}));
ops.registerProvider("anthropic", new AnthropicProvider({
  apiKey: process.env.ANTHROPIC_API_KEY!,
}));

// Extract text from a document image
const text = await ops.ocr({
  artifactId: "scan-123",
  format: "markdown",
  language: "en",
});

// Extract tables from a scanned report
const tables = await ops.extractTables({
  artifactId: "report-456",
  outputFormat: "json",
});

// Extract typed fields using a schema
const fields = await ops.extractFields({
  artifactId: "invoice-789",
  fields: [
    { name: "invoice_number", type: "string", description: "The invoice number" },
    { name: "invoice_date", type: "date", description: "The invoice date" },
    { name: "total", type: "number", description: "The total amount" },
    { name: "is_paid", type: "boolean", description: "Whether the invoice is paid" },
    { name: "line_items", type: "array", description: "Line items" },
  ],
});

// Summarize a long document
const summary = await ops.summarize({
  artifactId: "article-101",
  length: "medium",
  style: "executive",
});

API Reference

createDocumentExtractionOperations(artifactRegistry, storage)

Factory function that creates a DocumentExtractionOperations instance.

function createDocumentExtractionOperations(
  artifactRegistry: ArtifactRegistry,
  storage: ArtifactStore,
): DocumentExtractionOperations;

DocumentExtractionOperations

Main class providing all document extraction and summarization capabilities. Operations delegate to registered providers based on operation type with automatic fallback chains.

class DocumentExtractionOperations {
  constructor(artifactRegistry: ArtifactRegistry, storage: ArtifactStore);

  registerProvider(name: string, provider: MediaProvider): void;

  ocr(config: OCRConfig): Promise<Artifact>;
  extractTables(config: TableExtractionConfig): Promise<Artifact>;
  extractFields(config: FieldExtractionConfig): Promise<Artifact>;
  summarize(config: SummarizeConfig): Promise<Artifact>;
}

Operation Configs

OCRConfig

interface OCRConfig {
  artifactId: string;                  // ID of the document image or PDF
  format?: "plain-text" | "structured-json" | "markdown";  // Output format (default: "plain-text")
  language?: string;                   // Language code (e.g., "en", "es")
  provider?: string;                   // Force specific provider
}

TableExtractionConfig

interface TableExtractionConfig {
  artifactId: string;                  // ID of the document image or PDF
  outputFormat?: "markdown" | "json";  // Output format (default: "markdown")
  provider?: string;                   // Force specific provider
}

FieldExtractionConfig

interface FieldSchema {
  name: string;                        // Field name
  type: "string" | "number" | "date" | "boolean" | "array";  // Field type
  description?: string;                // Human-readable description
}

interface FieldExtractionConfig {
  artifactId: string;                  // ID of the document, text, or image artifact
  fields: FieldSchema[];               // Schema of fields to extract
  provider?: string;                   // Force specific provider
}

SummarizeConfig

interface SummarizeConfig {
  artifactId: string;                                       // ID of the document
  length?: "short" | "medium" | "long";                     // Summary length (default: "medium")
  style?: "bullet-points" | "paragraph" | "executive";      // Summary style (default: "paragraph")
  provider?: string;                                         // Force specific provider
}

Usage Patterns

OCR with Different Output Formats

// Plain text (default)
const plainText = await ops.ocr({
  artifactId: "doc-1",
  format: "plain-text",
  language: "en",
});

// Markdown with headings preserved
const markdown = await ops.ocr({
  artifactId: "doc-1",
  format: "markdown",
});
console.log(markdown.metadata.confidence); // 0.95
console.log(markdown.metadata.pageCount);  // 3

// Structured JSON with metadata
const structured = await ops.ocr({
  artifactId: "doc-1",
  format: "structured-json",
});
// Returns JSON with text, confidence, and language fields
const parsed = JSON.parse((await storage.get(structured.id)).data.toString());
console.log(parsed.text);
console.log(parsed.confidence);

Table Extraction in Multiple Formats

// Markdown table format
const mdTables = await ops.extractTables({
  artifactId: "report-123",
  outputFormat: "markdown",
});
// Returns markdown table: | Header 1 | Header 2 |\n|----------|----------|\n| Value A  | Value B  |
console.log(mdTables.metadata.tableCount);  // 1
console.log(mdTables.metadata.rowCount);    // 15

// JSON table format
const jsonTables = await ops.extractTables({
  artifactId: "report-123",
  outputFormat: "json",
});
// Returns structured JSON with headers and rows arrays
console.log(jsonTables.metadata.columnCount);  // 3

Schema-Driven Field Extraction

const fields = await ops.extractFields({
  artifactId: "invoice-123",
  fields: [
    { name: "invoice_number", type: "string", description: "Invoice number" },
    { name: "invoice_date", type: "date", description: "Date of invoice" },
    { name: "due_date", type: "date", description: "Payment due date" },
    { name: "vendor_name", type: "string", description: "Vendor company name" },
    { name: "vendor_tax_id", type: "string", description: "VAT/GST/Tax ID" },
    { name: "subtotal", type: "number", description: "Subtotal before tax" },
    { name: "tax", type: "number", description: "Tax amount" },
    { name: "total", type: "number", description: "Total including tax" },
    { name: "is_paid", type: "boolean", description: "Payment status" },
    { name: "line_items", type: "array", description: "List of line items" },
  ],
});

const extracted = JSON.parse(
  (await storage.get(fields.id)).data.toString()
);
// {
//   "invoice_number": "INV-2024-001",
//   "invoice_date": "2024-01-15",
//   "total": 1499.99,
//   "is_paid": true,
//   ...
// }
// Missing or unparseable fields are null in the output

console.log(fields.metadata.fieldCount);         // 10
console.log(fields.metadata.extractedFields);    // ["invoice_number", "invoice_date", ...]

Summarization with Style Options

// Short bullet-point summary
const short = await ops.summarize({
  artifactId: "report-123",
  length: "short",
  style: "bullet-points",
});

// Medium paragraph summary (default)
const medium = await ops.summarize({
  artifactId: "report-123",
  length: "medium",
  style: "paragraph",
});

// Long executive summary for decision-makers
const long = await ops.summarize({
  artifactId: "report-123",
  length: "long",
  style: "executive",
});

console.log(long.metadata.compressionRatio);  // 0.15 (15% of original)
console.log(long.metadata.originalLength);    // byte count of input

Provider Fallback Chain

Operations automatically try the best-fit provider first, then fall back:

  1. Document-specific providers (Google Document AI, Anthropic Claude) for OCR/extraction
  2. Falls back to image.describe capable providers (OpenAI GPT-4 Vision) if document providers are unavailable
const ops = createDocumentExtractionOperations(artifactRegistry, storage);

// Register multiple providers — operations route intelligently
ops.registerProvider("google", new GoogleProvider({
  projectId: "my-gcp-project",
  documentAiProcessorId: "processor-id",
}));
ops.registerProvider("anthropic", new AnthropicProvider({
  apiKey: process.env.ANTHROPIC_API_KEY!,
}));
ops.registerProvider("openai", new OpenAIProvider({
  apiKey: process.env.OPENAI_API_KEY!,
}));

// Force a specific provider
const result = await ops.ocr({
  artifactId: "doc-1",
  provider: "anthropic",  // explicitly use Anthropic Claude
});

// Without provider specified, uses best available:
// - document.ocr → tries Google, then Anthropic, then OpenAI vision
// - document.extract_fields → same fallback chain

Related Packages

License

MIT