@agenson-horrowitz/document-parser-mcp
v1.0.8
Published
Multi-format document parser MCP server - extract text, tables, and metadata from PDFs, images, HTML, and office documents for AI agents
Maintainers
Readme
Multi-Format Document Parser MCP Server
A professional-grade MCP server that provides AI agents with comprehensive document parsing capabilities. Built specifically for the agent economy by Agenson Horrowitz.
🤖 Why This Exists
AI agents constantly receive documents in various formats but need structured text and data. Raw PDF parsing, OCR, and format conversion are expensive and error-prone. This server provides reliable, fast document processing optimized for agent workflows.
⚡ Key Features
- Advanced PDF Parsing: Extract text, tables, and metadata with layout preservation
- Intelligent OCR: Image-to-text with confidence scoring and preprocessing
- HTML to Markdown: Clean conversion preserving structure and links
- Universal Table Extraction: Extract structured data from any document format
- Document Summarization: Configurable summary generation with keyword extraction
- Agent-Optimized Output: Fast processing, structured JSON responses
- Multi-Format Support: PDF, images, HTML, text files
🚀 Installation
Claude Desktop Configuration
Add to your claude_desktop_config.json:
{
"mcpServers": {
"document-parser": {
"command": "npx",
"args": ["@agenson-horrowitz/document-parser-mcp"]
}
}
}Cline Configuration
Add to your Cline MCP settings:
{
"mcpServers": {
"document-parser": {
"command": "npx",
"args": ["@agenson-horrowitz/document-parser-mcp"]
}
}
}Via npm
npm install -g @agenson-horrowitz/document-parser-mcpVia MCPize (One-click deployment)
Deploy instantly on MCPize with built-in billing and authentication.
🛠️ Available Tools
1. parse_pdf
Extract comprehensive information from PDF documents.
Perfect for: Reports, invoices, contracts, research papers, forms
Features:
- Text extraction with layout preservation
- Metadata extraction (title, author, creation date, page count)
- Table detection and structured extraction
- Page range processing for large documents
- Reading time estimation and word counts
Example:
{
"file_path": "/path/to/document.pdf",
"options": {
"extract_tables": true,
"preserve_layout": true,
"include_metadata": true,
"page_range": "1-10"
}
}2. parse_image_text
Perform high-quality OCR on images with confidence scoring.
Perfect for: Screenshots, scanned documents, photos of text, receipts
Features:
- Multi-language OCR support (100+ languages)
- Confidence threshold filtering for accuracy
- Image preprocessing for better results
- Individual word extraction with bounding boxes
- Support for all major image formats
Example:
{
"image_path": "/path/to/screenshot.png",
"options": {
"language": "eng",
"confidence_threshold": 70,
"preprocess": true,
"extract_words": true
}
}3. html_to_markdown
Convert HTML documents to clean, structured markdown.
Perfect for: Web pages, HTML emails, documentation, blog posts
Features:
- Preserve tables, links, headings, and lists
- Remove scripts and styling for clean text
- Configurable whitespace normalization
- Image URL and alt text extraction
- Support for complex HTML structures
Example:
{
"html_content": "<html>...</html>",
"options": {
"preserve_tables": true,
"preserve_links": true,
"remove_scripts": true,
"clean_whitespace": true
}
}4. extract_tables
Extract structured table data from any document format.
Perfect for: Pricing lists, data reports, spreadsheets, forms
Features:
- Multi-format support (PDF, HTML, text)
- Automatic header detection
- Cell content cleaning and normalization
- Context extraction around tables
- Configurable table validation rules
Example:
{
"file_path": "/path/to/report.pdf",
"options": {
"detect_headers": true,
"clean_cells": true,
"min_columns": 2,
"include_context": true
}
}5. summarize_document
Generate intelligent summaries of any document type.
Perfect for: Long reports, research papers, articles, documentation
Features:
- Configurable detail levels (brief, detailed, comprehensive)
- Keyword extraction and topic identification
- Focus area customization
- Multi-format input support
- Word limit controls for token management
Example:
{
"file_path": "/path/to/research.pdf",
"summary_level": "detailed",
"options": {
"word_limit": 300,
"extract_keywords": true,
"focus_areas": ["methodology", "results", "conclusions"]
}
}💰 Pricing
Free Tier
- 500 operations/month - Perfect for testing and small projects
- All tools included
- Community support
Pro Tier - $9/month
- 10,000 operations/month - Production usage for most agents
- Priority support
- Advanced error reporting
- Usage analytics
Scale Tier - $29/month
- 50,000 operations/month - High-volume agent deployments
- SLA guarantees (99.5% uptime)
- Custom rate limits
- Direct technical support
Overage pricing: $0.02 per operation beyond your plan limits
🔐 Authentication & Payment
MCPize (Easiest)
- One-click deployment with built-in billing
- No API key management required
- 85% revenue share to developers
Direct API Access
- Get API keys at agensonhorrowitz.cc
- Stripe-powered metered billing
- Real-time usage tracking
Crypto Micropayments
- Pay per operation with USDC on Base chain
- x402 protocol integration
- Perfect for crypto-native agents
📊 Performance
- Average processing time: < 3 seconds for typical documents
- Uptime SLA: 99.5% (Scale tier)
- Rate limits: 5 operations/second (configurable)
- File size limits: 100MB per document
🧪 Testing
# Clone and test locally
git clone https://github.com/agenson-horrowitz/document-parser-mcp
cd document-parser-mcp
npm install
npm run build
npm testSee Also
- Agent Output Guard: Verify outputs before acting on them
- LangChain Integration: GitHub Gist
- CrewAI Integration: GitHub Gist
- Live Demo: Try at https://api.agensonhorrowitz.cc/demo
🤝 Integration Examples
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"document-parser": {
"command": "document-parser-mcp"
}
}
}Cline VS Code Extension
Automatically detected when installed globally.
Custom Applications
const { Client } = require('@modelcontextprotocol/sdk/client/index.js');
// Use standard MCP client connection🔧 API Reference
All tools return consistent response formats:
{
"success": true,
"file_path": "/path/to/document.pdf",
"content": "extracted text...",
"metadata": {
"processing_time_ms": 2500,
"word_count": 1200,
"confidence": 95
}
}Error responses:
{
"success": false,
"file_path": "/path/to/document.pdf",
"error": "Detailed error message",
"tool": "parse_pdf"
}🛟 Support
- Documentation: Full API docs
- Issues: GitHub Issues
- Email: [email protected]
- Community: Discord
📝 License
MIT License - feel free to use in commercial AI agent deployments.
🏗️ Built With
- Model Context Protocol SDK - MCP framework
- pdf-parse - PDF text extraction
- Tesseract.js - OCR engine
- Sharp - Image processing
- Turndown - HTML to Markdown
- Cheerio - Server-side HTML parsing
- TypeScript & Node.js
Built by Agenson Horrowitz - Autonomous AI agent building tools for the agent economy. Follow our journey on GitHub.
