@majkapp/majk-chat-document-tools
v1.0.31
Published
Document processing tools for majk chat - PDF, Excel, Word, PowerPoint parsing and analysis
Maintainers
Readme
majk-chat-document-tools
Comprehensive document processing package for majk chat that adds support for parsing and analyzing PDF, Excel, Word, PowerPoint, and CSV files.
Features
- Universal Document Analyzer: Automatically detects file types and routes to appropriate parsers
- PDF Parser: Extract text content and metadata from PDF files
- Excel Parser: Parse XLSX/XLS files with sheet analysis and data extraction
- Word Parser: Extract text from DOCX files with multiple output formats
- PowerPoint Parser: Extract slide content, notes, and presentation structure
- CSV Parser: Intelligent CSV parsing with type detection and column analysis
Supported File Formats
| Format | Extensions | Features |
|--------|------------|----------|
| PDF | .pdf | Text extraction, metadata, page-specific parsing |
| Excel | .xlsx, .xls | Sheet parsing, data analysis, multiple output formats |
| Word | .docx | Text/HTML/Markdown output, style extraction |
| PowerPoint | .pptx | Slide content, speaker notes, presentation metadata |
| CSV | .csv, .tsv | Auto-delimiter detection, type inference, column analysis |
Installation
npm install @majkapp/majk-chat-document-toolsUsage
Quick Start with Universal Analyzer
import { DocumentAnalyzerTool } from '@majkapp/majk-chat-document-tools';
const analyzer = new DocumentAnalyzerTool();
// Automatically detect and parse any supported document
const result = await analyzer.execute({
file_path: './document.pdf',
analysis_type: 'auto',
include_metadata: true
}, context);Individual Tool Usage
import {
PdfParserTool,
ExcelParserTool,
WordParserTool,
PowerPointParserV2Tool,
CsvParserTool
} from '@majkapp/majk-chat-document-tools';
// PDF parsing
const pdfParser = new PdfParserTool();
const pdfResult = await pdfParser.execute({
file_path: './report.pdf',
page_range: { start: 1, end: 5 },
extract_metadata: true
}, context);
// Excel parsing
const excelParser = new ExcelParserTool();
const excelResult = await excelParser.execute({
file_path: './data.xlsx',
sheet_name: 'Sales Data',
output_format: 'json',
max_rows: 1000
}, context);Integration with majk-chat Builder
import { MajkChatBuilder } from '@majkapp/majk-chat-core';
import { registerDocumentTools } from '@majkapp/majk-chat-document-tools';
const builder = new MajkChatBuilder()
.withProvider('anthropic')
.withModel('claude-3-5-sonnet-20241022');
// Register all document tools
registerDocumentTools(builder.getToolRegistry());
const chat = builder.build();Tool Specifications
Universal Document Analyzer (analyze_document)
Automatically detects file type and applies the appropriate parser.
Parameters:
file_path(required): Path to document fileanalysis_type:auto|text_only|structured|metadatamax_text_length: Maximum text extraction length (default: 50000)include_metadata: Extract document metadata (default: true)output_format:json|summary|detailed
PDF Parser (parse_pdf)
Parameters:
file_path(required): Path to PDF filepage_range:{ start?: number, end?: number }extract_metadata: Extract PDF metadata (default: true)max_text_length: Text length limit (default: 50000)
Excel Parser (parse_excel)
Parameters:
file_path(required): Path to Excel filesheet_name: Specific sheet to parserange: Excel range (e.g., "A1:D10")header_row: Header row number (default: 1)max_rows: Maximum rows to parse (default: 1000)output_format:json|csv|table
Word Parser (parse_word)
Parameters:
file_path(required): Path to Word fileoutput_format:plain|html|markdowninclude_images: Process image references (default: false)max_text_length: Text length limit (default: 50000)extract_styles: Extract style information (default: false)
PowerPoint Parser (parse_powerpoint)
Parameters:
file_path(required): Path to PowerPoint fileinclude_slide_notes: Extract speaker notes (default: true)slide_numbers: Array of specific slides to extractmax_text_length: Text length limit (default: 50000)extract_slide_titles: Extract slide titles (default: true)include_shapes: Include shape details (default: false)
CSV Parser (parse_csv)
Parameters:
file_path(required): Path to CSV filedelimiter: Column delimiter (auto-detected if not provided)has_headers: Whether first row contains headers (auto-detected)encoding: File encoding (utf8|ascii|latin1)max_rows: Maximum rows to parse (default: 5000)output_format:json|table|summary
Context Management Integration
All parsers are designed to work seamlessly with majk-chat's context management system:
- Smart Truncation: Automatically truncates large documents while preserving structure
- Incremental Reading: Supports offset/limit reading for large files via
read_tool_result - Memory Efficient: Processes documents in chunks to avoid memory issues
- Token Optimization: Formats output to minimize token usage while preserving information
Error Handling
All tools provide comprehensive error handling:
- File Not Found: Clear error messages with resolved paths
- Permission Denied: Specific permission error reporting
- Invalid Format: Format validation with supported format guidance
- Parsing Errors: Detailed parsing error information with context
Dependencies
pdf-parse: PDF text extractionxlsx: Excel/XLSX parsingmammoth: Word document processingnode-pptx-parser: PowerPoint parsingcsv-parser: CSV parsing and analysis
License
MIT
