@majkapp/majk-chat-document-tools

v1.0.82

Published

a month ago

Document processing tools for majk chat - PDF, Excel, Word, PowerPoint parsing and analysis

Downloads

515

0High
0Medium
0Low

juleswhite

majk-chat document-processing pdf excel word powerpoint csv document-analysis

majk-chat-document-tools

Comprehensive document processing package for majk chat that adds support for parsing and analyzing PDF, Excel, Word, PowerPoint, and CSV files.

Features

Universal Document Analyzer: Automatically detects file types and routes to appropriate parsers
PDF Parser: Extract text content and metadata from PDF files
Excel Parser: Parse XLSX/XLS files with sheet analysis and data extraction
Word Parser: Extract text from DOCX files with multiple output formats
PowerPoint Parser: Extract slide content, notes, and presentation structure
CSV Parser: Intelligent CSV parsing with type detection and column analysis

Supported File Formats

| Format | Extensions | Features | |--------|------------|----------| | PDF | .pdf | Text extraction, metadata, page-specific parsing | | Excel | .xlsx, .xls | Sheet parsing, data analysis, multiple output formats | | Word | .docx | Text/HTML/Markdown output, style extraction | | PowerPoint | .pptx | Slide content, speaker notes, presentation metadata | | CSV | .csv, .tsv | Auto-delimiter detection, type inference, column analysis |

Installation

npm install @majkapp/majk-chat-document-tools

Usage

Quick Start with Universal Analyzer

import { DocumentAnalyzerTool } from '@majkapp/majk-chat-document-tools';

const analyzer = new DocumentAnalyzerTool();

// Automatically detect and parse any supported document
const result = await analyzer.execute({
  file_path: './document.pdf',
  analysis_type: 'auto',
  include_metadata: true
}, context);

Individual Tool Usage

import { 
  PdfParserTool, 
  ExcelParserTool,
  WordParserTool,
  PowerPointParserV2Tool,
  CsvParserTool 
} from '@majkapp/majk-chat-document-tools';

// PDF parsing
const pdfParser = new PdfParserTool();
const pdfResult = await pdfParser.execute({
  file_path: './report.pdf',
  page_range: { start: 1, end: 5 },
  extract_metadata: true
}, context);

// Excel parsing
const excelParser = new ExcelParserTool();
const excelResult = await excelParser.execute({
  file_path: './data.xlsx',
  sheet_name: 'Sales Data',
  output_format: 'json',
  max_rows: 1000
}, context);

Integration with majk-chat Builder

import { MajkChatBuilder } from '@majkapp/majk-chat-core';
import { registerDocumentTools } from '@majkapp/majk-chat-document-tools';

const builder = new MajkChatBuilder()
  .withProvider('anthropic')
  .withModel('claude-3-5-sonnet-20241022');

// Register all document tools
registerDocumentTools(builder.getToolRegistry());

const chat = builder.build();

Tool Specifications

Universal Document Analyzer (`analyze_document`)

Automatically detects file type and applies the appropriate parser.

Parameters:

file_path (required): Path to document file
analysis_type: auto | text_only | structured | metadata
max_text_length: Maximum text extraction length (default: 50000)
include_metadata: Extract document metadata (default: true)
output_format: json | summary | detailed

PDF Parser (`parse_pdf`)

Parameters:

file_path (required): Path to PDF file
page_range: { start?: number, end?: number }
extract_metadata: Extract PDF metadata (default: true)
max_text_length: Text length limit (default: 50000)

Excel Parser (`parse_excel`)

Parameters:

file_path (required): Path to Excel file
sheet_name: Specific sheet to parse
range: Excel range (e.g., "A1:D10")
header_row: Header row number (default: 1)
max_rows: Maximum rows to parse (default: 1000)
output_format: json | csv | table

Word Parser (`parse_word`)

Parameters:

file_path (required): Path to Word file
output_format: plain | html | markdown
include_images: Process image references (default: false)
max_text_length: Text length limit (default: 50000)
extract_styles: Extract style information (default: false)

PowerPoint Parser (`parse_powerpoint`)

Parameters:

file_path (required): Path to PowerPoint file
include_slide_notes: Extract speaker notes (default: true)
slide_numbers: Array of specific slides to extract
max_text_length: Text length limit (default: 50000)
extract_slide_titles: Extract slide titles (default: true)
include_shapes: Include shape details (default: false)

CSV Parser (`parse_csv`)

Parameters:

file_path (required): Path to CSV file
delimiter: Column delimiter (auto-detected if not provided)
has_headers: Whether first row contains headers (auto-detected)
encoding: File encoding (utf8 | ascii | latin1)
max_rows: Maximum rows to parse (default: 5000)
output_format: json | table | summary

Context Management Integration

All parsers are designed to work seamlessly with majk-chat's context management system:

Smart Truncation: Automatically truncates large documents while preserving structure
Incremental Reading: Supports offset/limit reading for large files via read_tool_result
Memory Efficient: Processes documents in chunks to avoid memory issues
Token Optimization: Formats output to minimize token usage while preserving information

Error Handling

All tools provide comprehensive error handling:

File Not Found: Clear error messages with resolved paths
Permission Denied: Specific permission error reporting
Invalid Format: Format validation with supported format guidance
Parsing Errors: Detailed parsing error information with context

Dependencies

pdf-parse: PDF text extraction
xlsx: Excel/XLSX parsing
mammoth: Word document processing
node-pptx-parser: PowerPoint parsing
csv-parser: CSV parsing and analysis

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

majk-chat-document-tools

Features

Supported File Formats

Installation

Usage

Quick Start with Universal Analyzer

Individual Tool Usage

Integration with majk-chat Builder

Tool Specifications

Universal Document Analyzer (analyze_document)

PDF Parser (parse_pdf)

Excel Parser (parse_excel)

Word Parser (parse_word)

PowerPoint Parser (parse_powerpoint)

CSV Parser (parse_csv)

Context Management Integration

Error Handling

Dependencies

License

Universal Document Analyzer (`analyze_document`)

PDF Parser (`parse_pdf`)

Excel Parser (`parse_excel`)

Word Parser (`parse_word`)

PowerPoint Parser (`parse_powerpoint`)

CSV Parser (`parse_csv`)