@lov3kaizen/agentsea-ingest
v0.5.2
Published
Comprehensive document processing pipeline for Node.js - PDF, DOCX, HTML, Markdown parsing with intelligent chunking, table/image extraction, and OCR
Downloads
220
Maintainers
Readme
@lov3kaizen/agentsea-ingest
TypeScript-native document processing pipeline for AI/RAG applications.
Features
- Multi-format Parsing: PDF, DOCX, HTML, Markdown, CSV, Excel, JSON
- Intelligent Chunking: Fixed, recursive, sentence, paragraph, semantic, hierarchical
- Table & Image Extraction: Automatic extraction with metadata
- Text Cleaning: Normalization, deduplication, PII removal
- Flexible Pipelines: Configurable processing stages
- Streaming Support: Process large documents efficiently
Installation
pnpm add @lov3kaizen/agentsea-ingestQuick Start
import { createIngester, pipelines } from '@lov3kaizen/agentsea-ingest';
// Simple ingestion
const ingester = createIngester();
const doc = await ingester.ingestFile('./document.pdf');
console.log(`Extracted ${doc.chunks.length} chunks`);
// RAG-optimized pipeline
const pipeline = pipelines.rag().build();
const result = await pipeline.process({ path: './document.md' });Parsing Documents
Supported Formats
| Format | Parser | MIME Types | | -------- | -------------- | ----------------------------------------------------------------------- | | PDF | PDFParser | application/pdf | | DOCX | DOCXParser | application/vnd.openxmlformats-officedocument.wordprocessingml.document | | HTML | HTMLParser | text/html | | Markdown | MarkdownParser | text/markdown | | Text | TextParser | text/plain | | CSV | CSVParser | text/csv | | Excel | ExcelParser | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | | JSON | JSONParser | application/json |
Direct Parser Usage
import {
createPDFParser,
createMarkdownParser,
} from '@lov3kaizen/agentsea-ingest';
import { readFileSync } from 'fs';
const pdfParser = createPDFParser();
const buffer = readFileSync('./document.pdf');
const result = await pdfParser.parse(buffer);
console.log(result.text);
console.log(result.elements);
console.log(result.tables);Chunking Strategies
Fixed Size
import { createFixedChunker } from '@lov3kaizen/agentsea-ingest';
const chunker = createFixedChunker();
const chunks = chunker.chunk(text, {
maxTokens: 512,
overlap: 50,
splitOnSentences: true,
});Recursive
import { createRecursiveChunker } from '@lov3kaizen/agentsea-ingest';
const chunker = createRecursiveChunker();
const chunks = chunker.chunk(text, {
maxTokens: 512,
separators: ['\n\n', '\n', '. ', ' '],
keepSeparator: true,
});Semantic
import { createSemanticChunker } from '@lov3kaizen/agentsea-ingest';
const chunker = createSemanticChunker();
const chunks = await chunker.chunk(text, {
maxTokens: 512,
similarityThreshold: 0.5,
embedFunction: async (text) => myEmbeddingModel(text),
});Hierarchical
import { createHierarchicalChunker } from '@lov3kaizen/agentsea-ingest';
const chunker = createHierarchicalChunker();
const chunks = chunker.chunk(markdownText, {
maxTokens: 512,
headingLevels: [1, 2, 3],
includeParentContext: true,
});Pipeline Builder
import { createPipelineBuilder } from '@lov3kaizen/agentsea-ingest';
const pipeline = createPipelineBuilder()
.withName('my-pipeline')
.withStages(['load', 'parse', 'clean', 'chunk', 'embed'])
.withChunking({
strategy: 'semantic',
maxTokens: 512,
overlap: 50,
})
.withCleaning({
operations: ['normalize_whitespace', 'remove_urls', 'trim'],
})
.withCallbacks({
onDocumentComplete: (doc) => console.log(`Processed: ${doc.id}`),
})
.build();
const result = await pipeline.process({ path: './document.pdf' });Pre-built Pipelines
import { pipelines } from '@lov3kaizen/agentsea-ingest';
// Simple text extraction
const simple = pipelines.simple().build();
// Full processing with all stages
const full = pipelines.full().build();
// RAG-optimized pipeline
const rag = pipelines.rag().build();
// Document analysis (no chunking)
const analysis = pipelines.analysis().build();
// OCR pipeline for scanned documents
const ocr = pipelines.ocr().build();Ingester
The Ingester class provides a high-level API for document ingestion:
import { createIngester } from '@lov3kaizen/agentsea-ingest';
const ingester = createIngester({
chunking: {
strategy: 'recursive',
maxTokens: 512,
},
concurrency: 4,
fileSizeLimit: 10 * 1024 * 1024, // 10MB
});
// Ingest single file
const doc = await ingester.ingestFile('./document.pdf');
// Ingest from URL
const webDoc = await ingester.ingestUrl('https://example.com/page.html');
// Ingest from buffer
const bufferDoc = await ingester.ingestBuffer(buffer, 'document.pdf');
// Ingest directory
const results = await ingester.ingestDirectory('./documents', {
recursive: true,
include: ['*.pdf', '*.docx'],
exclude: ['draft-*'],
});Watch Mode
const ingester = createIngester({
watchMode: {
enabled: true,
paths: ['./documents'],
include: ['*.pdf', '*.md'],
debounceDelay: 1000,
processExisting: true,
},
});
ingester.startWatching();
// Files added/modified in ./documents will be automatically processedEvents
const pipeline = createPipeline(config);
const emitter = pipeline.getEventEmitter();
emitter.on('document:loaded', (event) => {
console.log(`Loaded: ${event.documentId}`);
});
emitter.on('document:chunked', (event) => {
console.log(`Created ${event.chunkCount} chunks`);
});
emitter.on('document:completed', (event) => {
console.log(`Completed: ${event.document.id}`);
});API Reference
Types
ProcessedDocument- Processed document with chunks and metadataChunk- Text chunk with metadata and optional embeddingElement- Document element (paragraph, heading, list, etc.)TableData- Extracted table dataImageData- Extracted image dataPipelineConfig- Pipeline configuration optionsChunkingOptions- Chunking configuration options
Core Classes
Pipeline- Document processing pipelinePipelineBuilder- Fluent pipeline builderIngester- High-level document ingesterParserRegistry- Parser managementChunkerRegistry- Chunker management
Parsers
PDFParser- PDF document parsingDOCXParser- Word document parsingHTMLParser- HTML document parsingMarkdownParser- Markdown parsingTextParser- Plain text parsingCSVParser- CSV file parsingExcelParser- Excel file parsingJSONParser- JSON file parsing
Chunkers
FixedChunker- Fixed-size chunksRecursiveChunker- Recursive splittingSentenceChunker- Sentence-based chunksParagraphChunker- Paragraph-based chunksSemanticChunker- Semantic similarity-basedHierarchicalChunker- Heading-based hierarchy
License
MIT
