@reaatech/hybrid-rag-ingestion
v0.1.0
Published
Document loading, preprocessing, and chunking strategies for hybrid RAG systems
Readme
@reaatech/hybrid-rag-ingestion
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Multi-format document loading, preprocessing, validation, and four configurable chunking strategies for hybrid RAG systems. Supports PDF, Markdown, HTML, and plain text with deterministic chunk ID generation.
Installation
npm install @reaatech/hybrid-rag-ingestion
# or
pnpm add @reaatech/hybrid-rag-ingestionFeature Overview
- Multi-format loading — PDF, Markdown, HTML, and plain text with automatic format detection
- Text preprocessing — Unicode normalization, whitespace normalization, special character handling
- Document validation — duplicate detection via content hashing, file size limits, format verification
- Four chunking strategies — Fixed-Size, Semantic, Recursive, Sliding Window
- Deterministic chunk IDs — reproducible IDs based on document ID + chunk index
- Chunking benchmarks — compare strategies on your documents with measured quality
- Typed errors —
UnsupportedFormatError,FileSizeExceededError,DocumentParseError
Quick Start
import {
DocumentLoader,
TextPreprocessor,
DocumentValidator,
chunkDocument,
ChunkingStrategy,
} from '@reaatech/hybrid-rag-ingestion';
// Load a document
const loader = new DocumentLoader({ allowedFormats: ['pdf', 'md', 'html', 'txt'] });
const doc = await loader.loadFile('./docs/report.pdf');
console.log(`Loaded: ${doc.id}, ${doc.content.length} chars`);
// Validate
const validator = new DocumentValidator({ maxFileSize: 10 * 1024 * 1024 }); // 10MB
const validation = validator.validate(doc);
// Chunk
const chunks = await chunkDocument(
doc.content,
doc.id,
{
strategy: ChunkingStrategy.SEMANTIC,
chunkSize: 512,
overlap: 50,
similarityThreshold: 0.5,
},
doc.metadata,
);API Reference
Document Loading
DocumentLoader
| Constructor Option | Type | Default | Description |
|--------------------|------|---------|-------------|
| allowedFormats | string[] | ['pdf','md','html','txt'] | Whitelist of accepted formats |
| Method | Returns | Description |
|--------|---------|-------------|
| loadFile(filePath) | Document | Load and parse a single file |
| loadDirectory(dirPath) | Document[] | Load all supported files in a directory |
Custom Errors
| Error | When |
|-------|------|
| UnsupportedFormatError | File format not in allowedFormats |
| FileSizeExceededError | File exceeds maxFileSize limit |
| DocumentParseError | Parse failure for the detected format |
Preprocessing
TextPreprocessor
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| normalizeUnicode | boolean | true | Normalize to NFC form |
| normalizeWhitespace | boolean | true | Collapse multiple spaces, normalize newlines |
| removeControlChars | boolean | true | Strip non-printable control characters |
Validation
DocumentValidator
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| maxFileSize | number | 10 * 1024 * 1024 | Max file size in bytes |
| minContentLength | number | 1 | Minimum document content length |
ValidationResult
| Property | Type | Description |
|----------|------|-------------|
| valid | boolean | Whether the document passed all checks |
| errors | string[] | List of validation error messages |
Chunking Strategies
All strategies produce Chunk[] with deterministic IDs.
Fixed-Size
Splits by token count, word count, or character count with configurable overlap.
const chunks = await chunkDocument(content, docId, {
strategy: ChunkingStrategy.FIXED_SIZE,
chunkSize: 512, // tokens
overlap: 50,
});| Parameter | Description |
|-----------|-------------|
| chunkSize | Target size in tokens |
| overlap | Overlap between consecutive chunks in tokens |
Semantic
Splits at topic boundaries using sentence-level similarity. Best for long-form content.
const chunks = await chunkDocument(content, docId, {
strategy: ChunkingStrategy.SEMANTIC,
chunkSize: 512,
overlap: 50,
similarityThreshold: 0.5,
});| Parameter | Description |
|-----------|-------------|
| similarityThreshold | Minimum similarity for boundary detection (0–1) |
Recursive
Hierarchical splitting: headers → paragraphs → sentences. Best for structured documents.
const chunks = await chunkDocument(content, docId, {
strategy: ChunkingStrategy.RECURSIVE,
chunkSize: 512,
separators: ['\n## ', '\n', '. '],
});| Parameter | Description |
|-----------|-------------|
| separators | Splitting delimiters in priority order |
Sliding Window
Fixed window moving by configurable stride. Best for dense retrieval scenarios.
const chunks = await chunkDocument(content, docId, {
strategy: ChunkingStrategy.SLIDING_WINDOW,
windowSize: 512,
stride: 256,
});| Parameter | Description |
|-----------|-------------|
| windowSize | Size of each window in tokens |
| stride | Step size between windows in tokens |
Chunking Engine
ChunkingEngine
Orchestrator that routes to the correct strategy:
| Method | Description |
|--------|-------------|
| chunkDocument(content, docId, config, metadata?) | Main entry point — returns Chunk[] |
| chunkBatch(documents, config) | Process multiple documents in sequence |
ChunkingBenchmark
Compare strategies head-to-head:
import { ChunkingBenchmark } from '@reaatech/hybrid-rag-ingestion';
const benchmark = new ChunkingBenchmark();
const results = await benchmark.benchmark(documents, [
{ name: 'fixed-512', config: { strategy: ChunkingStrategy.FIXED_SIZE, chunkSize: 512, overlap: 50 } },
{ name: 'semantic-512', config: { strategy: ChunkingStrategy.SEMANTIC, chunkSize: 512, overlap: 50 } },
]);
console.table(results.map(r => ({ name: r.name, chunkCount: r.chunkCount, avgTokens: r.avgTokens })));Related Packages
- @reaatech/hybrid-rag — Core types (
Document,Chunk,ChunkingConfig) - @reaatech/hybrid-rag-retrieval — Retrieval engines (consume chunks)
- @reaatech/hybrid-rag-pipeline — RAGPipeline orchestrator
