@msbayindir/rag-chunker

v3.1.0

Published

9 days ago

Mistral OCR + deterministic AST chunker for RAG pipelines

0High
0Medium
0Low

msbayindir

rag pdf chunker mistral gemini contextual-retrieval

@msbayindir/rag-chunker

PDF → Mistral OCR → Deterministic AST Chunker → Contextual RAG

A production-ready pipeline that turns PDFs into retrieval-optimized chunks. Uses Mistral OCR for accurate text extraction, a deterministic AST-based chunker that respects document structure, and optionally enriches each chunk with a context summary following Anthropic's Contextual Retrieval method — reducing retrieval failures by up to 49%.

Highlights

Mistral OCR 3 as primary OCR provider; Gemini Vision as automatic fallback
Deterministic AST chunker powered by remark/mdast — no LLM required for chunking, fast and reproducible
Anthropic Contextual Retrieval — optional batch or per-chunk context enrichment via Gemini
Heading normalization — two-phase Gemini pipeline fixes inconsistent heading levels from OCR
Large PDF auto-split — PDFs over 50 MB are automatically split into 40 MB batches and merged seamlessly
OCR cache — results cached locally (7-day TTL) so the same PDF is never re-processed
CLI + programmatic API — use as a command-line tool or import into your own pipeline

Installation

npm install @msbayindir/rag-chunker

@google/genai is a required peer dependency and is installed automatically. If you plan to use OpenAI embeddings:

npm install openai

API Keys — Free to Get Started

You can process hundreds of documents without spending a cent.

Mistral AI — Primary OCR

Go to console.mistral.ai → Sign Up → API Keys
New accounts receive $5 free credit (no credit card required on sign-up)
Mistral OCR pricing: ~$1 per 1,000 pages → $5 gets you ~5,000 pages
More than enough to evaluate and prototype

If you only have a Mistral key, OCR works. Context enrichment and heading normalization require a Gemini key.

Google Gemini — Context Enrichment, Heading Normalization, Fallback OCR

Go to aistudio.google.com → Get API Key
Completely free on the Google AI Studio free tier — no credit card required
Free tier limits:
- gemini-2.0-flash (context default): 1,500 requests/day
- gemini-2.5-flash (context alternative): 500 requests/day
- gemini-2.5-pro (heading normalization phase 1): 50 requests/day

For most documents, context enrichment uses ~20–30 batch requests. You can process 50+ documents per day on the free tier.

Quick Start

# Basic — OCR + chunk, no context enrichment
npx rag-chunker process document.pdf -m YOUR_MISTRAL_KEY -o ./output

# With Anthropic-style context enrichment (recommended for RAG)
npx rag-chunker process document.pdf \
  -m YOUR_MISTRAL_KEY \
  -k YOUR_GEMINI_KEY \
  --context-mode batch \
  -o ./output

# With heading normalization (useful when OCR produces inconsistent heading levels)
npx rag-chunker process document.pdf \
  -m YOUR_MISTRAL_KEY \
  -k YOUR_GEMINI_KEY \
  --heading-normalization \
  -o ./output

# Using environment variables (recommended)
export MISTRAL_API_KEY=your_key
export GEMINI_API_KEY=your_key
npx rag-chunker process document.pdf -o ./output

After processing, ./output/ will contain:

output/
  chunks.jsonl      ← one chunk per line, ready for embedding
  document.md       ← full markdown with page markers
  structure.json    ← headings, tables, page count
  manifest.json     ← processing stats and metadata

CLI Reference

`rag-chunker process <pdf>`

Full pipeline: OCR → chunk → optional context enrichment → save.

rag-chunker process <pdf> [options]

| Option | Default | Description | |--------|---------|-------------| | -o, --output <dir> | — | Output directory. If omitted, results are not saved to disk. | | -m, --mistral-api-key <key> | MISTRAL_API_KEY env | Mistral API key — primary OCR provider | | -k, --gemini-api-key <key> | GEMINI_API_KEY env | Gemini API key — context enrichment, heading fix, fallback OCR | | --context-mode <mode> | none | none | batch | per-chunk | | --context-model <model> | gemini-2.0-flash | Gemini model for context summaries | | --context-batch-size <n> | 10 | Chunks per batch in batch mode | | --max-chunk-tokens <n> | 512 | Max tokens per chunk | | --min-chunk-tokens <n> | 50 | Min tokens for a chunk to be emitted | | --overlap-tokens <n> | 0 | Tokens prepended from the previous chunk | | --no-preserve-tables | — | Do not keep tables in their own chunk | | --no-preserve-code | — | Do not keep code blocks in their own chunk | | --heading-normalization | — | Fix inconsistent heading levels (requires --gemini-api-key) | | --ocr-cache-path <path> | ~/.rag-chunker/ocr-cache.json | Custom OCR cache file path | | --ocr-cache-ttl <days> | 7 | OCR cache TTL in days | | --no-ocr-cache | — | Disable OCR caching | | --warn-large-chunk <n> | 2000 | Warn when a table/code chunk exceeds N tokens | | --verbose | — | Show verbose pipeline logs |

Large PDF handling: PDFs over 50 MB automatically trigger a confirmation prompt. If confirmed, the file is split into 40 MB batches, each batch is OCR'd sequentially, and results are merged — page numbers and heading hierarchy are preserved across the split.

`rag-chunker ocr <pdf>`

Debug command. Runs OCR and prints each page's markdown to stdout.

rag-chunker ocr document.pdf -m YOUR_MISTRAL_KEY

| Option | Description | |--------|-------------| | -m, --mistral-api-key <key> | Mistral API key | | -k, --gemini-api-key <key> | Gemini API key (for Vision fallback) |

`rag-chunker chunk <md>`

Debug command. Runs the AST chunker on a .md file and prints chunk boundaries. No API key needed.

rag-chunker chunk document.md --max-tokens 512

| Option | Default | Description | |--------|---------|-------------| | --max-tokens <n> | 512 | Max tokens per chunk | | --min-tokens <n> | 50 | Min tokens per chunk | | --overlap-tokens <n> | 0 | Overlap tokens from previous chunk | | --no-preserve-tables | — | Do not isolate tables | | --no-preserve-code | — | Do not isolate code blocks |

`rag-chunker inspect <output-dir>`

Reads an output directory and prints a summary of its manifest and structure.

rag-chunker inspect ./output

`rag-chunker cache list / clear`

Manage the local OCR cache.

rag-chunker cache list
rag-chunker cache clear --expired    # remove entries older than TTL
rag-chunker cache clear --all        # wipe entire cache

| Option | Description | |--------|-------------| | --cache-path <path> | Custom cache file path | | --ttl <days> | TTL for expired check (default: 7) |

Programmatic API

`process(pdfInput, config)`

Full pipeline. Returns a ProcessResult with chunks, markdown, structure, manifest, and a save() method.

import { process } from '@msbayindir/rag-chunker'

const result = await process('document.pdf', {
  mistralApiKey: process.env.MISTRAL_API_KEY,
  geminiApiKey: process.env.GEMINI_API_KEY,
  contextMode: 'batch',
  maxChunkTokens: 512,
})

// Save all output files
await result.save('./output')

// Or work with chunks directly
for (const chunk of result.chunks) {
  console.log(chunk.content)     // embed this
  console.log(chunk.sectionPath) // breadcrumb path
  console.log(chunk.pageNumber)
}

`chunk(pdfInput, config)`

Convenience wrapper. Forces contextMode: 'none' and returns only Chunk[].

import { chunk } from '@msbayindir/rag-chunker'

const chunks = await chunk('document.pdf', {
  mistralApiKey: process.env.MISTRAL_API_KEY,
  maxChunkTokens: 512,
})

Full `ChunkerConfig` Reference

| Field | Type | Default | Description | |-------|------|---------|-------------| | mistralApiKey | string | — | Mistral API key — primary OCR | | geminiApiKey | string | — | Gemini API key — context, heading fix, fallback OCR | | contextMode | 'none' \| 'batch' \| 'per-chunk' | 'none' | Context enrichment mode | | contextModel | string | 'gemini-2.0-flash' | Gemini model for context summaries | | contextBatchSize | number | 10 | Chunks per Gemini batch call | | contextConcurrency | number | 2 | Max concurrent calls in per-chunk mode | | maxChunkTokens | number | 512 | Max tokens per chunk | | minChunkTokens | number | 50 | Min tokens per chunk | | overlapTokens | number | 0 | Overlap tokens prepended from previous chunk | | preserveTables | boolean | true | Keep tables in their own chunk | | preserveCodeBlocks | boolean | true | Keep code blocks in their own chunk | | ocrCachePath | string \| false | ~/.rag-chunker/ocr-cache.json | Cache path. false to disable. | | ocrCacheTtlDays | number | 7 | OCR cache TTL in days | | headingNormalization | boolean | false | Fix OCR heading levels via Gemini | | headingFixPhase1Model | string | 'gemini-2.5-pro' | Model for structure discovery phase | | headingFixPhase2Model | string | 'gemini-2.5-flash-preview-05-20' | Model for correction phase | | warnLargeChunkTokens | number | 2000 | Warn threshold for oversized preserved chunks | | embeddingProvider | IEmbeddingProvider | — | Embedding provider (optional) | | logger | ILogger | pino at INFO | Custom logger |

Embedding Providers

Built-in providers can optionally generate embeddings inline during the pipeline.

import {
  createGeminiEmbeddingProvider,
  createOpenAiEmbeddingProvider,
  createNullEmbeddingProvider,
} from '@msbayindir/rag-chunker'

// Gemini — 1536 dimensions
const geminiProvider = createGeminiEmbeddingProvider({
  apiKey: process.env.GEMINI_API_KEY!,
})

// OpenAI text-embedding-3-large — 3072 dimensions (requires: npm install openai)
const openaiProvider = createOpenAiEmbeddingProvider({
  apiKey: process.env.OPENAI_API_KEY!,
})

// Null — returns empty vectors, useful for testing
const nullProvider = createNullEmbeddingProvider()

const result = await process('document.pdf', {
  mistralApiKey: process.env.MISTRAL_API_KEY,
  embeddingProvider: openaiProvider,
})
// result.chunks[0].embedding → number[]

Note: Embedding is non-fatal. If the provider throws, the chunk is still returned with embedding: [].

Output Files

`chunks.jsonl`

One JSON object per line. Each line is a Chunk:

{
  "chunkId": "a3f7c2d1...",        // SHA-256(rawContent), first 32 hex chars — deterministic
  "index": 0,                       // 0-based position in chunk array
  "content": "Context: This section covers sodium metabolism in the electrolyte chapter.\n\n## Sodium Metabolism\n\nSodium is...",
  "rawContent": "## Sodium Metabolism\n\nSodium is...",   // pure markdown, no context
  "contextSummary": "This section covers sodium metabolism in the electrolyte chapter.",
  "tokenCount": 412,
  "contentType": "text",            // "text" | "table" | "code" | "mixed"
  "sectionPath": ["Electrolyte Disorders", "Sodium Metabolism"],
  "pageNumber": 14,                 // 1-based, from OCR page markers
  "prevChunkId": "9f1a...",
  "nextChunkId": "c82b...",
  "mustPreserve": false,            // true for table/code chunks that can't be split
  "embedding": []                   // number[] if embeddingProvider was configured
}

`document.md`

Full document markdown with page markers:

<!-- page 1 -->

# Chapter 1

Content...

<!-- page 2 -->

## Section 1.1

`structure.json`

Document structure extracted from markdown:

{
  "headings": [
    { "level": 1, "text": "Chapter 1", "pageNumber": 1, "markdownLine": 3 }
  ],
  "tables": [
    { "index": 0, "caption": "Table 1. Electrolyte values", "pageNumber": 5, "rowCount": 8, "columnCount": 3 }
  ],
  "tableCount": 12,
  "codeBlockCount": 0,
  "pageCount": 192,
  "totalTokens": 68432
}

`manifest.json`

Processing metadata:

{
  "version": "3.0",
  "processedAt": "2026-03-16T00:20:51.118Z",
  "pdfHash": "0f2860b4...",
  "ocrModel": "mistral-ocr-latest",
  "contextModel": "gemini-2.0-flash",
  "contextMode": "batch",
  "chunkStats": {
    "total": 183, "avgTokens": 378,
    "minTokens": 50, "maxTokens": 2662,
    "tableChunks": 25, "codeChunks": 0, "textChunks": 158, "mixedChunks": 0
  },
  "durationMs": 146689,
  "ocrCacheHit": true,
  "headingFix": null,               // populated if --heading-normalization was used
  "contextEnrichment": {
    "model": "gemini-2.0-flash",
    "chunksEnriched": 176,
    "chunksSkipped": 7,             // low-quality OCR artifacts skipped
    "batchCalls": 19,
    "durationMs": 146005,
    "cacheUsed": true               // whether Gemini CachedContent was used
  }
}

Contextual Retrieval — What to Embed and Why

The Problem with Naive Chunking

When you split a document into chunks and embed them independently, each chunk loses its context. A chunk containing "It was founded in 1987 and has since expanded to 42 countries" provides no signal for a query about a specific organization — the surrounding context that identifies the subject is gone.

This is the core retrieval failure in most RAG systems.

Anthropic's Solution — and the Numbers

In September 2024, Anthropic published Contextual Retrieval, showing that prepending a short, situating summary to each chunk before embedding significantly improves retrieval:

| Method | Retrieval Failure Reduction | |--------|-----------------------------| | Basic semantic embedding | baseline | | Contextual embedding | 49% fewer failures | | Contextual embedding + BM25 hybrid | 67% fewer failures |

The approach: for each chunk, generate a 1–2 sentence summary that situates it within the document — what section it belongs to, what topic it covers. Prepend this to the chunk content before embedding.

How This Package Implements It

When --context-mode batch (or per-chunk) is used:

The full document markdown is sent to Gemini as context
For each chunk, a 1–2 sentence summary is generated in the same language as the document
The contextSummary field is stored separately
content is assembled as: "Context: <summary>\n\n<rawContent>"

This matches Anthropic's exact format.

Which Field Should You Embed?

| Your setup | Embed this field | |-----------|-----------------| | --context-mode batch or per-chunk (recommended) | content — includes the context summary | | --context-mode none (default) | content or rawContent — they are identical | | Hybrid search: vector + BM25/keyword | content for vector index, rawContent for keyword index |

// Reading chunks.jsonl and embedding
import { createReadStream } from 'fs'
import { createInterface } from 'readline'

const rl = createInterface({ input: createReadStream('./output/chunks.jsonl') })
for await (const line of rl) {
  const chunk = JSON.parse(line)
  const textToEmbed = chunk.content  // always use content
  // → send to your vector database
}

Should You Strip Markdown Before Embedding?

Modern embedding models (OpenAI text-embedding-3-large, Gemini embedding-001, Cohere embed-v3) are trained on web-scale data that includes markdown. They handle #, **, | and similar syntax gracefully — stripping is generally not necessary.

Research findings:

Nussbaum et al. (2024) — Nomic Embed: Training a Reproducible Long Context Text Embedder — shows retrieval quality is robust to markdown formatting in general-purpose embedders
Muennighoff et al. (2022) — MTEB: Massive Text Embedding Benchmark — the benchmark includes mixed-format text; top models perform well without preprocessing
For table-heavy content, stripping | separators can marginally improve semantic similarity scores; for prose, the effect is negligible

This package does not strip markdown. If your downstream embedding model or use case requires clean text, strip it yourself before embedding — this is intentionally left to the caller:

// Optional: strip markdown before embedding (your responsibility)
function stripMarkdown(text: string): string {
  return text
    .replace(/#{1,6}\s+/g, '')          // headings
    .replace(/\*\*([^*]+)\*\*/g, '$1')  // bold
    .replace(/_([^_]+)_/g, '$1')        // italic
    .replace(/`([^`]+)`/g, '$1')        // inline code
    .replace(/\|/g, ' ')                // table separators
    .replace(/\s+/g, ' ').trim()
}

const textToEmbed = stripMarkdown(chunk.content)

Advanced Usage

Custom Logger

Replace the default pino logger with your own:

import { process } from '@msbayindir/rag-chunker'

const result = await process('document.pdf', {
  mistralApiKey: process.env.MISTRAL_API_KEY,
  logger: {
    debug: (msg, meta) => console.debug('[rag]', msg, meta),
    info:  (msg, meta) => console.info('[rag]', msg, meta),
    warn:  (msg, meta) => console.warn('[rag]', msg, meta),
    error: (msg, meta) => console.error('[rag]', msg, meta),
  }
})

Disable or Customize OCR Cache

// Disable caching entirely
const result = await process('document.pdf', {
  mistralApiKey: process.env.MISTRAL_API_KEY,
  ocrCachePath: false,
})

// Use a project-local cache instead of ~/.rag-chunker/
const result = await process('document.pdf', {
  mistralApiKey: process.env.MISTRAL_API_KEY,
  ocrCachePath: './.rag-cache/ocr.json',
  ocrCacheTtlDays: 30,
})

Heading Normalization

OCR pipelines often produce inconsistent heading levels — a section title becomes # Title on one page and ## Title on another, or numbered sections (1.1, 1.2) get assigned the wrong level.

The two-phase normalization process:

Phase 1 (Gemini Pro): Sends the full document to discover structure — document type, main sections, numbering patterns
Phase 2 (Gemini Flash): Sends the discovered structure + heading list (not the full document) and corrects each heading's level

Enable it when:

The source PDF has complex, multi-level structure (textbooks, technical reports)
OCR produces mismatched heading levels
You need accurate sectionPath breadcrumbs in your chunks

rag-chunker process document.pdf \
  -m YOUR_MISTRAL_KEY \
  -k YOUR_GEMINI_KEY \
  --heading-normalization \
  -o ./output

Phase 1 uses gemini-2.5-pro (50 requests/day on free tier). For a 200-page document, this is 1 request. You can process up to 50 documents per day with heading normalization on the free tier.

Process a Buffer Instead of a File Path

import { readFileSync } from 'fs'
import { process } from '@msbayindir/rag-chunker'

const pdfBuffer = readFileSync('document.pdf')
const result = await process(pdfBuffer, {
  mistralApiKey: process.env.MISTRAL_API_KEY,
})

Cost Reference

| Setup | Approx. cost per 200-page document | |-------|-------------------------------------| | OCR only (Mistral) | ~$0.05 | | OCR + context enrichment (batch) | ~$0.15–0.22 | | OCR + heading normalization | ~$0.06 | | OCR + context + heading normalization | ~$0.23 | | v2 equivalent (LLM-based OCR + chunking) | ~$0.95 |

Requirements

Node.js ≥ 20
At least one of: MISTRAL_API_KEY or GEMINI_API_KEY

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@msbayindir/rag-chunker

Highlights

Installation

API Keys — Free to Get Started

Mistral AI — Primary OCR

Google Gemini — Context Enrichment, Heading Normalization, Fallback OCR

Quick Start

CLI Reference

rag-chunker process <pdf>

rag-chunker ocr <pdf>

rag-chunker chunk <md>

rag-chunker inspect <output-dir>

rag-chunker cache list / clear

Programmatic API

process(pdfInput, config)

chunk(pdfInput, config)

Full ChunkerConfig Reference

Embedding Providers

Output Files

chunks.jsonl

document.md

structure.json

manifest.json

Contextual Retrieval — What to Embed and Why

The Problem with Naive Chunking

Anthropic's Solution — and the Numbers

How This Package Implements It

Which Field Should You Embed?

Should You Strip Markdown Before Embedding?

Advanced Usage

Custom Logger

Disable or Customize OCR Cache

Heading Normalization

Process a Buffer Instead of a File Path

Cost Reference

Requirements

License

`rag-chunker process <pdf>`

`rag-chunker ocr <pdf>`

`rag-chunker chunk <md>`

`rag-chunker inspect <output-dir>`

`rag-chunker cache list / clear`

`process(pdfInput, config)`

`chunk(pdfInput, config)`

Full `ChunkerConfig` Reference

`chunks.jsonl`

`document.md`

`structure.json`

`manifest.json`