adaptive-chunker
v1.0.0
Published
Smart text chunking for use in RAG systems
Maintainers
Readme
adaptive-chunker
Smart, document-aware text chunking for RAG and LLM pipelines. It adaptively detects content types (Markdown, code, HTML, dialogue, LaTeX, logs, emails, plain text) and applies the most appropriate strategy, cascading to sensible fallbacks to respect token limits while preserving structure.
Installation
npm install adaptive-chunker
# or
yarn add adaptive-chunker
# or
pnpm add adaptive-chunkerQuick start
Import the high-level APIs and chunk some text.
import { chunkText, streamChunkText } from "adaptive-chunker";
// Optional: deep import if you prefer
// import { chunkText, streamChunkText } from "adaptive-chunker/chunk";
const text = "# Title\n\nThis is an example paragraph. It has some sentences.";
// Synchronous materialization
const chunks = chunkText(text, { maxTokens: 256, overlap: 0 });
console.log(chunks);
// Streaming (async generator)
for await (const chunk of streamChunkText(text, { maxTokens: 256 })) {
console.log(chunk);
}Options (ChunkingOptions)
You can pass options to both chunkText and streamChunkText:
- maxTokens: Maximum estimated tokens per chunk. Default: 256 (package default).
- overlap: Desired token overlap between successive chunks. Default: 0.
- tokenizer: Optional function to estimate token counts. Default uses a lightweight internal
countTokensheuristic. - allowFallback: Whether strategies may cascade to smaller units when a block exceeds
maxTokens. Default: true.
Example:
const chunks = chunkText(longText, {
maxTokens: 512,
overlap: 32,
allowFallback: true,
});Strategies
Adaptive strategy
The default behavior uses an adaptive router that inspects the text and chooses a document-type strategy, in the following priority:
- Markdown
- Code
- HTML/XML
- Dialogue/Transcript
- LaTeX/Scientific
- Logs
- Emails
- Plain text (default)
Oversized blocks (relative to maxTokens) cascade to fallbacks (e.g., paragraphs → sentences → fixed-size) when allowFallback is enabled.
Use adaptive explicitly (it is the default):
import { chunkText } from "adaptive-chunker";
import { adaptiveStrategy } from "adaptive-chunker/core/strategies/adaptive"; // optional explicit
const chunks = chunkText(text, { maxTokens: 256 }, adaptiveStrategy);Document-type strategies
You can opt into a specific document-type strategy when you know the input’s structure:
- markdownStrategy: Headings, fenced code blocks, lists, tables, paragraphs.
- codeStrategy: Function/class/indentation blocks; falls back to lines.
- htmlStrategy:
<p>,<div>,<section>,<pre>,<code>,<table>blocks. - dialogueStrategy: Speaker turns like
Speaker:,Q:,A:. - latexStrategy:
\section{},\subsection{}, environments,$$...$$. - logsStrategy: Log lines with timestamps/levels.
- emailStrategy: Headers, quoted replies (
>), body paragraphs. - plainTextStrategy: Paragraph-based for unstructured text.
Usage:
import { chunkText } from "adaptive-chunker";
import { markdownStrategy } from "adaptive-chunker";
const chunks = chunkText(markdownDoc, { maxTokens: 400 }, markdownStrategy);Fallback strategies
Lower-level, structure-preserving strategies that many doc-type strategies fall back to:
- paragraphStrategy: Splits on paragraphs; falls back to sentences, then fixed.
- sentenceStrategy: Splits on sentences; falls back to fixed.
- lineStrategy: Splits on lines; falls back to fixed.
- fixedStrategy: Fixed-size, token-aligned splitting of words/whitespace.
Example: using fixedStrategy directly
import { chunkText, streamChunkText } from "adaptive-chunker";
import { fixedStrategy } from "adaptive-chunker";
const chunks = chunkText(text, { maxTokens: 200 }, fixedStrategy);
for await (const chunk of streamChunkText(text, { maxTokens: 200 }, fixedStrategy)) {
// process chunk
}Notes
- All strategies preserve original formatting (including newlines) as much as possible.
allowFallbackcontrols whether oversized blocks are further split using the next fallback layer.- Types are included; import
ChunkingOptions,ChunkingStrategy, andTokenizerfrom the package if needed.
