@storepress/llm-md-text-splitter
v0.0.1
Published
High-performance streaming Markdown text splitter for LLM pipelines and RAG systems. Zero sequence loss for code blocks, tables, links, and videos. 5 built-in strategies + custom. Zero dependencies.
Maintainers
Readme
LLM MD TEXT SPLITTER (Vibe Coded)
A high-performance, streaming Markdown text splitter built for LLM pipelines and RAG systems. Zero dependencies. Runs in browsers and Node.js 18+.
Zero Sequence Loss — code blocks, tables, reference links, and video embeds are never split. They stay grouped with their surrounding context as atomic semantic units.
Why?
When feeding large documentation (100,000+ lines) into LLM context windows or vector databases, naive splitters break code mid-function, separate explanations from their code examples, and lose reference links. This library guarantees:
- Code blocks with
```fences are never split, even inside delimiter/char/word strategies - Explanatory text stays grouped with its adjacent code blocks
- Reference-style link definitions (
[id]: url) are resolved per chunk - Tables, video embeds, and YAML frontmatter are kept atomic
- Stream-based processing handles massive files with constant memory
Install
npm install @storepress/llm-md-text-splitteryarn add @storepress/llm-md-text-splitterpnpm add @storepress/llm-md-text-splitterOr use directly in the browser via CDN:
<script type="module">
import MarkdownTextSplitter from 'https://esm.sh/@storepress/llm-md-text-splitter';
</script>Quick Start
import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';
const splitter = new MarkdownTextSplitter();
const chunks = await splitter.splitFromString(markdown);
chunks.forEach(chunk => {
console.log(chunk.index, chunk.heading, chunk.tokenEstimate);
});Splitting Strategies
The library ships with 5 built-in strategies and supports custom strategy registration.
| Strategy | Splits by | Best for |
|----------|-----------|----------|
| semantic | Markdown structure (headings, blocks) | RAG pipelines, LLM context |
| delimiter | Custom string (---, ===, etc.) | Manually-sectioned docs |
| char | Character count | Fixed-size windows |
| word | Word count | Readability-based splits |
| token | Estimated LLM token count | Token-budget-aware pipelines |
Every strategy protects fenced code blocks from being split.
Strategy Examples
1. Semantic Strategy (Default)
The most intelligent strategy. Understands Markdown structure and groups related content together — code with its explanation, videos with their context, links with their sections.
import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';
const splitter = new MarkdownTextSplitter({
strategy: 'semantic',
maxChunkTokens: 1500, // Target tokens per chunk (~4 chars/token)
overlapTokens: 150, // Overlap between consecutive chunks
preserveCodeContext: true, // Keep code blocks with surrounding text
preserveLinks: true, // Keep reference links with their sections
preserveVideos: true, // Keep video embeds with context
});
const chunks = await splitter.splitFromString(markdown);
for (const chunk of chunks) {
console.log(`#${chunk.index} — ${chunk.heading}`);
console.log(` Tokens: ${chunk.tokenEstimate}`);
console.log(` Code: ${chunk.hasCode} | Table: ${chunk.hasTable}`);
console.log(` Languages: ${chunk.languages.join(', ')}`);
console.log(` Links: ${chunk.links.length}`);
console.log(` Lines: ${chunk.lines.start}–${chunk.lines.end}`);
}2. Delimiter Strategy
Splits on a custom delimiter string. The delimiter itself is excluded by default.
const splitter = new MarkdownTextSplitter({
strategy: 'delimiter',
strategyOptions: {
delimiter: '---', // Split on this exact string
keepDelimiter: false, // Exclude delimiter from output
trimChunks: true, // Trim whitespace from edges
},
});
const chunks = await splitter.splitFromString(`
# Section One
Content for section one.
---
# Section Two
Content for section two.
\`\`\`js
// This code block contains --- but won't be split
const divider = '---';
\`\`\`
`);
console.log(chunks.length); // → 2 (code block with --- inside is protected)You can use any string as a delimiter:
// Split on HTML comments
{ delimiter: '<!-- split -->' }
// Split on equals signs
{ delimiter: '===' }
// Split on custom markers
{ delimiter: '## CHUNK_BREAK' }3. Character Limit Strategy
Splits when accumulated characters exceed the limit. Defers splits until outside code blocks.
const splitter = new MarkdownTextSplitter({
strategy: 'char',
strategyOptions: {
charLimit: 4000, // Max characters per chunk
overlap: 200, // Character overlap between chunks
},
});
const chunks = await splitter.splitFromString(markdown);
chunks.forEach(c => {
console.log(`Chunk ${c.index}: ${c.charCount} chars`);
});4. Word Limit Strategy
Splits by word count. Useful for readability-based chunking.
const splitter = new MarkdownTextSplitter({
strategy: 'word',
strategyOptions: {
wordLimit: 500, // Max words per chunk
overlap: 30, // Word overlap between chunks
},
});
const chunks = await splitter.splitFromString(markdown);
chunks.forEach(c => {
console.log(`Chunk ${c.index}: ${c.wordCount} words`);
});5. Token Limit Strategy
Splits by estimated LLM token count (uses ~4 chars/token heuristic by default).
const splitter = new MarkdownTextSplitter({
strategy: 'token',
strategyOptions: {
tokenLimit: 2000, // Max tokens per chunk
},
});
const chunks = await splitter.splitFromString(markdown);
chunks.forEach(c => {
console.log(`Chunk ${c.index}: ~${c.tokenEstimate} tokens`);
});6. Custom Strategy
Register your own splitting logic. Your class must implement constructor(config) and async *process(lineIterator).
import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';
class RegexStrategy {
constructor(config) {
this.config = config;
this.name = 'regex';
this.pattern = new RegExp(config.strategyOptions?.pattern || '^#{1,2}\\s', 'm');
}
async *process(lineIterator) {
let buffer = [];
let startLine = 1;
let index = 0;
for await (const { lineNumber, text } of lineIterator) {
if (this.pattern.test(text) && buffer.length > 0) {
yield this._makeChunk(buffer, startLine, lineNumber - 1, index++);
buffer = [];
startLine = lineNumber;
}
buffer.push(text);
}
if (buffer.length > 0) {
yield this._makeChunk(buffer, startLine, startLine + buffer.length - 1, index);
}
}
_makeChunk(lines, startLine, endLine, index) {
const content = lines.join('\n');
return {
id: `regex_${index}`,
index,
content,
tokenEstimate: Math.ceil(content.length / 4),
overlapTokens: 0,
charCount: content.length,
wordCount: content.split(/\s+/).filter(Boolean).length,
lines: { start: startLine, end: endLine },
heading: null,
headingPath: [],
headingLevel: null,
hasCode: /```[\s\S]*?```/.test(content),
hasVideo: false,
hasTable: false,
languages: [],
links: [],
videos: [],
isOversized: false,
containsAtomicBlock: false,
blockTypes: [],
strategy: 'regex',
metadata: {},
};
}
}
// Register globally
MarkdownTextSplitter.registerStrategy('regex', RegexStrategy);
// Use it
const splitter = new MarkdownTextSplitter({
strategy: 'regex',
strategyOptions: { pattern: '^#{1,2}\\s' },
});
const chunks = await splitter.splitFromString(markdown);Input Sources
From String
const chunks = await splitter.splitFromString(markdownContent);From URL (Streaming)
Fetches via HTTP with streaming — never loads the full file into memory:
const chunks = await splitter.splitFromUrl(
'https://raw.githubusercontent.com/user/repo/main/docs.md'
);With custom headers (e.g., for private repos):
const chunks = await splitter.splitFromUrl(url, {
headers: { Authorization: 'Bearer token' },
});From File (Browser)
Works with <input type="file"> and the File/Blob API:
const input = document.querySelector('input[type="file"]');
input.addEventListener('change', async (e) => {
const chunks = await splitter.splitFromFile(e.target.files[0]);
});Streaming API
Process chunks as they arrive — ideal for huge files or progress indicators:
for await (const chunk of splitter.streamFromUrl(url)) {
await vectorDB.upsert({
id: chunk.id,
text: chunk.content,
metadata: {
heading: chunk.heading,
lines: chunk.lines,
hasCode: chunk.hasCode,
},
});
updateProgress(chunk.index);
}All input methods have streaming variants:
splitter.streamFromString(markdown) // AsyncGenerator
splitter.streamFromUrl(url) // AsyncGenerator
splitter.streamFromFile(file) // AsyncGeneratorChunk Output Format
Every chunk object has this shape regardless of strategy:
{
// ── Identity ──
id: "chunk_a1b2c3d4_0042", // Deterministic FNV-1a hash ID
index: 42, // Sequential position (0-based)
// ── Content ──
content: "## useEffect Hook\n\nThe `useEffect` hook…\n\n```jsx\n…\n```",
// ── Size Metrics ──
tokenEstimate: 1450, // Approximate LLM tokens (~4 chars/token)
overlapTokens: 100, // Tokens repeated from previous chunk
charCount: 5800, // Exact character count
wordCount: 342, // Word count
// ── Source Mapping ──
lines: { start: 120, end: 185 }, // Original line numbers
// ── Structural Context (Semantic strategy) ──
heading: "useEffect Hook", // Nearest heading text
headingPath: ["React Hooks Guide", "useEffect Hook"], // Breadcrumb
headingLevel: 2, // Heading depth (1–6)
// ── Content Classification ──
hasCode: true, // Contains fenced code blocks
hasTable: false, // Contains markdown tables
hasVideo: true, // Contains video embeds
languages: ["jsx"], // Code block languages
// ── Extracted References ──
links: [
{ text: "React Docs", url: "https://react.dev" }
],
videos: [
{ platform: "youtube", videoId: "dQw4w9WgXcQ", url: "https://..." }
],
// ── Quality Flags ──
isOversized: false, // Exceeds 1.5× target size
containsAtomicBlock: true, // Has code/table/video blocks
blockTypes: ["heading", "paragraph", "code_block"], // Semantic types
// ── Strategy Info ──
strategy: "semantic", // Which strategy produced this chunk
// ── Extensible ──
metadata: {
splitterVersion: "3.0.0",
strategy: "semantic",
}
}Configuration Reference
Global Options
These apply to all strategies:
{
charsPerToken: 4, // Characters-per-token ratio for estimation
fetchTimeoutMs: 60000, // HTTP fetch timeout (ms)
chunkIdPrefix: "chunk", // Prefix for generated chunk IDs
}Semantic Strategy Options
{
strategy: "semantic",
maxChunkTokens: 1500, // Target max tokens per chunk
overlapTokens: 150, // Token overlap between chunks
preserveCodeContext: true, // Group code with explanatory text
preserveLinks: true, // Group reference links with sections
preserveVideos: true, // Group video embeds with context
}Delimiter Strategy Options
{
strategy: "delimiter",
strategyOptions: {
delimiter: "---", // String to split on
keepDelimiter: false, // Include delimiter in output
trimChunks: true, // Trim whitespace from chunk edges
}
}Character Limit Strategy Options
{
strategy: "char",
strategyOptions: {
charLimit: 4000, // Max characters per chunk
overlap: 200, // Character overlap
}
}Word Limit Strategy Options
{
strategy: "word",
strategyOptions: {
wordLimit: 1000, // Max words per chunk
overlap: 50, // Word overlap
}
}Token Limit Strategy Options
{
strategy: "token",
strategyOptions: {
tokenLimit: 1500, // Max estimated tokens per chunk
}
}API Reference
Class: MarkdownTextSplitter
new MarkdownTextSplitter(config?)
Creates a new splitter instance with merged configuration.
const splitter = new MarkdownTextSplitter({ maxChunkTokens: 2000 });.splitFromString(markdown): Promise<Chunk[]>
Splits a markdown string and returns all chunks.
.splitFromUrl(url, fetchOptions?): Promise<Chunk[]>
Fetches a URL via streaming HTTP and returns all chunks.
.splitFromFile(fileOrBlob): Promise<Chunk[]>
Splits a browser File or Blob object.
.streamFromString(markdown): AsyncGenerator<Chunk>
Yields chunks one-at-a-time from a string.
.streamFromUrl(url, fetchOptions?): AsyncGenerator<Chunk>
Yields chunks one-at-a-time from a streaming HTTP fetch.
.streamFromFile(fileOrBlob): AsyncGenerator<Chunk>
Yields chunks one-at-a-time from a File/Blob.
.setStrategy(name, options?): void
Switches the active strategy at runtime.
splitter.setStrategy('delimiter', { delimiter: '===' });.getStats(): Stats
Returns processing statistics from the last split operation.
const stats = splitter.getStats();
// {
// totalChunks: 42,
// totalTokens: 58320,
// totalChars: 233280,
// totalWords: 38880,
// oversizedChunks: 1,
// codeBlockChunks: 15,
// tableChunks: 3,
// videoChunks: 2,
// processingTimeMs: 47.3,
// source: "https://..."
// }.reset(): void
Resets internal statistics for reuse.
static .registerStrategy(name, StrategyClass): void
Registers a custom splitting strategy globally.
static .getAvailableStrategies(): string[]
Returns all registered strategy names.
MarkdownTextSplitter.getAvailableStrategies();
// ['semantic', 'delimiter', 'char', 'word', 'token']Named Exports
import {
MarkdownTextSplitter, // Main class
SemanticStrategy, // Built-in strategies
DelimiterStrategy,
CharLimitStrategy,
WordLimitStrategy,
TokenLimitStrategy,
SemanticParser, // Low-level Markdown parser
BlockType, // Block type enum
DEFAULT_CONFIG, // Default configuration object
estimateTokens, // Token estimation utility
countWords, // Word count utility
generateChunkId, // Chunk ID generator
extractLinks, // Link extraction utility
extractVideos, // Video extraction utility
streamToLines, // Stream → line iterator
stringToLines, // String → line iterator
} from '@storepress/llm-md-text-splitter';Architecture
Input (URL / String / File)
│
▼
┌─────────────────────┐
│ Streaming Fetch │ fetch() → ReadableStream<Uint8Array>
│ or String→Stream │ Zero RAM — bytes flow through
└──────────┬──────────┘
▼
┌─────────────────────┐
│ Line Accumulator │ TextDecoderStream → async yield { lineNumber, text }
│ │ One line in memory at a time
└──────────┬──────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ STRATEGY LAYER │
│ │
│ ┌─────────────┐ ┌───────────┐ ┌──────┐ ┌──────┐ ┌───────┐ │
│ │ Semantic │ │ Delimiter │ │ Char │ │ Word │ │ Token │ │
│ │ ┌────────┐ │ │ │ │ │ │ │ │ │ │
│ │ │ Parser │ │ │ Fence- │ │Fence-│ │Fence-│ │ │ │
│ │ │Grouper │ │ │ aware │ │aware │ │aware │ │ │ │
│ │ │Assembly│ │ │ split │ │split │ │split │ │ │ │
│ │ └────────┘ │ │ │ │ │ │ │ │ │ │
│ └─────────────┘ └───────────┘ └──────┘ └──────┘ └───────┘ │
│ │
│ + Custom strategies via registerStrategy() │
└──────────────────────────┬──────────────────────────────────┘
▼
Chunk Objects with
rich metadataSemantic Strategy Pipeline Detail
Lines → SemanticParser → Context Grouper → Chunk Assembler
│ │ │
▼ ▼ ▼
13 block types Zero-loss grouping Token-budgeted
Heading tracking Code + text paired Overlap insertion
Atomic enforcement Video/link grouped Oversized handlingUse Cases
RAG Pipeline with Vector Database
import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';
const splitter = new MarkdownTextSplitter({ maxChunkTokens: 1000 });
for await (const chunk of splitter.streamFromUrl(docsUrl)) {
await pinecone.upsert({
id: chunk.id,
values: await embed(chunk.content),
metadata: {
heading: chunk.heading,
headingPath: chunk.headingPath,
hasCode: chunk.hasCode,
languages: chunk.languages,
lines: `${chunk.lines.start}-${chunk.lines.end}`,
source: docsUrl,
},
});
}Browser File Upload with Progress
<input type="file" id="mdFile" accept=".md,.markdown,.txt">
<div id="progress"></div>
<script type="module">
import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';
document.getElementById('mdFile').addEventListener('change', async (e) => {
const splitter = new MarkdownTextSplitter({ strategy: 'semantic' });
const progress = document.getElementById('progress');
const chunks = [];
for await (const chunk of splitter.streamFromFile(e.target.files[0])) {
chunks.push(chunk);
progress.textContent = `Processed ${chunks.length} chunks…`;
}
const stats = splitter.getStats();
progress.textContent = `Done: ${stats.totalChunks} chunks in ${stats.processingTimeMs.toFixed(0)}ms`;
});
</script>Batch Processing Multiple Docs
const splitter = new MarkdownTextSplitter({ maxChunkTokens: 2000 });
const urls = [
'https://raw.githubusercontent.com/org/repo/main/docs/getting-started.md',
'https://raw.githubusercontent.com/org/repo/main/docs/api-reference.md',
'https://raw.githubusercontent.com/org/repo/main/docs/advanced-usage.md',
];
for (const url of urls) {
splitter.reset();
const chunks = await splitter.splitFromUrl(url);
console.log(`${url}: ${chunks.length} chunks`);
await ingestChunks(chunks);
}Switching Strategies at Runtime
const splitter = new MarkdownTextSplitter();
// Start with semantic
let chunks = await splitter.splitFromString(markdown);
console.log('Semantic:', chunks.length, 'chunks');
// Switch to delimiter
splitter.setStrategy('delimiter', { delimiter: '---' });
chunks = await splitter.splitFromString(markdown);
console.log('Delimiter:', chunks.length, 'chunks');
// Switch to word limit
splitter.setStrategy('word', { wordLimit: 300 });
chunks = await splitter.splitFromString(markdown);
console.log('Word:', chunks.length, 'chunks');Export Chunks as NDJSON
const chunks = await splitter.splitFromString(markdown);
const ndjson = chunks
.map(c => JSON.stringify(c))
.join('\n');
// Download in browser
const blob = new Blob([ndjson], { type: 'application/x-ndjson' });
const url = URL.createObjectURL(blob);Zero Sequence Loss Guarantee
The splitter enforces these invariants across all strategies:
Fenced code blocks (
```or~~~) are never split. Even inchar,word, anddelimiterstrategies, splits are deferred until after the code fence closes.Tables (
| col | col |) are kept as atomic units in the semantic strategy.Video embeds (YouTube, Vimeo links) are grouped with their surrounding context paragraph.
Reference link definitions (
[id]: url) are resolved — every chunk that references[id]also includes the URL definition.YAML frontmatter (
---blocks at file start) is kept as a single atomic block.
You can verify this programmatically:
const chunks = await splitter.splitFromString(markdown);
for (const chunk of chunks) {
if (chunk.hasCode) {
const opens = (chunk.content.match(/```\w*/g) || []).length;
const closes = (chunk.content.match(/\n```/g) || []).length;
console.assert(opens === closes, `Chunk ${chunk.index}: unbalanced fences!`);
}
}Browser Compatibility
| Browser | Minimum Version | Notes |
|---------|----------------|-------|
| Chrome | 71+ | Full support |
| Firefox | 65+ | Full support |
| Safari | 14.1+ | Requires ReadableStream support |
| Edge | 79+ | Chromium-based |
| Node.js | 18+ | Works with --experimental-vm-modules or native ESM |
The module uses only standard web APIs: fetch(), ReadableStream, TextDecoderStream, TextEncoder, AbortController, and performance.now().
Performance
Tested with synthetic Markdown files on a MacBook Pro M2:
| File Size | Lines | Strategy | Chunks | Time | |-----------|-------|----------|--------|------| | 500 KB | 12,000 | semantic | 84 | 32ms | | 5 MB | 120,000 | semantic | 820 | 180ms | | 50 MB | 1,200,000 | char | 6,200 | 1.4s | | 50 MB | 1,200,000 | semantic | 8,100 | 2.8s |
Memory usage stays constant regardless of file size when using the streaming API (streamFrom* methods).
Contributing
Contributions are welcome. The architecture is designed for extensibility:
- New strategies: Implement
constructor(config)andasync *process(lineIterator), then register viaMarkdownTextSplitter.registerStrategy(). - New block types: Add to
BlockTypeenum and updateSemanticParser.parse(). - Better tokenization: Replace
estimateTokens()with a WASM-based tokenizer liketiktoken.
License
MIT
