@storepress/llm-md-text-splitter

v0.0.1

Published

2 months ago

High-performance streaming Markdown text splitter for LLM pipelines and RAG systems. Zero sequence loss for code blocks, tables, links, and videos. 5 built-in strategies + custom. Zero dependencies.

LLM MD TEXT SPLITTER (Vibe Coded)

A high-performance, streaming Markdown text splitter built for LLM pipelines and RAG systems. Zero dependencies. Runs in browsers and Node.js 18+.

Zero Sequence Loss — code blocks, tables, reference links, and video embeds are never split. They stay grouped with their surrounding context as atomic semantic units.

Why?

When feeding large documentation (100,000+ lines) into LLM context windows or vector databases, naive splitters break code mid-function, separate explanations from their code examples, and lose reference links. This library guarantees:

Code blocks with ``` fences are never split, even inside delimiter/char/word strategies
Explanatory text stays grouped with its adjacent code blocks
Reference-style link definitions ([id]: url) are resolved per chunk
Tables, video embeds, and YAML frontmatter are kept atomic
Stream-based processing handles massive files with constant memory

Install

npm install @storepress/llm-md-text-splitter

yarn add @storepress/llm-md-text-splitter

pnpm add @storepress/llm-md-text-splitter

Or use directly in the browser via CDN:

<script type="module">
  import MarkdownTextSplitter from 'https://esm.sh/@storepress/llm-md-text-splitter';
</script>

Quick Start

import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';

const splitter = new MarkdownTextSplitter();
const chunks = await splitter.splitFromString(markdown);

chunks.forEach(chunk => {
  console.log(chunk.index, chunk.heading, chunk.tokenEstimate);
});

Splitting Strategies

The library ships with 5 built-in strategies and supports custom strategy registration.

| Strategy | Splits by | Best for | |----------|-----------|----------| | semantic | Markdown structure (headings, blocks) | RAG pipelines, LLM context | | delimiter | Custom string (---, ===, etc.) | Manually-sectioned docs | | char | Character count | Fixed-size windows | | word | Word count | Readability-based splits | | token | Estimated LLM token count | Token-budget-aware pipelines |

Every strategy protects fenced code blocks from being split.

Strategy Examples

1. Semantic Strategy (Default)

The most intelligent strategy. Understands Markdown structure and groups related content together — code with its explanation, videos with their context, links with their sections.

import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';

const splitter = new MarkdownTextSplitter({
  strategy: 'semantic',
  maxChunkTokens: 1500,       // Target tokens per chunk (~4 chars/token)
  overlapTokens: 150,         // Overlap between consecutive chunks
  preserveCodeContext: true,   // Keep code blocks with surrounding text
  preserveLinks: true,         // Keep reference links with their sections
  preserveVideos: true,        // Keep video embeds with context
});

const chunks = await splitter.splitFromString(markdown);

for (const chunk of chunks) {
  console.log(`#${chunk.index} — ${chunk.heading}`);
  console.log(`  Tokens: ${chunk.tokenEstimate}`);
  console.log(`  Code: ${chunk.hasCode} | Table: ${chunk.hasTable}`);
  console.log(`  Languages: ${chunk.languages.join(', ')}`);
  console.log(`  Links: ${chunk.links.length}`);
  console.log(`  Lines: ${chunk.lines.start}–${chunk.lines.end}`);
}

2. Delimiter Strategy

Splits on a custom delimiter string. The delimiter itself is excluded by default.

const splitter = new MarkdownTextSplitter({
  strategy: 'delimiter',
  strategyOptions: {
    delimiter: '---',           // Split on this exact string
    keepDelimiter: false,       // Exclude delimiter from output
    trimChunks: true,           // Trim whitespace from edges
  },
});

const chunks = await splitter.splitFromString(`
# Section One

Content for section one.

---

# Section Two

Content for section two.

\`\`\`js
// This code block contains --- but won't be split
const divider = '---';
\`\`\`
`);

console.log(chunks.length); // → 2 (code block with --- inside is protected)

You can use any string as a delimiter:

// Split on HTML comments
{ delimiter: '<!-- split -->' }

// Split on equals signs
{ delimiter: '===' }

// Split on custom markers
{ delimiter: '## CHUNK_BREAK' }

3. Character Limit Strategy

Splits when accumulated characters exceed the limit. Defers splits until outside code blocks.

const splitter = new MarkdownTextSplitter({
  strategy: 'char',
  strategyOptions: {
    charLimit: 4000,    // Max characters per chunk
    overlap: 200,       // Character overlap between chunks
  },
});

const chunks = await splitter.splitFromString(markdown);

chunks.forEach(c => {
  console.log(`Chunk ${c.index}: ${c.charCount} chars`);
});

4. Word Limit Strategy

Splits by word count. Useful for readability-based chunking.

const splitter = new MarkdownTextSplitter({
  strategy: 'word',
  strategyOptions: {
    wordLimit: 500,     // Max words per chunk
    overlap: 30,        // Word overlap between chunks
  },
});

const chunks = await splitter.splitFromString(markdown);

chunks.forEach(c => {
  console.log(`Chunk ${c.index}: ${c.wordCount} words`);
});

5. Token Limit Strategy

Splits by estimated LLM token count (uses ~4 chars/token heuristic by default).

const splitter = new MarkdownTextSplitter({
  strategy: 'token',
  strategyOptions: {
    tokenLimit: 2000,   // Max tokens per chunk
  },
});

const chunks = await splitter.splitFromString(markdown);

chunks.forEach(c => {
  console.log(`Chunk ${c.index}: ~${c.tokenEstimate} tokens`);
});

6. Custom Strategy

import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';

class RegexStrategy {
  constructor(config) {
    this.config = config;
    this.name = 'regex';
    this.pattern = new RegExp(config.strategyOptions?.pattern || '^#{1,2}\\s', 'm');
  }

  async *process(lineIterator) {
    let buffer = [];
    let startLine = 1;
    let index = 0;

    for await (const { lineNumber, text } of lineIterator) {
      if (this.pattern.test(text) && buffer.length > 0) {
        yield this._makeChunk(buffer, startLine, lineNumber - 1, index++);
        buffer = [];
        startLine = lineNumber;
      }
      buffer.push(text);
    }

    if (buffer.length > 0) {
      yield this._makeChunk(buffer, startLine, startLine + buffer.length - 1, index);
    }
  }

  _makeChunk(lines, startLine, endLine, index) {
    const content = lines.join('\n');
    return {
      id: `regex_${index}`,
      index,
      content,
      tokenEstimate: Math.ceil(content.length / 4),
      overlapTokens: 0,
      charCount: content.length,
      wordCount: content.split(/\s+/).filter(Boolean).length,
      lines: { start: startLine, end: endLine },
      heading: null,
      headingPath: [],
      headingLevel: null,
      hasCode: /```[\s\S]*?```/.test(content),
      hasVideo: false,
      hasTable: false,
      languages: [],
      links: [],
      videos: [],
      isOversized: false,
      containsAtomicBlock: false,
      blockTypes: [],
      strategy: 'regex',
      metadata: {},
    };
  }
}

// Register globally
MarkdownTextSplitter.registerStrategy('regex', RegexStrategy);

// Use it
const splitter = new MarkdownTextSplitter({
  strategy: 'regex',
  strategyOptions: { pattern: '^#{1,2}\\s' },
});

const chunks = await splitter.splitFromString(markdown);

Input Sources

From String

const chunks = await splitter.splitFromString(markdownContent);

From URL (Streaming)

Fetches via HTTP with streaming — never loads the full file into memory:

const chunks = await splitter.splitFromUrl(
  'https://raw.githubusercontent.com/user/repo/main/docs.md'
);

With custom headers (e.g., for private repos):

const chunks = await splitter.splitFromUrl(url, {
  headers: { Authorization: 'Bearer token' },
});

From File (Browser)

Works with <input type="file"> and the File/Blob API:

const input = document.querySelector('input[type="file"]');
input.addEventListener('change', async (e) => {
  const chunks = await splitter.splitFromFile(e.target.files[0]);
});

Streaming API

Process chunks as they arrive — ideal for huge files or progress indicators:

for await (const chunk of splitter.streamFromUrl(url)) {
  await vectorDB.upsert({
    id: chunk.id,
    text: chunk.content,
    metadata: {
      heading: chunk.heading,
      lines: chunk.lines,
      hasCode: chunk.hasCode,
    },
  });
  updateProgress(chunk.index);
}

All input methods have streaming variants:

splitter.streamFromString(markdown)   // AsyncGenerator
splitter.streamFromUrl(url)           // AsyncGenerator
splitter.streamFromFile(file)         // AsyncGenerator

Chunk Output Format

Every chunk object has this shape regardless of strategy:

{
  // ── Identity ──
  id: "chunk_a1b2c3d4_0042",     // Deterministic FNV-1a hash ID
  index: 42,                     // Sequential position (0-based)

  // ── Content ──
  content: "## useEffect Hook\n\nThe `useEffect` hook…\n\n```jsx\n…\n```",

  // ── Size Metrics ──
  tokenEstimate: 1450,             // Approximate LLM tokens (~4 chars/token)
  overlapTokens: 100,              // Tokens repeated from previous chunk
  charCount: 5800,                 // Exact character count
  wordCount: 342,                  // Word count

  // ── Source Mapping ──
  lines: { start: 120, end: 185 }, // Original line numbers

  // ── Structural Context (Semantic strategy) ──
  heading: "useEffect Hook",       // Nearest heading text
  headingPath: ["React Hooks Guide", "useEffect Hook"],  // Breadcrumb
  headingLevel: 2,                 // Heading depth (1–6)

  // ── Content Classification ──
  hasCode: true,                   // Contains fenced code blocks
  hasTable: false,                 // Contains markdown tables
  hasVideo: true,                  // Contains video embeds
  languages: ["jsx"],              // Code block languages

  // ── Extracted References ──
  links: [
    { text: "React Docs", url: "https://react.dev" }
  ],
  videos: [
    { platform: "youtube", videoId: "dQw4w9WgXcQ", url: "https://..." }
  ],

  // ── Quality Flags ──
  isOversized: false,              // Exceeds 1.5× target size
  containsAtomicBlock: true,       // Has code/table/video blocks
  blockTypes: ["heading", "paragraph", "code_block"],  // Semantic types

  // ── Strategy Info ──
  strategy: "semantic",            // Which strategy produced this chunk

  // ── Extensible ──
  metadata: {
    splitterVersion: "3.0.0",
    strategy: "semantic",
  }
}

Configuration Reference

Global Options

These apply to all strategies:

{
  charsPerToken: 4,              // Characters-per-token ratio for estimation
  fetchTimeoutMs: 60000,         // HTTP fetch timeout (ms)
  chunkIdPrefix: "chunk",        // Prefix for generated chunk IDs
}

Semantic Strategy Options

{
  strategy: "semantic",
  maxChunkTokens: 1500,          // Target max tokens per chunk
  overlapTokens: 150,            // Token overlap between chunks
  preserveCodeContext: true,      // Group code with explanatory text
  preserveLinks: true,            // Group reference links with sections
  preserveVideos: true,           // Group video embeds with context
}

Delimiter Strategy Options

{
  strategy: "delimiter",
  strategyOptions: {
    delimiter: "---",             // String to split on
    keepDelimiter: false,         // Include delimiter in output
    trimChunks: true,             // Trim whitespace from chunk edges
  }
}

Character Limit Strategy Options

{
  strategy: "char",
  strategyOptions: {
    charLimit: 4000,              // Max characters per chunk
    overlap: 200,                 // Character overlap
  }
}

Word Limit Strategy Options

{
  strategy: "word",
  strategyOptions: {
    wordLimit: 1000,              // Max words per chunk
    overlap: 50,                  // Word overlap
  }
}

Token Limit Strategy Options

{
  strategy: "token",
  strategyOptions: {
    tokenLimit: 1500,             // Max estimated tokens per chunk
  }
}

API Reference

Class: `MarkdownTextSplitter`

`new MarkdownTextSplitter(config?)`

Creates a new splitter instance with merged configuration.

const splitter = new MarkdownTextSplitter({ maxChunkTokens: 2000 });

`.splitFromString(markdown): Promise<Chunk[]>`

Splits a markdown string and returns all chunks.

`.splitFromUrl(url, fetchOptions?): Promise<Chunk[]>`

Fetches a URL via streaming HTTP and returns all chunks.

`.splitFromFile(fileOrBlob): Promise<Chunk[]>`

Splits a browser File or Blob object.

`.streamFromString(markdown): AsyncGenerator<Chunk>`

Yields chunks one-at-a-time from a string.

`.streamFromUrl(url, fetchOptions?): AsyncGenerator<Chunk>`

Yields chunks one-at-a-time from a streaming HTTP fetch.

`.streamFromFile(fileOrBlob): AsyncGenerator<Chunk>`

Yields chunks one-at-a-time from a File/Blob.

`.setStrategy(name, options?): void`

Switches the active strategy at runtime.

splitter.setStrategy('delimiter', { delimiter: '===' });

`.getStats(): Stats`

Returns processing statistics from the last split operation.

const stats = splitter.getStats();
// {
//   totalChunks: 42,
//   totalTokens: 58320,
//   totalChars: 233280,
//   totalWords: 38880,
//   oversizedChunks: 1,
//   codeBlockChunks: 15,
//   tableChunks: 3,
//   videoChunks: 2,
//   processingTimeMs: 47.3,
//   source: "https://..."
// }

`.reset(): void`

Resets internal statistics for reuse.

`static .registerStrategy(name, StrategyClass): void`

Registers a custom splitting strategy globally.

`static .getAvailableStrategies(): string[]`

Returns all registered strategy names.

MarkdownTextSplitter.getAvailableStrategies();
// ['semantic', 'delimiter', 'char', 'word', 'token']

Named Exports

import {
  MarkdownTextSplitter,     // Main class
  SemanticStrategy,          // Built-in strategies
  DelimiterStrategy,
  CharLimitStrategy,
  WordLimitStrategy,
  TokenLimitStrategy,
  SemanticParser,            // Low-level Markdown parser
  BlockType,                 // Block type enum
  DEFAULT_CONFIG,            // Default configuration object
  estimateTokens,            // Token estimation utility
  countWords,                // Word count utility
  generateChunkId,           // Chunk ID generator
  extractLinks,              // Link extraction utility
  extractVideos,             // Video extraction utility
  streamToLines,             // Stream → line iterator
  stringToLines,             // String → line iterator
} from '@storepress/llm-md-text-splitter';

Architecture

Input (URL / String / File)
  │
  ▼
┌─────────────────────┐
│  Streaming Fetch    │  fetch() → ReadableStream<Uint8Array>
│  or String→Stream   │  Zero RAM — bytes flow through
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│  Line Accumulator   │  TextDecoderStream → async yield { lineNumber, text }
│                     │  One line in memory at a time
└──────────┬──────────┘
           ▼
┌─────────────────────────────────────────────────────────────┐
│                    STRATEGY LAYER                           │
│                                                             │
│  ┌─────────────┐ ┌───────────┐ ┌──────┐ ┌──────┐ ┌───────┐  │
│  │  Semantic   │ │ Delimiter │ │ Char │ │ Word │ │ Token │  │
│  │  ┌────────┐ │ │           │ │      │ │      │ │       │  │
│  │  │ Parser │ │ │  Fence-   │ │Fence-│ │Fence-│ │       │  │
│  │  │Grouper │ │ │  aware    │ │aware │ │aware │ │       │  │
│  │  │Assembly│ │ │  split    │ │split │ │split │ │       │  │
│  │  └────────┘ │ │           │ │      │ │      │ │       │  │
│  └─────────────┘ └───────────┘ └──────┘ └──────┘ └───────┘  │
│                                                             │
│  + Custom strategies via registerStrategy()                 │
└──────────────────────────┬──────────────────────────────────┘
                           ▼
                   Chunk Objects with
                   rich metadata

Semantic Strategy Pipeline Detail

Lines → SemanticParser → Context Grouper → Chunk Assembler
            │                   │                │
            ▼                   ▼                ▼
     13 block types      Zero-loss grouping   Token-budgeted
     Heading tracking    Code + text paired    Overlap insertion
     Atomic enforcement  Video/link grouped    Oversized handling

Use Cases

RAG Pipeline with Vector Database

import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';

const splitter = new MarkdownTextSplitter({ maxChunkTokens: 1000 });

for await (const chunk of splitter.streamFromUrl(docsUrl)) {
  await pinecone.upsert({
    id: chunk.id,
    values: await embed(chunk.content),
    metadata: {
      heading: chunk.heading,
      headingPath: chunk.headingPath,
      hasCode: chunk.hasCode,
      languages: chunk.languages,
      lines: `${chunk.lines.start}-${chunk.lines.end}`,
      source: docsUrl,
    },
  });
}

Browser File Upload with Progress

<input type="file" id="mdFile" accept=".md,.markdown,.txt">
<div id="progress"></div>

<script type="module">
import MarkdownTextSplitter from '@storepress/llm-md-text-splitter';

document.getElementById('mdFile').addEventListener('change', async (e) => {
  const splitter = new MarkdownTextSplitter({ strategy: 'semantic' });
  const progress = document.getElementById('progress');
  const chunks = [];

  for await (const chunk of splitter.streamFromFile(e.target.files[0])) {
    chunks.push(chunk);
    progress.textContent = `Processed ${chunks.length} chunks…`;
  }

  const stats = splitter.getStats();
  progress.textContent = `Done: ${stats.totalChunks} chunks in ${stats.processingTimeMs.toFixed(0)}ms`;
});
</script>

Batch Processing Multiple Docs

const splitter = new MarkdownTextSplitter({ maxChunkTokens: 2000 });

const urls = [
  'https://raw.githubusercontent.com/org/repo/main/docs/getting-started.md',
  'https://raw.githubusercontent.com/org/repo/main/docs/api-reference.md',
  'https://raw.githubusercontent.com/org/repo/main/docs/advanced-usage.md',
];

for (const url of urls) {
  splitter.reset();
  const chunks = await splitter.splitFromUrl(url);
  console.log(`${url}: ${chunks.length} chunks`);
  await ingestChunks(chunks);
}

Switching Strategies at Runtime

const splitter = new MarkdownTextSplitter();

// Start with semantic
let chunks = await splitter.splitFromString(markdown);
console.log('Semantic:', chunks.length, 'chunks');

// Switch to delimiter
splitter.setStrategy('delimiter', { delimiter: '---' });
chunks = await splitter.splitFromString(markdown);
console.log('Delimiter:', chunks.length, 'chunks');

// Switch to word limit
splitter.setStrategy('word', { wordLimit: 300 });
chunks = await splitter.splitFromString(markdown);
console.log('Word:', chunks.length, 'chunks');

Export Chunks as NDJSON

const chunks = await splitter.splitFromString(markdown);

const ndjson = chunks
  .map(c => JSON.stringify(c))
  .join('\n');

// Download in browser
const blob = new Blob([ndjson], { type: 'application/x-ndjson' });
const url = URL.createObjectURL(blob);

Zero Sequence Loss Guarantee

The splitter enforces these invariants across all strategies:

Fenced code blocks (``` or ~~~) are never split. Even in char, word, and delimiter strategies, splits are deferred until after the code fence closes.
Tables (| col | col |) are kept as atomic units in the semantic strategy.
Video embeds (YouTube, Vimeo links) are grouped with their surrounding context paragraph.
Reference link definitions ([id]: url) are resolved — every chunk that references [id] also includes the URL definition.
YAML frontmatter (--- blocks at file start) is kept as a single atomic block.

You can verify this programmatically:

const chunks = await splitter.splitFromString(markdown);

for (const chunk of chunks) {
  if (chunk.hasCode) {
    const opens = (chunk.content.match(/```\w*/g) || []).length;
    const closes = (chunk.content.match(/\n```/g) || []).length;
    console.assert(opens === closes, `Chunk ${chunk.index}: unbalanced fences!`);
  }
}

Browser Compatibility

| Browser | Minimum Version | Notes | |---------|----------------|-------| | Chrome | 71+ | Full support | | Firefox | 65+ | Full support | | Safari | 14.1+ | Requires ReadableStream support | | Edge | 79+ | Chromium-based | | Node.js | 18+ | Works with --experimental-vm-modules or native ESM |

The module uses only standard web APIs: fetch(), ReadableStream, TextDecoderStream, TextEncoder, AbortController, and performance.now().

Performance

Tested with synthetic Markdown files on a MacBook Pro M2:

| File Size | Lines | Strategy | Chunks | Time | |-----------|-------|----------|--------|------| | 500 KB | 12,000 | semantic | 84 | 32ms | | 5 MB | 120,000 | semantic | 820 | 180ms | | 50 MB | 1,200,000 | char | 6,200 | 1.4s | | 50 MB | 1,200,000 | semantic | 8,100 | 2.8s |

Memory usage stays constant regardless of file size when using the streaming API (streamFrom* methods).

Contributing

Contributions are welcome. The architecture is designed for extensibility:

New strategies: Implement constructor(config) and async *process(lineIterator), then register via MarkdownTextSplitter.registerStrategy().
New block types: Add to BlockType enum and update SemanticParser.parse().
Better tokenization: Replace estimateTokens() with a WASM-based tokenizer like tiktoken.

License

MIT