mdchunker
v0.0.2
Published
Chunking markdown files
Readme
mdchunker
A sophisticated library for intelligently chunking markdown documents while preserving context and semantic meaning. Built specifically for Large Language Model (LLM) applications, mdchunker ensures optimal chunk sizes while maintaining document structure and relationships.
Features
- Intelligent Content Splitting: Automatically splits markdown content into semantically meaningful chunks while respecting document structure
- Context Preservation: Maintains heading hierarchies and document relationships in the chunked output
- Token-Aware: Built-in token length calculation ensures chunks are optimally sized for LLM context windows
- Structure-Aware Parsing: Special handling for:
- Headers and heading hierarchies
- Code blocks with language context
- Tables with header preservation
- Markdown link references
- Flexible Configuration: Configurable minimum and maximum token lengths to match your specific LLM requirements
Installation
npm install mdchunker
# or
pnpm add mdchunkerEnvironment Setup
If you plan to use semantic analysis (recommended for production):
# Create a .env file
echo "OPENAI_API_KEY=your-api-key-here" > .envUsage
import { chunkMarkdown } from 'mdchunker';
const markdown = `# My Document
Some content here...
`;
// With semantic analysis (default - uses OpenAI embeddings)
const chunks = await chunkMarkdown(markdown, {
minTokenLength: 256, // minimum tokens per chunk
maxTokenLength: 768, // maximum tokens per chunk
useSemantics: true // enable semantic similarity (default)
});
// Without semantic analysis (faster, no API calls)
const fastChunks = await chunkMarkdown(markdown, {
minTokenLength: 256,
maxTokenLength: 768,
useSemantics: false // use simple heuristics instead
});Each chunk contains:
- Content: The actual markdown text
- Metadata: Contextual information like heading paths
- Token Length: Pre-calculated token count
How It Works
- Preprocessing: The markdown is parsed into an entity tree, preserving structure and formatting
- Token Calculation: Token lengths are calculated for each entity
- Initial Split: Large sections are split into smaller pieces while respecting markdown structure
- Merge Phase: Small, related chunks are intelligently merged to meet minimum token requirements
- Final Processing: Chunks are finalized with metadata and contextual information
Roadmap
- Intelligent splitting of code and tables
- code - class and function headers
- table - start of a section
- Providing metadata for every chunk
- Better contextual data for every chunk
Technical Details
The library uses a multi-stage processing pipeline:
- Entity Tree Building: Parses markdown into a hierarchical structure using
remarkwith GFM support - Token Analysis: Uses
tiktokenfor accurate token counting - Smart Splitting: Multiple splitting strategies:
- Structure-based (headings, paragraphs)
- Semantic-based (using embeddings)
- Token-based (for optimal sizing)
- Context Preservation: Maintains document hierarchy and relationships through metadata
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT
