semchunk
v1.0.0
Published
A fast and lightweight library for splitting text into semantically meaningful chunks.
Downloads
15
Maintainers
Readme
semchunk-ts
A TypeScript port of semchunk, a Python library by Isaacus / Umar Butler for splitting text into semantically meaningful chunks.
semchunk uses a novel hierarchical chunking algorithm that preserves local semantic context by splitting text along structurally meaningful boundaries (paragraphs, sentences, clauses, words) before falling back to character-level splits. The original Python library delivers 15% better RAG performance than its closest competitors and is used in production by Docling, the Microsoft Intelligence Toolkit, and the Isaacus API.
This port brings the core algorithm to the TypeScript/JavaScript ecosystem.
Install
bun add semchunk-ts
# or
npm install semchunk-tsQuickstart
import { chunk, chunkerify } from "semchunk-ts";
const text = "The quick brown fox jumps over the lazy dog.";
// Use chunk() directly with any token counting function.
const chunks = chunk(text, 20, (text) => text.length);
// => ["The quick brown fox", "jumps over the lazy", "dog."]
// Or create a reusable Chunker with chunkerify().
const chunker = chunkerify((text) => text.length, 20);
chunker.chunk(text);
// => ["The quick brown fox", "jumps over the lazy", "dog."]API
chunk(text, chunkSize, tokenCounter, options?)
Split a single text into chunks.
import { chunk } from "semchunk-ts";
// Basic usage
const chunks = chunk("Your text here...", 512, tokenCounter);
// With offsets — returns [chunks, offsets] where text.slice(start, end) === chunk
const [chunks, offsets] = chunk("Your text here...", 512, tokenCounter, {
offsets: true,
});
// With overlap — proportion (<1) or absolute token count (>=1)
const overlapping = chunk("Your text here...", 512, tokenCounter, {
overlap: 0.5,
});Options:
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| memoize | boolean | true | Cache token counter results for performance |
| offsets | boolean | false | Return [chunks, offsets] instead of just chunks |
| overlap | number \| null | null | Chunk overlap as proportion (<1) or token count (>=1) |
| cacheMaxsize | number \| null | null | Max memoization cache entries (null = unlimited) |
chunkerify(tokenCounter, chunkSize, options?)
Create a reusable Chunker instance.
import { chunkerify } from "semchunk-ts";
// From a token counting function
const chunker = chunkerify((text) => text.split(/\s+/).length, 100);
// From a tokenizer object with an encode() method
const chunker = chunkerify(myTokenizer, 512);
// With max token chars optimization (skips tokenization for clearly oversized text)
const chunker = chunkerify(tokenCounter, 512, { maxTokenChars: 20 });Chunker
// Chunk a single text
const chunks = chunker.chunk(text);
const [chunks, offsets] = chunker.chunk(text, { offsets: true });
const chunks = chunker.chunk(text, { overlap: 0.5 });
// Chunk multiple texts
const results = chunker.chunkBatch([text1, text2, text3]);
const [allChunks, allOffsets] = chunker.chunkBatch(texts, { offsets: true });How It Works
The algorithm splits text using a hierarchy of structurally meaningful splitters, from most to least desirable:
- Largest sequence of newlines / carriage returns
- Largest sequence of tabs
- Largest sequence of whitespace (with preference for whitespace after punctuation)
- Sentence terminators:
.?! - Clause separators:
;,()[]"'`* - Sentence interrupters:
:—… - Word joiners:
/\–&- - Individual characters (last resort)
Chunks that exceed the size limit are recursively split. Undersized adjacent chunks are merged using binary search to approach the chunk size as closely as possible.
Publishing to npm
# Make sure you're logged in
npm login
# Run tests and build
bun test
bun run build
# Dry-run to verify package contents
npm pack --dry-run
# Publish (runs build automatically via prepublishOnly)
npm publish
# Publish with a specific tag (e.g. beta)
npm publish --tag betaTo publish a new version:
# Bump version (patch/minor/major)
npm version patch # 4.0.0 -> 4.0.1
npm version minor # 4.0.0 -> 4.1.0
npm version major # 4.0.0 -> 5.0.0
# Then publish
npm publishDifferences from the Python Library
This port covers the core chunking algorithm. The following Python-specific features are not included:
- AI-powered chunking via Isaacus enrichment models
- Multiprocessing (mpire) — use
chunkBatch()for sequential batch processing - Automatic tokenizer loading from tiktoken/transformers by name — pass a token counter function or tokenizer object directly
Acknowledgements
This is a TypeScript port of semchunk (v4.0.0), created by Isaacus and Umar Butler. The original library is licensed under MIT. All credit for the chunking algorithm and its design goes to the original authors.
A Rust port (semchunk-rs) is also available, maintained by @dominictarro.
License
MIT
