semchunk

v1.0.0

Published

2 months ago

A fast and lightweight library for splitting text into semantically meaningful chunks.

Downloads

0High
0Medium
0Low

taneron

chunking splitting text nlp ai rag semantic chunk splitter tokenizer

semchunk-ts

A TypeScript port of semchunk, a Python library by Isaacus / Umar Butler for splitting text into semantically meaningful chunks.

semchunk uses a novel hierarchical chunking algorithm that preserves local semantic context by splitting text along structurally meaningful boundaries (paragraphs, sentences, clauses, words) before falling back to character-level splits. The original Python library delivers 15% better RAG performance than its closest competitors and is used in production by Docling, the Microsoft Intelligence Toolkit, and the Isaacus API.

This port brings the core algorithm to the TypeScript/JavaScript ecosystem.

Install

bun add semchunk-ts
# or
npm install semchunk-ts

Quickstart

import { chunk, chunkerify } from "semchunk-ts";

const text = "The quick brown fox jumps over the lazy dog.";

// Use chunk() directly with any token counting function.
const chunks = chunk(text, 20, (text) => text.length);
// => ["The quick brown fox", "jumps over the lazy", "dog."]

// Or create a reusable Chunker with chunkerify().
const chunker = chunkerify((text) => text.length, 20);
chunker.chunk(text);
// => ["The quick brown fox", "jumps over the lazy", "dog."]

API

`chunk(text, chunkSize, tokenCounter, options?)`

Split a single text into chunks.

import { chunk } from "semchunk-ts";

// Basic usage
const chunks = chunk("Your text here...", 512, tokenCounter);

// With offsets — returns [chunks, offsets] where text.slice(start, end) === chunk
const [chunks, offsets] = chunk("Your text here...", 512, tokenCounter, {
  offsets: true,
});

// With overlap — proportion (<1) or absolute token count (>=1)
const overlapping = chunk("Your text here...", 512, tokenCounter, {
  overlap: 0.5,
});

Options:

| Option | Type | Default | Description | |--------|------|---------|-------------| | memoize | boolean | true | Cache token counter results for performance | | offsets | boolean | false | Return [chunks, offsets] instead of just chunks | | overlap | number \| null | null | Chunk overlap as proportion (<1) or token count (>=1) | | cacheMaxsize | number \| null | null | Max memoization cache entries (null = unlimited) |

`chunkerify(tokenCounter, chunkSize, options?)`

Create a reusable Chunker instance.

import { chunkerify } from "semchunk-ts";

// From a token counting function
const chunker = chunkerify((text) => text.split(/\s+/).length, 100);

// From a tokenizer object with an encode() method
const chunker = chunkerify(myTokenizer, 512);

// With max token chars optimization (skips tokenization for clearly oversized text)
const chunker = chunkerify(tokenCounter, 512, { maxTokenChars: 20 });

`Chunker`

// Chunk a single text
const chunks = chunker.chunk(text);
const [chunks, offsets] = chunker.chunk(text, { offsets: true });
const chunks = chunker.chunk(text, { overlap: 0.5 });

// Chunk multiple texts
const results = chunker.chunkBatch([text1, text2, text3]);
const [allChunks, allOffsets] = chunker.chunkBatch(texts, { offsets: true });

How It Works

The algorithm splits text using a hierarchy of structurally meaningful splitters, from most to least desirable:

Largest sequence of newlines / carriage returns
Largest sequence of tabs
Largest sequence of whitespace (with preference for whitespace after punctuation)
Sentence terminators: . ? !
Clause separators: ; , ( ) [ ] " ' ` *
Sentence interrupters: : — …
Word joiners: / \ – & -
Individual characters (last resort)

Chunks that exceed the size limit are recursively split. Undersized adjacent chunks are merged using binary search to approach the chunk size as closely as possible.

Publishing to npm

# Make sure you're logged in
npm login

# Run tests and build
bun test
bun run build

# Dry-run to verify package contents
npm pack --dry-run

# Publish (runs build automatically via prepublishOnly)
npm publish

# Publish with a specific tag (e.g. beta)
npm publish --tag beta

To publish a new version:

# Bump version (patch/minor/major)
npm version patch   # 4.0.0 -> 4.0.1
npm version minor   # 4.0.0 -> 4.1.0
npm version major   # 4.0.0 -> 5.0.0

# Then publish
npm publish

Differences from the Python Library

This port covers the core chunking algorithm. The following Python-specific features are not included:

AI-powered chunking via Isaacus enrichment models
Multiprocessing (mpire) — use chunkBatch() for sequential batch processing
Automatic tokenizer loading from tiktoken/transformers by name — pass a token counter function or tokenizer object directly

Acknowledgements

This is a TypeScript port of semchunk (v4.0.0), created by Isaacus and Umar Butler. The original library is licensed under MIT. All credit for the chunking algorithm and its design goes to the original authors.

A Rust port (semchunk-rs) is also available, maintained by @dominictarro.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme