smartchunk
v1.0.2
Published
Semantic chunking strategy
Maintainers
Readme
smartchunk
Smart chunking strategies.
- Why?
- How do I install it?
- How do I use it?
- What does the returned data look like?
- Why is the chunked text not returned?
- How does it work?
- What strategies are available?
- What options are available?
- Is there a change log?
- How do I set up the dev environment?
- What versions of Node.js does it support?
- What license is it released under?
Why?
A common requirement is chunking long text before generating embeddings for vector search. The efficacy of vector search is directly impacted by the length of whatever text is sent to your embeddings model. Longer text tends to include more concepts, which dilutes the precision of vectors returned by the model. But chunking at arbitrary fixed points also dilutes precision because it splits semantic units of text across one or more discrete chunks. So a smarter chunking strategy is needed.
A crude workaround is to add some overlap between chunks, but that doesn't really tackle the root cause and is hard to generalise across source material because there's no ideal chunk size or overlap length that works for all input text. Instead, you really want something that looks at the content and splits intelligently around it.
smartchunk is my approach to solving this issue,
based on experience building a few different systems
that had to chunk data for vector search
(some for RAG, some for general context truncation).
How do I install it?
If you're using npm:
npm i smartchunk --saveOr if you just want the git repo:
git clone [email protected]:philbooth/smartchunk.gitHow do I use it?
import * as assert from 'node:assert';
import { smartchunk } from 'smartchunk';
const paragraphs = [
"A long time ago in a galaxy far, far away....",
"It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.",
"During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet.",
"Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy....",
];
const text = paragraphs.join('\n\n');
const chunks = smartchunk(text);
assert.equal(chunks.length, 4);
assert.equal(chunks[0].offset, 0);
assert.equal(chunks[0].length, paragraphs[0].length + '\n\n'.length);
assert.equal(
chunks[0].input.slice(chunks[0].offset, chunks[0].length),
paragraphs[0] + '\n\n',
);
assert.equal(chunks[1].offset, chunks[0].length);
assert.equal(chunks[1].length, paragraphs[1].length + '\n\n'.length);
assert.equal(
chunks[1].input.slice(chunks[1].offset, chunks[1].offset + chunks[1].length),
paragraphs[1] + '\n\n',
);
assert.equal(chunks[2].offset, chunks[1].offset + chunks[1].length);
assert.equal(chunks[2].length, paragraphs[2].length + '\n\n'.length);
assert.equal(
chunks[2].input.slice(chunks[2].offset, chunks[2].offset + chunks[2].length),
paragraphs[2] + '\n\n',
);
assert.equal(chunks[3].offset, chunks[2].offset + chunks[2].length);
assert.equal(chunks[3].length, paragraphs[3].length);
assert.equal(
chunks[3].input.slice(chunks[3].offset, chunks[3].offset + chunks[3].length),
paragraphs[3],
);What does the returned data look like?
smartchunk returns an array of Chunk objects:
type Chunk = {
input: string;
length: number;
offset: number;
};inputis the original input text, unmodified.lengthis the length of the chunk, in bytes.offsetis the index of the chunk's first byte.
Why is the chunked text not returned?
If you're batch-chunking large volumes of data,
typically you don't want to load all those strings into memory at the same time.
So chunks are returned with input, offset and length instead,
allowing callers to slice chunk strings
for one batch at a time.
How does it work?
smartchunk applies a cascading hierarchy of strategies
to divide long text into chunks,
without breaking up semantic units
like paragraphs and sentences
unless forced to do so
by the limits imposed by maxSize and minSize.
If unset,
maxSize defaults to 200 bytes and
minSize defaults to 20 bytes.
You can also override which strategies are used
by setting the strategies option.
What strategies are available?
By default,
smartchunk will split at paragraph boundaries,
falling back to line boundaries for any paragraphs that are too long,
falling back to sentence boundaries for any lines that are too long,
falling back to word boundaries for any sentences that are too long and
finally falling back to grapheme boundaries for any words that are too long
(but hopefully your data doesn't fall back this far 🙂).
The strategies for words and graphemes include some overlap
(up to 20% for words, 10% for graphemes)
to compensate for mangled semantics around split points.
To recreate this default behaviour
with the strategies option,
you would pass:
const chunks = smartchunk(text, {
strategies: ['paragraph', 'line', 'sentence', 'word', 'grapheme'],
});By omitting items from the array,
or by changing the order of items,
you can control the hierarchy of strategies
that are used by smartchunk.
What options are available?
locale: The language of the source material, in the same BCP 47 format used byIntl.Segmenter(so either a string or an array of strings). The default value is'en'.maxSize: Maximum chunk size in bytes. It's guaranteed that no returned chunks will be greater than this size. The default value is200.minSize: Minimum chunk size in bytes. Most chunks will be greater than or equal tominSize, but it's possible that the very last chunk will be less than. The default value is20.strategies: Array of strategies thatsmartchunkwill try to use, in order of preference. The default value is['paragraph', 'line', 'sentence', 'word', 'grapheme'].
Is there a change log?
Yes.
How do I set up the dev environment?
To compile TypeScript:
make buildTo lint the code:
make lintTo run the tests:
make testWhat versions of Node does it support?
Node versions 20 or greater are supported.
What license is it released under?
MIT.
