smartchunk

v1.0.2

Published

9 months ago

Semantic chunking strategy

0High
0Medium
0Low

philbooth

chunk strategy split slice string text embedding vector

smartchunk

Smart chunking strategies.

Why?

A common requirement is chunking long text before generating embeddings for vector search. The efficacy of vector search is directly impacted by the length of whatever text is sent to your embeddings model. Longer text tends to include more concepts, which dilutes the precision of vectors returned by the model. But chunking at arbitrary fixed points also dilutes precision because it splits semantic units of text across one or more discrete chunks. So a smarter chunking strategy is needed.

A crude workaround is to add some overlap between chunks, but that doesn't really tackle the root cause and is hard to generalise across source material because there's no ideal chunk size or overlap length that works for all input text. Instead, you really want something that looks at the content and splits intelligently around it.

smartchunk is my approach to solving this issue, based on experience building a few different systems that had to chunk data for vector search (some for RAG, some for general context truncation).

How do I install it?

If you're using npm:

npm i smartchunk --save

Or if you just want the git repo:

git clone [email protected]:philbooth/smartchunk.git

How do I use it?

import * as assert from 'node:assert';
import { smartchunk } from 'smartchunk';

const paragraphs = [
  "A long time ago in a galaxy far, far away....",
  "It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.",
  "During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet.",
  "Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy....",
];
const text = paragraphs.join('\n\n');

const chunks = smartchunk(text);

assert.equal(chunks.length, 4);

assert.equal(chunks[0].offset, 0);
assert.equal(chunks[0].length, paragraphs[0].length + '\n\n'.length);
assert.equal(
  chunks[0].input.slice(chunks[0].offset, chunks[0].length),
  paragraphs[0] + '\n\n',
);

assert.equal(chunks[1].offset, chunks[0].length);
assert.equal(chunks[1].length, paragraphs[1].length + '\n\n'.length);
assert.equal(
  chunks[1].input.slice(chunks[1].offset, chunks[1].offset + chunks[1].length),
  paragraphs[1] + '\n\n',
);

assert.equal(chunks[2].offset, chunks[1].offset + chunks[1].length);
assert.equal(chunks[2].length, paragraphs[2].length + '\n\n'.length);
assert.equal(
  chunks[2].input.slice(chunks[2].offset, chunks[2].offset + chunks[2].length),
  paragraphs[2] + '\n\n',
);

assert.equal(chunks[3].offset, chunks[2].offset + chunks[2].length);
assert.equal(chunks[3].length, paragraphs[3].length);
assert.equal(
  chunks[3].input.slice(chunks[3].offset, chunks[3].offset + chunks[3].length),
  paragraphs[3],
);

What does the returned data look like?

smartchunk returns an array of Chunk objects:

type Chunk = {
  input: string;
  length: number;
  offset: number;
};

input is the original input text, unmodified.
length is the length of the chunk, in bytes.
offset is the index of the chunk's first byte.

Why is the chunked text not returned?

If you're batch-chunking large volumes of data, typically you don't want to load all those strings into memory at the same time. So chunks are returned with input, offset and length instead, allowing callers to slice chunk strings for one batch at a time.

How does it work?

smartchunk applies a cascading hierarchy of strategies to divide long text into chunks, without breaking up semantic units like paragraphs and sentences unless forced to do so by the limits imposed by maxSize and minSize. If unset, maxSize defaults to 200 bytes and minSize defaults to 20 bytes. You can also override which strategies are used by setting the strategies option.

What strategies are available?

By default, smartchunk will split at paragraph boundaries, falling back to line boundaries for any paragraphs that are too long, falling back to sentence boundaries for any lines that are too long, falling back to word boundaries for any sentences that are too long and finally falling back to grapheme boundaries for any words that are too long (but hopefully your data doesn't fall back this far 🙂). The strategies for words and graphemes include some overlap (up to 20% for words, 10% for graphemes) to compensate for mangled semantics around split points.

To recreate this default behaviour with the strategies option, you would pass:

const chunks = smartchunk(text, {
  strategies: ['paragraph', 'line', 'sentence', 'word', 'grapheme'],
});

By omitting items from the array, or by changing the order of items, you can control the hierarchy of strategies that are used by smartchunk.

What options are available?

locale: The language of the source material, in the same BCP 47 format used by Intl.Segmenter (so either a string or an array of strings). The default value is 'en'.
maxSize: Maximum chunk size in bytes. It's guaranteed that no returned chunks will be greater than this size. The default value is 200.
minSize: Minimum chunk size in bytes. Most chunks will be greater than or equal to minSize, but it's possible that the very last chunk will be less than. The default value is 20.
strategies: Array of strategies that smartchunk will try to use, in order of preference. The default value is ['paragraph', 'line', 'sentence', 'word', 'grapheme'].

Is there a change log?

Yes.

How do I set up the dev environment?

To compile TypeScript:

make build

To lint the code:

make lint

To run the tests:

make test

What versions of Node does it support?

Node versions 20 or greater are supported.

What license is it released under?

MIT.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

smartchunk

Why?

How do I install it?

How do I use it?

What does the returned data look like?

Why is the chunked text not returned?

How does it work?

What strategies are available?

What options are available?

Is there a change log?

How do I set up the dev environment?

What versions of Node does it support?

What license is it released under?