npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

smartchunk

v1.0.2

Published

Semantic chunking strategy

Readme

smartchunk

Build status Package status Downloads License

Smart chunking strategies.

Why?

A common requirement is chunking long text before generating embeddings for vector search. The efficacy of vector search is directly impacted by the length of whatever text is sent to your embeddings model. Longer text tends to include more concepts, which dilutes the precision of vectors returned by the model. But chunking at arbitrary fixed points also dilutes precision because it splits semantic units of text across one or more discrete chunks. So a smarter chunking strategy is needed.

A crude workaround is to add some overlap between chunks, but that doesn't really tackle the root cause and is hard to generalise across source material because there's no ideal chunk size or overlap length that works for all input text. Instead, you really want something that looks at the content and splits intelligently around it.

smartchunk is my approach to solving this issue, based on experience building a few different systems that had to chunk data for vector search (some for RAG, some for general context truncation).

How do I install it?

If you're using npm:

npm i smartchunk --save

Or if you just want the git repo:

git clone [email protected]:philbooth/smartchunk.git

How do I use it?

import * as assert from 'node:assert';
import { smartchunk } from 'smartchunk';

const paragraphs = [
  "A long time ago in a galaxy far, far away....",
  "It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire.",
  "During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet.",
  "Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy....",
];
const text = paragraphs.join('\n\n');

const chunks = smartchunk(text);

assert.equal(chunks.length, 4);

assert.equal(chunks[0].offset, 0);
assert.equal(chunks[0].length, paragraphs[0].length + '\n\n'.length);
assert.equal(
  chunks[0].input.slice(chunks[0].offset, chunks[0].length),
  paragraphs[0] + '\n\n',
);

assert.equal(chunks[1].offset, chunks[0].length);
assert.equal(chunks[1].length, paragraphs[1].length + '\n\n'.length);
assert.equal(
  chunks[1].input.slice(chunks[1].offset, chunks[1].offset + chunks[1].length),
  paragraphs[1] + '\n\n',
);

assert.equal(chunks[2].offset, chunks[1].offset + chunks[1].length);
assert.equal(chunks[2].length, paragraphs[2].length + '\n\n'.length);
assert.equal(
  chunks[2].input.slice(chunks[2].offset, chunks[2].offset + chunks[2].length),
  paragraphs[2] + '\n\n',
);

assert.equal(chunks[3].offset, chunks[2].offset + chunks[2].length);
assert.equal(chunks[3].length, paragraphs[3].length);
assert.equal(
  chunks[3].input.slice(chunks[3].offset, chunks[3].offset + chunks[3].length),
  paragraphs[3],
);

What does the returned data look like?

smartchunk returns an array of Chunk objects:

type Chunk = {
  input: string;
  length: number;
  offset: number;
};
  • input is the original input text, unmodified.

  • length is the length of the chunk, in bytes.

  • offset is the index of the chunk's first byte.

Why is the chunked text not returned?

If you're batch-chunking large volumes of data, typically you don't want to load all those strings into memory at the same time. So chunks are returned with input, offset and length instead, allowing callers to slice chunk strings for one batch at a time.

How does it work?

smartchunk applies a cascading hierarchy of strategies to divide long text into chunks, without breaking up semantic units like paragraphs and sentences unless forced to do so by the limits imposed by maxSize and minSize. If unset, maxSize defaults to 200 bytes and minSize defaults to 20 bytes. You can also override which strategies are used by setting the strategies option.

What strategies are available?

By default, smartchunk will split at paragraph boundaries, falling back to line boundaries for any paragraphs that are too long, falling back to sentence boundaries for any lines that are too long, falling back to word boundaries for any sentences that are too long and finally falling back to grapheme boundaries for any words that are too long (but hopefully your data doesn't fall back this far 🙂). The strategies for words and graphemes include some overlap (up to 20% for words, 10% for graphemes) to compensate for mangled semantics around split points.

To recreate this default behaviour with the strategies option, you would pass:

const chunks = smartchunk(text, {
  strategies: ['paragraph', 'line', 'sentence', 'word', 'grapheme'],
});

By omitting items from the array, or by changing the order of items, you can control the hierarchy of strategies that are used by smartchunk.

What options are available?

  • locale: The language of the source material, in the same BCP 47 format used by Intl.Segmenter (so either a string or an array of strings). The default value is 'en'.

  • maxSize: Maximum chunk size in bytes. It's guaranteed that no returned chunks will be greater than this size. The default value is 200.

  • minSize: Minimum chunk size in bytes. Most chunks will be greater than or equal to minSize, but it's possible that the very last chunk will be less than. The default value is 20.

  • strategies: Array of strategies that smartchunk will try to use, in order of preference. The default value is ['paragraph', 'line', 'sentence', 'word', 'grapheme'].

Is there a change log?

Yes.

How do I set up the dev environment?

To compile TypeScript:

make build

To lint the code:

make lint

To run the tests:

make test

What versions of Node does it support?

Node versions 20 or greater are supported.

What license is it released under?

MIT.