npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

semchunk

v1.0.0

Published

A fast and lightweight library for splitting text into semantically meaningful chunks.

Downloads

15

Readme

semchunk-ts

A TypeScript port of semchunk, a Python library by Isaacus / Umar Butler for splitting text into semantically meaningful chunks.

semchunk uses a novel hierarchical chunking algorithm that preserves local semantic context by splitting text along structurally meaningful boundaries (paragraphs, sentences, clauses, words) before falling back to character-level splits. The original Python library delivers 15% better RAG performance than its closest competitors and is used in production by Docling, the Microsoft Intelligence Toolkit, and the Isaacus API.

This port brings the core algorithm to the TypeScript/JavaScript ecosystem.

Install

bun add semchunk-ts
# or
npm install semchunk-ts

Quickstart

import { chunk, chunkerify } from "semchunk-ts";

const text = "The quick brown fox jumps over the lazy dog.";

// Use chunk() directly with any token counting function.
const chunks = chunk(text, 20, (text) => text.length);
// => ["The quick brown fox", "jumps over the lazy", "dog."]

// Or create a reusable Chunker with chunkerify().
const chunker = chunkerify((text) => text.length, 20);
chunker.chunk(text);
// => ["The quick brown fox", "jumps over the lazy", "dog."]

API

chunk(text, chunkSize, tokenCounter, options?)

Split a single text into chunks.

import { chunk } from "semchunk-ts";

// Basic usage
const chunks = chunk("Your text here...", 512, tokenCounter);

// With offsets — returns [chunks, offsets] where text.slice(start, end) === chunk
const [chunks, offsets] = chunk("Your text here...", 512, tokenCounter, {
  offsets: true,
});

// With overlap — proportion (<1) or absolute token count (>=1)
const overlapping = chunk("Your text here...", 512, tokenCounter, {
  overlap: 0.5,
});

Options:

| Option | Type | Default | Description | |--------|------|---------|-------------| | memoize | boolean | true | Cache token counter results for performance | | offsets | boolean | false | Return [chunks, offsets] instead of just chunks | | overlap | number \| null | null | Chunk overlap as proportion (<1) or token count (>=1) | | cacheMaxsize | number \| null | null | Max memoization cache entries (null = unlimited) |

chunkerify(tokenCounter, chunkSize, options?)

Create a reusable Chunker instance.

import { chunkerify } from "semchunk-ts";

// From a token counting function
const chunker = chunkerify((text) => text.split(/\s+/).length, 100);

// From a tokenizer object with an encode() method
const chunker = chunkerify(myTokenizer, 512);

// With max token chars optimization (skips tokenization for clearly oversized text)
const chunker = chunkerify(tokenCounter, 512, { maxTokenChars: 20 });

Chunker

// Chunk a single text
const chunks = chunker.chunk(text);
const [chunks, offsets] = chunker.chunk(text, { offsets: true });
const chunks = chunker.chunk(text, { overlap: 0.5 });

// Chunk multiple texts
const results = chunker.chunkBatch([text1, text2, text3]);
const [allChunks, allOffsets] = chunker.chunkBatch(texts, { offsets: true });

How It Works

The algorithm splits text using a hierarchy of structurally meaningful splitters, from most to least desirable:

  1. Largest sequence of newlines / carriage returns
  2. Largest sequence of tabs
  3. Largest sequence of whitespace (with preference for whitespace after punctuation)
  4. Sentence terminators: . ? !
  5. Clause separators: ; , ( ) [ ] " ' ` *
  6. Sentence interrupters: :
  7. Word joiners: / \ & -
  8. Individual characters (last resort)

Chunks that exceed the size limit are recursively split. Undersized adjacent chunks are merged using binary search to approach the chunk size as closely as possible.

Publishing to npm

# Make sure you're logged in
npm login

# Run tests and build
bun test
bun run build

# Dry-run to verify package contents
npm pack --dry-run

# Publish (runs build automatically via prepublishOnly)
npm publish

# Publish with a specific tag (e.g. beta)
npm publish --tag beta

To publish a new version:

# Bump version (patch/minor/major)
npm version patch   # 4.0.0 -> 4.0.1
npm version minor   # 4.0.0 -> 4.1.0
npm version major   # 4.0.0 -> 5.0.0

# Then publish
npm publish

Differences from the Python Library

This port covers the core chunking algorithm. The following Python-specific features are not included:

  • AI-powered chunking via Isaacus enrichment models
  • Multiprocessing (mpire) — use chunkBatch() for sequential batch processing
  • Automatic tokenizer loading from tiktoken/transformers by name — pass a token counter function or tokenizer object directly

Acknowledgements

This is a TypeScript port of semchunk (v4.0.0), created by Isaacus and Umar Butler. The original library is licensed under MIT. All credit for the chunking algorithm and its design goes to the original authors.

A Rust port (semchunk-rs) is also available, maintained by @dominictarro.

License

MIT