npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

llm-splitter

v0.2.0

Published

Efficient, configurable text chunking utility for LLM vectorization. Returns rich chunk metadata.

Downloads

936

Readme

llm-splitter

npm version GitHub release GitHub CI

A JavaScript library for splitting text into configurable chunks with overlap support.

Features

  • 📖 Paragraph-Aware Chunking: Respects document structure while maintaining token limits
  • 🧠 LLM Optimized: Designed for vectorization with tiktoken and other tokenizers
  • 📊 Rich Metadata: Complete character position tracking for all chunks
  • High Performance: Single pass greedy algorithms for optimized processing
  • 🎨 Flexible Input: Supports strings, arrays, and custom tokenization
  • 📝 TypeScript: Full type safety with comprehensive interfaces

Installation

$ npm install llm-splitter

Usage

import { split, getChunk } from 'llm-splitter'

API

split(input, options)

Splits text into chunks based on a custom splitter function.

Each chunk contains positional data (start and end) that may be used to separately retrieve the chunk string (or array of strings) from those arguments alone via getChunk(). The purpose of this pairing is for the common scenario of wanting to store embeddings for a chunk in a data (e.g. pgvector) but not wanting to also directly store the chunk -- yet being able to get the full text of the chunk later if you have the original string input.

Parameters

  • input (string|string[]) - The text or array of texts to split
  • options (object) - Configuration options
    • chunkSize (number) - Maximum number of tokens per chunk (default: 512)
    • chunkOverlap (number) - Number of overlapping tokens between chunks (default: 0)
    • chunkStrategy (string) - Grouping preference for chunks (default: "character")
    • splitter (function) - Function to split text into tokens (default: character-by-character)

Notes:

  • chunkSize must be a positive integer ≥ 1
  • chunkOverlap must be a non-negative integer ≥ 0
  • chunkOverlap must be less than chunkSize
  • splitter functions can omit text when splitting, but should not mutate the emitted tokens. This means that splitting by spaces is fine (e.g. (t) => t.split(" ")) but splitting and changing text is not allowed (e.g. (t) => t.split(" ").map((x) => x.toUpperCase())).
  • Here are some sample splitter functions:
    • Character: text => text.split('') (default)
    • Word: text => text.split(/\s+/)
    • Sentence: text => text.split(/[.!?]+/)
    • Line: text => text.split(/\n/)

Returns

Returns an array of chunk objects with the following structure:

{
  text: string | string[], // The chunk text
  start: number,           // Start position in the original text
  end: number              // End position in the original text
}

Examples

Basic usage with default options:

const text = 'Hello world! This is a test.'
const chunks = split(text)

// =>
// Splits into character-level chunks of 512 characters, which is just the original string here ;)
;[{ text: 'Hello world! This is a test.', start: 0, end: 28 }]

Custom chunk size and overlap:

const text = 'Hello world! This is a test.'
const chunks = split(text, {
  chunkSize: 10,
  chunkOverlap: 2
})

// =>
;[
  { text: 'Hello worl', start: 0, end: 10 },
  { text: 'rld! This ', start: 8, end: 18 },
  { text: 's is a tes', start: 16, end: 26 },
  { text: 'est.', start: 24, end: 28 }
]

Word-based splitting:

const text = 'Hello world! This is a test.'
const chunks = split(text, {
  chunkSize: 3,
  chunkOverlap: 1,
  splitter: text => text.split(/\s+/)
})

// =>
;[
  { text: 'Hello world! This', start: 0, end: 17 },
  { text: 'This is a', start: 13, end: 22 },
  { text: 'a test.', start: 21, end: 28 }
]

Array of strings:

const texts = ['Hello world!', 'This is a test.']
const chunks = split(texts, {
  chunkSize: 5,
  splitter: text => text.split(' ')
})

// =>
;[
  { text: ['Hello world!', 'This is a'], start: 0, end: 21 },
  { text: ['test.'], start: 22, end: 27 }
]

Paragraph chunking

By default, we assemble chunks with as many tokens fit in. This default is considered the chunkStrategy = "character". Another options is to fit as many whole paragraphs (denoted by string array end or \n\n characters) as we can into a chunk, but not splitting up paragraphs unless the paragraph of tokens is at the start of the chunk, in which case we then split across as many subsequent chunks as we need to. This approach allows you to keep paragraph structures more contained within chunks which may yield advantageous context outcomes for your upstream usage (in a RAG app, etc).

// Mix of paragraphs across array items and within items with `\n\n` marker.
const texts = [
  'Who has seen the wind?\n\nNeither I nor you.',
  'But when the leaves hang trembling,',
  'The wind is passing through.',
  'Who has seen the wind?\n\nNeither you nor I.',
  'But when the trees bow down their heads,',
  'The wind is passing by.'
]
const chunks = split(text10, {
  chunkSize: 20,
  chunkOverlap: 2,
  chunkStrategy: 'paragraph',
  splitter: text => text.split(/\s+/)
})

// =>
;[
  {
    text: [
      'Who has seen the wind?\n\nNeither I nor you.',
      'But when the leaves hang trembling,',
      'The wind is passing through.'
    ],
    start: 0,
    end: 105
  },
  {
    text: [
      'passing through.',
      'Who has seen the wind?\n\nNeither you nor I.',
      'But when the trees bow down their heads,'
    ],
    start: 89,
    end: 187
  },
  {
    text: ['their heads,', 'The wind is passing by.'],
    start: 175,
    end: 210
  }
]

getChunk(input, start, end)

Extracts a specific chunk of text from the original input based on start and end positions. For array input the positions are treated as if all elements in the array were concatenated into a single long string.

Note that for arrays, the returned result will be an array and that the first and/or last element of the array may be a substring of that array item's text.

Parameters

  • input (string|string[]) - The original input text or array of texts
  • start (number) - Start position in the original text
  • end (number) - End position in the original text

Returns

  • string - For single string input
  • string[] - For array of strings input

Examples

const text = 'Hello world! This is a test.'
const chunk = getChunk(text, 0, 12)
// =>
;('Hello world!')

const texts = ['Hello world!', 'This is a test.']
const chunk = getChunk(texts, 0, 16)
// =>
;['Hello world!', 'This']

Advanced Usage

Custom Splitter Functions

You can create custom splitter functions for different tokenization strategies:

Sentences

Split by sentences using a regular expression.

// Sentence-based splitting
const sentenceSplitter = text => text.split(/[.!?]+/)
const chunks = split(text, {
  chunkSize: 5,
  splitter: sentenceSplitter
})

// =>
;[{ text: 'Hello world! This is a test', start: 0, end: 27 }]

TikToken

Split using the TikToken tokenizer with the commonly used text-embedding-ada-002 model.

import tiktoken from 'tiktoken'

// Create a tokenizer for a specific model
const tokenizer = tiktoken.encoding_for_model('text-embedding-ada-002')
const td = new TextDecoder()

// Create a token splitter function
const tokenSplitter = text =>
  Array.from(tokenizer.encode(text)).map(token =>
    td.decode(tokenizer.decode([token]))
  )

const text = 'Hello world! This is a test.'
const chunks = split(text, {
  chunkSize: 3,
  chunkOverlap: 1,
  splitter: tokenSplitter
})

// Don't forget to free the tokenizer when done
tokenizer.free()

// =>
;[
  { text: 'Hello world!', start: 0, end: 12 },
  { text: '! This is', start: 11, end: 20 },
  { text: ' is a test', start: 17, end: 27 },
  { text: ' test.', start: 22, end: 28 }
]

Working with Overlaps

Chunk overlap is useful for maintaining context between chunks:

const text = 'This is a very long document that needs to be split into chunks.'
const chunks = split(text, {
  chunkSize: 10,
  chunkOverlap: 3,
  splitter: text => text.split(' ')
})
// Each chunk will share 3 words with the previous chunk
// =>
;[
  {
    text: 'This is a very long document that needs to be',
    start: 0,
    end: 45
  },
  { text: 'needs to be split into chunks.', start: 34, end: 64 }
]

Multibyte / Unicode Strings

Processing text with multibyte characters (Unicode characters with char codes greater than 255 -- e.g. emojis) is problematic for tokenizers that can split strings across byte boundaries (as noted by other text splitting libraries). llm-splitter needs to determine the start/end locations of each chunk and thus have to find locations for the split parts in the original input(s).

llm-splitter approaches the multibyte characters problem as follows: For each part produced by splitter()...

  • If the part doesn't have multibyte characters, then it should be completely matched.
  • Next, try to do a simple string startsWith(part) match. This will correctly match on many strings with multibyte characters in them.
  • If that fails, then ignore the multibyte characters, and iterate through the part until we find a match on the single-byte parts. At this point we will potentially skip multibyte characters to just match on strings starting with single-byte characters.

When the parts are then gathered into chunks and aggregated into { text, start, end } array items, this means that some of the chunks will undercount the number of parts produced by the splitter() function -- put another way, there may be more parts in chunkSize than specified. In a simple test we conducted on 10MB of blog post content using the tiktoken tokenizer in our splitter() function, our results were as follows: for the 3 million parts produced, 99.6% of them matched the input strings without needing to ignore multibyte characters. So, if your chunking implementation is need hard constraints (like embedding API max tokens) on how large the chunks can be, we would advise to add a reduction in chunkSize to accomodate. Additionally, if a large number of multibyte characters are present, it would likely make sense to do some upfront analysis to determine a proper discount factor for chunkSize.

Let's take a quick look at multibyte handling with some emojis and a tiktoken-based splitter:

const text = `
A noiseless 🤫 patient spider, 🕷️
I mark'd where on a little 🏔️ promontory it stood isolated,
Mark'd how to explore 🔍 the vacant vast 🌌 surrounding,
`

const chunks = split(text, {
  chunkSize: 15,
  chunkOverlap: 2,
  chunkStrategy: 'paragraph',
  splitter: tokenSplitter // from examples above
})

console.log(JSON.stringify(chunks, null, 2))
// =>
;[
  {
    text: "\nA noiseless 🤫 patient spider, 🕷️\nI mark'd where on",
    start: 0,
    end: 53
  },
  {
    text: " where on a little 🏔️ promontory it stood isolated,\nMark'd how",
    start: 44,
    end: 107
  },
  {
    text: "'d how to explore 🔍 the vacant vast 🌌 surrounding,\n",
    start: 101,
    end: 154
  }
]

Ultimately, this approach represents a tradeoff: while some higher-level Unicode data may be under counted during the splitting process, it ensures that chunk start/end positions can be reliably determined with any user-supplied splitter function, preventing malformed chunks and internal errors.

License

MIT