semachunk

v0.3.2

Published

13 days ago

Semantic chunking of large texts. Plug any embedding provider/API. Batch embeddings for efficiency and handling API rate limits.

0High
0Medium
0Low

geleto

semantic chunking sentence similarity cosine chunk rag splitting transformers transformers.js embeddings xenova text-processing nlp ml BERT

🍱semachunk - a Minimal Semantic Chunker

A lightweight, dependency-free (almost) TypeScript/JavaScript library for semantically chunking text.

Features

Zero Heavy Dependencies (only sentence-parse)
Model Agnostic, plug in any embedding provider (OpenAI, HuggingFace, etc.) via a simple callback
Batch Embeddings for Efficiency and handling API Rate Limiting
Optimized Sentence/Chunk Merging Algorithm

Usage

import { chunkText } from 'semachunk';

// 1. Define your embedding callback
async function myEmbedder(texts) {
    // Return array of vectors:
    return await openai.embeddings.create({ input: texts, model: "text-embedding-3-small" });
}

// 2. Chunk your text
const text = "Your long document text...";
const chunks = await chunkText(text, myEmbedder, {
    maxChunkSize: 500,
    similarityThreshold: 0.5
});

console.log(chunks);
// Output: [{ text: "...", embedding: [...] }, ...]

Configuration

| Option | Default | Description | |---|---|---| | maxChunkSize | 500 | Max characters per chunk | | similarityThreshold | 0.5 | Threshold to merge sentences/chunks | | dynamicThresholdLowerBound | 0.4 | Lower bound for dynamic threshold adjustment | | dynamicThresholdUpperBound | 0.8 | Upper bound for dynamic threshold adjustment | | numSimilaritySentencesLookahead | 3 | Number of future sentences to look ahead for similarity context | | combineChunks | true | Enable the iterative merging optimization | | combineChunksSimilarityThreshold | 0.5 | Threshold for merging chunks during optimization pass | | maxUncappedPasses | 100 | Max number of passes where merges are NOT throttled (capped passes are unlimited) | | maxMergesPerPass | 50 | Absolute limit on the number of merges per pass. | | candiateMergesPercentageCap | 40 | Percentage of valid merge candidates to execute per pass. | | uncappedCandidateMerges | 12 | Soft-minimum number of merges per pass (overrides percentage cap) | | returnEmbedding | true | Include embeddings in the output | | chunkPrefix | '' | Prefix to add to each chunk before embedding | | excludeChunkPrefixInResults | false | Exclude the prefix from the returned text |

Acknowledgements

This project is derived from the original semantic-chunking library by jparkerweb with these changes:

Zero Heavy Dependencies (only sentence-parse)
Stripped down to the core functionality
Model Agnostic
Batch Embeddings for Efficiency and handling API Rate Limiting
New optimized chunk merging algorithm

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

🍱semachunk - a Minimal Semantic Chunker

Features

Usage

Configuration

Acknowledgements