endee-model
v0.1.0
Published
Sparse text embeddings using the BM25 algorithm for vector database integration
Maintainers
Readme
endee-model
A lightweight JavaScript/TypeScript library for generating sparse text embeddings using the BM25 algorithm. Designed to plug into vector databases (Qdrant, Weaviate, etc.) to enable efficient keyword-based (sparse) search, often combined with dense embeddings for hybrid search.
Installation
npm install endee-modelQuick Start
import { SparseModel } from 'endee-model';
const model = new SparseModel('endee/bm25');
const documents = [
'The quick brown fox jumps over the lazy dog',
'Machine learning enables computers to learn from data',
];
for (const embedding of model.embed(documents)) {
console.log(embedding.asDict()); // { tokenId: weight, ... }
}Usage
Embed Documents
const model = new SparseModel('endee/bm25');
for (const embedding of model.embed(documents, /* batchSize */ 256)) {
const dict = embedding.asDict(); // { tokenId: weight }
const obj = embedding.asObject(); // { indices: Int32Array, values: Float32Array }
}Embed Queries
Query embeddings use unit weights (all 1.0) over the unique normalised tokens.
for (const embedding of model.queryEmbed('search query text')) {
console.log(embedding.asDict());
}
// Also accepts an array of queries
for (const embedding of model.queryEmbed(['query one', 'query two'])) {
console.log(embedding.asDict());
}Count Tokens
const count = model.tokenCount('some text here');Work with SparseEmbedding Directly
import { SparseEmbedding } from 'endee-model';
const embedding = SparseEmbedding.fromDict({ 100: 0.5, 200: 0.8, 300: 1.2 });
embedding.asDict(); // { 100: 0.5, 200: 0.8, 300: 1.2 }
embedding.asObject(); // { indices: Int32Array([100, 200, 300]), values: Float32Array([0.5, 0.8, 1.2]) }Configuration
SparseModel / Bm25Options
| Option | Default | Description |
|--------|---------|-------------|
| k | 1.2 | BM25 saturation parameter |
| b | 0.75 | Length normalisation factor (0 = none, 1 = full) |
| avgLen | 256 | Expected average document length in tokens |
| language | "english" | Language for stopword removal and stemming |
| maxTokenLen | 40 | Tokens longer than this are discarded |
| disableStemmer | false | Skip stemming |
| cacheDir | undefined | Custom cache directory |
const model = new SparseModel('endee/bm25', {
k: 1.5,
b: 0.8,
language: 'english',
});Cache directory
Resolved in this order:
cacheDiroptionENDEE_CACHE_PATHenvironment variable{os.tmpdir()}/endee_cache
Available Languages
import { bm25Languages } from 'endee-model';
console.log(bm25Languages()); // ['afrikaans', 'arabic', 'english', ...]How It Works
- Tokenisation — text split into word tokens (punctuation stripped)
- Normalisation — stopwords removed, oversized tokens discarded
- Stemming — tokens reduced to stems via Porter stemmer (English default)
- BM25 weights — term-frequency weights computed:
weight = tf * (k + 1) / (tf + k * (1 - b + b * (docLen / avgLen))) - Token IDs — each token hashed to a stable integer via MurmurHash3
Note: IDF weighting must be applied on the vector index side. This library outputs TF weights only.
Requirements
- Node.js >= 14
License
MIT
