endee-model

v0.1.0

Published

2 months ago

Sparse text embeddings using the BM25 algorithm for vector database integration

0High
0Medium
0Low

harshit_endee

vector-database embeddings bm25 sparse keyword-search qdrant weaviate nlp

endee-model

A lightweight JavaScript/TypeScript library for generating sparse text embeddings using the BM25 algorithm. Designed to plug into vector databases (Qdrant, Weaviate, etc.) to enable efficient keyword-based (sparse) search, often combined with dense embeddings for hybrid search.

Installation

npm install endee-model

Quick Start

import { SparseModel } from 'endee-model';

const model = new SparseModel('endee/bm25');

const documents = [
  'The quick brown fox jumps over the lazy dog',
  'Machine learning enables computers to learn from data',
];

for (const embedding of model.embed(documents)) {
  console.log(embedding.asDict()); // { tokenId: weight, ... }
}

Usage

Embed Documents

const model = new SparseModel('endee/bm25');

for (const embedding of model.embed(documents, /* batchSize */ 256)) {
  const dict = embedding.asDict();    // { tokenId: weight }
  const obj  = embedding.asObject();  // { indices: Int32Array, values: Float32Array }
}

Embed Queries

Query embeddings use unit weights (all 1.0) over the unique normalised tokens.

for (const embedding of model.queryEmbed('search query text')) {
  console.log(embedding.asDict());
}

// Also accepts an array of queries
for (const embedding of model.queryEmbed(['query one', 'query two'])) {
  console.log(embedding.asDict());
}

Count Tokens

const count = model.tokenCount('some text here');

Work with SparseEmbedding Directly

import { SparseEmbedding } from 'endee-model';

const embedding = SparseEmbedding.fromDict({ 100: 0.5, 200: 0.8, 300: 1.2 });

embedding.asDict();    // { 100: 0.5, 200: 0.8, 300: 1.2 }
embedding.asObject();  // { indices: Int32Array([100, 200, 300]), values: Float32Array([0.5, 0.8, 1.2]) }

Configuration

SparseModel / Bm25Options

| Option | Default | Description | |--------|---------|-------------| | k | 1.2 | BM25 saturation parameter | | b | 0.75 | Length normalisation factor (0 = none, 1 = full) | | avgLen | 256 | Expected average document length in tokens | | language | "english" | Language for stopword removal and stemming | | maxTokenLen | 40 | Tokens longer than this are discarded | | disableStemmer | false | Skip stemming | | cacheDir | undefined | Custom cache directory |

const model = new SparseModel('endee/bm25', {
  k: 1.5,
  b: 0.8,
  language: 'english',
});

Cache directory

Resolved in this order:

cacheDir option
ENDEE_CACHE_PATH environment variable
{os.tmpdir()}/endee_cache

Available Languages

import { bm25Languages } from 'endee-model';

console.log(bm25Languages()); // ['afrikaans', 'arabic', 'english', ...]

How It Works

Tokenisation — text split into word tokens (punctuation stripped)
Normalisation — stopwords removed, oversized tokens discarded
Stemming — tokens reduced to stems via Porter stemmer (English default)

BM25 weights — term-frequency weights computed:

weight = tf * (k + 1) / (tf + k * (1 - b + b * (docLen / avgLen)))

Token IDs — each token hashed to a stable integer via MurmurHash3

Note: IDF weighting must be applied on the vector index side. This library outputs TF weights only.

Requirements

Node.js >= 14

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

endee-model

Installation

Quick Start

Usage

Embed Documents

Embed Queries

Count Tokens

Work with SparseEmbedding Directly

Configuration

SparseModel / Bm25Options

Cache directory

Available Languages

How It Works

Requirements

License