@codemix/text-search

v0.0.2

Published

3 months ago

Fast, accurate text search with BM25 scoring and stemming.

0High
0Medium
0Low

charlespick

@codemix/text-search

Fast, accurate text search with BM25-inspired scoring, English stemming, and phrase-aware ranking. Zero dependencies.

Features

BM25-inspired scoring — term frequency saturation and document length normalization
Porter stemmer — matches "running" to "run", "connections" to "connect"
Stopword filtering — ignores noise words like "the", "and", "is"
Phrase bonuses — consecutive term matches score higher than scattered terms
Exact and prefix bonuses — boosts documents containing the literal query string
Position-aware scoring — terms appearing earlier in a document score slightly higher
Detailed results — inspect each scoring component for debugging or tuning
Scores clamped to [0, 1] — always safe to compare or threshold

Installation

npm install @codemix/text-search

Quick start

import { createMatcher, rankDocuments } from "@codemix/text-search";

// Score a single document against a query
const match = createMatcher("quick brown fox");
match("The quick brown fox jumps over the lazy dog"); // ~0.85
match("A slow gray elephant"); // 0.0

// Rank an array of strings by relevance
const results = rankDocuments("database connection", [
  "How to connect to a database",
  "Database connection pooling best practices",
  "Unrelated article about cooking",
]);
// results[0].document === "Database connection pooling best practices"
// results[0].score    === 0.73 (example)

API

`createMatcher(query, options?): MatcherFn`

Prepares a query and returns a scoring function. Creating the matcher does the expensive tokenization work once; the returned function is cheap to call repeatedly.

const match = createMatcher("error handling");

match("Always handle errors gracefully"); // > 0
match("Cooking recipes for beginners"); // 0

Returns (text: string) => number — a score in [0, 1].

`createDetailedMatcher(query, options?): (text: string) => MatchResult`

Same as createMatcher but returns a full breakdown of each scoring component. Useful for debugging or building custom ranking logic on top.

const match = createDetailedMatcher("quick brown fox");
const result = match("The quick brown fox jumps");
// {
//   score: 0.87,
//   termScore: 0.52,        // BM25 term frequency contribution
//   exactBonus: 0.15,       // literal substring found in document
//   prefixBonus: 0.0,       // document did not start with query
//   consecutiveBonus: 0.1,  // "quick brown" and "brown fox" appeared in order
//   positionScore: 0.05,    // terms appeared near the start
//   matchedTerms: 3,        // all 3 query terms matched
//   totalTerms: 3,
// }

`rankDocuments(query, documents, options?)`

Scores an array of strings and returns them sorted by relevance, highest first.

const results = rankDocuments("react hooks", [
  "React hooks introduction and useState examples",
  "Understanding React component lifecycle methods",
  "Getting started with React Native development",
]);
// [
//   { document: "React hooks introduction...", score: 0.71 },
//   { document: "Getting started with React Native...", score: 0.28 },
//   { document: "Understanding React component...", score: 0.21 },
// ]

`rankDocuments(query, config, documents, options?)`

Scores an array of objects by extracting text via a function, and returns a key from each object alongside its score.

interface Article {
  slug: string;
  title: string;
  content: string;
}

const articles: Article[] = [
  { slug: "intro-react", title: "Introduction to React", content: "Learn React basics" },
  { slug: "vue-guide", title: "Vue.js Guide", content: "Getting started with Vue" },
];

const results = rankDocuments(
  "react",
  { key: "slug", text: (a) => `${a.title} ${a.content}` },
  articles,
);
// [
//   { key: "intro-react", score: 0.55 },
//   { key: "vue-guide",   score: 0.0  },
// ]

`tokenize(text, options?): Token[]`

Splits text into tokens, optionally applying stemming and stopword removal. Each token carries its original form, stemmed form, and word-index position.

import { tokenize } from "@codemix/text-search";

tokenize("The quick brown fox");
// [
//   { original: "quick", stemmed: "quick", position: 1 },
//   { original: "brown", stemmed: "brown", position: 2 },
//   { original: "fox",   stemmed: "fox",   position: 3 },
// ]

tokenize("running dogs", { stem: true });
// [
//   { original: "running", stemmed: "run", position: 0 },
//   { original: "dogs",    stemmed: "dog", position: 1 },
// ]

`stem(word): string`

Applies the Porter stemming algorithm to a single lowercase word.

import { stem } from "@codemix/text-search";

stem("running"); // "run"
stem("connections"); // "connect"
stem("happily"); // "happili"

`extractTerms(text, options?): string[]`

Convenience wrapper around tokenize that returns just the stemmed term strings.

import { extractTerms } from "@codemix/text-search";

extractTerms("Database connection pooling");
// ["databas", "connect", "pool"]

`buildTermFrequency(tokens): Map<string, number>`

Builds a term → count map from a token array.

`buildPositionMap(tokens): Map<string, number[]>`

Builds a term → positions map from a token array.

`STOPWORDS`

A Set<string> of the English stopwords that are filtered by default.

Options

All main functions accept a shared options object.

Tokenization options (`TokenizeOptions`)

| Option | Type | Default | Description | | ----------------- | --------- | ------- | ------------------------------- | | stem | boolean | true | Apply Porter stemming | | removeStopwords | boolean | true | Filter common English stopwords | | minLength | number | 1 | Minimum token length to keep |

Matcher options (`MatcherOptions`)

Extends TokenizeOptions with BM25 tuning and bonus weights.

| Option | Type | Default | Description | | ------------------ | -------- | ------- | ----------------------------------------------------------------- | | k1 | number | 1.2 | BM25 term-frequency saturation. Higher = more weight on frequency | | b | number | 0.75 | BM25 length normalization. 0 = none, 1 = full | | exactMatchBonus | number | 0.15 | Added when query appears as a literal substring | | prefixMatchBonus | number | 0.1 | Added when document starts with query | | consecutiveBonus | number | 0.1 | Added proportionally for consecutive term matches | | positionWeight | number | 0.05 | Weight for position-based scoring |

Disabling all bonuses gives you a pure BM25-style score capped at ~0.6:

const match = createMatcher("error", {
  exactMatchBonus: 0,
  prefixMatchBonus: 0,
  consecutiveBonus: 0,
  positionWeight: 0,
});

How scoring works

The final score is the sum of five components, clamped to [0, 1]:

score = termScore + exactBonus + prefixBonus + consecutiveBonus + positionScore

| Component | Max contribution | Description | | ------------------ | ---------------- | --------------------------------------------------------------- | | termScore | 0.60 | BM25-style term frequency + query term coverage | | exactBonus | 0.15 | Literal query string found in document | | prefixBonus | 0.10 | Document starts with query string | | consecutiveBonus | 0.10 | Fraction of adjacent query term pairs that appear consecutively | | positionScore | 0.05 | Earlier term positions score higher |

Term score detail

Because there is no corpus to compute IDF from, the library uses a simplified BM25 formula:

For each unique stemmed query term, compute the saturated term frequency in the document using the standard BM25 TF formula with the configured k1/b parameters and an assumed average document length of 50 words.
Normalize the total by the number of query terms.
Apply a coverage multiplier: matching 3/3 terms scores higher than matching 1/3.
Scale to a maximum of 0.6 to leave headroom for bonuses.

Stemming

The library ships a complete Porter stemmer (Martin Porter, 1980). It handles all five steps of the algorithm:

Step 1a/b/c — plurals, -ed/-ing endings, -y → -i
Step 2 — suffix reductions with measure > 0 (-ational → -ate, etc.)
Step 3 — additional suffix reductions (-icate → -ic, etc.)
Step 4 — suffix removal with measure > 1
Step 5a/b — trailing -e and -ll removal

Words shorter than 3 characters are returned unchanged.

Note: the Porter stemmer is a rule-based algorithm and does not handle irregular forms. ran will not stem to run.

Performance

Benchmarks on 1000-word documents (Node.js, M-series Mac):

| Operation | Time | | -------------------------------------------- | --------- | | Tokenize 1 document × 100 | < 500 ms | | Score 1 document × 100 | < 500 ms | | Rank 100 × 1000-word documents | < 1000 ms | | 500 scoring operations (50 docs × 10 passes) | < 1000 ms |

The matcher compiles the query once — reuse the function returned by createMatcher when scoring many documents against the same query.

Examples

Search a documentation index

const docs = [
  "How to connect to a PostgreSQL database using Node.js",
  "Database connection pooling best practices",
  "Introduction to NoSQL databases",
  "Setting up a web server with Express",
  "Understanding database transactions and ACID properties",
];

const results = rankDocuments("database connection", docs);
// results[0].document === "Database connection pooling best practices"

Autocomplete / prefix search

const faq = [
  "How do I reset my password?",
  "What payment methods are accepted?",
  "How to cancel my subscription",
];

const match = createMatcher("cancel subscription");
faq.filter((q) => match(q) > 0.3);
// ["How to cancel my subscription"]

Search structured objects with a custom key

const users = [
  { id: 1, name: "John Doe", bio: "Software engineer at TechCorp" },
  { id: 2, name: "Jane Smith", bio: "Product manager with engineering background" },
  { id: 3, name: "Bob Wilson", bio: "Marketing specialist" },
];

const results = rankDocuments(
  "software engineer",
  { key: "id", text: (u) => `${u.name} ${u.bio}` },
  users,
);
// results[0].key === 1  (John Doe)

Debug scoring

const match = createDetailedMatcher("machine learning");
const result = match("Introduction to machine learning algorithms");

console.log(`Score: ${result.score}`);
console.log(`Terms matched: ${result.matchedTerms}/${result.totalTerms}`);
console.log(`Consecutive bonus: ${result.consecutiveBonus}`);

Disable stemming for exact keyword matching

const match = createMatcher("engineering", { stem: false });
// Only matches the literal word "engineering", not "engineer" or "engineered"

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@codemix/text-search

Features

Installation

Quick start

API

createMatcher(query, options?): MatcherFn

createDetailedMatcher(query, options?): (text: string) => MatchResult

rankDocuments(query, documents, options?)

rankDocuments(query, config, documents, options?)

tokenize(text, options?): Token[]

stem(word): string

extractTerms(text, options?): string[]

buildTermFrequency(tokens): Map<string, number>

buildPositionMap(tokens): Map<string, number[]>

STOPWORDS

Options

Tokenization options (TokenizeOptions)

Matcher options (MatcherOptions)

How scoring works

Term score detail

Stemming

Performance

Examples

Search a documentation index

Autocomplete / prefix search

Search structured objects with a custom key

Debug scoring

Disable stemming for exact keyword matching

License

`createMatcher(query, options?): MatcherFn`

`createDetailedMatcher(query, options?): (text: string) => MatchResult`

`rankDocuments(query, documents, options?)`

`rankDocuments(query, config, documents, options?)`

`tokenize(text, options?): Token[]`

`stem(word): string`

`extractTerms(text, options?): string[]`

`buildTermFrequency(tokens): Map<string, number>`

`buildPositionMap(tokens): Map<string, number[]>`

`STOPWORDS`

Tokenization options (`TokenizeOptions`)

Matcher options (`MatcherOptions`)