@zakkster/lite-search

v1.0.1

Published

18 days ago

Fuzzy search over a @zakkster/lite-store array with an index that maintains itself incrementally: trigram candidate generation, bounded zero-allocation Levenshtein ranking, weighted multi-field scoring. Zero runtime dependencies.

0High
0Medium
0Low

zakkster

search fuzzy fuzzy-search trigram levenshtein full-text incremental index reactive signal lite zero-dependency zero-gc

@zakkster/lite-search

Fuzzy search over a @zakkster/lite-store array, with an index that maintains itself incrementally instead of rebuilding from scratch.

Trigram index for fast substring candidate generation.
Bounded Levenshtein for ranking near-matches, with two reused Int32Array rows and early-exit: zero allocation per candidate.
Weighted multi-field scoring (boost the title over the body, and so on).
Optional stop-word filtering.
A derived, read-only view of your data. lite-search never mutates your array and exposes no add/remove/update of its own. You mutate the array; the index reconciles itself.

Why not Fuse.js?

Fuse rebuilds its entire index when the data changes. For a large collection that is expensive on every edit. lite-search keeps a live trigram index and, when the data changes, runs a reference-equality diff to find the items that actually changed and re-trigrams only those (and only their changed fields). Adding or removing an item is caught automatically; the rest of the index is untouched.

Install

npm install @zakkster/lite-search @zakkster/lite-store @zakkster/lite-signal

@zakkster/lite-store and @zakkster/lite-signal are peer dependencies.

Quick start

import { store } from "@zakkster/lite-store";
import { signal, effect } from "@zakkster/lite-signal";
import { createSearchIndex } from "@zakkster/lite-search";

const items = store([
    { id: 1, title: "Buy milk",            body: "whole milk from the store", tags: ["errand"] },
    { id: 2, title: "Milk frother review", body: "comparing frothers",        tags: ["kitchen"] },
    { id: 3, title: "Read a book",         body: "the milky way galaxy",       tags: ["leisure"] },
]);

const search = createSearchIndex(items, {
    fields: ["title", "body", "tags"],
    weights: { title: 3, body: 1, tags: 2 },
    fuzziness: 0.3,
    limit: 20,
});

// One-shot search (a plain array):
search.search("milk");        // -> [{ item: {id:1,...}, id:1, score: 4.0 }, ...]
search.search("frothr");      // typo -> still finds the frother (id 2)

// Reactive search: pass a query signal, get a results signal.
const query = signal("galaxy");
const results = search.query(query);
effect(() => render(results()));   // re-runs when the query OR the item set changes

items.push({ id: 4, title: "Galaxy poster", body: "", tags: [] }); // results() updates automatically
query.set("milk");                                                  // results() updates automatically

How it stays in sync

lite-search is a derived view of your array. It tracks changes through two channels.

flowchart LR
    A["your lite-store array"] -->|"add / remove<br/>(length changes)"| B["length signal fires"]
    A -.->|"in-place field edit<br/>(no length change)"| C["you call search.sync()"]
    B --> R["reconcile()"]
    C --> R
    R -->|"reference-equality diff"| D["re-trigram only<br/>changed items + fields"]
    D --> V["bump version signal"]
    V --> Q["query() results recompute"]

Adds and removes are automatic. lite-search subscribes to the lite-store length signal -- a single signal, so this scales to any collection size. Pushing, splicing, or popping items reconciles the index and refreshes any live query() results.

In-place property edits need a nudge. Editing an existing item's field (items[0].title = "...") does not change the array's length, so it is not caught automatically. Call search.sync() after such edits to reconcile and refresh live results. (A one-shot search.search(...) always reconciles first, so it always sees the latest data.)

Why length-tracking instead of watching every field?

lite-store fires a separate reactive signal per field, and those signals live in a fixed-capacity reactive pool. Subscribing to every field of a large collection would exhaust that pool. Length-tracking plus an explicit sync() is honest and pool-safe: it covers the overwhelming majority of mutations (adds and removes) automatically, and gives a clear escape hatch for in-place edits. It also keeps lite-search strictly unidirectional -- your array is the single source of truth, and the index only ever derives from it.

The query pipeline

flowchart TD
    Q["query string"] --> N["normalize<br/>(lowercase, strip diacritics)"]
    N --> T["tokenize + drop stop-words"]
    T --> G["rolling 3-grams"]
    G --> C["gather candidates<br/>from the trigram index"]
    C --> P["prune by trigram overlap<br/>(similarity threshold)"]
    P --> S["score each candidate"]
    S --> S1["exact / prefix / substring match"]
    S --> S2["bounded Levenshtein<br/>(within the fuzziness budget)"]
    S1 --> W["weight per field, sum"]
    S2 --> W
    W --> O["sort by score, take limit"]

The trigram index generates a small candidate set quickly; Levenshtein then ranks those candidates. Levenshtein is not a fallback that finds terms with no trigram overlap -- see the limitation below.

Incremental indexing in detail

Each item contributes its trigrams to a shared gram -> set of item ids index. Because two fields of the same item can produce the same gram, lite-search keeps a per-item refcount of each gram. Editing one field decrements that field's grams and increments the new ones; an item only leaves a gram's posting set when its refcount reaches zero. So editing the title of an item whose body still contains the term keeps the item correctly indexed under that term.

reconcile() walks the array once and compares each item (and each indexed field) by reference. New items are trigrammed and inserted, removed items are dropped, and a field whose value reference changed is re-trigrammed -- nothing else is touched.

Ranking

A candidate's score is summed across fields. Within a field, each query token takes its best match against the field's tokens; those per-token best scores are averaged over the query tokens to produce the per-field contribution, then multiplied by weights[field] (default 1) and added to the total.

Per query token's best match against a field token:

| match | score contribution | | --- | --- | | exact token | 1.0 | | query is a prefix of the token | 0.9 | | query is a substring of the token | 0.7 | | within the Levenshtein budget | 0.6 * (1 - distance / (maxDistance + 1)) |

maxDistance for a query token is round(fuzziness * token.length). With fuzziness: 0, only exact/prefix/substring matches count -- fuzzy matching is fully disabled.

The trigram limitation (read this)

Candidate generation is trigram-based, so a query only finds items that share some trigrams with it. This is the same behavior as PostgreSQL's pg_trgm:

Typos that preserve trigram overlap are found and ranked by Levenshtein -- frothr finds frother, galxy finds galaxy, kiten finds kitten.
A pathological transposition that destroys all trigram overlap is not found -- mlik shares no trigrams with milk, so it is never a candidate. Levenshtein never sees it, by design (scanning every item with Levenshtein would defeat the index).

If you need to match arbitrary transpositions of very short words, lite-search is not the right tool.

API

`createSearchIndex(items, options)`

items must be an array (or array-like with a numeric .length) -- a lite-store array, or a plain array. A SearchError with code "not_array" is thrown otherwise.

| option | type | default | description | | --- | --- | --- | --- | | fields | string[] | (required) | item fields to index; array fields are joined | | weights | Record<string, number> | 1 each | per-field score multipliers | | fuzziness | number | 0.3 | max Levenshtein ratio; 0 disables fuzzy matching | | limit | number | 20 | maximum results returned | | stopWords | Iterable<string> | none | words to drop from indexed text and queries | | identify | (item) => string \| number | object identity | stable id per item (required for primitive items) |

Returns a SearchIndex:

| member | description | | --- | --- | | query(signal) | pass a query signal, get a reactive computed results signal | | query(string) | pass a string, get a one-shot results array | | search(string) | imperative one-shot search (reconciles first) | | sync() | force a reconcile after in-place edits; returns whether anything changed | | size | number of indexed items | | stats() | { items, grams } | | dispose() | release the length subscription and internal signal; idempotent |

Each result is { item, id, score }, where item is the current object from your array and higher score is more relevant.

`dispose()` is a one-way door

After dispose(), every other method is a no-op: search() and query(string) return an empty array, query(signal) returns a stable accessor that yields the empty array, and sync() returns false. None of them will rebuild the index against the disposed internal signal. dispose() itself is idempotent.

`SearchError`

Thrown for programmer mistakes. error.code is "not_array" or "misconfigured".

`STOP_WORDS_EN`

A small default English stop-word set: createSearchIndex(items, { fields, stopWords: STOP_WORDS_EN }).

Performance characteristics

Query hits the trigram index, prunes by overlap, then ranks the bounded candidate set with allocation-free Levenshtein. The per-candidate path allocates nothing; the only query-path allocations are the query's own tokens/grams and the returned results array.
Add / remove is caught automatically but runs an O(N) reference scan to find what changed (the length signal says that something changed, not which item). It then re-trigrams only the changed item. For very large, very churny datasets, batch your mutations and call sync() once.
reconcile() is O(N) reference comparisons plus O(changed) trigram work.

The zero-allocation hot-path property is locked in by test/07-zero-gc.test.mjs (run via npm run test:gc, which exposes global.gc and asserts < 100 B/query retained across 20k+ iterations). Run the included benchmark with node --expose-gc bench.mjs.

Zero dependencies

No runtime dependencies. lite-store and lite-signal are peers you already have. Ships as a single ESM file plus its type definitions.

License

MIT (c) Zahary Shinikchiev

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@zakkster/lite-search

Why not Fuse.js?

Install

Quick start

How it stays in sync

Why length-tracking instead of watching every field?

The query pipeline

Incremental indexing in detail

Ranking

The trigram limitation (read this)

API

createSearchIndex(items, options)

dispose() is a one-way door

SearchError

STOP_WORDS_EN

Performance characteristics

Zero dependencies

License

`createSearchIndex(items, options)`

`dispose()` is a one-way door

`SearchError`

`STOP_WORDS_EN`