@zakkster/lite-search
v1.0.1
Published
Fuzzy search over a @zakkster/lite-store array with an index that maintains itself incrementally: trigram candidate generation, bounded zero-allocation Levenshtein ranking, weighted multi-field scoring. Zero runtime dependencies.
Maintainers
Readme
@zakkster/lite-search
Fuzzy search over a @zakkster/lite-store array, with an index that maintains itself incrementally instead of rebuilding from scratch.
- Trigram index for fast substring candidate generation.
- Bounded Levenshtein for ranking near-matches, with two reused
Int32Arrayrows and early-exit: zero allocation per candidate. - Weighted multi-field scoring (boost the title over the body, and so on).
- Optional stop-word filtering.
- A derived, read-only view of your data. lite-search never mutates your array and exposes no add/remove/update of its own. You mutate the array; the index reconciles itself.
Why not Fuse.js?
Fuse rebuilds its entire index when the data changes. For a large collection that is expensive on every edit. lite-search keeps a live trigram index and, when the data changes, runs a reference-equality diff to find the items that actually changed and re-trigrams only those (and only their changed fields). Adding or removing an item is caught automatically; the rest of the index is untouched.
Install
npm install @zakkster/lite-search @zakkster/lite-store @zakkster/lite-signal@zakkster/lite-store and @zakkster/lite-signal are peer dependencies.
Quick start
import { store } from "@zakkster/lite-store";
import { signal, effect } from "@zakkster/lite-signal";
import { createSearchIndex } from "@zakkster/lite-search";
const items = store([
{ id: 1, title: "Buy milk", body: "whole milk from the store", tags: ["errand"] },
{ id: 2, title: "Milk frother review", body: "comparing frothers", tags: ["kitchen"] },
{ id: 3, title: "Read a book", body: "the milky way galaxy", tags: ["leisure"] },
]);
const search = createSearchIndex(items, {
fields: ["title", "body", "tags"],
weights: { title: 3, body: 1, tags: 2 },
fuzziness: 0.3,
limit: 20,
});
// One-shot search (a plain array):
search.search("milk"); // -> [{ item: {id:1,...}, id:1, score: 4.0 }, ...]
search.search("frothr"); // typo -> still finds the frother (id 2)
// Reactive search: pass a query signal, get a results signal.
const query = signal("galaxy");
const results = search.query(query);
effect(() => render(results())); // re-runs when the query OR the item set changes
items.push({ id: 4, title: "Galaxy poster", body: "", tags: [] }); // results() updates automatically
query.set("milk"); // results() updates automaticallyHow it stays in sync
lite-search is a derived view of your array. It tracks changes through two channels.
flowchart LR
A["your lite-store array"] -->|"add / remove<br/>(length changes)"| B["length signal fires"]
A -.->|"in-place field edit<br/>(no length change)"| C["you call search.sync()"]
B --> R["reconcile()"]
C --> R
R -->|"reference-equality diff"| D["re-trigram only<br/>changed items + fields"]
D --> V["bump version signal"]
V --> Q["query() results recompute"]Adds and removes are automatic. lite-search subscribes to the lite-store length signal -- a single signal, so this scales to any collection size. Pushing, splicing, or popping items reconciles the index and refreshes any live query() results.
In-place property edits need a nudge. Editing an existing item's field (items[0].title = "...") does not change the array's length, so it is not caught automatically. Call search.sync() after such edits to reconcile and refresh live results. (A one-shot search.search(...) always reconciles first, so it always sees the latest data.)
Why length-tracking instead of watching every field?
lite-store fires a separate reactive signal per field, and those signals live in a fixed-capacity reactive pool. Subscribing to every field of a large collection would exhaust that pool. Length-tracking plus an explicit sync() is honest and pool-safe: it covers the overwhelming majority of mutations (adds and removes) automatically, and gives a clear escape hatch for in-place edits. It also keeps lite-search strictly unidirectional -- your array is the single source of truth, and the index only ever derives from it.
The query pipeline
flowchart TD
Q["query string"] --> N["normalize<br/>(lowercase, strip diacritics)"]
N --> T["tokenize + drop stop-words"]
T --> G["rolling 3-grams"]
G --> C["gather candidates<br/>from the trigram index"]
C --> P["prune by trigram overlap<br/>(similarity threshold)"]
P --> S["score each candidate"]
S --> S1["exact / prefix / substring match"]
S --> S2["bounded Levenshtein<br/>(within the fuzziness budget)"]
S1 --> W["weight per field, sum"]
S2 --> W
W --> O["sort by score, take limit"]The trigram index generates a small candidate set quickly; Levenshtein then ranks those candidates. Levenshtein is not a fallback that finds terms with no trigram overlap -- see the limitation below.
Incremental indexing in detail
Each item contributes its trigrams to a shared gram -> set of item ids index. Because two fields of the same item can produce the same gram, lite-search keeps a per-item refcount of each gram. Editing one field decrements that field's grams and increments the new ones; an item only leaves a gram's posting set when its refcount reaches zero. So editing the title of an item whose body still contains the term keeps the item correctly indexed under that term.
reconcile() walks the array once and compares each item (and each indexed field) by reference. New items are trigrammed and inserted, removed items are dropped, and a field whose value reference changed is re-trigrammed -- nothing else is touched.
Ranking
A candidate's score is summed across fields. Within a field, each query token takes its best match against the field's tokens; those per-token best scores are averaged over the query tokens to produce the per-field contribution, then multiplied by weights[field] (default 1) and added to the total.
Per query token's best match against a field token:
| match | score contribution |
| --- | --- |
| exact token | 1.0 |
| query is a prefix of the token | 0.9 |
| query is a substring of the token | 0.7 |
| within the Levenshtein budget | 0.6 * (1 - distance / (maxDistance + 1)) |
maxDistance for a query token is round(fuzziness * token.length). With fuzziness: 0, only exact/prefix/substring matches count -- fuzzy matching is fully disabled.
The trigram limitation (read this)
Candidate generation is trigram-based, so a query only finds items that share some trigrams with it. This is the same behavior as PostgreSQL's pg_trgm:
- Typos that preserve trigram overlap are found and ranked by Levenshtein --
frothrfindsfrother,galxyfindsgalaxy,kitenfindskitten. - A pathological transposition that destroys all trigram overlap is not found --
mlikshares no trigrams withmilk, so it is never a candidate. Levenshtein never sees it, by design (scanning every item with Levenshtein would defeat the index).
If you need to match arbitrary transpositions of very short words, lite-search is not the right tool.
API
createSearchIndex(items, options)
items must be an array (or array-like with a numeric .length) -- a lite-store array, or a plain array. A SearchError with code "not_array" is thrown otherwise.
| option | type | default | description |
| --- | --- | --- | --- |
| fields | string[] | (required) | item fields to index; array fields are joined |
| weights | Record<string, number> | 1 each | per-field score multipliers |
| fuzziness | number | 0.3 | max Levenshtein ratio; 0 disables fuzzy matching |
| limit | number | 20 | maximum results returned |
| stopWords | Iterable<string> | none | words to drop from indexed text and queries |
| identify | (item) => string \| number | object identity | stable id per item (required for primitive items) |
Returns a SearchIndex:
| member | description |
| --- | --- |
| query(signal) | pass a query signal, get a reactive computed results signal |
| query(string) | pass a string, get a one-shot results array |
| search(string) | imperative one-shot search (reconciles first) |
| sync() | force a reconcile after in-place edits; returns whether anything changed |
| size | number of indexed items |
| stats() | { items, grams } |
| dispose() | release the length subscription and internal signal; idempotent |
Each result is { item, id, score }, where item is the current object from your array and higher score is more relevant.
dispose() is a one-way door
After dispose(), every other method is a no-op: search() and query(string) return an empty array, query(signal) returns a stable accessor that yields the empty array, and sync() returns false. None of them will rebuild the index against the disposed internal signal. dispose() itself is idempotent.
SearchError
Thrown for programmer mistakes. error.code is "not_array" or "misconfigured".
STOP_WORDS_EN
A small default English stop-word set: createSearchIndex(items, { fields, stopWords: STOP_WORDS_EN }).
Performance characteristics
- Query hits the trigram index, prunes by overlap, then ranks the bounded candidate set with allocation-free Levenshtein. The per-candidate path allocates nothing; the only query-path allocations are the query's own tokens/grams and the returned results array.
- Add / remove is caught automatically but runs an
O(N)reference scan to find what changed (the length signal says that something changed, not which item). It then re-trigrams only the changed item. For very large, very churny datasets, batch your mutations and callsync()once. reconcile()isO(N)reference comparisons plusO(changed)trigram work.
The zero-allocation hot-path property is locked in by test/07-zero-gc.test.mjs (run via npm run test:gc, which exposes global.gc and asserts < 100 B/query retained across 20k+ iterations). Run the included benchmark with node --expose-gc bench.mjs.
Zero dependencies
No runtime dependencies. lite-store and lite-signal are peers you already have. Ships as a single ESM file plus its type definitions.
License
MIT (c) Zahary Shinikchiev
