@yoch/minisearch

v8.1.0

Published

4 hours ago

Node.js full-text search with FrozenMiniSearch and binary index snapshots

0High
0Medium
0Low

yoch

search full text fuzzy prefix auto suggest index frozen binary node

@yoch/minisearch

In-memory full-text search for Node.js — a fork of MiniSearch by Luca Ongaro, extended for production serving: smaller indexes, faster loads, and a read-only fast path.

Current release: 8.1.0 · install with npm install @yoch/minisearch

Why this fork?

MiniSearch is excellent for building and querying an index in JavaScript. This fork keeps that API for mutable indexing, and adds FrozenMiniSearch for when the index is built once and queried many times:

| | Mutable MiniSearch | FrozenMiniSearch | |---|---------------------|-------------------| | Use when | Documents change (add, remove, discard) | Corpus is fixed, or you reload from disk | | Memory | Maps and nested objects per posting | Flat Uint32Array / Uint8Array postings | | On disk | toJSON / loadJSON | saveBinary / loadBinary (MSv4 / MSv3) | | Typical search | Baseline | Often ~20–35% faster p50 on the same corpus (see benchmarks) |

Same BM25 scoring, prefix/fuzzy search, autoSuggest, and query combinators — frozen indexes aim for search ranking parity with addAll + freeze() when built with the same options. Term frequencies are stored as Uint8 (max 255 per document/field); extreme repetition can cause a small score drift versus the mutable index.

Quick start

npm install @yoch/minisearch
# pre-releases: npm install @yoch/minisearch@beta

One-shot frozen index (no mutable step):

import { FrozenMiniSearch } from '@yoch/minisearch'

const options = { fields: ['title', 'text'], storeFields: ['title'] }

const index = FrozenMiniSearch.fromDocuments(documents, options)
index.search('ishmael', { prefix: true })
index.autoSuggest('zen')

// Persist and reload
const buf = index.saveBinary()
const loaded = FrozenMiniSearch.loadBinary(buf, options)

Mutable index, then freeze (incremental build):

import MiniSearch, { FrozenMiniSearch } from '@yoch/minisearch'

const ms = new MiniSearch({ fields: ['title', 'text'] })
ms.addAll(documents)

const frozen = ms.freeze()   // immutable snapshot
const buf = frozen.saveBinary()

// ESM
import MiniSearch, { FrozenMiniSearch, buildFrozenFromDocuments } from '@yoch/minisearch'

// CommonJS
const MiniSearch = require('@yoch/minisearch')
const { FrozenMiniSearch } = require('@yoch/minisearch')

Pick the right API

| Goal | API | |------|-----| | Live index that changes over time | MiniSearch → freeze() when you need read-only serving | | Fixed corpus, build frozen directly | FrozenMiniSearch.fromDocuments(documents, options) | | Build doc-by-doc (no documents[] buffer) | createFrozenIndexBuilder(options) → .add(doc) → freezeFrozenIndexBuilder(builder) | | Async stream of documents | FrozenMiniSearch.fromAsyncIterable(iterable, options) | | Load a snapshot from disk | FrozenMiniSearch.loadBinary(buffer, options) | | Custom assembly pipeline | buildFrozenFromDocuments, assembleFrozen, freezeFromMiniSearch |

fromDocuments matches new MiniSearch(opts).addAll(docs).freeze() for search ranking on the same corpus and options (fields, tokenize, processTerm, …). Frozen indexes do not support add / remove.

External corpus (e.g. lookup by id after search): keep full rows in your own store (dataCache, DB, etc.) and use minimal storeFields (often ['id'] only) so the frozen index does not duplicate payload text:

import { createFrozenIndexBuilder, freezeFrozenIndexBuilder } from '@yoch/minisearch'

function buildFrozenIndexFromRows (rows, options) {
  const builder = createFrozenIndexBuilder(options, {
    estimatedDocumentCount: rows.length
  })
  for (let i = 0; i < rows.length; i++) {
    builder.add(buildIndexDocument(rows[i], i))
  }
  return freezeFrozenIndexBuilder(builder)
}

// After search: enrich from your store — frozen.getStoredFields(res.id) or dataCache[type][res.id]

Async stream (no intermediate array; documents are indexed as they arrive):

import { createReadStream } from 'node:fs'
import { parse } from 'csv-parse'
import { FrozenMiniSearch } from '@yoch/minisearch'

async function buildFromCsv (path, options) {
  async function * documents () {
    const parser = createReadStream(path).pipe(parse({ columns: true }))
    for await (const row of parser) {
      yield { id: row.cis, denomination: row.denomination, /* … */ }
    }
  }
  return FrozenMiniSearch.fromAsyncIterable(documents(), options)
}

For a sync iterable (for...of on an array or generator), use the builder directly:

import { createFrozenIndexBuilder, freezeFrozenIndexBuilder } from '@yoch/minisearch'

const builder = createFrozenIndexBuilder(options)
for (const doc of documentGenerator()) {
  builder.add(doc)
}
const frozen = freezeFrozenIndexBuilder(builder)

estimatedDocumentCount in the second argument to createFrozenIndexBuilder pre-allocates per-document arrays when the final size is known; internal buffers are trimmed to the actual count on freeze if the hint was too large.

FrozenMiniSearch in a bit more detail

freeze() — snapshot a mutable index into compact typed postings + a radix tree keyed by term index.
fromDocuments() — build that structure in one pass (skips nested Map postings and radix cloning at freeze time).
createFrozenIndexBuilder() — same output without a temporary documents[] array; finalize with freezeFrozenIndexBuilder(builder) (or assembleFrozen(builder.freezeParams()) for custom assembly).
fromAsyncIterable() — async document stream (e.g. CSV parser) into a frozen index; equivalent to builder + for await + freezeFrozenIndexBuilder.
saveBinary() / loadBinary() — MSv4 (sparse multi-field, Uint16 doc ids when possible) or MSv3 (single-field dense, Uint32 doc ids). MSv1/MSv2 are not supported — re-save older snapshots. Field names are stored in the snapshot; fields in loadBinary options is optional (if provided, it must match exactly). Custom tokenize / processTerm are not stored — pass the same functions at load time if you customized them. storeFields data is embedded in the snapshot.
Term frequencies — stored as Uint8 (max 255 per doc/term); only affects scores for extreme term repetition.
frozenMemoryBreakdown() — introspect postings, radix tree, and stored-field footprint (estimates only; not exact heap accounting).

Mutable index → frozen: prefer a fixed corpus. If you used discard() on a MiniSearch index, run vacuum() before freeze() to shrink the snapshot; search parity is still expected without vacuum, but the binary may retain sparse slots.

Advanced API (assembleFrozen, freezeFromMiniSearch, FrozenIndexBuilder) is for custom pipelines — most apps should use fromDocuments, freeze(), or the builder helpers above.

Advanced exports:

import {
  FrozenMiniSearch,
  createFrozenIndexBuilder,
  freezeFrozenIndexBuilder,
  FrozenIndexBuilder,
  type FrozenIndexBuilderHints,
  buildFrozenFromDocuments,
  assembleFrozen,
  freezeFromMiniSearch,
  frozenMemoryBreakdown
} from '@yoch/minisearch'

MiniSearch (mutable)

Full upstream-style API: field boosts, fuzzy/prefix, nested queries, AND / OR / AND_NOT, filters, autoSuggest, vacuum after discard, etc.

import MiniSearch from '@yoch/minisearch'

const miniSearch = new MiniSearch({ fields: ['title', 'text'] })
miniSearch.addAll(documents)
miniSearch.search('zen art motorcycle')

TypeScript definitions: dist/es/index.d.ts.

FrozenMiniSearch — optimizations

Already in MSv3 / MSv4 (8.0.0+)

| Area | Change | Effect | |------|--------|--------| | Format | MSv3 replaces MSv1/MSv2 (breaking) | CRC32 payload check; binary field names, ids, stored fields, term tree | | Binary load | Structural validation in decodeFrozenSnapshot / validateFrozenSnapshot | Corrupt snapshots fail fast with Invalid frozen index: … | | loadBinary | fields optional (embedded in snapshot); if provided, must match exactly | Simpler reload; no silent field subset | | saveBinary | Single pre-allocated buffer | Lower peak memory while serializing | | Search | Per-query cache for fieldTermDataFor(termIndex) | Fewer allocations on prefix/fuzzy queries |

Measure regressions with benchmarks/ (freezeMs, saveBinary, loadBinary, search p50, heap frozen).

Suggested follow-ups (not implemented yet)

| Priority | Topic | Idea | Trade-off | |----------|-------|------|-----------| | Format | Term dictionary | Drop runtime _terms[] duplicate at rest | Saves heap; more complex save path | | API | loadBinaryAsync | Chunked/async load like loadJSONAsync | Better cold start on huge indexes | | API | Input types | Accept Uint8Array as well as Buffer on loadBinary | Broader runtime support | | Build | freeze / builder | One-pass posting flatten with size estimate | Faster freeze on very large corpora | | Search | Wildcard | Iterate only active document slots after dense remap | Faster wildcard after many discards | | Search | Hot path | Direct subarray posting access in aggregateTerm | Lower GC; invasive |

Intentionally deferred: embedding tokenize / processTerm in the snapshot. Raising the Uint8 term-frequency cap needs a new postings encoding.

For contributor-oriented notes, see DESIGN_DOCUMENT.md — FrozenMiniSearch.

Benchmarks

Reproducible comparisons (heap, load time, search latency) live under benchmarks/:

npm run benchmark:compare    # terminal report
npm run benchmark:diff       # vs versioned baseline

Development

npm install
npm test
npm run build

Use npm run for scripts (Yarn 1.x on Node 22 prints url.parse deprecation noise when invoking yarn test / yarn build).

Publish stable (updates npm latest):

npm run release:stable

Publish a pre-release (dist-tag beta only):

npm run release:beta

Requirements: Node.js ES2018+. No browser UMD/CDN build in this fork (Node-only ESM + CJS).

Changelog & credits

See CHANGELOG.md.

MiniSearch — Luca Ongaro (MIT)
This fork — yoch/minisearch: FrozenMiniSearch, MSv4/MSv3 binary snapshots, shared scoring refactor

Upstream docs: MiniSearch site · intro article

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@yoch/minisearch

Why this fork?

Quick start

Pick the right API

FrozenMiniSearch in a bit more detail

MiniSearch (mutable)

FrozenMiniSearch — optimizations

Already in MSv3 / MSv4 (8.0.0+)

Suggested follow-ups (not implemented yet)

Benchmarks

Development

Changelog & credits