@ccorgz/caiss

v1.0.5

Published

2 months ago

Custom Adaptive Indexed Similarity Search — a fast, rule-driven, phonetic-aware in-memory search engine inspired by FAISS. Built for precision over raw vector similarity.

0High
0Medium
0Low

ccorgz

CAISS

Custom Adaptive Indexed Similarity Search

A fast, rule-driven, phonetic-aware in-memory search engine — inspired by Facebook FAISS but built on inverted indexes and orthographic/phonetic fingerprints instead of floating-point vector embeddings.

Why CAISS?

Traditional vector search (FAISS, Pinecone, etc.) works great for semantic similarity over embeddings — but it requires a model, GPU/CPU budget, and loses precision on short, abbreviation-heavy, or domain-specific text.

CAISS was designed to solve a concrete problem: matching product names from purchase orders against a catalog, where:

Abbreviations are common (MECH.KEYB → MECHANICAL KEYBOARD)
Typos happen (HEADFONE vs HEADPHONE)
Numbers matter (200ML ≠ 500ML)
Speed is critical (< 2 ms per query on 4 000-item catalogs)
No GPU / embedding model is available or needed

The result is a pure TypeScript engine that pre-computes four phonetic/orthographic fingerprints per token at index time and resolves queries in sub-millisecond lookups.

How It Works

TRAIN (once)                          SEARCH (every query)
─────────────────────────────────     ─────────────────────────────────────────
For each item in your array:          1. Tokenise & fingerprint the query
  1. Concatenate the chosen fields    2. Hit-count pruning via global inverted
  2. Normalise + expand abbreviations    indexes → top-N candidate documents
  3. Split into tokens                3. 5-pass scoring kernel per candidate:
  4. Compute 4 fingerprints:              Pass 1 – exact normalised match   1.00
       normalized  →  "MECHANICAL"       Pass 2 – prefix match             0.95
       soundex     →  "A320"             Pass 3 – substring / concat       0.70–1.0
       prefix3     →  "ADO"              Pass 4 – substring equality       0.30–0.50
       digits      →  "200"              Pass 5 – phonetic (Soundex)       0.19–0.90
  5. Push into global inverted        4. Final percentage + extra-word penalty
     indexes + per-doc local maps     5. Sort & return top-K results

Installation

npm install @ccorgz/caiss
# or
pnpm add @ccorgz/caiss
# or
yarn add @ccorgz/caiss

Requires Node.js ≥ 18 (uses Map, Set, String.prototype.normalize).

Quick Start

import caiss from '@ccorgz/caiss';

const products = [
  { id: 1, name: 'WIRELESS HEADPHONE BLUETOOTH 40H', category: 'AUDIO',   stock: 150 },
  { id: 2, name: 'MECHANICAL KEYBOARD TENKEYLESS',   category: 'INPUT',   stock: 80  },
  { id: 3, name: 'PORTABLE SSD 1TB USB C',           category: 'STORAGE', stock: 400 },
];

// ── 1. Train ──────────────────────────────────────────────────────────────────
const index = caiss.train(products, ['name', 'category'], {
  orderBy: ['stock'],   // higher-stock items surface first on tie-breaks
});

// ── 2. Search ─────────────────────────────────────────────────────────────────
const results = index.search('MECH.KEYB.TENKEYLESS', 5, {
  minPercentage: 30,
});

console.log(results[0]);
// {
//   item:             { id: 2, name: 'MECHANICAL KEYBOARD TENKEYLESS', ... },
//   percentage:       79,
//   matched:          [['MECH','MECHANICAL'], ['KEYB','KEYBOARD'], ['TENKEYLESS','TENKEYLESS']],
//   searchableString: 'MECHANICAL KEYBOARD TENKEYLESS INPUT',
// }

API Reference

`caiss.train(items, fields, opts?)`

Builds (trains) a CAISS index.

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | items | T[] | — | Array of objects. Any shape is accepted. | | fields | (keyof T)[] | — | Fields to concatenate into the searchable string. | | opts.dictionary | DictionaryEntry[] | — | Optional abbreviation-expansion entries. No default — pass your own or omit. | | opts.orderBy | string[] | — | Sort items before indexing (numbers → desc, strings → asc). | | opts.phonetic | 'en' \| 'pt' | 'en' | Soundex variant to use. | | opts.unmatchedWordPenaltyFactor | number | 2 | Multiplier applied to each unmatched document word to compute the extra penalty. | | opts.unmatchedWordPenaltyCap | number | 20 | Maximum value the unmatched-word extra penalty can reach. | | opts.basePenalty | number | 5 | Flat penalty added on top of the unmatched-word extra penalty when score is below penaltyGuardThreshold. | | opts.penaltyGuardThreshold | number | 85 | Scores at or above this value are not penalised. | | opts.highScorePenaltyMin | number | 80 | When the score is between highScorePenaltyMin and penaltyGuardThreshold, only highScorePenalty is applied (no unmatched-word penalty). | | opts.highScorePenalty | number | 5 | Flat penalty for scores in the high-score range above. | | opts.discountMatchRatio | number | 0.5 | Fraction of the larger token count required before the loss-distribution formula activates. |

These scoring parameters are baked into the index at train time — they are stored in the snapshot and fully restored by caiss.fromSnapshot() and CaissStore, so you never have to repeat them at search time.

Returns a CaissIndex<T>.

Tuning tip — short queries vs. long document names

When the query contains very few words (e.g. a single brand name like "TODDY") while document names contain many words (e.g. "ACHOCOL PO TODDY CX 24X200GR"), the default penalties can push scores very low because many document words are unmatched. Reduce the penalties at train time to avoid this:

const index = caiss.train(products, ['name'], {
  orderBy:                    ['stock'],
  unmatchedWordPenaltyFactor: 0.5,  // less penalty per unmatched word
  unmatchedWordPenaltyCap:    5,    // lower ceiling
  basePenalty:                0,    // no flat penalty
});

`index.search(query, topK?, opts?)`

const results = index.search('WIRELESS HEADPHONE 40H', 10, {
  minPercentage:        30,
  maxCandidates:        200,
  debug:                false,
  prioritizeQueryPrefix: false,
  deduplicateChars:     true,
});

| Parameter | Default | Description | |-----------|---------|-------------| | query | — | Free-text query string. | | topK | 10 | Max results to return. Pass 0 for all results above minPercentage. | | opts.minPercentage | 0 | Exclude results below this percentage. | | opts.maxCandidates | 200 | Max candidates to score per query token. | | opts.debug | false | Print detailed scoring info to the console. | | opts.prioritizeQueryPrefix | false | When true, results whose indexed string begins with the normalised query are moved to the top of the list (order within each group is still by percentage). Useful when a short query like "ADOC" should surface "ADOCANTE …" before products that merely contain "ADOC" in the middle. | | opts.deduplicateChars | true | When true, consecutive duplicate letters are collapsed before every comparison at search time — e.g. TODDY → TODY, TUTTI → TUTI — so that a query like "TODY" produces a 100 % match against the indexed token "TODDY". The stored index is never modified; disabling this restores the original behaviour instantly. |

Returns CaissResult<T>[] sorted by percentage descending.

interface CaissResult<T> {
  item:             T;                    // original object
  percentage:       number;              // 0–100 similarity score
  matched:          [string, string][];  // [queryToken, docToken] evidence pairs
  searchableString: string;              // assembled & normalised indexed string
}

Search option examples

prioritizeQueryPrefix — bring prefix-matching products to the front:

// Without the option (default false):
// COCO RALADO UMIDO ADOC SINHA CX 24X100GR  ← higher percentage, comes first
// ADOCANTE ASSUGRIN LIQ CX12X100M
// ADOCANTE UNIAO SUCRALOSE SCH 400X600MG

// With prioritizeQueryPrefix: true:
// ADOCANTE ASSUGRIN LIQ CX12X100M           ← starts with "ADOC", moved up
// ADOCANTE UNIAO SUCRALOSE SCH 400X600MG    ← starts with "ADOC", moved up
// COCO RALADO UMIDO ADOC SINHA CX 24X100GR
const results = index.search('ADOC', 10, { prioritizeQueryPrefix: true });

deduplicateChars — match despite doubled letters (on by default):

// TODY matches TODDY at 100 % because both collapse to "TODY"
index.search('TODY',  5);  // 100 % match against "TODDY"
index.search('TODDY', 5);  // also 100 % (TODDY → TODY on both sides)

// Disable if you need strict character-exact matching:
index.search('TODDY', 5, { deduplicateChars: false });

`index.toSnapshot()` / `caiss.fromSnapshot(data)`

Serialize an index to a plain object (JSON-safe) and restore it later.

import fs from 'fs';

// Save
const snap = index.toSnapshot();
fs.writeFileSync('products-index.json', JSON.stringify(snap));

// Load
const restored = caiss.fromSnapshot(JSON.parse(fs.readFileSync('products-index.json', 'utf-8')));
const results  = restored.search('WIRELESS HEADPHONE', 5);

`caiss.createStore(opts?)`

Creates a CaissStore — a caching layer with memory + disk persistence and TTL.

const store = caiss.createStore({
  cacheDir: './caiss-cache',         // null to disable disk
  ttl:      24 * 60 * 60 * 1000,    // 24 h (default)
  log:      true,
});

`store.getOrTrain(key, loader, fields, buildOpts?)`

Returns a cached index if unexpired, otherwise calls loader() to fetch fresh data, trains a new index, and caches it.

Cache priority: memory → disk → loader.

const runtime = await store.getOrTrain(
  'products-region-10',
  async () => db.query('SELECT * FROM products WHERE region = 10'),
  ['name', 'category'],
  { orderBy: ['stock'] },
);

const results = runtime.index.search('WIRELESS HEADPHONE', 10);

`store.train(key, items, fields, buildOpts?)`

Force-trains and caches a new index under key.

`store.get(key)`

Returns the cached CaissRuntimeIndex (or null if missing / expired).

`store.invalidate(key?)`

store.invalidate('key') — evicts a specific entry (memory + disk).
store.invalidate() — clears everything.

`store.list()`

Returns all currently cached (unexpired) keys.

Custom Dictionary

CAISS does not ship with a built-in dictionary. You supply your own list of { term, replacement } pairs and pass it into caiss.train (or caiss.createStore → getOrTrain). Terms are matched as whole words (in the order you provide; longest terms are matched first automatically).

import caiss from '@ccorgz/caiss';
import type { DictionaryEntry } from '@ccorgz/caiss';

const myDict: DictionaryEntry[] = [
  { term: 'LIQ',  replacement: 'LIQUID'   },
  { term: 'ORIG', replacement: 'ORIGINAL' },
  { term: 'SPEC', replacement: 'SPECIAL'  },
];

const index = caiss.train(items, ['name'], {
  dictionary: myDict,
});

If you do not pass a dictionary, CAISS indexes the raw text as-is.

Phonetic Algorithm

CAISS uses a Soundex-derived phonetic fingerprint in Pass 3 and Pass 5 of the scoring kernel. You can choose the variant via the phonetic build option:

| Value | Description | |-------|-------------| | 'en' (default) | Standard American Soundex — best for English-language catalogs. Does not perform accent-to-ASCII conversion beyond standard NFD normalisation. | | 'pt' | Soundex variant adapted for Latin/accented scripts (treats Ç → C and applies slightly different consonant groupings). Useful for Portuguese, Spanish, Italian, or similar languages. |

// English (default — omit the option or pass 'en' explicitly)
const enIndex = caiss.train(products, ['name'], { phonetic: 'en' });

// Portuguese / Latin-script variant
const ptIndex = caiss.train(produtos, ['nome'], { phonetic: 'pt' });

The selected algorithm is applied consistently at both index time and query time, so you only set it once in caiss.train.

Scoring Passes — Quick Reference

The scoring kernel runs up to 5 passes per query token, stopping at the first match:

| Pass | Technique | Score | |------|-----------|-------| | 1 | Exact normalised match | 1.00 | | 2 | Prefix match (≥ 3 chars) | 0.95 | | 3 | Substring / concat overlap + Soundex | 0.70 – 1.00 | | 4 | LCS substring equality (≥ 80 % overlap) | 0.30 – 0.50 | | 5 | Soundex phonetic fallback (length-penalised) | 0.06 – 0.90 |

After all tokens are processed, the raw score is normalised against the maximum token-list length. A penalty-recovery step then distributes the remaining loss across the matched pairs — but only when at least 50 % of the words (taking the longer of the query and document token lists as the reference) were matched. Comparisons that match fewer than half the words receive no recovery and keep the full penalty, ensuring weak matches are not artificially promoted. Finally, an extra-word penalty is applied to discourage matches with large unmatched tails.

Soundex Phonetic Fallback — Length-Based Score Penalty

Pass 5 (phonetic fallback) uses Soundex to match query words that survived all earlier passes without a structural hit. Because Soundex encodes every word into a fixed 4-character code, short words (4 letters) use almost all of their characters just to fill the code, leaving no room for meaningful differentiation. For example, COCO and CAJU both encode to C200, producing a Soundex similarity of 1.0 — even though they sound nothing alike in practice.

To correct this, Pass 5 applies a lengthFactor multiplier that scales the word score down based on the length of the shorter of the two words being compared:

lengthFactor = min(1.0, (minWordLength − 3) / 4)

| Word length | lengthFactor | Effective score cap (base sim × 0.90) | |-------------|----------------|------------------------------------------| | 4 letters | 0.25 | ≤ 0.225 | | 5 letters | 0.50 | ≤ 0.45 | | 6 letters | 0.75 | ≤ 0.675 | | 7+ letters | 1.00 | ≤ 0.90 (no penalty) |

The final word score in the phonetic fallback becomes soundexSimilarity × baseMultiplier × lengthFactor instead of soundexSimilarity × baseMultiplier.

This change only affects Pass 5. Passes 1–4 already have strong structural evidence (exact characters, prefixes, substrings) and are not penalised.

Practical effect: a 4-letter phonetic false-positive like COCO ↔ CAJU (Soundex sim = 1.0) now produces a word score of 0.225 instead of 0.90, pulling the overall match well below any reasonable acceptance threshold. Longer words that genuinely sound alike (7+ letters) are completely unaffected.

See the GitHub repository for the full technical specification.

Performance

| Scenario | Time | |----------|------| | Index build (4 000 items) | ~500 ms (one-time) | | Memory cache hit | < 1 ms | | Disk snapshot load | ~60 ms | | Search (single query, 4 000 items) | ~1–2 ms |

TypeScript

CAISS is written in TypeScript and ships full type declarations. No @types/caiss needed.

import caiss from '@ccorgz/caiss';
import type { CaissResult, CaissBuildOptions } from '@ccorgz/caiss';

interface Product { id: number; name: string; }

const index   = caiss.train<Product>(products, ['name']);
const results: CaissResult<Product>[] = index.search('test', 5);

Contributing

Contributions are welcome! Please open an issue or pull request on GitHub.

Important: the 5-pass scoring kernel inside src/core/CaissIndex.ts (_scoreDocAgainstQuery) must not be changed without a corresponding benchmark showing a measurable accuracy improvement. The algorithm is the core value of this library.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme