fuzzy-vir
v0.0.2
Published
Fuzzy string matching via MinHash + LSH with O(1) lookup, designed for streaming use.
Downloads
2,105
Maintainers
Readme
fuzzy-vir
Easy-to-use utilities for fuzzy matching.
Install
npm i fuzzy-virUsage
FuzzyIndex
Groups near-duplicate strings under a shared cluster key with O(1) lookup. Use it to dedupe or count things like error messages that differ only in embedded IDs, timestamps, or line numbers.
import {FuzzyIndex} from 'fuzzy-vir';
/** A per-cluster counter that lives alongside the index. */
const occurrenceCounts = new Map<string, number>();
const index = new FuzzyIndex({
/** Cap memory at 1,000 distinct clusters; older clusters are evicted FIFO. */
maxIndexSize: 1000,
/**
* Keep the sidecar Map in sync with the index. Without this, entries for evicted clusters would
* stay in `occurrenceCounts` forever, since Maps hold strong references to their keys.
*/
onEvict(clusterKey) {
occurrenceCounts.delete(clusterKey);
},
});
function recordOccurrence(message: string): number {
const key = index.insert(message);
const next = (occurrenceCounts.get(key) ?? 0) + 1;
occurrenceCounts.set(key, next);
return next;
}
/** A near-duplicate (different id, same shape) collapses onto the same cluster. */
recordOccurrence('Database query failed: id abc123 not found'); // returns 1
recordOccurrence('Database query failed: id 9f0e3d not found'); // returns 2
/** A genuinely different message gets its own cluster. */
recordOccurrence('Network request timed out after 30s'); // returns 1
/** Look up without inserting; returns the existing canonical cluster key or `undefined`. */
const existing = index.findKey('Database query failed: id xyz not found');
console.info(existing); // canonical key of the "Database query failed" clusterfuzzySearch
A one-shot fuzzy match over an array. Each item is converted to a string (via the optional itemToString), punctuation is stripped from both the query and options, and matches are scored with Fuse.js. The default threshold (0.1) keeps results close to the query. All other Fuse.js options (e.g. threshold) can be passed through to tune is also supported
import {fuzzySearch} from 'fuzzy-vir';
const fruits = [
'banana',
'grape',
'apple',
'orange',
'pineapple',
];
/** Exact and near matches are returned, ranked by similarity. */
fuzzySearch('apple', fruits); // ['apple', 'pineapple']
/** Typos still match within the default 0.1 threshold. */
fuzzySearch('banan', fruits); // ['banana']
/** Pass `itemToString` to search over objects by a derived string. */
const people = [
{
firstName: 'Jane',
lastName: 'Smith',
},
{
firstName: 'John',
lastName: 'Doe',
},
{
firstName: 'Janet',
lastName: 'Jones',
},
];
fuzzySearch('Jane', people, {
itemToString: (person) =>
[
person.firstName,
person.lastName,
].join(' '),
}); // [{firstName: 'Jane', ...}, {firstName: 'Janet', ...}]
/** Loosen the threshold to allow more distant matches. Other Fuse.js options pass through too. */
fuzzySearch('aple', fruits, {
threshold: 0.5,
}); // ['apple', 'grape', 'pineapple']