@3leaps/string-metrics-wasm

v0.3.8

Published

7 months ago

High-performance string similarity metrics via WASM bindings to rapidfuzz-rs

Downloads

506

0High
0Medium
0Low

3leapsdave

string-metrics-wasm

High-performance string similarity and fuzzy matching via WASM bindings to rapidfuzz-rs.

Description

This library provides blazing-fast string similarity metrics through WASM bindings to the Rust rapidfuzz-rs library, plus TypeScript implementations of advanced fuzzy matching algorithms. It combines the performance of compiled Rust/WASM with the flexibility of TypeScript for a comprehensive text similarity toolkit.

Features:

WASM-powered distance metrics: Levenshtein, Damerau-Levenshtein, OSA, Jaro, Jaro-Winkler, Indel, LCS
Fuzzy matching: Token-based comparison (order-insensitive, set-based)
Process helpers: Find best matches from arrays with configurable scoring
Unified API: Consistent interface across all metrics
TypeScript extensions: Substring similarity, normalization presets, suggestions API
Multi-runtime: Node.js, Bun, Deno support

Prerequisites

Rust toolchain via rustup
wasm-pack (pinned to the version we build against)

Install wasm-pack once per machine:

cargo install wasm-pack --version 0.13.1

Installation

npm install string-metrics-wasm

Quick Start

import { levenshtein, ratio, tokenSortRatio, extractOne, score } from 'string-metrics-wasm';

// Basic edit distance
const dist = levenshtein('kitten', 'sitting');
console.log(dist); // 3

// Fuzzy matching (0-100 scale)
const fuzzy = ratio('hello', 'hallo');
console.log(fuzzy); // 80.0

// Order-insensitive comparison
const tokens = tokenSortRatio('new york mets', 'mets york new');
console.log(tokens); // 100.0

// Find best match from array
const choices = ['Atlanta Falcons', 'New York Jets', 'Dallas Cowboys'];
const best = extractOne('new york', choices);
console.log(best); // { choice: 'New York Jets', score: 57.14, index: 1 }

// Unified scoring API (0-1 scale)
const similarity = score('hello', 'world', 'jaroWinkler');
console.log(similarity); // 0.4666...

API Documentation

Compatibility: All examples use camelCase option names and metric identifiers. For ecosystems that standardize on snake_case (e.g., Fulmen/Crucible fixtures), the same snake_case names are accepted as aliases and normalized internally.

Distance Metrics (WASM)

Edit distance metrics return raw integer distances (lower = more similar):

`levenshtein(a: string, b: string): number`

Minimum edits (insertions, deletions, substitutions) to transform a into b.

levenshtein('kitten', 'sitting'); // 3

`damerau_levenshtein(a: string, b: string): number`

Levenshtein + transpositions (unrestricted).

damerau_levenshtein('abcd', 'abdc'); // 1

`osa_distance(a: string, b: string): number`

Optimal String Alignment (restricted Damerau-Levenshtein).

osa_distance('abcd', 'abdc'); // 1

`indel_distance(a: string, b: string): number`

Insertions and deletions only (no substitutions).

indel_distance('hello', 'hallo'); // 2

`lcs_seq_distance(a: string, b: string): number`

Longest Common Subsequence distance.

lcs_seq_distance('AGGTAB', 'GXTXAYB'); // 3

Similarity Metrics (WASM)

Normalized similarity scores (0.0-1.0 scale, higher = more similar):

`normalized_levenshtein(a: string, b: string): number`

Normalized Levenshtein similarity.

normalized_levenshtein('kitten', 'sitting'); // 0.5714

`jaro(a: string, b: string): number`

Jaro similarity.

jaro('kitten', 'sitting'); // 0.7460

`jaro_winkler(a: string, b: string): number`

Jaro-Winkler similarity (boosts prefix matches).

jaro_winkler('kitten', 'sitting'); // 0.7460

`indel_normalized_similarity(a: string, b: string): number`

Normalized indel similarity.

indel_normalized_similarity('hello', 'hallo'); // 0.8

`lcs_seq_normalized_similarity(a: string, b: string): number`

Normalized LCS similarity.

lcs_seq_normalized_similarity('AGGTAB', 'GXTXAYB'); // 0.5714

Fuzzy Matching (WASM + TypeScript)

Fuzzy string comparison metrics (0-100 scale):

`ratio(a: string, b: string): number` (WASM)

Basic fuzzy similarity using Indel distance.

ratio('kitten', 'sitting'); // 61.54

`partialRatio(a: string, b: string): number` (TypeScript)

Best matching substring using sliding window.

partialRatio('fuzzy', 'fuzzy wuzzy was a bear'); // 100.0

`tokenSortRatio(a: string, b: string): number` (TypeScript)

Order-insensitive token comparison (sorts tokens first).

tokenSortRatio('new york mets', 'mets york new'); // 100.0

`tokenSetRatio(a: string, b: string): number` (TypeScript)

Set-based token comparison (handles duplicates and order).

tokenSetRatio('hello world world', 'world hello'); // 100.0

Process Helpers (TypeScript)

Find best matches from arrays:

`extractOne(query: string, choices: string[], options?): ExtractResult | null`

Find the single best match.

Options:

scorer?: (a: string, b: string) => number - Scoring function (default: ratio)
processor?: (str: string) => string - Preprocessing function
scoreCutoff?: number - Minimum score threshold (default: 0)

const choices = ['Atlanta Falcons', 'New York Jets', 'Dallas Cowboys'];
const best = extractOne('jets', choices, { scoreCutoff: 30 });
// { choice: 'New York Jets', score: 35.29, index: 1 }

`extract(query: string, choices: string[], options?): ExtractResult[]`

Find top N matches (sorted by score).

Options:

scorer?: (a: string, b: string) => number - Scoring function
processor?: (str: string) => string - Preprocessing function
scoreCutoff?: number - Minimum score threshold
limit?: number - Maximum results to return

const results = extract('new york', choices, { limit: 2, scoreCutoff: 40 });
// [
//   { choice: 'New York Jets', score: 57.14, index: 1 },
//   { choice: 'New York Giants', score: 52.17, index: 2 }
// ]

Unified API (TypeScript)

Metric-selectable interface with consistent scales:

`distance(a: string, b: string, metric?: DistanceMetric): number`

Calculate edit distance using any metric (returns raw distance).

Supported metrics: 'levenshtein' (default), 'damerauLevenshtein', 'osa', 'indel', 'lcsSeq'

distance('hello', 'world'); // 4 (default: levenshtein)
distance('hello', 'world', 'indel'); // 8

`score(a: string, b: string, metric?: SimilarityMetric): number`

Calculate similarity using any metric (returns 0-1 normalized score).

Supported metrics: 'jaroWinkler' (default), 'levenshtein', 'damerauLevenshtein', 'osa', 'jaro', 'indel', 'lcsSeq', 'ratio', 'partialRatio', 'tokenSortRatio', 'tokenSetRatio'

score('hello', 'world'); // 0.4666... (default: jaroWinkler)
score('new york mets', 'mets york new', 'tokenSortRatio'); // 1.0

// Fulmen/Crucible users: override default metric if needed
score('hello', 'world', 'levenshtein'); // 0.5714 (edit distance-based)

Normalization & Suggestions

`normalize(input: string, preset?: NormalizationPreset, locale?: NormalizationLocale): string`

Normalize text for comparison with optional locale-specific case folding.

Presets: 'none', 'minimal', 'default', 'aggressive'

Locales: 'tr' (Turkish), 'az' (Azerbaijani), 'lt' (Lithuanian), or undefined (default Unicode casefold)

normalize('Naïve Café', 'default'); // 'naïve café'

// Turkish/Azerbaijani: dotted/dotless I handling
normalize('İstanbul', 'default', 'tr'); // 'istanbul' (İ→i)
normalize('IĞDIR', 'default', 'tr'); // 'ığdır' (I→ı dotless)

// Default Unicode casefold (no locale)
normalize('İstanbul', 'default'); // 'i̇stanbul' (İ→i + combining dot)

Note: Most applications don't need locale-specific normalization. Only use when processing Turkish, Azerbaijani, or Lithuanian text where dotted/dotless I distinction matters.

`suggest(query: string, candidates: string[], options?): Suggestion[]`

Get ranked suggestions with detailed scoring.

const suggestions = suggest('pythn', ['python', 'java', 'javascript'], {
  metric: 'jaroWinkler',
  minScore: 0.6,
  maxSuggestions: 3,
});
// [
//   { value: 'python', score: 0.9555, ... },
//   ...
// ]

See Suggestions API docs for full details.

Implementation Details

WASM vs TypeScript

This library uses a hybrid approach for optimal performance and flexibility:

WASM Implementations (fastest):

Core distance metrics: levenshtein, damerau_levenshtein, osa_distance, jaro, jaro_winkler
RapidFuzz metrics: ratio, indel_*, lcs_seq_*

TypeScript Implementations (flexible):

Token-based fuzzy matching: partialRatio, tokenSortRatio, tokenSetRatio
Process helpers: extractOne, extract
Unified API: distance(), score()
Suggestions and normalization

Token-based metrics benefit from TypeScript's array operations and avoid WASM serialization overhead. The unified API provides a convenient abstraction over both WASM and TypeScript implementations.

Supported Runtimes

Node.js 16+ (ESM and CommonJS)
Bun (native ESM support)
Deno (use npm: specifier)

Building from Source

Install dependencies and tooling: make bootstrap
Build WASM: npm run build:wasm or make build
Build TS: npm run build:ts

Development

This project uses a Makefile for common tasks:

make help           # Show all available targets
make build          # Build WASM and TypeScript (with version check)
make test           # Run tests
make clean          # Remove build artifacts

# Code quality
make quality        # Run all quality checks (format-check, lint, rust checks)
make format         # Format all code (Biome + Prettier + rustfmt)
make format-check   # Check formatting without changes
make lint           # Lint TypeScript code with Biome
make lint-fix       # Lint and auto-fix TypeScript code

# Version management
make version-check  # Verify package.json and Cargo.toml versions match
make bump-patch     # Bump patch version (0.1.0 -> 0.1.1)
make bump-minor     # Bump minor version (0.1.0 -> 0.2.0)
make bump-major     # Bump major version (0.1.0 -> 1.0.0)
make set-version VERSION=x.y.z  # Set explicit version

Explore the rest of the documentation under docs/. Start with the high-level overview or jump straight to the contributor guide in docs/development.md.

Code Quality Tools

This project uses modern, fast tooling for code quality:

TypeScript/JavaScript: Biome for linting and formatting
JSON/YAML/Markdown: Prettier for formatting
Rust: rustfmt for formatting, clippy for linting

Run make quality before committing to ensure all checks pass.

Version Management

This project maintains version sync between package.json (npm) and Cargo.toml (Rust). The Makefile provides targets to bump versions and keep them in sync. Additionally, the test suite includes a version consistency check that will fail if versions drift.

Important: Always use make bump-* or make set-version commands to update versions. This ensures both files stay synchronized.

Performance

All string comparison operations complete in < 1ms:

WASM metrics: 0.0003-0.0005ms per operation
Token-based metrics: 0.0003-0.0017ms per operation
Process helpers: 0.0008-0.001ms per operation
Unified API: minimal dispatch overhead

Run node benchmark-phase1b.js for detailed benchmarks.

Testing

This project includes comprehensive test coverage:

119 unit tests covering all functions
80 YAML fixture test cases for reproducibility
100% regression-free across all releases

Run tests with npm test or make test.

Related Projects

rapidfuzz-rs - Rust implementation of RapidFuzz
rapidfuzz - Original Python implementation
strsim-rs - String similarity metrics (deprecated in favor of rapidfuzz-rs)

Versioning

This project follows Semantic Versioning. Version history is maintained in CHANGELOG.md.

Current Status: See latest release for the current version and changes.

License

This project is licensed under the MIT License.

Contributing

Contributions welcome! Please see our contributing guidelines:

Development setup: docs/development.md
Release workflow (maintainers): docs/publishing.md

Governance

Authoritative policies repository: https://github.com/3leaps/oss-policies/
Code of Conduct: https://github.com/3leaps/oss-policies/blob/main/CODE_OF_CONDUCT.md
Security Policy: https://github.com/3leaps/oss-policies/blob/main/SECURITY.md
Contributing Guide: https://github.com/3leaps/oss-policies/blob/main/CONTRIBUTING.md

⚡ Fast Strings. Accurate Matches. ⚡

High-performance text similarity for modern TypeScript applications

Built with ⚡ by the 3 Leaps team

String Metrics • Fuzzy Matching • WASM Performance

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

string-metrics-wasm

Description

Prerequisites

Installation

Quick Start

API Documentation

Distance Metrics (WASM)

levenshtein(a: string, b: string): number

damerau_levenshtein(a: string, b: string): number

osa_distance(a: string, b: string): number

indel_distance(a: string, b: string): number

lcs_seq_distance(a: string, b: string): number