shingles
v0.1.1
Published
A WebAssembly library for efficient text deduplication and similarity detection using shingling and MinHash.
Downloads
161
Readme
shingles
Fast, privacy-aware client-side near-duplicate detection in the browser
A lightweight Rust library compiled to WebAssembly (via wasm-bindgen) for computing MinHash signatures and Locality-Sensitive Hashing (LSH) indices to find similar/duplicate texts entirely in the browser, with optional salting for privacy-preserving similarity checks.
No server needed. Runs offline. Perfect for PWAs, note-taking apps, plagiarism checkers, private document comparison, or client-side data cleaning.
✨ Features
- MinHash signatures for estimating Jaccard similarity between texts
- Locality-Sensitive Hashing (LSH) with banding for fast candidate duplicate detection
- Privacy-preserving mode: Optional per-document salt for blind similarity comparison (same salt → comparable signatures :-> different salt → no leakage)
- Efficient shingling (k-grams on bytes, UTF-8 safe)
- Zero-copy slices where possible for fast WASM ↔ JS interop
- Small footprint WASM binary stays tiny (~500KB–1MB gzipped)
- Offline-first works in browsers, PWAs, even low-end devices
Installation
Via npm (recommended for web projects)
npm install shingles
# or
yarn add shingles
# or
pnpm add shinglesUsage (JavaScript / TypeScript)
shingles is designed to be used in a background thread (Web Worker) for large datasets, but the API is simple enough for main-thread use on smaller collections.
import init, { Deduplicator } from 'shingles';
async function main() {
await init();
// 128 hashes divided into 16 bands (Threshold ~0.7-0.8)
// use a BigInt for the seed (n suffix required in JS)
const seed = 42n;
const dedup = new Deduplicator(128, 16, seed);
const privacySalt = "user-secret-key-123";
// add documents
const docs = [
{ id: 1, text: "The quick brown fox jumps over the lazy dog." },
{ id: 2, text: "The quick brown fox leaps over the lazy dog!" }, // near dupe
{ id: 3, text: "I love coding in Rust and WebAssembly." } // different
];
for (const doc of docs) {
const shingless = Deduplicator.get_shingless(doc.text, 5);
const signature = dedup.compute_minhash(shingless, privacySalt);
dedup.add_document(doc.id, signature);
}
// query for near-duplicates
const queryText = "A quick brown fox jumps over a lazy dog.";
const queryshingless = Deduplicator.get_shingless(queryText, 5);
const querySig = dedup.compute_minhash(queryshingless, privacySalt);
// LSH finds candidates in O(1) time instead of O(N)
const candidates = dedup.query_candidates(querySig);
console.log("Candidate IDs:", candidates); // Output: [1, 2]
// fine-grained similarity check
// compare the query against the most likely candidate (ID: 1)
const sig1 = dedup.compute_minhash(Deduplicator.get_shingless(docs[0].text, 5), privacySalt);
const similarity = dedup.estimate_jaccard(querySig, sig1);
console.log(`Estimated Similarity: ${(similarity * 100).toFixed(2)}%`);
}
main();Why Privacy Salt?
When two parties use the same secret salt, their MinHash signatures become comparable without revealing the original text.
Different salts produce completely incomparable signatures → ideal for secure, private similarity checks (e.g., two users compare notes without sending plaintext).
Performance
- Tested on 1,000–5,000 short documents: signature computation + indexing < 500ms (browser, mid-range device)
- Scales to 10k+ docs with Web Workers (recommended for heavy use)
- Much faster than pure JS alternatives for large texts
Tuning the "Sensitivity" (S-Curve)
The relationship between Bands, Rows, and Similarity follows an S-curve. By adjusting the parameters in new Deduplicator(hashes, bands, seed), you control how "fuzzy" the matching is. | Goal | Hashes | Bands | Threshold (T) | |-------------|--------|-------|---------------| | Strict | 128 | 8 | ≈ 0.84 | | Balanced | 128 | 16 | ≈ 0.70 | | Permissive | 128 | 32 | ≈ 0.55 |
Formula: $$T≈(1/b)1/r$$ where b is bands and r is rows per band
Building from Source
# Install wasm-pack if needed
cargo install wasm-pack
# Build for web target
wasm-pack build --target webRoadmap
- Web Worker helpers for non-blocking heavy operations
- Exact similarity (Jaccard on shingles) only on candidates
- Streaming/large-text support
- Advanced PSI extensions for stronger privacy
- TypeScript types & better docs
Contributing
Contributions welcome! Feel free to open issues or PRs.
- Fork the repo
cargo build/wasm-pack build- Make changes
- Submit PR
License
MIT © Preetham Pemmasani
If you use this in a project, star the repo and let me know — I'd love to hear about it!
Follow me on X: @ppmpreetham
