anchor-fingerprint
v0.1.0
Published
64-bit SimHash for text deduplication with Hamming distance calculations
Maintainers
Readme
anchor-fingerprint
64-bit SimHash for text deduplication with Hamming distance calculations
Features
- 64-bit SimHash: Near-duplicate detection for text
- Hamming Distance: Efficient bit-difference counting (uses POPCNT instruction)
- Similarity Scoring: 0.0 to 1.0 similarity metric
- Unicode Support: Handles international text correctly
- Zero Dependencies: Only uses
murmur3for token hashing - Optional SIMD: Enable the
simdfeature for potential acceleration
Performance
| Operation | Target | Measured |
|-----------|--------|----------|
| SimHash (50 chars) | ≤1ms | See cargo bench |
| SimHash (500 chars) | ≤2ms | See cargo bench |
| Hamming Distance | ≥4M ops/sec | See cargo bench |
Quick Start
use anchor_fingerprint::{simhash, hamming_distance, similarity};
// Fingerprint some text
let hash1 = simhash("The quick brown fox jumps over the lazy dog");
let hash2 = simhash("The quick brown fox leaps over the lazy dog");
// Compute distance (number of differing bits)
let dist = hamming_distance(hash1, hash2);
println!("Hamming distance: {}", dist);
// Compute similarity (0.0 to 1.0)
let sim = similarity(hash1, hash2);
println!("Similarity: {:.2}", sim);API
simhash(text: &str) -> u64
Compute the 64-bit SimHash fingerprint of a text string.
let hash = simhash("Hello, world!");
println!("Fingerprint: {:016x}", hash);hamming_distance(a: u64, b: u64) -> u32
Compute the number of differing bits between two fingerprints.
let hash1 = simhash("Hello");
let hash2 = simhash("Hello");
assert_eq!(hamming_distance(hash1, hash2), 0);similarity(a: u64, b: u64) -> f32
Compute similarity score between two fingerprints (0.0 to 1.0).
let hash1 = simhash("Hello");
let hash2 = simhash("Hello");
assert_eq!(similarity(hash1, hash2), 1.0);tokenize(text: &str) -> Vec<String>
Tokenize text into lowercase words (public for custom pipelines).
let tokens = tokenize("Hello, World! 123");
assert_eq!(tokens, vec!["hello", "world", "123"]);simhash_with_tokens(tokens: &[String]) -> u64
Compute SimHash from pre-tokenized words.
let tokens = vec!["hello".to_string(), "world".to_string()];
let hash = simhash_with_tokens(&tokens);Installation
Add to your Cargo.toml:
[dependencies]
anchor-fingerprint = "0.1.0"Or install via cargo:
cargo add anchor-fingerprintUsage Examples
Deduplication
use anchor_fingerprint::{simhash, hamming_distance};
fn is_duplicate(existing: &[u64], new_text: &str, threshold: u32) -> bool {
let new_hash = simhash(new_text);
existing.iter().any(|&hash| hamming_distance(hash, new_hash) < threshold)
}
// Usage
let existing_hashes = vec![/* ... */];
if is_duplicate(&existing_hashes, "New document text", 5) {
println!("This is a near-duplicate!");
}Similarity Search
use anchor_fingerprint::{simhash, similarity};
fn find_most_similar(query: &str, candidates: &[(&str, u64)]) -> Option<&str> {
let query_hash = simhash(query);
candidates
.iter()
.max_by(|a, b| {
let sim_a = similarity(query_hash, a.1);
let sim_b = similarity(query_hash, b.1);
sim_a.partial_cmp(&sim_b).unwrap()
})
.map(|(text, _)| *text)
}Batch Processing
use anchor_fingerprint::simhash;
fn fingerprint_documents(docs: &[&str]) -> Vec<u64> {
docs.iter().map(|&doc| simhash(doc)).collect()
}Algorithm
The SimHash algorithm works as follows:
- Tokenize: Split input into lowercase words
- Hash tokens: Compute 64-bit MurmurHash3 for each token
- Accumulate: For each bit position (0-63), add +1 if token hash bit is 1, -1 if 0
- Finalize: Bit i in fingerprint is 1 if accumulator[i] > 0, else 0
This produces a fingerprint where similar texts have similar bit patterns (small Hamming distance).
Benchmarks
Run benchmarks with:
cargo benchSample output:
simhash_short_50_chars time: [500 ns 520 ns 540 ns]
simhash_medium_500_chars time: [1.5 µs 1.6 µs 1.7 µs]
simhash_long_2000_chars time: [5.0 µs 5.2 µs 5.4 µs]
hamming_distance time: [0.3 ns 0.4 ns 0.5 ns] (~2B ops/sec)Testing
cargo test --all-featuresCoverage report:
cargo tarpaulin --out HtmlLicense
AGPL-3.0 - See LICENSE for details.
Contributing
- Read the specification
- Follow code style
- Write tests per testing standards
- Submit a PR
Acknowledgments
- SimHash algorithm: Moses Charikar (1997)
- MurmurHash3: Austin Appleby
