npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

anchor-fingerprint

v0.1.0

Published

64-bit SimHash for text deduplication with Hamming distance calculations

Readme

anchor-fingerprint

64-bit SimHash for text deduplication with Hamming distance calculations

Crates.io Documentation License

Features

  • 64-bit SimHash: Near-duplicate detection for text
  • Hamming Distance: Efficient bit-difference counting (uses POPCNT instruction)
  • Similarity Scoring: 0.0 to 1.0 similarity metric
  • Unicode Support: Handles international text correctly
  • Zero Dependencies: Only uses murmur3 for token hashing
  • Optional SIMD: Enable the simd feature for potential acceleration

Performance

| Operation | Target | Measured | |-----------|--------|----------| | SimHash (50 chars) | ≤1ms | See cargo bench | | SimHash (500 chars) | ≤2ms | See cargo bench | | Hamming Distance | ≥4M ops/sec | See cargo bench |

Quick Start

use anchor_fingerprint::{simhash, hamming_distance, similarity};

// Fingerprint some text
let hash1 = simhash("The quick brown fox jumps over the lazy dog");
let hash2 = simhash("The quick brown fox leaps over the lazy dog");

// Compute distance (number of differing bits)
let dist = hamming_distance(hash1, hash2);
println!("Hamming distance: {}", dist);

// Compute similarity (0.0 to 1.0)
let sim = similarity(hash1, hash2);
println!("Similarity: {:.2}", sim);

API

simhash(text: &str) -> u64

Compute the 64-bit SimHash fingerprint of a text string.

let hash = simhash("Hello, world!");
println!("Fingerprint: {:016x}", hash);

hamming_distance(a: u64, b: u64) -> u32

Compute the number of differing bits between two fingerprints.

let hash1 = simhash("Hello");
let hash2 = simhash("Hello");
assert_eq!(hamming_distance(hash1, hash2), 0);

similarity(a: u64, b: u64) -> f32

Compute similarity score between two fingerprints (0.0 to 1.0).

let hash1 = simhash("Hello");
let hash2 = simhash("Hello");
assert_eq!(similarity(hash1, hash2), 1.0);

tokenize(text: &str) -> Vec<String>

Tokenize text into lowercase words (public for custom pipelines).

let tokens = tokenize("Hello, World! 123");
assert_eq!(tokens, vec!["hello", "world", "123"]);

simhash_with_tokens(tokens: &[String]) -> u64

Compute SimHash from pre-tokenized words.

let tokens = vec!["hello".to_string(), "world".to_string()];
let hash = simhash_with_tokens(&tokens);

Installation

Add to your Cargo.toml:

[dependencies]
anchor-fingerprint = "0.1.0"

Or install via cargo:

cargo add anchor-fingerprint

Usage Examples

Deduplication

use anchor_fingerprint::{simhash, hamming_distance};

fn is_duplicate(existing: &[u64], new_text: &str, threshold: u32) -> bool {
    let new_hash = simhash(new_text);
    existing.iter().any(|&hash| hamming_distance(hash, new_hash) < threshold)
}

// Usage
let existing_hashes = vec![/* ... */];
if is_duplicate(&existing_hashes, "New document text", 5) {
    println!("This is a near-duplicate!");
}

Similarity Search

use anchor_fingerprint::{simhash, similarity};

fn find_most_similar(query: &str, candidates: &[(&str, u64)]) -> Option<&str> {
    let query_hash = simhash(query);
    candidates
        .iter()
        .max_by(|a, b| {
            let sim_a = similarity(query_hash, a.1);
            let sim_b = similarity(query_hash, b.1);
            sim_a.partial_cmp(&sim_b).unwrap()
        })
        .map(|(text, _)| *text)
}

Batch Processing

use anchor_fingerprint::simhash;

fn fingerprint_documents(docs: &[&str]) -> Vec<u64> {
    docs.iter().map(|&doc| simhash(doc)).collect()
}

Algorithm

The SimHash algorithm works as follows:

  1. Tokenize: Split input into lowercase words
  2. Hash tokens: Compute 64-bit MurmurHash3 for each token
  3. Accumulate: For each bit position (0-63), add +1 if token hash bit is 1, -1 if 0
  4. Finalize: Bit i in fingerprint is 1 if accumulator[i] > 0, else 0

This produces a fingerprint where similar texts have similar bit patterns (small Hamming distance).

Benchmarks

Run benchmarks with:

cargo bench

Sample output:

simhash_short_50_chars    time:   [500 ns 520 ns 540 ns]
simhash_medium_500_chars  time:   [1.5 µs 1.6 µs 1.7 µs]
simhash_long_2000_chars   time:   [5.0 µs 5.2 µs 5.4 µs]
hamming_distance          time:   [0.3 ns 0.4 ns 0.5 ns]  (~2B ops/sec)

Testing

cargo test --all-features

Coverage report:

cargo tarpaulin --out Html

License

AGPL-3.0 - See LICENSE for details.

Contributing

  1. Read the specification
  2. Follow code style
  3. Write tests per testing standards
  4. Submit a PR

Acknowledgments

  • SimHash algorithm: Moses Charikar (1997)
  • MurmurHash3: Austin Appleby