npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@arclabs561/rank-refine

v0.7.36

Published

SIMD-accelerated MaxSim (ColBERT/ColPali), cosine similarity, diversity (MMR/DPP), token alignment/highlighting for vector search and RAG. Supports text and multimodal late interaction.

Readme

rank-refine

SIMD-accelerated similarity scoring for vector search and RAG. Provides MaxSim (ColBERT/ColPali), cosine similarity, diversity selection (MMR, DPP), token pooling, token-level alignment/highlighting, and Matryoshka refinement. Supports both text (ColBERT) and multimodal (ColPali) late interaction.

CI Crates.io Docs

cargo add rank-refine

Why Late Interaction?

Dense retrieval encodes each document as a single vector. This works for broad matching but loses token-level alignment.

Problem: A query like "capital of France" might match a document about "France's economic capital" with high similarity, even if it never mentions "capital" in the geographic sense.

Solution: Late interaction (ColBERT-style) keeps one vector per token instead of pooling. At query time, each query token finds its best-matching document token, then we sum those matches.

Dense:           "the quick brown fox" → [0.1, 0.2, ...]  (1 vector)
Late Interaction: "the quick brown fox" → [[...], [...], [...], [...]]  (4 vectors)

This preserves token-level semantics that single-vector embeddings lose. Useful for reranking in RAG pipelines.

What This Is

Scoring primitives for retrieval systems:

| You need | This crate provides | |----------|---------------------| | Score pre-computed embeddings | cosine, dot, maxsim | | ColBERT/late interaction | maxsim_vecs, maxsim_batch | | Token-level alignment/highlighting | maxsim_alignments, highlight_matches | | Batch alignment/highlighting | maxsim_alignments_batch, highlight_matches_batch | | Alignment utilities | top_k_alignments, filter_alignments, alignment_stats | | Diversity selection | mmr_cosine, dpp | | Compress token embeddings | pool_tokens, pool_tokens_adaptive | | Two-stage refinement | matryoshka::refine |

What this is NOT: embedding generation, model weights, or storage systems. This crate scores embeddings you provide; it does not run model inference. See fastembed-rs for inference. Trait-based interfaces available for custom models (see crossencoder module).

Usage

use rank_refine::simd::{cosine, maxsim_vecs};

// Dense similarity
let score = cosine(&query_embedding, &doc_embedding);

// Late interaction (ColBERT)
let score = maxsim_vecs(&query_tokens, &doc_tokens);

Realistic Example

use rank_refine::colbert;

// Query: "capital of France" (32 tokens, 128-dim embeddings)
let query = vec![
    vec![0.12, -0.45, 0.89, ...],  // "capital" token
    vec![0.34, 0.67, -0.23, ...],  // "of" token
    vec![0.78, -0.12, 0.45, ...],  // "France" token
    // ... 29 more tokens
];

// Document: "Paris is the capital of France" (100 tokens)
let doc = vec![
    vec![0.11, -0.44, 0.90, ...],  // "Paris" token
    vec![0.35, 0.66, -0.24, ...],  // "is" token
    // ... 98 more tokens
];

// MaxSim finds best matches for each query token
let score = colbert::maxsim_vecs(&query, &doc);
// "capital" matches "capital" (0.95), "France" matches "France" (0.92)
// Score = 0.95 + 0.92 + ... (sum of best matches per query token)

API

Similarity (SIMD-accelerated)

| Function | Input | Notes | |----------|-------|-------| | cosine(a, b) | &[f32] | Normalized, -1 to 1 | | dot(a, b) | &[f32] | Unnormalized | | maxsim(q, d) | &[&[f32]] | Sum of max similarities | | maxsim_cosine(q, d) | &[&[f32]] | Cosine variant | | maxsim_weighted(q, d, w) | &[&[f32]], &[f32] | Per-token weights | | maxsim_batch(q, docs) | &[Vec<f32>], &[Vec<Vec<f32>>] | Batch scoring | | maxsim_alignments(q, d) | &[&[f32]] | Token-level alignments | | maxsim_alignments_batch(q, docs) | &[Vec<f32>], &[Vec<Vec<f32>>] | Batch alignments | | highlight_matches(q, d, t) | &[&[f32]], f32 | Highlighted token indices | | highlight_matches_batch(q, docs, t) | &[Vec<f32>], &[Vec<Vec<f32>>], f32 | Batch highlights | | top_k_alignments(aligns, k) | &[(usize, usize, f32)], usize | Top-k alignments | | filter_alignments(aligns, min) | &[(usize, usize, f32)], f32 | Filter by score | | alignment_stats(aligns) | &[(usize, usize, f32)] | Min/max/mean/sum stats |

Token Pooling

| Function | Tokens Kept | Notes | |----------|-------------|-------| | pool_tokens(t, 2) | 50% | Safe default | | pool_tokens(t, 4) | 25% | Use hierarchical feature | | pool_tokens_adaptive(t, f) | varies | Auto-selects greedy vs ward | | pool_tokens_with_protected(t, f, n) | varies | Keeps first n tokens unpooled |

Diversity

| Function | Algorithm | |----------|-----------| | mmr_cosine(candidates, embeddings, config) | Maximal Marginal Relevance | | dpp(candidates, embeddings, config) | Determinantal Point Process |

Token Alignment & Highlighting

| Function | Purpose | |----------|---------| | maxsim_alignments(query, doc) | Get (query_idx, doc_idx, score) alignment pairs | | highlight_matches(query, doc, threshold) | Extract highlighted doc token indices | | colbert::alignments(query, doc) | Convenience wrapper for alignments | | colbert::highlight(query, doc, threshold) | Convenience wrapper for highlighting |

Utilities

| Function | Purpose | |----------|---------| | normalize_maxsim(score, qlen) | Scale to [0,1] | | softmax_scores(scores) | Probability distribution | | top_k_indices(scores, k) | Top-k by score | | blend(a, b, α) | Linear interpolation | | idf_weights(doc_freqs, total_docs) | IDF weighting for query tokens | | bm25_weights(doc_freqs, query_freqs, total_docs, k1) | BM25-style weighting | | extract_snippet_indices(alignments, context, max) | Extract text snippet indices | | patches_to_regions(patch_indices, w, h, patches) | Convert patches to image regions |

How It Works

MaxSim (Late Interaction)

MaxSim scores token-level alignment. For each query token, find its best-matching document token, then sum:

$$\text{score}(Q, D) = \sum_{i=1}^{|Q|} \max_{j=1}^{|D|} (q_i \cdot d_j)$$

where $|Q|$ is the number of query tokens and $|D|$ is the number of document tokens.

Visual example:

Query tokens:     [q1]  [q2]  [q3]
                    \    |    /
                     \   |   /     Each query token searches
                      v  v  v      for its best match
Document tokens:  [d1] [d2] [d3] [d4]
                   ↑         ↑
                  0.9       0.8    (best matches)

MaxSim = 0.9 + 0.8 + ... (sum of best matches)

Example: Query "capital of France" (2 tokens) vs document "Paris is the capital of France" (6 tokens):

  • Query token "capital" finds best match: dot("capital", "capital") = 0.95
  • Query token "France" finds best match: dot("France", "France") = 0.92
  • MaxSim = 0.95 + 0.92 = 1.87

This captures token-level alignment: "capital" and "France" both have strong matches, even if they appear in different parts of the document. Single-vector embeddings average these signals and lose precision.

Token-level alignment and highlighting: Unlike single-vector embeddings, ColBERT can show exactly which document tokens match each query token. Use maxsim_alignments() to get alignment pairs, or highlight_matches() to extract highlighted token indices for snippet extraction.

Multimodal support (ColPali): The same alignment functions work for vision-language retrieval. In ColPali-style systems, image patches are treated as "tokens"—query text tokens align with image patch embeddings. This enables visual snippet extraction: identify which image regions (patches) are relevant to a query, then extract those regions as visual snippets for display. Use patches_to_regions() to convert patch indices to pixel coordinates.

Query augmentation with [MASK] tokens: ColBERT uses [MASK] tokens for soft query expansion. These tokens are added during encoding and should be weighted lower (typically 0.2-0.4) than original query tokens when using maxsim_weighted(). See examples/mask_token_weighting.rs for a complete example.

IDF and BM25 weighting: Use idf_weights() or bm25_weights() to compute importance weights for query tokens based on document frequency. These can be passed to maxsim_weighted() to boost rare terms and improve retrieval quality by ~2-5%.

When to use:

  • Second-stage reranking (after dense retrieval)
  • Precision-critical applications (legal, medical)
  • Queries with multiple important terms

When not to use:

  • First-stage retrieval (too slow for millions of docs)
  • Storage-constrained (typically 10-50x larger than dense, depends on document length)

MMR (Diversity)

Problem: Top-k by relevance returns near-duplicates. A search for "async programming" might return 10 Python asyncio tutorials instead of examples in Python, Rust, JavaScript, and Go.

Solution: Balance relevance with diversity by penalizing similarity to already-selected items:

$$\text{MMR}(d) = \lambda \cdot \text{rel}(d) - (1-\lambda) \cdot \max_{s \in S} \text{sim}(d, s)$$

where:

  • $\lambda$ is the relevance-diversity tradeoff parameter (range [0, 1])
  • $\text{rel}(d)$ is the relevance score of document $d$
  • $S$ is the set of already-selected documents
  • $\text{sim}(d, s)$ is the similarity between document $d$ and selected document $s$

Lambda parameter:

  • λ = 1.0: Pure relevance (equivalent to top-k)
  • λ = 0.5: Balanced (common default for RAG)
  • λ = 0.3: Strong diversity (exploration mode)
  • λ = 0.0: Maximum diversity (ignore relevance)

When to use:

  • RAG pipelines (diverse context helps LLMs)
  • Recommendation systems (avoid redundancy)
  • Search result diversification

When not to use:

  • Known-item search (single correct answer)
  • When relevance is paramount

Token Pooling

Storage: ColBERT stores one vector per token. For 10M documents with 100 tokens each:

  • Storage = 10M × 100 × 128 × 4 bytes = 512 GB (vs ~5 GB for dense: 10M × 128 × 4 bytes)

Token pooling: Cluster similar document tokens and store only cluster centroids:

Before:  [tok1] [tok2] [tok3] [tok4] [tok5] [tok6]  (6 vectors)
         similar--^       similar------^
After:   [mean(1,2)]  [tok3] [mean(4,5)] [tok6]    (4 vectors = 33% reduction)

Why it works: Many tokens are redundant. Function words cluster together, as do related content words. Merging similar tokens loses little discriminative information.

| Factor | Tokens Kept | MRR@10 Loss | When to Use | |--------|-------------|-------------|-------------| | 2 | 50% | 0.1–0.3% | Default choice | | 3 | 33% | 0.5–1.0% | Good tradeoff | | 4 | 25% | 1.5–3.0% | Storage-constrained (use hierarchical feature) | | 8+ | 12% | 5–10% | Extreme storage constraints only |

Numbers from MS MARCO dev (Clavie et al., 2024). Pool at index time. Re-score with unpooled embeddings at query time if needed.

When pooling hurts more:

  • Long documents with many distinct concepts
  • Queries that need to distinguish between similar tokens
  • Short passages (less redundancy to exploit)

Benchmarks

Measured on Apple M3 Max with cargo bench:

| Operation | Dim | Time | |-----------|-----|------| | dot | 128 | 13ns | | dot | 768 | 126ns | | cosine | 128 | 40ns | | cosine | 768 | 380ns | | maxsim | 32q×128d×128dim | 49μs | | maxsim (pooled 2x) | 32q×64d×128dim | 25μs | | maxsim (pooled 4x) | 32q×32d×128dim | 12μs |

These timings enable real-time reranking of 100-1000 candidates.

Vendoring

If you prefer not to add a dependency:

  • src/simd.rs is self-contained (~600 lines)
  • AVX2+FMA / NEON with portable fallback
  • No dependencies

Quick Decision Guide

What problem are you solving?

  1. Scoring embeddings → Use cosine or dot

    • Dense embeddings: cosine(&query, &doc)
    • Pre-normalized: dot(&query, &doc)
  2. Reranking with token-level precision → Use maxsim_vecs

    • After dense retrieval (100-1000 candidates)
    • ColBERT-style token embeddings
    • Precision-critical applications
  3. Diversity selection → Use mmr_cosine or dpp

    • RAG pipelines (diverse context)
    • Recommendation systems
    • Search result diversification
  4. Compressing token embeddings → Use pool_tokens

    • Storage-constrained deployments
    • Factor 2-3 recommended (50-66% reduction)
  5. Two-stage refinement with Matryoshka → Use matryoshka::refine

    • After coarse search using head dimensions
    • Refine candidates using tail dimensions
    • Matryoshka embeddings (hierarchical dimension packing)

When not to use:

  • First-stage retrieval over millions of docs (use dense + ANN)
  • Very short documents (<10 tokens, little benefit over dense)
  • Latency-critical first-stage (dense embeddings are faster)
  • Storage-constrained without pooling (typically 10-50x larger than dense, depends on document length)

See REFERENCE.md for algorithm details and edge cases.

Multimodal Support (ColPali)

This crate supports both text-only (ColBERT) and multimodal (ColPali) late interaction:

  • Text-to-text: Query text tokens align with document text tokens (standard ColBERT)
  • Text-to-image: Query text tokens align with image patch embeddings (ColPali-style)

For ColPali systems, document images are split into patches (e.g., 32×32 grid = 1024 patches per page). Each patch becomes a "token" embedding. The same maxsim_alignments() and highlight_matches() functions work for both text and multimodal retrieval, enabling visual snippet extraction.

Visual snippet extraction: In ColPali, identifying which image patches match query tokens enables extracting visual regions (snippets) from document images. These snippets can be displayed to users showing exactly which parts of a document image are relevant to their query.

See examples/multimodal_alignment.rs for a demonstration.

Features

| Feature | Dependency | Purpose | |---------|------------|---------| | hierarchical | kodama | Ward's clustering (better at 4x+) |

See Also

License

MIT OR Apache-2.0