npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

sweet-search

v2.6.0

Published

Sweet Search - SOTA Hybrid Code Search Engine with WASM CatBoost Query Router, Semantic/Lexical/Structural Search, and Multilingual Support

Readme

Local code search for AI coding agents. Six fast, purpose-built tools that hand Claude Code, Codex & friends ranked answers, not raw grep. Zero API keys, 100% on-device.

Maybe grep isn't all you need… 🍬 Every coding agent today reaches for grep + Read by reflex. sweet-search challenges the narrative. 😎

npm GitHub stars license node platforms inference


✨ Highlights

  • Hybrid retrieval — one of the six tools uses BM25F lexical + dense semantic + structural graph signals, fused per query and reranked by late-interaction
  • Agent-native by design — token-budgeted output tiers, an optional MCP server (and default zero-overhead CLI), and a GEPA-evolved system prompt installed into Claude Code, Codex, Gemini CLI, and Cursor with one command
  • Indexed grep, ~10× faster than ripgrep — a sparse n-gram prefilter skips the files that provably can't match
  • ColBERT-style reranking, locally — per-token MaxSim late interaction on hand-written SIMD kernels
  • GPU-accelerated indexing — Apple Metal, CUDA, CoreML Neural Engine, or plain CPU via ORT; same engine, auto-selected
  • Never stale — incremental indexing keeps the index aligned with your working tree, uncommitted edits included
  • No storage hassle — indexed artifacts maximally optimized without any accuracy tradeoff; up to INT4 quantization
  • Local-first — all models run on-device; nothing is sent anywhere, ever. CPU-inference supported for all models

📚 Table of Contents

GET STARTED

🚀 Quickstart three commands to a searchable repo

🖥️ Platform Support macOS · Linux · WASM fallback

USE IT

🧰 The Six Tools search · grep · find · semantic · trace · read

🧠 The Evolved Agent Prompt GEPA-optimized search discipline

🔌 Works With Your Agent MCP · Claude Code · Codex · Gemini · Cursor

UNDER THE HOOD

⚡ GPU-Accelerated Indexing candle · fused kernels · cAST chunking

🔄 An Index That Never Goes Stale reconcile daemon tracks your working tree

🦀 The Native Engine Room four Rust crates + INT4 LI compression

THE RECEIPTS

📊 Benchmarks agent cost savings · engine speed · full-corpus MRR

🧭 Where sweet-search Fits honest wins & trade-offs vs peers

🙏 Prior Art & Acknowledgements the shoulders we stand on

📄 License Apache-2.0

🚀 Quickstart

npm install -g sweet-search

cd your-repo
sweet-search init     # one-time: downloads local models, wires up your agent
sweet-search index    # builds the index — GPU-accelerated where available

sweet-search "where do we validate JWT tokens?"

That's it. init is idempotent and SHA256-verifies every model binary; re-running it is always safe. From then on the index maintains itself — edit, save, search.

sweet-search init --wizard          # interactive: shows your hardware, recommends a model tier
sweet-search init --profile core    # lexical-only, no model downloads (CI-friendly)
sweet-search init --li-model edge   # compact late-interaction model for constrained machines
sweet-search uninstall              # clean removal: models, caches, config — never your code
  • Requirements: Node ≥ 18. macOS (arm64/x64) and Linux (x64/arm64) ship native binaries; other platforms fall back to WASM/JS automatically.
  • Footprint: CPU-only hosts download a few hundred MB of INT8 models; GPU hosts add ~1.2 GB of FP32 backbones (skipped automatically where they'd be useless); M3+ Macs can additionally fetch a ~3.2 GB CoreML cascade for Neural Engine acceleration. Everything lands in ~/.cache/sweet-search/models/ and is used strictly on-device.
  • Agent wiring: init injects the tool-routing system prompt into CLAUDE.md (and AGENTS.md, GEMINI.md, Cursor rules via flags), registers a session-start prewarm hook so your first query hits a warm daemon, and installs a /sweet-index skill in Claude Code.
  • What gets indexed: what you'd expect — .gitignore is respected, node_modules/build dirs/minified artifacts are denied, files over 1 MB skipped, with a .sweet-search-ignore for extra rules.

📊 Benchmarks

We measure sweet-search four ways — from how much it helps a real agent down to raw engine throughput:

🤖 Code-retrieval (agent-in-the-loop) Does it make a real coding agent cheaper and more useful when it searches your repo? Paired against each model's own grep-and-read loop.

🚧 Task-completion (coming soon) Does cheaper, denser context compound into a higher resolve-rate on multi-step engineering tasks? Harness in progress.

📄 Paper-type IR (academic) The standard NL→code retrieval suites (GCSN, M2CRB, CoSQA…), full-corpus MRR@10.

Engine speed Raw systems numbers — grep throughput, query latency, rerank kernels, HNSW.


🤖 1. Code-retrieval benchmarks — the agent-in-the-loop test

We install the evolved agent prompt (the GEPA-evolved search discipline), point a coding agent at a real repo, and pair it probe-for-probe against the same model running its own native grep-and-read loop. Same model, same tasks, same judge — the only difference is whether sweet-search is wired in.

top-of-range figures · full per-harness ranges in the dropdown · 11 model×harness cells, paired, multiplicity-controlled

The headline, in four claims:

  • 💰 Cheaper where the agent thrashes — up to −34% realized cost on Codex; −18 to −32% across the GPT-5.5 / opencode / bare-API harnesses.
  • 🔧 Fewer round-trips — up to −56% tool calls, significant on 9 of 11 cells.
  • More useful per response+0.18 to +0.31 on a 5-dimension usefulness score, and still denser when length-matched (significant on 8 of 11 cells).
  • 🎯 Accuracy held — and lifted on the weak — a statistical tie on flagship models (saturated at 0.94–0.99), and +3 pp (up to +8 pp out-of-distribution) on weaker models like GLM-5.1 and DeepSeek.

The win is harness-adaptive: where the native loop is disciplined (Claude Code) it shows up as denser, more useful context per token; where it thrashes (Codex floods 30k+ tokens of its own grep output into context) it shows up as a large cost and tool-call cut. Either way, final-answer accuracy never significantly regresses.

| 🧰 Native agent harness | 💰 Realized cost | 🔧 Tool calls | ✨ Useful content / response | 🎯 Final accuracy | |---|---:|---:|---:|:--| | 🤖 Codex (GPT-5.5) | −30 to −34% | −44 to −56% | +0.06 → +0.17 ↑ | tie (saturated) | | 🐚 opencode (GPT-5.5 / GLM-5.1) | −18 to −22% | −15 to −49% | +0.23 to +0.31 ↑ | tie | | 🔌 bare API (GPT-5.5 / GLM / DeepSeek) | −15 to −32% ᵃ | −15 to −33% | +0.08 to +0.24 ↑ | tie · +3 pp on weak models | | 🟣 Claude Code (Sonnet / Opus) | −10% to +14% ᵇ | −5 to −33% | +0.18 to +0.29 ↑ | tie |

↑ "Useful content / response" is the per-response delta on a 5-dimension usefulness score (answer-grounding · workable-code · navigability · edit-locality · sufficiency), 0–1 scale. "tie" = final-answer correctness statistically indistinguishable (saturated in the 0.94–0.99 band on flagships).ᵃ the two cheapest bare models cost fractions of a cent either way (GLM +27% of $0.008; DeepSeek −15% of $0.004). ᵇ Opus −5/−10%; Sonnet +8–14%, which is ≈1¢ on a flat-rate subscription for a richer answer.

Denser, not just longer. The usefulness lift survives length-matching — comparing sweet-search and native responses of equal token length, sweet-search's content is significantly higher on 8 of 11 cells. The validated single-number usefulness composite (grounding × content × density) is significant on all 11 sealed cells.

  • What's being compared: the installed sweet-search agent prompt + tools vs. the same model using only its built-in file-reading and shell-grep tools. Not a different model — the same model, with and without sweet-search.
  • Design: 11 model×harness cells. Sealed vault (n=60/arm, the pre-registered primary) opened once; plus held-out (n=30) and out-of-distribution (n=40) sets for generalization. Stratified, fixed-seed splits.
  • Judging: 3-judge panel (DeepSeek-V4-flash + Gemini-3.1-flash-lite + MiniMax-M2.7), paired by probe, 20k-sample bootstrap CIs, Benjamini–Hochberg FDR multiplicity correction across each metric family. We report family-level survival counts, never a single cherry-picked cell.
  • What survives FDR (vault): useful-content 10/11, density-composite 11/11, length-matched content 8/11, fewer-tool-calls 9/11. Generalization (held-out + OOD): content 17–18/20, fewer calls 14/20.
  • The token fact that drives everything: sweet-search's footprint is nearly constant (~1.3k–3.3k tokens) because the tool responses are capped; native's footprint is whatever the model decides to grep — up to 37k tokens on Codex. That single fact is what drives the cost and tool-call gaps.
  • Honest caveats we keep attached: (1) accuracy ties on flagship models — it is not an accuracy win there, it's saturated; the accuracy gains are real only on weaker models. (2) The two weakest cells for length-matched density (Codex-low, DeepSeek) are correct-sign but underpowered — Codex's responses are so token-divergent that too few equal-length pairs exist to reach significance, and DeepSeek is simply under-powered. Those are honest non-victories, not wins.
  • Full methodology and per-cell tables: docs/PHASE7.md.

🚧 2. Task-completion benchmarks — coming soon

Retrieval quality is necessary but not sufficient. Cheaper, denser context only matters if it compounds across a real, multi-step engineering task — finding the code, understanding it, changing it, and not breaking anything. The next suite measures exactly that: resolve-rate on SWE-bench-style multi-file tasks, sweet-search-wired vs. native, on the same paired, multiplicity-controlled bar as above. Harness and pilot are in progress — numbers land here when they clear that bar, and not before.


📄 3. Paper-type retrieval benchmarks — academic NL→code IR

Every number below is the ss-search pipeline end-to-end — the same binary you install — run against the full benchmark corpus (no 99-distractor shortcuts), zero-shot (we never fine-tune on these tasks). Where a benchmark's queries are docstrings, we strip the docstring out of the indexed code so the query can't trivially match itself — the standard retrieval protocol.

We're SOTA in June 2026 on 3/4 attempted benchmarks at HARDER settings (running on full pool) than most other attempts!

| 📚 Benchmark | 🔍 What it tests | # Queries | 📂 Pool | 🎯 MRR@10 | 🏆 SOTA? | |-----------|---------------|---------:|---------:|--------:|--------:| | 🌐 GenCodeSearchNet | NL→code, 6 languages | 6,000 | full 6,000 | 86.6 | YES ✅ | | 🐍 CoSQA | web queries → Python | 500 | full 6,267 | 65.5 | ✅ (zero-shot) | | 🗺️ M2CRB | multilingual NL→code (ES/PT/DE/FR → Py/Java/JS) | 5,795 | full 5,795 | 54.0 | YES ✅ | | 🛡️ AdvTest | adversarial, identifier-obfuscated Python | 19,210 | full 19,210 | 51.4 | NO ❌ |

SOTA = best result we can find in the published literature as of June 2026; cross-metric/protocol comparisons are spelled out per benchmark below.

🌐 GenCodeSearchNet → 86.6  ·  🏆 SOTA in June 2026

  • The BEST PUBLISHED number we can find, anywhere
  • The benchmark's own paper caps at MRR ≤ 0.42 for fine-tuned baselines (≤ 0.10 cross-lingual); even zero-shot OpenAI Ada-2 reaches 0.79–0.94 — but all of it against a tiny 99-distractor pool.
  • We score 0.866 against the entire 6,000-document corpusa strictly harder setting — and zero-shot. 🔥

🐍 CoSQA → 65.5  ·  🥇 Zero-shot SOTA in June 2026

  • Beats EVERY PUBLISHED zero-shot model
  • Canonical setup: 500 real web queries → the fixed 6,267-code database, no fine-tuning.
  • Clears the strongest zero-shot results out there — CodeSage-Large 47.5 · OpenAI text-embedding-3-large 55.4 · OASIS 55.8 — and goes toe-to-toe with fine-tuned CodeBERT / GraphCodeBERT (64.7 / 67.5). 💪
  • CoSQA has known label noise, so we read the absolute height with a pinch of salt.

🗺️ M2CRB → 54.0  ·  🏆 SOTA in June 2026

  • the BEST PUBLISHED number we can find, anywhere — and zero-shot
  • 🇪🇸 Spanish · 🇵🇹 Portuguese · 🇩🇪 German · 🇫🇷 French → Python / Java / JavaScript.
  • The paper's best — a CodeBERT fine-tuned on the task — reaches 52.7 auMRRc, a metric that averages over easier, smaller pools (so auMRRc ≥ full-pool MRR for any model). Our 54.0 is full-pool MRR@10 over all 5,795 functions in one pool — a strictly harder measure, cleared with no fine-tuning. 🔥

🛡️ AdvTest → 51.4  ·  🧪 our honest worst case — and we publish it anyway

  • Adversarial obfuscation (def Func(arg_0):) deletes the lexical + graph signals our hybrid feeds on — yet we still beat the classic fine-tuned baselines (CodeBERT 27 · GraphCodeBERT 35 · UniXcoder 41), and our stack still lifts our own encoder ~3pp even here.
  • 🔍 Full transparency: we could not reproduce the often-cited 59.5 for the bare CodeRankEmbed encoder — the reference FP32 model scores 54.7 on our leak-free corpus, our shipped INT8 build 51.4. The gap is stricter preprocessing + INT8 quantization, not the retrieval pipeline. We report exactly what we measured.
  • Reproduction: result artifacts live in eval/results/; rerun via eval/run_all.js. The canonical full-pool loaders are in eval/download_data.py.
  • Full corpus, not distractors. Published baselines for GCSN- and CoSQA-style benchmarks typically rank the gold against 99 sampled distractors; every number here ranks against the benchmark's full corpus (6k–19k candidates) — strictly harder.
  • Zero-shot + docstring-stripped. We never fine-tune on these tasks. For docstring-derived benchmarks (AdvTest, M2CRB) we strip the docstring from the indexed code — otherwise the NL query matches itself verbatim (a no-strip AdvTest run scores a meaningless 0.98). This is the standard protocol; it is also why our AdvTest is lower than naïve setups that leave the docstring in.
  • What we deliberately don't claim yet. CoIR (official metric NDCG@10 over per-subtask corpora up to ~1M docs), CoSQA+ (multi-positive, MAP-primary), and CLARC (per-group pools) use protocols and metrics our single-pool MRR@10 harness doesn't currently match. Rather than publish apples-to-oranges numbers, we omit them; faithful per-subtask CoIR (NDCG@10) runs are queued.
  • M2CRB — the paper's metric is auMRRc (area under the MRR-vs-pool-size curve; best published 52.7, fine-tuned). Because that area averages over easier small pools, auMRRc ≥ full-pool MRR for any model — so our 54.0 full-pool MRR@10 (all 5,795 functions, zero-shot) clears their best on a strictly harder measure. No one publishes a plain full-corpus MRR@10 on M2CRB, so ours is the best available.
  • AdvTest honesty note. We could not reproduce the commonly-cited 59.5 for the bare CodeRankEmbed encoder on our corpus: the reference FP32 model scores 54.7 on our leak-free, docstring-stripped, full-19,210 setup, and our shipped INT8 build 51.4. We report our measured numbers and the reference check rather than the leaderboard figure.
  • Honesty corner: CrossCodeEval — cross-file completion-context retrieval, a different task than NL search — sits at 0.12. We don't optimize for it and report it anyway.

⚡ 4. Engine speed — systems benchmarks, measured in-repo

10.2× ripgrep's median grep  ·  2.9 ms warm queries  ·  47× MaxSim kernels  ·  −33% HNSW search p50

| ⚙️ What | 📈 Result | 📄 Source | |------|--------|--------| | ⚡ Indexed grep vs ripgrep | 10.2× faster at the median (8.5–17.7× across 5 repos, 353 realistic queries, 1 ms p50 — identical match counts on every query) | docs/GREP_INDEXING_STRATEGY.md | | ⏱️ Warm query latency (native CLI) | 2.9 ms warm · 108 ms cold | docs/INIT_STRATEGY.md | | 🧮 MaxSim rerank kernels | 1.26 s → 27 ms for a 231-candidate pass (47× native Rust; 16× WASM SIMD) | docs/MAXSIM_OPTIMIZATION.md | | 🧠 HNSW tuning for code | −33% search p50, +5.9 pp recall@200 | docs/HNSW_APPROACH.md | | 💾 Indexing memory | peak JS heap 785 MB → 213 MB | docs/DISK_FLUSHING_STRATEGY.md | | 🍏 CoreML cascade (M3 Max) | 18% faster full indexing vs the Metal baseline | docs/INIT_STRATEGY.md |

🧭 Where sweet-search Fits

Code search is a crowded space. Here's an honest read on where sweet-search wins and where it gives ground, against the trending leaders and our closest local peers.

| Capability | sweet-search | claude-context | Cursor index | codebase-memory | SocratiCode | |---|:---:|:---:|:---:|:---:|:---:| | 100% local — code never leaves your machine | ✅ | ✅¹ | ❌ | ✅ | ✅ | | Works with zero API keys | ✅ | ✅¹ | ❌ | ✅ | ✅ | | No external service to run (vector DB · Ollama · Docker) | ✅ | ❌ Milvus | ❌ cloud | ✅ | ⚠️⁵ | | ColBERT late-interaction rerank | ✅ | ❌ | ❌ | ❌ | ❌ | | Faster-than-ripgrep exact grep | ✅ | ❌ | ✅⁷ | ❌ | ❌ | | Call-graph trace (callers · callees · impact) | ✅ | ❌ | ❌ | ✅ | ✅ | | Drives any terminal agent (Claude Code · Codex · Gemini CLI) | ✅ | ✅ | ❌² | ✅ | ✅ | | Published NL→code retrieval benchmarks | ✅ | ⚠️³ | ❌ | ⚠️³ | ⚠️³ | | …and where sweet-search gives ground | | | | | | | Native Windows | ❌⁴ | ✅ | ✅ | ✅ | ⚠️⁸ | | Deep-AST language coverage | ⚠️ 14 (+70 via regex) | ⚠️ | ⚠️ | ✅ 158 | ⚠️ | | In-editor GUI · writes & edits code | ❌ | ❌ | ✅ | ❌ | ❌⁶ | | Org-wide, multi-repo scale | ❌ | ⚠️ | ⚠️ | ⚠️ | ✅ |

✅ yes · ⚠️ partial / with caveats · ❌ no. Verified June 2026; capabilities drift. ¹ claude-context's local path (Milvus Lite + Ollama embeddings) needs no API key, but it defaults to OpenAI/Voyage embeddings + Zilliz Cloud — and still runs Milvus + Ollama either way. ² Cursor's index is editor-locked — external terminal agents can't query it. ³ Reports token-reduction / efficiency, not a public NL→code retrieval-quality leaderboard. ⁴ Runs on Windows via WSL2. ⁵ SocratiCode manages a bundled Qdrant for you, but uses an auto-detected Ollama for local embeddings. ⁶ Ships an interactive HTML graph viewer, but doesn't edit code. ⁷ Cursor's local Instant Grep — a literal + regex index it benchmarks at ripgrep 16.8 s → 13 ms (the post that inspired our own n-gram prefilter). ⁸ SocratiCode runs on Windows via Docker only — no native binary, and no GPU there.

Where we lose, plainly: no native Windows yet, no editor GUI, and we index one repo at a time. If you need org-wide search across many repos and branches, that's where SocratiCode and Sourcegraph are built to win. If you live inside one editor, Cursor's index is already there. sweet-search is for the terminal agent that wants the best local retrieval on the repo in front of it. No one else combines all of it: ColBERT late-interaction reranking and faster-than-grep search, fully on-device, with nothing to sign up for.

Also in the space: Sourcegraph/Cody (org-scale, server-based), Continue.dev (local-default RAG), Serena (LSP symbol search, no embeddings), grepai (local CLI + trace), and cocoindex-code (embedded AST search).

🧰 The Six Tools

Six small tools, one shared index. Each returns ranked, deduplicated, token-budgeted output designed to be consumed by an agent — a useful answer, not a wall of matches to scroll through.

| Tool | What you give it | What you get back | |------|------------------|-------------------| | 1. ss-search | a natural-language query | ranked, self-contained code blocks | | 2. ss-grep | an exact regex/literal | every file:line hit, ripgrep-identical | | 3. ss-find | a regex + a query | regex matches, semantically re-ranked, as code blocks | | 4. ss-semantic | a file + a question | just the relevant spans of that file | | 5. ss-trace | a symbol | callers + callees + impact, in one call | | 6. ss-read | a file (± line range) | exact bytes + symbol metadata |


1. 🔍 ss-search — hybrid search powerhouse

A hybrid search pipeline with late interaction reranking that returns actual code blocks.

Leading published-benchmark results — strongest we can find on GenCodeSearchNet, and above every published zero-shot model on CoSQA. See benchmarks.

flowchart TD
    Q(["🔍  natural-language query"]) --> ROUTE{{"🧭 WASM CatBoost router · lexical / hybrid"}}

    ROUTE --> BM["📑 <b>BM25F</b><br/>field-weighted FTS5"]
    ROUTE --> ANN

    subgraph ANN ["🧬 three-stage ANN cascade"]
        direction LR
        BIN["binary <b>HNSW</b><br/>Hamming · ~100µs"] --> INT["INT8<br/>rescore"] --> FL["float32<br/>mmap sidecar"]
    end

    BM  --> FUSE
    ANN --> FUSE
    FUSE["🔀 <b>CCFusion</b><br/>convex combo · RRF fallback"] --> ROW1

    subgraph ROW1 [" "]
        direction LR
        IAR["⚓ <b>IAR</b><br/>exact-symbol injection"] --> INTENT["🎯 intent rerank<br/>demote docs · tests · config"]
    end

    ROW1 --> ROW2

    subgraph ROW2 [" "]
        direction LR
        GRAPH["🕸️ graph expansion<br/>typed edges · 1–2 hops · <b>PathRAG</b>"] --> MAXSIM["🧮 <b>Late-Interaction Rerank</b><br/>⚡ native Rust MaxSim kernel"] --> OUT(["🏁 <b>self-contained code blocks</b><br/>whole functions · 3k/8k/12k budget"])
    end

    classDef io    fill:#fde68a,stroke:#f59e0b,color:#000;
    classDef out   fill:#bbf7d0,stroke:#15803d,color:#000,stroke-width:3px;
    classDef route fill:#e0e7ff,stroke:#818cf8,color:#000;
    classDef lex   fill:#dbeafe,stroke:#60a5fa,color:#000;
    classDef fuse  fill:#f3e8ff,stroke:#c084fc,color:#000;
    classDef rank  fill:#ffe4e6,stroke:#fb7185,color:#000;

    class Q io;
    class OUT out;
    class ROUTE route;
    class BM,BIN,INT,FL lex;
    class FUSE,IAR fuse;
    class INTENT,GRAPH,MAXSIM rank;

    style ANN  fill:#eff6ff,stroke:#93c5fd,color:#000;
    style ROW1 fill:none,stroke:none;
    style ROW2 fill:none,stroke:none;

↑ The diagram traces the hybrid route. A pure-lexical query — or a literal file path — short-circuits at the router straight to BM25F, skipping the vector cascade and fusion.

| Stage | What it actually does | |-------|-----------------------| | 🧭 Route | WASM-exported CatBoost · lexical / hybrid · ~10 µs routing · low-confidence → max-recall hybrid | | 🧬 Retrieve | • LexicalBM25F over field-weighted FTS5 (name 10× · signature 5× · alias 4× · doc 1×)Embed — query vectorized by the local CodeRankEmbed model (swappable for Voyage / Jina / Codestral)Vector cascade — binary HNSW (Hamming, 64-byte, ~100 µs) → INT8 rescore → exact float32 from a memory-mapped sidecar | | 🔀 Fuse | • CCFusion — convex-combine both rankings · per-route weights · quantile-normalizedMMR (λ=0.9) diversity pass over the fused list• auto RRF (k=60) fallback on degenerate score distributions | | ⚓ Anchor | • IAR (Identifier Anchor Retrieval) — a real symbol in the query fires an exact-name code-graph lookup that injects that entity, even when the encoder ranked it too low | | 🎯 Intent Rerank | • demote docs / tests / config when you want implementation• log-scaled call-site boosts surface the most-referenced function | | 🕸️ Graph Expansion | • typed-edge walks (imports/extends/calls/uses) · adaptive 2-hop on the AST graph · edges picked by intentPathRAG flow pruning + degree normalization → hubs can't dominate | | 🧮 Late interaction Rerank | • Query embedded per-token by LateOn-Code (149M; a 17M edge variant auto-selected on low-RAM hosts)MaxSim against the pre-indexed quantized token vectors• native Rust+Rayon MaxSim kernel ⚡ · WASM-SIMD fallback (1.26 s → 27 ms on a 231-candidate rerank) | | 📦 Package | • entity-aware expansion → whole functions (imports, docstrings, decorators)• same-file overlap demotion → diverse, non-overlapping spans• auto-selected 3k / 8k / 12k token budget |

🧠 The HNSW, in full (full writeup). Stage 1 is a from-scratch binary HNSW, and every "advanced" trick ships on by default:

  • Heuristic neighbor selection (HNSW Algorithm 4) + M0 = 2M on layer 0 — a real graph backbone, not naïve closest-M
  • Shuffled insertion order — no filesystem-ordering bias baked into the highway structure
  • Discovery-rate adaptive early termination + adaptive ef — easy queries stop early, hard ones keep their budget
  • A denser graph than most vendors ship (M=64 · efC=800 · efS=400) — which broke an 80.6 % → 86.5 % recall@200 plateau and cut p50 latency ~33 %
  • Zero-GC search: typed-array heaps + generation-stamped visited lists — no per-query allocation
  • 64-byte sign-bit vectors (Hamming) → INT8 → exact float32 from a memory-mapped sidecar

⚡ Why it's quick. A native Rust + Rayon MaxSim kernel (47× over scalar; 16× WASM-SIMD fallback) · int4-quantized, binary-packed token vectors (plain INT4 is the shipped path — the full TurboQuant algorithm is researched but deferred; binary packing alone cut the LI index ~3.4×, 1.34 GiB → ~396 MiB) · a memory-mapped float32 sidecar that skips SQL on the rescore hot path · score-spread adaptive pooling (decisive queries shrink the rescore pool, ambiguous ones widen it) · and a warm daemon that answers in a single NAPI call — no process is ever forked.

🎛️ Priors & structure.

  • Quality priors: every chunk carries a 0–1 prior from test proximity, git recency, symbol centrality (PageRank), comment density, and complexity — production code surfaces, stale fixtures sink.
  • Community structure: a canonical Leiden pass detects code communities on the entity graph at index time, feeding vocabulary prewarming and structural signals — it understands your modules, not just your directories.
  • Multilingual: 14 languages get full tree-sitter AST treatment; a 39-config registry covers 70+ extensions beyond that. Router features handle camelCase/snake_case, CJK density, and German compounds.
  • Format-gated signals: structure-aware boosts and demotions (symbol-exact, path-token, mega-entity) fire only in agent mode — they help agent-shaped queries and would hurt plain NL, so they stay gated by default.

🛟 Rescues & honest trade-offs.

  • Long-query rescue: wordy NL queries that FTS5 would tokenize into an unsatisfiable AND fall back to multi-query BM25F + RRF — one query per content keyword, fused.
  • Near-duplicate dedup: a SimHash + MinHash-LSH pass (Jaccard τ=0.9) clusters copy-paste and vendored code at index time; aliases reuse their exemplar's vectors and skip both the bi-encoder and late-interaction encoding.
  • A negative result we ship anyway: we built a full cross-encoder rerank cascade behind an adaptive confidence gate, measured it on our eval sets — and it didn't beat MaxSim at 3× the latency. So it ships disabled (SWEET_SEARCH_CASCADE_ENABLED=true to try it). We'd rather ship the faster path than a fancier diagram.
  • Budget tiers: the expensive 8k/12k tiers fire on ~1–5 % of queries — the default stays cheap. Force one with --full / --xl, or pick a mode with --mode lexical|semantic|hybrid|pattern.

Also available as sweet-search "<query>" on the CLI and the search MCP tool.


2. ⚡ ss-grep — grep, minus every wasted millisecond

10.2× faster than ripgrep end-to-end at the median — measured across 353 realistic queries on 5 real repos (range 8.5–17.7× per repo, 1 ms p50), with identical match counts on every single query. Three things buy that:

  • A sparse n-gram index (inspired by Cursor's fast-regex-search and GitHub's Blackbird): instead of a fixed trigram table, gram boundaries adapt to your codebase's character-pair frequencies, so common trigrams get absorbed into longer, more selective grams.
  • Regex-AST literal extraction + SIMD intersection: required substrings are pulled from the pattern's syntax tree, posting lists are intersected with NEON/SSE2 block merges (galloping search for skewed sizes), and only the files that can match — typically 0.1–5% of the corpus — see the real regex.
  • Fully in-process: verification runs on Rust's regex crate with Rayon across all cores, inside the warm daemon, in a single NAPI call. No child process is ever spawned — zero fork/exec, zero pipe I/O, zero JSON re-parsing.

Every match comes back in stable file:line order — ripgrep-identical counts, optional context lines — with no relevance guessing, no subprocess, in one warm call.

  • Full methodology, per-repo table, and the optimization log: docs/GREP_INDEXING_STRATEGY.md.
  • Regexes with no extractable literals fall back to native grep over the indexed file set; fixed-string and glob queries use a ripgrep fallback.

3. ss-find — ColGrep, on a faster engine

ss-find "token refresh logic" --regex "refresh.*[Tt]oken"

Inspired by LightOn's ColGrep — regex precision, semantically ranked — but rebuilt on our own substrate:

  • The regex stage runs on the same indexed sparse-gram engine as ss-grep (in-process, no subprocess), not a filesystem scan.
  • The ranking stage scores candidates with per-token MaxSim over pre-indexed late-interaction embeddings — no model inference over documents at query time — on our custom kernels: native Rust + Rayon takes a 231-candidate MaxSim pass from 1.26 s down to 27 ms (WASM SIMD fallback at 16×).
  • Regex tokens are merged into the semantic query, so the ranking sees both what you typed and what you matched.
  • Like ss-search, it answers with ranked, self-contained code snippets — not bare file:line — so the find and the read collapse into one tool call. In our 30-question agent-workflow eval that eliminated every follow-up read and cut tokens 25.4% vs a grep + read workflow, at quality parity (gap of 0.01 on a 5-point scale).
  • On the 60-query pattern benchmark, MaxSim ranking lifts MRR@10 to 0.45 vs 0.11 for raw grep ordering — 4× more likely the right hit lands on top.
  • Requires the late-interaction index (built by default; --li-model none disables pattern mode).
  • Also available as sweet-search --mode pattern and via the search MCP tool's regex argument.

4. ss-semantic — hybrid retrieval, scoped to one file

ss-semantic src/auth/session.ts "where does the cookie get its expiry?"

You know the file; this finds the lines. Every indexed chunk of the file is scored by three independent signals — BM25-style lexical term match, exact symbol-name match (weighted 1.5×), and per-token MaxSim late interaction over the LateOn-Code embeddings — fused with Reciprocal Rank Fusion (k=60), with symbol-less fragment chunks demoted 0.85× so real definitions win ties. The top spans are then re-read from disk (±2 context lines, overlapping spans merged), so the answer is filesystem ground truth even mid-edit; if the file is newer than its index entry you get an explicit staleness warning.

The useful answer: just the relevant spans with line numbers — not the whole file through your context window.

  • Unindexed files degrade gracefully to a plain read. Defaults: top 5 spans, relevance threshold 0.4, 8k-char cap.
  • Also available as sweet-search read-semantic and the read-semantic MCP tool.

5. ss-trace — graph algorithms, not grep guesswork

ss-trace processOrder --in src/orders/service.py

One call returns a symbol's callers, callees, and transitive impact paths from the AST-derived code graph (entities + typed calls/imports/extends/uses edges, persisted in SQLite at index time). Ranking fuses three signals:

  • Query-time Personalized PageRank via Forward Push — a local algorithm that spreads mass directionally from your target symbol and touches only the neighborhood it reaches, never the whole graph;
  • Index-time edge-weighted global PageRank (damping 0.85), precomputed into a page_rank column — a function called from five sites carries five units of mass, and it costs zero at query time;
  • Structural heuristics — relationship type, depth, exported-API status, fan-in — with penalties for test-only and external paths.

Because the graph is prebuilt, the global ranking is precomputed, and the personalized walk is local, a full three-section trace costs milliseconds. The relation word (callers / callees / impact) re-weights how the response token budget is split; --in disambiguates duplicate names; --depth bounds impact traversal (1–4).

  • Honest caveat: call-graph extraction is precise but incomplete on highly dynamic code (bare-name dispatch, metaprogramming) — traces can be sparse there, and the agent prompt teaches a recovery strategy for exactly that case.
  • Also available as sweet-search trace and the trace MCP tool.

6. ss-read — exact bytes, with the index's knowledge attached

ss-read src/db/pool.js 120 180

A read tool that is filesystem-grounded by construction: bytes come straight from disk (never from the index, so never stale), but each indexed file arrives annotated with its cAST chunk metadata — symbol name, entity type, signature, line span — joined from the AST chunk index. The agent gets the code and the structural map of what it's looking at in one call: cite, navigate, or trace next without another search.

  • The CLI/MCP form scales it up: sweet-search read <file...> (and the read MCP tool) batches 1–20 files in a single call, each with the same symbol metadata — twenty files for the price of one tool invocation.

The ss-* wrappers ship in the npm package and are what the installed agent prompt drives. Every capability is equally available as sweet-search CLI subcommands and as MCP tools — see Works With Your Agent.


🧠 An Agent Prompt That Was Evolved, Not Written

Shipping six tools is easy. Getting an agent to stop grepping in circles is the hard part.

So sweet-search init installs a ~1k-token system prompt that we didn't write — we grew it. A GEPA-style loop mutated candidate prompts, scored each on a dual Pareto front (accuracy × cost) against two different production agents at once — Claude Code (Sonnet) and Codex (GPT-5.5) — kept the survivors, and repeated. A final correctness pass hardened the winner. ~1k tokens, one job: teach the agent to search well.

🎓 The five rules it encodes:

| | Rule | What it kills | |--|--|--| | 🥇 | Cheapest tool first | Got an exact symbol? One ss-grep, trust the top hit, stop — no semantic search "just to confirm." | | 🎯 | Trust the ranking | At most one narrow read to confirm; never re-run a hit that already matched. | | 🚫 | Absence is an answer | Two empty probes (one semantic, one lexical) settle a negative — no third synonym, no find/ls spiral. | | ⛔ | No raw-shell escape | The #1 token-waster in our trace analysis: agents bailing to dozens of raw grep/find calls after one miss. Door closed. | | 📝 | Think before you dig | Before a third probe, the agent states what it knows and what its blind spot is. |

🧾 The receiptsheld-out discipline throughout: a dev set to iterate on, a held-out set touched only at milestones, a sealed vault opened exactly once.

| Validation gate | Result | |--|--| | 🎯 Held-out (30 probes × both agents) | joint score (worst of the two) 0.988 | | 🌍 Out-of-distribution (8 languages never seen in the loop) | 0.952every language ≥ 0.79, zero weak spots | | 🛡️ Adversarial counter-probes | 1.00 / 1.00 | | 🔀 Held-out model families (never optimized on) | MiMo 0.988 · Qwen 0.980 — it generalizes, it doesn't memorize | | 🧩 Paraphrase robustness (reword the prompt, same behavior) | correctness-weighted 0.95 / 0.93 |

  • Seeds → survivors: 15 hand-authored seed prompts entered a reflective-evolution loop (an agent reads the real tool-call traces, proposes one targeted edit, we keep what helps). Operators included trajectory crossover, structural pivots, tool-name masking, and a pruner that fights prompt bloat.
  • Two targets, jointly: every candidate was scored on both Claude Code/Sonnet and Codex/GPT-5.5 with Maximin discipline (a prompt is only as good as its worse target), so it can't overfit one model's quirks.
  • What actually won: not clever phrasing — terseness (a shorter prompt re-sent every turn is cheaper), a leaner tool mix (grep/read over heavy semantic blocks that fatten the transcript), and decisiveness on no-match (stop spiraling). We report this plainly because it's what the traces showed.
  • The correctness pass: the shipped prompt ("M++") is the cost-winner plus 7 edits that fix factual descriptions of the tools — routing byte-identical, accuracy held, cost unchanged. A lateral move that buys honesty.
  • Held-out everything: dev to iterate, held-out checked only at milestones, a sealed vault opened once, plus held-out model families (MiMo, Qwen) and a reasoning-mode replay (MiniMax 0.963) it never trained against. Figures: docs/PHASE7.md (internal probe suites; an externally-reproducible suite is in progress).
  • Idempotent install: init writes a marker-delimited block into CLAUDE.md / AGENTS.md / GEMINI.md / .cursor/rules — re-run it freely, it never touches anything else you wrote.

⚡ GPU-Accelerated Indexing, Fully Local

Chunk → enrich → embed → quantize — every step on-device and in Rust. Batches are sized to your CPU's actual cache, two open code-models do the encoding, and two separate quantizations make the index both faster to build and small enough to live in RAM. Zero API keys; nothing ever leaves the machine.

① 🧩 Structure-aware chunk cAST over tree-sitter ASTs — whole functions, never sliced mid-body

② 🏷️ Enrich from structure deterministic preamble from the code graph — no LLM call

③ 🤖 Embed — two models dense CodeRankEmbed + per-token LateOn-Code

④ 🗜️ Quantize + persist INT8 weights → 2× faster build · INT4 vectors → fits in RAM

The inference engine, picked for your silicon:

| Your hardware | What runs | |--|--| | 🍏 Apple Silicon (M1+) | candle Metal, BF16, fused SDPA attention | | 🍏 Apple Silicon (M3+) | …​ plus a CoreML Neural Engine cascade — ~18% faster full index (measured, M3 Max) | | 🟩 NVIDIA GPU (SM 7.0+) | candle CUDA; flash-attention on Ampere+ | | 💻 No accelerator | ONNX Runtime INT8 — tuned CPU path, 132 MB model, zero GPU weights downloaded |

🧩 Chunking — every chunk is whole code, never a fixed window

  • cAST structure-aware chunking over real tree-sitter ASTs: a recursive split-then-merge greedily packs sibling AST nodes up to the size cap and recurses into nodes too big to fit. So a chunk is always a function, a class, or a contiguous run of declarations — never a body cut in half, never a string split mid-literal.
  • 14 languages get true AST grammars — JS · TS · TSX · Python · Go · Rust · Java · C · C++ · Ruby · PHP · Kotlin · Swift · C# — and a 39-config regex registry carries structure-aware chunking to 70+ more extensions.

🏷️ Metadata — context the encoder can actually see

  • Every chunk ships its symbol name · entity type · signature · line span — the metadata that powers the code graph, ss-read annotations, and the self-contained answers everywhere else.
  • Contextual enrichment: before embedding, each chunk is prefixed with a structured preamble assembled from the AST + code graph — file path · enclosing-scope breadcrumb · name & type · merged siblings · the imports it actually uses. Both encoders see it, so a bare getId() still retrieves on the class and module around it.
  • Our nod to Anthropic's Contextual Retrieval — except they prepend an LLM-generated summary (one model call per chunk); we derive the context deterministically from structure: no LLM, no per-chunk inference, regenerated for free on every reindex. Tuned per language from GenCodeSearchNet ablations — Python stays minimal, the Java family keeps a slug-stripped path, JS/Ruby/Go/C/C++/Rust get the full preamble where closures and imports earn their keep.

🧠 Cache-aware batching — we read your CPU before we batch it

  • We detect your last-level cache at runtimehw.perflevel0.l2cachesize (the 16 MB P-cluster on Apple Silicon, not the smaller E-cluster), Intel L3, or /sys/.../cache on Linux — then size every embedding batch so one transformer layer's weights plus the batch's activations stay resident in cache. No spilling to main memory mid-layer; on a long-sequence tail that's the difference between B=1 and a measured 2.1× per-chunk slowdown.
  • Uses every core the hardware really has — full count on ARM/Apple Silicon; x86 SMT siblings discounted because they don't scale inference linearly.
  • ORT drives the CPU path (ONNX Runtime); GPU hosts swap in fused kernels (below). Either way inference runs off the event loop as a napi AsyncTask, so tokenization and SQLite writes overlap compute instead of stalling behind it.

🗜️ Two quantizations — one buys speed, one buys size

| | Model weights · INT8 ORT | Index vectors · INT4 binary | |:--|:--|:--| | Job | build the index faster on CPU | keep the on-disk index tiny | | Win | ~2× faster indexing · 4× smaller model (132 MB) | LI index 1.34 GiB → ~396 MiB · INT4 nibble-packing halves it again | | Fidelity | ≥ 0.96 cosine vs FP32 | no measurable retrieval loss (A/B-tested vs INT8) |

🤖 Two models — both open, both local, both code-specialized

  • CodeRankEmbed — 768-d dense bi-encoder (137M, Apache-2.0) for first-stage recall.
  • LateOn-Code — ModernBERT per-token late interaction (149M) for the rerank.
  • Edge fallback for leaner machines: a 17M edge LateOn-Code (~9× smaller FP32 backbone) auto-selects on low-RAM hosts, and the whole CPU path runs INT8 with no GPU weights ever downloaded — full local search on a laptop with no accelerator.
  • Surgical attention swap: we vendor the upstream model implementations (NomicBERT for embeddings, ModernBERT for late interaction) and replace only the attention forward pass — an MLX-ported fused SDPA kernel on Metal, candle-flash-attn with varlen packing on CUDA Ampere+, and byte-for-byte upstream math on CPU so the fallback is provably identical.
  • A silent-NaN bug, found and fixed: Apple's Metal SDPA kernel downcasts attention masks to F16, which saturates the standard f32::MIN mask to -Inf and quietly produces NaN on padded rows — collapsing retrieval quality. We clamp the mask and serialize Metal command-buffer submissions (concurrent submission corrupts outputs on shared queues). Details in crates/sweet-search-native/src/inference/.
  • CoreML cascade: 18 pre-traced .mlpackage variants (bucketed by sequence length) dispatched to the Apple Neural Engine through an Objective-C shim; oversized batches fall through to Metal. Gated to M3+ because on M1/M2 the ANE doesn't beat its own compile overhead — we measured, so it's off there.
  • Structure-routed enrichment: the preamble (path · scope chain · symbol · siblings · imports) is assembled at index time from a code-graph line-range overlap query — never an LLM call — then routed per language family (full enriched text for JS/Ruby/Go/C-family/Rust, a slimmer path policy for Python and the Java family), every decision settled by per-language ablation rather than a global default.
  • Pipelined, crash-safe indexing: while batch N+1 embeds, batch N's vectors stream into SQLite through zero-copy buffer views; full rebuilds write to a temp file and atomically swap, so a crash never leaves you serving half an index.

🔄 An Index That Never Goes Stale

Most code indexes rot the moment you start typing. sweet-search ships a reconcile daemon that keeps every tier of the index converged with your working tree — uncommitted edits included — without you ever running a command.

  • Save → searchable at the next reconcile tick — auto-tuned per machine between 15 s and 300 s, typically 15–60 s on a warm, idle box
  • Tracks the filesystem, not git — unstaged and uncommitted changes are first-class; deleted or newly-gitignored files disappear from results automatically
  • Atomic by construction — every tick publishes all five index tiers (float HNSW, binary HNSW, late-interaction segments, sparse-gram, code graph) through a single fsync-renamed epoch manifest, so a query never sees a half-updated index
  • No-op edits cost almost nothing — content hashing collapses byte-identical rewrites and editor touch events into skipped re-encoding work
  • Baseline gate: the daemon never plays first-index-builder. It verifies a full-indexer fingerprint (epoch manifest + merkle config fingerprint + the vectors DB it names) before touching anything, and reports waiting_for_initial_index otherwise — no corrupted partial baselines.
  • One admission policy: the full indexer and the reconciler share a single createAdmissionPolicy module (include globs → deny list → .sweet-search-ignore → 1 MB size cap → batched git check-ignore), so the two paths cannot drift.
  • Orphan sweep: files that are deleted, newly excluded, or newly oversized get tombstoned across every tier; the index converges to exactly what a fresh full rebuild would produce.
  • Self-maintenance: per-tier health watermarks (tombstone fraction, stale-doc ratio, delta ratio) schedule low-priority background compaction in a separate worker — the index stays fast over months without a manual rebuild.
  • Worktree-safe: a worktree stamp plus a single-writer lockfile prevent two daemons from silently interleaving index histories across git worktrees.
  • Resource-polite: ticks are budgeted (≤50 files / ≤2 s CPU per tick), run CPU-only (the GPU is reserved for cold full indexing), and the interval auto-tunes from load average, churn, and backlog.
  • sweet-search reconcile status / reconcile inspect <path> explain exactly what the daemon thinks and why. Opt out any time with SWEET_SEARCH_RECONCILE_V2=0.

🦀 The Native Engine Room

Four Rust crates do the heavy lifting, each with a graceful fallback so the engine runs everywhere:

| Crate | What it does | |-------|--------------| | sweet-search-native | candle GPU/CPU inference, sparse-gram grep engine, SIMD posting-list intersection, SimHash/MinHash-LSH dedup, HuggingFace tokenizers — all over zero-copy NAPI | | wasm-maxsim | a hand-written WASM SIMD kernel computing ColBERT MaxSim in ~4 KB (~1.6 KB gzipped), with fused INT8 dequantization inside the SIMD pipeline plus a 4-bit nibble-packed path | | wasm-router | the 498-tree CatBoost query router, loop-unrolled, zero-allocation | | sweet-search-cli | a native CLI that talks to a warm search daemon over a per-project Unix socket — 2.9 ms measured warm-path queries |

  • MaxSim, three speeds: scoring auto-selects the best available tier — native Rust + Rayon across all cores (47× vs baseline JS in our microbenchmark), portable WASM SIMD (16×), or a norm-cached pure-JS fallback (3.5×). Equivalent rankings, any platform.
  • SIMD set intersection: posting-list intersection dispatches per-pair — galloping search when one list is ≥8× smaller, 4-wide NEON/SSE2 block merges for balanced lists, scalar merge for small ones — following the Lemire/Clausecker line of work.
  • Dedup at index time: near-duplicate chunks are fingerprinted (64-bit SimHash + 128-permutation MinHash), clustered with banded LSH + union-find, then re-validated pairwise against the exemplar so transitive weak links can't glue unrelated clusters together. Duplicates skip embedding entirely — and at query time the best-matching sibling can take the exemplar's slot, so collapsing copies never hides the right answer.
  • Per-project warm daemon: the CLI derives an isolated socket path from an FNV-1a hash of the project root, auto-starts the server on first use, and falls back to pure JS where no native binary exists (measured: 2.9 ms warm / 108 ms cold / 64.7 ms JS fallback).
  • Native tokenization: the official HuggingFace tokenizers crate over NAPI — batched, cached, no Python anywhere in the stack.

🗜️ INT4 binary segments: the on-disk format behind the RAM-sized index

The quantization headline lives up in indexing1.34 GiB → ~396 MiB, INT4-halved again. Here's the SSLX segment format that delivers it: crash-safe by construction, and the three-stage retrieval it feeds at query time.

  • INT4 by default: per-token min/scale quantization with nibble packing (two values per byte), A/B-tested against the INT8 baseline with no meaningful retrieval regression before becoming the default. We borrowed the rotation insight from Google's TurboQuant, but ship plain INT4 — the full TurboQuant algorithm (WHT + PolarQuant + QJL) is researched and deferred, not in the product path.
  • SSLX binary segments: the index persists as ~10k-document binary segment files with structured headers and CRC32 footers — a crash costs you at most one segment, not the index.
  • Three-stage retrieval: a binary HNSW (Hamming distance over 64-byte binarized vectors, ~32× smaller than float HNSW) produces candidates in ~100 µs, INT8 rescoring narrows them, and a float32 sidecar rescores the final pool — speed without giving up top-result quality.
  • Memory-mapped HNSW: the float graph index loads via mmap (USearch view()), contributing 0 MB to the V8 heap at search time; the OS reclaims pages under pressure.
  • Streaming indexer: vectors stream from SQLite cursors instead of materializing in arrays — peak JS heap during indexing dropped from ~785 MB to ~213 MB, with 30-second fsync-ordered checkpoints bounding crash loss. The OOM cliff that used to appear above ~200k chunks is gone; large repos index comfortably on an 8 GB machine.
  • Tuned HNSW parameters and zero-GC search internals (typed-array heaps, generation-stamped visited lists) cut search p50 by 33% while raising recall@200 by 5.9 pp in our internal evaluation (docs/HNSW_APPROACH.md).

🔌 Works With Your Agent

sweet-search meets your agent wherever it is — shell tools, MCP, or injected instructions:

// .mcp.json (project root) — that's the whole integration
// or just run: sweet-search init --mcp
{
  "mcpServers": {
    "sweet-search": {
      "command": "npx",
      "args": ["-y", "sweet-search-mcp", "--project-root", "/absolute/path/to/your/repo"]
    }
  }
}
  • MCP server — 8 tools (search, trace, read, read-semantic, index, health, repo-map, vocab-prewarm), 2 resources, 2 prompts; all search tools declared read-only and idempotent
  • Harness injectioninit writes the evolved system prompt into Claude Code, Codex (--codex, including session hooks), Gemini CLI (--gemini), and Cursor (--cursor) from one canonical source
  • Repo maps for sub-agents — the repo-map tool returns a PageRank-ranked symbol overview squeezed into any token budget, perfect for briefing a delegated agent
  • Warm from the first query — a SessionStart hook pre-launches the search daemon so models, vocabulary, and indexes are loaded before you ask anything
  • Tool routing enforcement (opt-in): init --enforce-tools denies the native Grep tool in Claude Code and installs a hint hook nudging native Read toward ss-read/ss-semantic — for when you want the discipline guaranteed, not suggested.
  • /sweet-index skill: a Claude Code slash command for a full GPU-aware reindex, installed by init.
  • Vocabulary prewarm: sweet-search prewarm-vocab mines your repo's real identifiers, detects code communities (Leiden), and pre-warms all three search modes so even the first semantic query of a session is cache-warm.
  • Honest committed-state: init never writes machine-specific absolute paths into committed settings files, and all instruction injection is marker-delimited and reversible.

🖥️ Platform Support

| Platform | Engine | Acceleration | |----------|--------|--------------| | macOS arm64 (Apple Silicon) | native | Metal (M1+) · CoreML Neural Engine (M3+) | | macOS x64 (Intel) | native | ONNX Runtime INT8 CPU | | Linux x64 (glibc) | native | CUDA (SM 7.0+, flash-attn on Ampere+) or INT8 CPU | | Linux arm64 (glibc) | native | CUDA (Jetson Orin / Grace) or INT8 CPU | | Windows | — | via WSL2 (= Linux x64) | | Everything else | WASM/JS fallback | runs everywhere Node ≥ 18 runs |

Native binaries are selected automatically at npm install time via optionalDependencies — no flags, no postinstall scripts to debug. Every native fast path has a WASM or JS fallback that produces the same results.

🙏 Prior Art & Acknowledgements

sweet-search stands on a lot of shoulders, and we'd rather name them than pretend otherwise:

  • ColBERT (Khattab & Zaharia) — late interaction; LightOn for the LateOn-Code models and the ColGrep concept our pattern mode parallels
  • ripgrep (BurntSushi) — the bar for grep, and our verification baseline
  • GitHub's Blackbird — the sparse n-gram indexing idea we tuned per-codebase
  • candle & MLX — Rust ML and the fused SDPA kernels we build on; HuggingFace tokenizers
  • Aider — the repo-map idea, here rebuilt on a real knowledge graph
  • USearch — memory-mapped HNSW; Malkov & Yashunin for HNSW itself
  • CatBoost — the query router model; Traag et al. for the Leiden algorithm; Cormack et al. for RRF; PathRAG for flow-pruned graph expansion; cAST for structure-aware chunking
  • GEPA — the reflective evolutionary prompt-optimization paradigm behind our agent prompt
  • nomic-ai — the CodeRankEmbed embedding model
  • Anthropic — the Contextual Retrieval idea behind our chunk enrichment, here derived from code structure instead of an LLM summary

📄 License

Apache-2.0 © PanonIT


Found it useful?

If sweet-search saves your agent's tokens, a ⭐ helps other agents' humans find it.

GitHub stars