toolsift

v0.1.0

Published

17 days ago

Local, model-agnostic MCP proxy that fixes tool-definition bloat. Sits between your agent and N upstream MCP servers and exposes a tiny surface — search_tools + invoke_tool — so the agent retrieves only the few tools it needs instead of paying for 50+ JSO

Downloads

0High
0Medium
0Low

abdulmunimjemal

mcp model-context-protocol mcp-proxy tool-loading lazy-tools rag-mcp tool-retrieval ai-agent claude-code cursor codex context-window token-efficiency bm25

toolsift

A local, model-agnostic MCP proxy that fixes tool-definition bloat. It sits between your agent and N upstream MCP servers and exposes a tiny surface — search_tools + invoke_tool — so the agent retrieves only the few tools it needs instead of paying for 50+ JSON definitions up front.

When an agent is wired to several MCP servers, their tool definitions can eat a huge slice of the context window before the first message — schemas, descriptions, and parameter lists for dozens of tools the agent won't use this turn. toolsift connects to your upstream servers as a client, aggregates their tools into a local index, and re-exposes just two meta-tools to the agent. The agent calls search_tools("open a github issue") to pull back the handful of relevant definitions, then invoke_tool(...) to run one. This is the RAG-MCP / lazy-tool-loading pattern.

100% local. No API keys, no network beyond your own upstream servers, no telemetry. BM25 retrieval is the zero-dependency default.

Why this matters

RAG-MCP showed that retrieving tools on demand instead of loading them all raises tool-selection accuracy (from ~14% with the all-tools baseline to ~43% with retrieval) while cutting prompt tokens by ~50%. The intuition: a smaller, relevant tool list is easier for the model to choose from and cheaper to carry. toolsift brings that pattern to any MCP client, and ships an eval harness so you can measure the token side of it on your own toolset.

Install

npx toolsift mcp                 # zero-install, or:
npm i -g toolsift                # then the `toolsift` command is on your PATH

Requires Node ≥ 18.

Quick start

1. Tell toolsift about your upstream servers. Create toolsift.json in your project (see toolsift.example.json):

{
  "servers": [
    {
      "name": "github",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": { "GITHUB_PERSONAL_ACCESS_TOKEN": "ghp_…" }
    },
    {
      "name": "fs",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "."]
    }
  ],
  "pinned": ["read_file"],
  "retriever": "bm25",
  "topK": 5
}

pinned tools are exposed directly (pass-through) so always-needed tools skip the search. HTTP upstreams work too: { "name": "remote", "url": "https://…/mcp" }.

2. Wire toolsift into your agents — the one-liner:

npx toolsift install             # adds toolsift to Claude Code + Cursor

That writes the MCP server config (non-destructively). Restart the agent and you're done. --agent claude|cursor targets one; --global writes to your home dir instead of the project.

Prefer to wire it by hand? The server command is npx -y toolsift mcp.

Claude Code (.mcp.json or claude mcp add):

{
  "mcpServers": {
    "toolsift": { "command": "npx", "args": ["-y", "toolsift", "mcp"] }
  }
}

Cursor / Codex / any MCP client: use the same command — npx -y toolsift mcp over stdio.

MCP tools the agent sees

| tool | what it does | |------|--------------| | search_tools(query, k?) | retrieve the top-k relevant upstream tool definitions (server, name, description, inputSchema) for a task | | invoke_tool(server, name, arguments) | run one upstream tool and proxy its result back unchanged | | list_servers() | summary of the upstream servers and their tool counts (orientation) | | each pinned tool | exposed directly as a pass-through, no search needed |

Tool descriptions are written for the agent — they nudge it to search first and invoke second, instead of expecting every tool to be in context.

The eval harness — measure it on your own toolset

This is the part to actually run. toolsift eval is a deterministic, no-API-key benchmark that scores retrieval against a labelled query→tool dataset and reports, per retriever:

recall@k — is the right tool in the top-k retrieved?
MRR — how high does it rank, on average?
token reduction — tokens of the all-tools surface (what your agent pays today) vs the surface toolsift exposes, via gpt-tokenizer.

Run it with no arguments for BM25 on the built-in toolset, or --embeddings to also score the semantic and hybrid retrievers (28 tools across GitHub / Slack / filesystem / Postgres / calendar):

$ toolsift eval --embeddings
tools: 28 · queries: 28 · k: 5 · baseline surface: 1342 tokens

retriever   recall@5  mrr   tokens  saved
----------  --------  ----  ------  -----
all-tools   1.00      1.00  1342    0%
bm25        0.89      0.67  223     83%
embeddings  1.00      0.90  242     82%
hybrid      0.96      0.80  236     82%

→ embeddings: recall@5 1.00 · 82% fewer tokens than loading every tool

The all-tools baseline trivially has recall 1.0 and 0% reduction — that's what every agent pays right now. Retrieval keeps the right tool in the top-k while cutting the surface the model carries by ~80%. On a realistic 60-tool multi-server setup that becomes ~92% fewer tokens at 0.83 recall@5 (embeddings). Full methodology, per-domain generalization, and honest caveats are in BENCHMARKS.md.

Point it at your own toolset — the viral artifact:

toolsift eval --config toolsift.json --dataset my-queries.json

where my-queries.json is an array (or { "examples": [...] }) of:

[
  { "query": "open a bug report", "expectedServer": "github", "expectedTool": "create_issue" },
  { "query": "post to the team channel", "expectedServer": "slack", "expectedTool": "send_message" }
]

toolsift connects to your configured upstreams, pulls their real tool list, and tells you exactly how many tokens you'd save and whether retrieval still finds the tools you need. An end-to-end --llm accuracy mode (real model, needs an API key) is planned; the default eval stays deterministic and key-free.

How it works

Aggregate. toolsift connects to each configured upstream MCP server as a client (stdio or HTTP), calls tools/list, and builds one in-memory registry of { server, name, description, inputSchema }.
Index. The registry is indexed by a pluggable retriever. The default is a self-contained BM25 over each tool's name (weighted), description, and parameter names — zero dependencies, instant, and tuned for the camelCase / snake_case identifiers tools use. The corpus is tiny (dozens to hundreds of tools), so it lives in memory; no vector DB.
Expose. toolsift runs as its own MCP server over stdio and advertises search_tools + invoke_tool (+ any pinned tools), not the upstream definitions. The agent searches, gets a few schemas, and invokes.
Proxy. invoke_tool forwards the call to the owning upstream and returns its result unchanged.

Programmatic API

Everything is importable:

import { UpstreamManager, createServer, createRetriever, runEval } from "toolsift";

const manager = new UpstreamManager(config.servers);
await manager.connect();

const retriever = createRetriever("hybrid");      // or "bm25" | "embeddings"
await retriever.index(manager.tools());
await retriever.search("open a github issue", 5); // → scored tool refs

// inject your own embedding backend (model-agnostic):
const custom = createRetriever("embeddings", { embed: async (texts) => myApi(texts) });

// build the proxy MCP server directly:
const server = createServer({ manager, config });

// or benchmark a toolset yourself (compare retrievers):
const report = await runEval(manager.tools(), dataset, {
  k: 5,
  retrievers: ["all-tools", "bm25", "embeddings", "hybrid"],
});

Retrievers

| retriever | needs | notes | |-----------|-------|-------| | bm25 | nothing (default) | self-contained lexical retrieval; zero-dependency, in-memory, instant. Excels at identifier/keyword queries. | | embeddings | @huggingface/transformers (optional peer) | semantic retrieval via a small model run locally (no API key, model fetched once and cached). Catches paraphrases BM25 misses. | | hybrid | @huggingface/transformers | fuses BM25 + embeddings with Reciprocal Rank Fusion — best recall of the three. |

npm i @huggingface/transformers     # opt in to embeddings / hybrid

Set "retriever": "hybrid" in toolsift.json. The embedding backend is pluggable — inject any EmbedFn (a hosted embedding API, a different local model) via the programmatic API; BM25 stays the zero-install default so the core package is lean.

Limitations

The default retriever is lexical (BM25) and excels at identifier/keyword queries; install @huggingface/transformers and set "retriever": "hybrid" for paraphrase-heavy toolsets (it fuses lexical + semantic).
Tool definitions are aggregated at startup (tools/list); a refresh hook exists (UpstreamManager.refresh()) but live upstream tool-list change subscriptions aren't wired into the proxy yet.
pinned matches by tool name across servers; if two upstreams expose the same tool name, pin by reviewing your config.

Roadmap

--llm eval mode — end-to-end tool-selection accuracy with a real model.
Live tool-list refresh — react to upstream notifications/tools/list_changed.
Persistent embedding cache — reuse tool embeddings across restarts.

Contributing

Contributions are very welcome — toolsift is small, fully tested, and has no network in its test suite. See CONTRIBUTING.md.

pnpm install && pnpm test && pnpm typecheck && pnpm build

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

toolsift

Why this matters

Install

Quick start

MCP tools the agent sees

The eval harness — measure it on your own toolset

How it works

Programmatic API

Retrievers

Limitations

Roadmap

Contributing

License