shortlist

v0.1.2

Published

18 days ago

Dynamic tool selection for AI SDK agents — index your tools and expose only the most relevant ones to the model on every step.

0High
0Medium
0Low

enemyr

shortlist ai ai-sdk vercel-ai-sdk tools tool-selection tool-routing agents llm rag bm25 embeddings semantic-search

shortlist

Dynamic tool selection for AI SDK agents.

When an agent has 50, 100, or 200+ tools, handing the model all of them on every step is a problem twice over: it burns context tokens, and it lowers tool-calling accuracy — the right tool gets lost in the noise. shortlist indexes your tools once and, on each step, exposes only the handful that actually match what the user is trying to do.

import { createToolIndex } from "shortlist";
import { generateText } from "ai";

const index = createToolIndex(allTools); // 200 tools, no API key needed

const result = await generateText({
  model,
  tools: allTools,
  prepareStep: index.prepareStep({ maxTools: 5 }), // model sees the best 5 per step
  prompt: "Refund the last charge for customer Acme.",
});

Zero-config default — keyword search (BM25 + TF-IDF) that needs no API key and runs in well under a millisecond.
Better with embeddings — add an embedding model for semantic search; shortlist fuses it with keyword search automatically.
Drops into the AI SDK — prepareStep, wrapLanguageModel middleware, or a select() you call yourself.
Built for agent loops — adaptive cutoffs, miss-driven escalation, recently-used tool retention, related-tool expansion, and an in-memory query-embedding cache.

Status: pre-release (0.0.1). The API below is implemented and tested; expect additive changes before 1.0.

Install

npm install shortlist
# peers (you almost certainly already have these):
npm install ai zod

Requires Node 18+, ai >= 4.0, and zod >= 3.25.

How it works

createToolIndex(tools, options?) builds a search index over each tool's name, description, and parameter names. You then pick an integration point:

| You want… | Use | Returns | | --- | --- | --- | | The AI SDK to filter tools per step | index.prepareStep(opts) | a prepareStep function | | Transparent filtering at the provider level | index.middleware(opts) | a LanguageModelMiddleware | | To select tools yourself | index.select(query, opts) | string[] of tool names | | To debug why tools ranked as they did | index.selectWithScores(query, opts) | { tools, results } | | The model to discover tools on demand | index.searchTool() | a callable meta-tool |

Strategies

// hybrid (default when no embeddingModel) — free, keyword-based, instant
createToolIndex(tools);

// combined (default when an embeddingModel is given) — keyword + semantic, fused
createToolIndex(tools, {
  embeddingModel: openai.embeddingModel("text-embedding-3-small"),
});

// semantic — embeddings only
createToolIndex(tools, { strategy: "semantic", embeddingModel });

combined runs keyword and semantic search in parallel and fuses the normalized scores. The base weighting is ≈30% keyword / 70% semantic, but by default the keyword weight is scaled per query by how much of it the keyword index actually matches (adaptiveFusion): a paraphrase or a query in another language — where keyword would only be matching noise — fuses as semantic-only, while a query that reuses the tools' words keeps keyword's exact-match anchoring. Tune the base split with fusionWeights, or set adaptiveFusion: false for a fixed ratio. If the embedding call fails, combined falls back to keyword-only so selection never hard-fails — at warm-up or per query.

When you use embeddings, call await index.warmUp() once at startup to pre-compute tool embeddings (and persist them with an embedding cache) so the first select() isn't slow.

`prepareStep` — the main path

const result = await generateText({
  model,
  tools: allTools,
  prepareStep: index.prepareStep({
    maxTools: 5,
    alwaysActive: ["getCurrentUser"], // always exposed, regardless of the query
    recentToolBoost: 3,               // keep up to 3 recently-used tools available
  }),
  stopWhen: stepCountIs(8),
  prompt,
});

prepareStep re-selects tools on every step from the live conversation, and adds two behaviors that matter inside real agent loops:

Miss-driven escalation. If the model produced no tool calls last step, the next step shows the next page of ranked tools instead of repeating the same set. After two consecutive misses, all tools are exposed so the agent is never stuck.
Recently-used retention (recentToolBoost, opt-in). A fresh per-step selection can otherwise strip away a tool the agent just used successfully. Set recentToolBoost to keep the most-recently-used tools active across steps — derived entirely from the steps the SDK passes in, so there's no hidden session state. recentToolWindow (default 3) controls how far back it looks.

`select` and `selectWithScores`

Call the selector directly when you manage activeTools yourself:

const names = await index.select("ship it to prod", { maxTools: 5 });
// → ["deployToVercel", ...]

selectWithScores returns the same names plus per-tool provenance — invaluable for tuning and debugging "why did that tool rank there?":

const { tools, results } = await index.selectWithScores("create an invoice", {
  maxTools: 5,
  relatedTools: { createInvoice: ["queryCustomers"] },
});

for (const r of results) {
  console.log(r.name, r.score, {
    reranked: r.reranked,     // an LLM reranker produced this ordering
    viaRelated: r.viaRelated, // pulled in as a companion of a selected tool
    alwaysActive: r.alwaysActive, // pinned, not matched by search
  });
}

`SelectOptions`

| Option | Default | Description | | --- | --- | --- | | maxTools | 5 | Maximum tools to return. | | alwaysActive | [] | Tool names always included. | | threshold | — | Drop results scoring below this. | | adaptive | true | Return fewer than maxTools when there's a clear score gap (the "elbow"). Automatically skipped when a reranker is active, since the reranker already decides the top-N. | | relatedTools | — | Per-call override of the index-level related-tools map. |

PrepareStepOptions extends SelectOptions with recentToolBoost and recentToolWindow.

Related tools

Some tools travel together — selecting createInvoice is useless without queryCustomers. Declare those links and shortlist pulls companions in whenever the key tool is selected (expansion is one-directional and can exceed maxTools):

const index = createToolIndex(tools, {
  relatedTools: {
    createInvoice: ["queryCustomers", "sendEmail"],
  },
});

LLM reranking & enrichment (optional)

For maximum accuracy on messy, slangy queries, add a cheap model:

const index = createToolIndex(tools, {
  embeddingModel: openai.embeddingModel("text-embedding-3-small"),
  rerankerModel: openai("gpt-4o-mini"),
  enrichDescriptions: true, // expand descriptions with synonyms at warmUp (cached)
});
await index.warmUp();

The reranker re-scores the top candidates with reasoning a pure vector match can't do (when there are ≤50 tools it sees all of them). enrichDescriptions rewrites each tool's description once with synonyms and common phrasings. Both fail soft — set SHORTLIST_DEBUG=1 to log when they fall back.

Caching embeddings

Persist tool embeddings across restarts so you don't re-pay the embedding API:

import { createToolIndex, fileCache } from "shortlist";

const index = createToolIndex(tools, {
  embeddingModel,
  embeddingCache: fileCache(".shortlist-cache.json"),
});
await index.warmUp();

shortlist also keeps an in-memory query-embedding cache (semantic/combined strategies) so repeated queries in an agent loop skip the embedding call. It's on by default (50 entries, exact-match on the normalized query); tune with queryCacheSize or set it to 0 to disable.

Letting the model discover tools

Expose a meta-tool the model can call when it needs a capability that isn't in its current selection:

const tools = { ...alwaysOnTools, search_tools: index.searchTool() };

Evaluating accuracy

import { evalToolIndex } from "shortlist/eval";

const report = await evalToolIndex(index, [
  { query: "create a ticket", expected: "createJiraTicket" },
  { query: "ship it", expected: "deployToVercel", alternatives: ["deployToProd"] },
]);
console.log(report); // { top1, top3, top5, avgLatencyMs, misses }

Benchmarks

There's a real, reproducible benchmark in bench/ — a ~215-tool corpus and a labeled query set split by query type, so you can see when each strategy wins. Run npm run bench (keyword only, no key) or npm run bench:write (adds the embedding modes when OPENAI_API_KEY is set, and regenerates bench/RESULTS.md).

Top-5 accuracy by query type (215 tools, 99 queries, text-embedding-3-small):

| query type | keyword | combined | semantic | | --- | --- | --- | --- | | lexical (reuses the tool's words) | 100% | 100% | 100% | | exact-name (DynamoDB, Lambda, …) | 100% | 100% | 100% | | typo / misspelling | 80% | 100% | 100% | | acronym (k8s, PR, FX) | 67% | 100% | 83% | | paraphrase (same intent, new words) | 10% | 90% | 90% | | verbose / noisy phrasing | 25% | 88% | 88% | | multilingual (sv/fr/es/de) | 25% | 100% | 100% |

Keyword is perfect when the query reuses the tool's words and effectively free; embeddings are what carry paraphrases, noisy phrasing, and other languages. combined (with adaptive fusion) is the best all-rounder — it matches semantic recall and keeps keyword's exact-match anchoring, where pure semantic slips (acronyms 100% vs 83%, top-5).

API reference

function createToolIndex<T extends ToolSet>(tools: T, options?: ToolIndexOptions): ToolIndex<T>;

interface ToolIndex<T> {
  toolNames: (keyof T & string)[];
  warmUp(): Promise<void>;
  select(query: string, options?: SelectOptions): Promise<string[]>;
  selectWithScores(query: string, options?: SelectOptions): Promise<ScoredSelection>;
  prepareStep(options?: PrepareStepOptions): PrepareStepFunction<T>;
  middleware(options?: SelectOptions): LanguageModelMiddleware;
  searchTool(): Tool;
}

interface ToolIndexOptions {
  strategy?: "hybrid" | "semantic" | "combined";
  embeddingModel?: EmbeddingModel;
  embeddingCache?: EmbeddingCacheOptions;
  rerankerModel?: LanguageModel;
  enrichDescriptions?: boolean;
  relatedTools?: Record<string, string[]>;
  queryCacheSize?: number;
  fusionWeights?: { keyword?: number; semantic?: number }; // combined base split, default 0.3/0.7
  adaptiveFusion?: boolean; // scale keyword weight by per-query coverage, default true
}

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

shortlist

Install

How it works

Strategies

prepareStep — the main path

select and selectWithScores

SelectOptions

Related tools

LLM reranking & enrichment (optional)

Caching embeddings

Letting the model discover tools

Evaluating accuracy

Benchmarks

API reference

License

`prepareStep` — the main path

`select` and `selectWithScores`

`SelectOptions`