@speakeasy-api/docs-mcp-eval
v0.6.0
Published
Evaluation and benchmarking harness for docs-mcp search quality metrics
Maintainers
Keywords
Readme
@speakeasy-api/docs-mcp-eval
Evaluation and benchmarking harness for Speakeasy Docs MCP.
Beta. Part of the docs-mcp monorepo.
Installation
npm install -g @speakeasy-api/docs-mcp-evalEval Types
Search-Quality Eval (run)
Measures retrieval quality metrics (MRR, NDCG, precision, latency) by driving the MCP server directly via stdio JSON-RPC.
docs-mcp-eval run --cases ./cases.json \
--server-command "docs-mcp-server --index-dir ./index"- Recall@K — fraction of expected chunks found in the top-K results
- MRR (Mean Reciprocal Rank) — how early the first relevant result appears
- Precision@K — fraction of top-K results that are relevant
- Delta reports — side-by-side comparison between evaluation runs
See docs/eval.md for the full search-quality eval specification.
Agent Eval (agent-eval)
End-to-end evaluation that spawns an AI coding agent with docs-mcp tools, runs it against a prompt, and evaluates assertions on the output. Validates the full stack — from search quality to how well a real model uses the tools.
Supports multiple agent providers:
| Provider | Flag | Backend | Prerequisites |
|----------|------|---------|---------------|
| Claude | --provider claude | @anthropic-ai/claude-agent-sdk | ANTHROPIC_API_KEY |
| OpenAI Codex | --provider openai | codex exec --json (CLI spawn) | OPENAI_API_KEY + codex on PATH |
| Auto (default) | --provider auto | Detected from env | Whichever key is set |
# Claude (default when ANTHROPIC_API_KEY is set)
docs-mcp-eval agent-eval --suite acmeauth
# OpenAI Codex
docs-mcp-eval agent-eval --suite dub-go --provider openai
# Custom scenario file
docs-mcp-eval agent-eval --scenarios ./my-scenarios.json --docs-dir ./my-docsSupports contains, not_contains, matches, and script assertions, per-scenario docs sources (local path or git clone), auto-built index caching, and trend comparison against prior runs.
See docs/agent-eval.md for the full agent eval specification.
