@speakeasy-api/docs-mcp-eval

v0.6.0

Published

a day ago

Evaluation and benchmarking harness for docs-mcp search quality metrics

0High
0Medium
0Low

thomasrooneyspeakeasy

@speakeasy-api/docs-mcp-eval

Evaluation and benchmarking harness for Speakeasy Docs MCP.

Beta. Part of the docs-mcp monorepo.

Installation

npm install -g @speakeasy-api/docs-mcp-eval

Eval Types

Search-Quality Eval (`run`)

Measures retrieval quality metrics (MRR, NDCG, precision, latency) by driving the MCP server directly via stdio JSON-RPC.

docs-mcp-eval run --cases ./cases.json \
  --server-command "docs-mcp-server --index-dir ./index"

Recall@K — fraction of expected chunks found in the top-K results
MRR (Mean Reciprocal Rank) — how early the first relevant result appears
Precision@K — fraction of top-K results that are relevant
Delta reports — side-by-side comparison between evaluation runs

See docs/eval.md for the full search-quality eval specification.

Agent Eval (`agent-eval`)

End-to-end evaluation that spawns an AI coding agent with docs-mcp tools, runs it against a prompt, and evaluates assertions on the output. Validates the full stack — from search quality to how well a real model uses the tools.

Supports multiple agent providers:

| Provider | Flag | Backend | Prerequisites | |----------|------|---------|---------------| | Claude | --provider claude | @anthropic-ai/claude-agent-sdk | ANTHROPIC_API_KEY | | OpenAI Codex | --provider openai | codex exec --json (CLI spawn) | OPENAI_API_KEY + codex on PATH | | Auto (default) | --provider auto | Detected from env | Whichever key is set |

# Claude (default when ANTHROPIC_API_KEY is set)
docs-mcp-eval agent-eval --suite acmeauth

# OpenAI Codex
docs-mcp-eval agent-eval --suite dub-go --provider openai

# Custom scenario file
docs-mcp-eval agent-eval --scenarios ./my-scenarios.json --docs-dir ./my-docs

Supports contains, not_contains, matches, and script assertions, per-scenario docs sources (local path or git clone), auto-built index caching, and trend comparison against prior runs.

See docs/agent-eval.md for the full agent eval specification.

License

AGPL-3.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@speakeasy-api/docs-mcp-eval

Installation

Eval Types

Search-Quality Eval (run)

Agent Eval (agent-eval)

License

Search-Quality Eval (`run`)

Agent Eval (`agent-eval`)