@fre4x/benchmark
v1.1.0-beta.0
Published
A benchmark MCP server for agent evaluation workflows.
Maintainers
Readme
benchmark — Unified Agent Evaluation
This package exposes a consistent MCP workflow for benchmark-driven agent evaluation.
GAIA is the first built-in adapter, but the tool surface is generic so other benchmarks can plug in later without changing client behavior.
Tools
| Tool | Purpose |
|------|---------|
| benchmark_list_challenges | List available benchmark suites with version and asset metadata |
| benchmark_start_challenge | Start an attempt and return the first question |
| benchmark_submit_solution | Grade one answer and return the next question or final score |
| benchmark_get_asset | Read an attached benchmark asset by asset_id |
| benchmark_get_attempt | Inspect attempt status and the current question |
| benchmark_cancel_attempt | Cancel an active attempt |
Workflow
- Call
benchmark_list_challenges - Pick a
challenge_id - Call
benchmark_start_challenge - If the question has assets, call
benchmark_get_asset - Call
benchmark_submit_solution - Repeat until
done: true
Each response includes machine-readable guidance for the most likely next tool call.
Mock Mode
Run without any external dataset file:
MOCK=true npx @fre4x/benchmarkOptional Environment
BENCHMARK_GAIA_DATA_FILE=/absolute/path/to/gaia-challenges.json
BENCHMARK_STATE_DIR=/absolute/path/to/store-attempt-json
BENCHMARK_MOCK=trueBENCHMARK_GAIA_DATA_FILE: Optional JSON file with GAIA-compatible normalized challenge definitionsBENCHMARK_STATE_DIR: Where attempt state is persistedBENCHMARK_MOCK: Alternate mock-mode flag
Claude Desktop
{
"mcpServers": {
"benchmark": {
"command": "npx",
"args": ["-y", "@fre4x/benchmark"],
"env": {
"BENCHMARK_GAIA_DATA_FILE": "/absolute/path/to/gaia-challenges.json"
}
}
}
}Development
npm install
npm run build -w @fre4x/benchmark
npm run typecheck -w @fre4x/benchmark
npm test -w @fre4x/benchmark
MOCK=true npm run inspector -w @fre4x/benchmark