@fre4x/benchmark

v1.1.0-beta.0

Published

10 hours ago

A benchmark MCP server for agent evaluation workflows.

0High
0Medium
0Low

fritzprix

mcp benchmark gaia evaluation ai

benchmark — Unified Agent Evaluation

This package exposes a consistent MCP workflow for benchmark-driven agent evaluation.

GAIA is the first built-in adapter, but the tool surface is generic so other benchmarks can plug in later without changing client behavior.

Tools

| Tool | Purpose | |------|---------| | benchmark_list_challenges | List available benchmark suites with version and asset metadata | | benchmark_start_challenge | Start an attempt and return the first question | | benchmark_submit_solution | Grade one answer and return the next question or final score | | benchmark_get_asset | Read an attached benchmark asset by asset_id | | benchmark_get_attempt | Inspect attempt status and the current question | | benchmark_cancel_attempt | Cancel an active attempt |

Workflow

Call benchmark_list_challenges
Pick a challenge_id
Call benchmark_start_challenge
If the question has assets, call benchmark_get_asset
Call benchmark_submit_solution
Repeat until done: true

Each response includes machine-readable guidance for the most likely next tool call.

Mock Mode

Run without any external dataset file:

MOCK=true npx @fre4x/benchmark

Optional Environment

BENCHMARK_GAIA_DATA_FILE=/absolute/path/to/gaia-challenges.json
BENCHMARK_STATE_DIR=/absolute/path/to/store-attempt-json
BENCHMARK_MOCK=true

BENCHMARK_GAIA_DATA_FILE: Optional JSON file with GAIA-compatible normalized challenge definitions
BENCHMARK_STATE_DIR: Where attempt state is persisted
BENCHMARK_MOCK: Alternate mock-mode flag

Claude Desktop

{
  "mcpServers": {
    "benchmark": {
      "command": "npx",
      "args": ["-y", "@fre4x/benchmark"],
      "env": {
        "BENCHMARK_GAIA_DATA_FILE": "/absolute/path/to/gaia-challenges.json"
      }
    }
  }
}

Development

npm install
npm run build -w @fre4x/benchmark
npm run typecheck -w @fre4x/benchmark
npm test -w @fre4x/benchmark
MOCK=true npm run inspector -w @fre4x/benchmark

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme