@reaatech/agent-eval-harness-mcp-server

v0.1.2

Published

23 days ago

Three-layer MCP tool server (judge, suite, gate) for agent-eval-harness

0High
0Medium
0Low

reaatech

@reaatech/agent-eval-harness-mcp-server

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Three-layer MCP (Model Context Protocol) server exposing evaluation tools. Provides 13 tools across three layers — atomic judge operations, orchestrated suite runs, and CI gate operations — all accessible via MCP stdio transport for integration with AI coding agents like Claude Desktop.

Installation

npm install @reaatech/agent-eval-harness-mcp-server

Feature Overview

13 MCP tools — covering the full evaluation lifecycle from atomic judgment to CI gate checking
Three-layer architecture — eval.judge.* (5 fast, stateless atomic ops), eval.suite.* (5 orchestrated longer-running ops), eval.gate.* (3 blocking CI gate ops)
Stdio transport — standard MCP protocol over stdin/stdout, no HTTP server required
Auto-discovery — agents can list available tools and their input/output schemas at connection
In-memory state — session-scoped run storage with no external database dependency
JSON Schema tool definitions — each tool declares its input shape for type-safe agent invocation

Quick Start

import { createMCPServer } from '@reaatech/agent-eval-harness-mcp-server';

const server = await createMCPServer();
await server.start(); // Connects via stdio — ready for MCP clients

Configure tool layers individually:

import { createMCPServer } from '@reaatech/agent-eval-harness-mcp-server';

const server = await createMCPServer({
  name: 'my-eval-server',
  enableJudgeTools: true,
  enableSuiteTools: true,
  enableGateTools: false, // gate ops disabled
});

API Reference

Server

| Export | Type | Description | |--------|------|-------------| | EvalHarnessMCPServer | Class | MCP server instance wrapping @modelcontextprotocol/sdk | | createMCPServer(config?) | async (config?: Partial<MCPServerConfig>) => Promise<EvalHarnessMCPServer> | Create and start server in one call |

EvalHarnessMCPServer methods:

| Method | Description | |--------|-------------| | run() | Connect and start listening on stdio transport | | getServer() | Access underlying MCP Server instance | | close() | Gracefully close the server connection |

Configuration

MCPServerConfig

| Field | Type | Default | Description | |-------|------|---------|-------------| | name | string | 'agent-eval-harness' | Server name reported to MCP clients | | version | string | package.version | Server version | | enableJudgeTools | boolean | true | Register eval.judge.* tools | | enableSuiteTools | boolean | true | Register eval.suite.* tools | | enableGateTools | boolean | true | Register eval.gate.* tools |

Tool Reference

Layer 1 — eval.judge.* (Atomic Operations)

Fast, stateless operations designed for mid-task self-evaluation by agents.

| Tool | Input | Output | Description | |------|-------|--------|-------------| | eval.judge.faithfulness | { context: string, response: string } | { score, explanation, confidence } | Score response faithfulness to context | | eval.judge.relevance | { intent: string, response: string } | { score, explanation, confidence } | Score response relevance to intent | | eval.judge.tool_correctness | { expected_tool: string, actual_tool: string, arguments?: object, result?: object } | { score, explanation, confidence } | Validate tool selection and arguments | | eval.judge.cost_check | { trajectory: object, budget: number } | { within_budget, cost, budget, usage_percentage } | Verify cost within budget | | eval.judge.latency_check | { trajectory: object, sla: number } | { within_sla, p99_ms, p50_ms, p90_ms, total_ms } | Verify latency within SLA |

Layer 2 — eval.suite.* (Orchestrated Runs)

Stateful operations for eval-driven development. In-memory storage per session.

| Tool | Input | Output | Description | |------|-------|--------|-------------| | eval.suite.run | { trajectories: object[], config?: { metrics?, judge_model?, budget_limit? } } | { run_id, status, total_trajectories, completed, failed, duration_ms } | Execute evaluation suite | | eval.suite.status | { run_id: string } | { run_id, status, progress, completed, total, started_at, ended_at } | Get run progress | | eval.suite.results | { run_id: string, format?: 'json' \| 'summary' } | Aggregated results or summary | Retrieve evaluation results | | eval.suite.compare | { baseline_run: string, candidate_run: string } | { score_diff, verdict, regressions, improvements, key_findings } | Compare two runs | | eval.suite.baseline | { run_id: string, name?: string } | { baseline_id, name, set_at } | Set baseline for regression |

Layer 3 — eval.gate.* (CI Gates)

Blocking, opinionated operations for CI/CD pipelines. In-memory gate storage per session.

| Tool | Input | Output | Description | |------|-------|--------|-------------| | eval.gate.run | { run_id?: string, gate_config?: string, results?: object, comparison?: object } | { passed, total_gates, passed_gates, failed_gates, results, exit_code } | Run CI-style pass/fail gate | | eval.gate.config | { action: 'get' \| 'set' \| 'list', config?: object[], preset?: 'standard' \| 'strict' \| 'lenient' } | { gates } or { success, gates_loaded } | Get/set/list gate configuration | | eval.gate.diff | { baseline: object, candidate: object, metrics?: string[] } | { score_diff, metric_diffs, regressions, improvements, verdict } | Detailed diff from baseline |

Related Packages

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@reaatech/agent-eval-harness-mcp-server

Installation

Feature Overview

Quick Start

API Reference

Server

Configuration

Tool Reference

Layer 1 — eval.judge.* (Atomic Operations)

Layer 2 — eval.suite.* (Orchestrated Runs)

Layer 3 — eval.gate.* (CI Gates)

Related Packages

License