modelcontextprotocol-eval

v1.0.2

Published

5 months ago

Model Context Protocol (MCP) evaluation framework and tools

0High
0Medium
0Low

wye

mcp model-context-protocol evaluation ai llm framework testing

MCP Evaluation Harness

A command-line interface (CLI) tool designed to benchmark and evaluate the tool-use capabilities of Large Language Models (LLMs) using the Model Context Protocol (MCP).

Overview

The MCP Evaluation Harness provides a standardized, repeatable method for assessing how effectively different LLMs interact with external tools and data sources exposed via MCP servers. The tool orchestrates interactions between specified LLMs and MCP servers, executing predefined scenarios and measuring both qualitative correctness and quantitative performance.

Features

Multi-LLM Support: Currently supports Google Gemini and Ollama with extensible architecture for additional providers
Judge-Based Evaluation: Uses separate LLM judges to evaluate agent performance with custom rubrics
MCP Integration: Connects to MCP servers via Streamable HTTP transport
YAML Configuration: Easy-to-define evaluation scenarios and configurations
Comprehensive Metrics: Collects both qualitative (correctness) and quantitative (latency, tokens) metrics
Multiple Output Formats: JSON, Markdown, and CSV report generation
Rich Reporting: Formatted console reports with PASS/FAIL status and detailed reasoning
Error Handling: Graceful handling of network failures, API errors, and configuration issues

Installation

Prerequisites

Node.js 18.0.0 or higher

Install from npm

# Install globally
npm install -g modelcontextprotocol-eval

# Or use npx directly
npx modelcontextprotocol-eval --help

Build from Source

# Clone the repository
git clone <repository-url>
cd mcp-eval

# Install dependencies
pnpm install

# Build the project
pnpm build

# Link the CLI globally (optional)
cd packages/cli
npm link

Configuration

Create a YAML configuration file that defines your LLMs, MCP servers, and evaluation scenarios:

# Define the Language Models with their roles
llms:
  - name: "Gemini-Flash-Agent"
    provider: "gemini"
    model: "gemini-2.5-flash"
    role: "agent" # Can be 'agent' or 'judge'
    auth:
      api_key_env: "GOOGLE_API_KEY"

  - name: "Ollama-Agent"
    provider: "ollama"
    model: "llama3.1"
    role: "agent"
    auth:
      api_key_env: "OLLAMA_API_KEY" # Not used but required

  - name: "Gemini-Pro-Judge"
    provider: "gemini"
    model: "gemini-2.5-pro"
    role: "judge"
    auth:
      api_key_env: "GOOGLE_API_KEY"

# Define the MCP Servers that provide the tools
mcp_servers:
  - name: "cp-labs-mcp-server"
    uri: "http://localhost:3000/mcp"

# Define the evaluation scenarios
scenarios:
  - name: "MCP server integration test"
    llm_agent: "Gemini-Flash-Agent" # Reference an LLM by name
    llm_judge: "Gemini-Pro-Judge" # Reference a judge by name
    mcp_server: "cp-labs-mcp-server" # Reference an MCP server
    system_prompt: "You are a helpful assistant. Use available tools when needed."
    prompt: "What are the latest 5 CVEs?"

    # Required evaluation rubric for the judge LLM
    llm_judge_rubric: |
      Analyze the conversation transcript.
      CRITERIA:
      1. Tool Usage: Did the agent correctly use the cve-list tool?
      2. Output Quality: Is the response clear and informative?

      Respond with JSON:
      - "result": "PASS" or "FAIL"
      - "reason": explanation
      - "score": 1-5 rating

  - name: "Answer from own knowledge"
    llm_agent: "Gemini-Flash-Agent"
    llm_judge: "Gemini-Pro-Judge"
    system_prompt: "You are a helpful general knowledge assistant."
    prompt: "What is the capital of France?"

    llm_judge_rubric: |
      Evaluate the response to a knowledge question.
      CRITERIA:
      1. Correctness: Is the answer "Paris"?
      2. Tool Abstinence: Did the agent avoid using tools?

      Respond with JSON:
      - "result": "PASS" or "FAIL"
      - "reason": explanation
      - "score": 1-5 rating

Configuration Reference

LLM Configuration

name: Display name for the LLM
provider: Provider type (supports "gemini" and "ollama")
model: Model identifier (e.g., "gemini-2.5-flash", "gemini-2.5-pro" for Gemini; "llama3.1", "qwen2.5" for Ollama)
role: Either "agent" (performs tasks) or "judge" (evaluates performance)
auth.api_key_env: Environment variable containing the API key (not used by Ollama but required by schema)

MCP Server Configuration

name: Display name for the MCP server
uri: HTTP endpoint URL for the MCP server

Scenario Configuration

name: Display name for the scenario
llm_agent: Name of the LLM to use as the agent
llm_judge: Name of the LLM to use as the judge
mcp_server: (Optional) Name of the MCP server to connect to
system_prompt: (Optional) System prompt to set LLM behavior
prompt: User prompt that initiates the interaction
llm_judge_rubric: Evaluation criteria and instructions for the judge LLM

Environment Variables

You can set API keys in two ways:

Option 1: Using a .env file (Recommended)

Create a .env file in the project root:

# .env
GOOGLE_API_KEY=your_actual_google_api_key_here

# For Ollama provider
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_API_KEY=dummy  # Not used but required by schema

Option 2: Using environment variables

# For Google Gemini
export GOOGLE_API_KEY="your-gemini-api-key"

# For Ollama (base URL, defaults to http://localhost:11434)
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_API_KEY="dummy" # Not used but required by schema

# Add other provider keys as needed

Important:

The api_key_env field in your configuration should contain the name of the environment variable (e.g., "GOOGLE_API_KEY"), not the actual API key value
Never commit actual API keys to version control

Usage

Basic Usage

# Run evaluation with a configuration file
modelcontextprotocol-eval evaluate --config config.yaml

# Or use npx
npx modelcontextprotocol-eval evaluate --config config.yaml

# Enable verbose logging for debugging
modelcontextprotocol-eval evaluate --config config.yaml --verbose

# Save results to a file (auto-detects format from extension)
modelcontextprotocol-eval evaluate --config config.yaml --output results.json
modelcontextprotocol-eval evaluate --config config.yaml --output results.csv
modelcontextprotocol-eval evaluate --config config.yaml --output results.md

# Specify format explicitly
modelcontextprotocol-eval evaluate --config config.yaml --output results.txt --format json

# Short form with verbose mode
modelcontextprotocol-eval evaluate -c config.yaml -o results.json -v

Output Formats

The tool supports multiple output formats for saving evaluation results:

JSON (.json) - Complete structured data including:
- Full conversation transcripts (LLM responses, tool calls, results)
- Detailed metrics (tokens, latency, tool usage)
- Judge evaluations with scores and reasoning
- Timestamps and metadata
Markdown (.md) - Human-readable reports featuring:
- Executive summary with visual indicators (✅/❌)
- Detailed scenario breakdowns
- Full LLM response content
- Tool interaction details
- Judge evaluation explanations

Response Display Features

Judge-Based Evaluation System

The harness uses a two-LLM system:

Agent LLM: Performs the actual task (tool usage, reasoning)
Judge LLM: Evaluates the agent's performance using custom rubrics

This approach provides:

Objective Assessment: Separate LLM removes bias from self-evaluation
Custom Rubrics: Tailored evaluation criteria for each scenario
Detailed Feedback: Rich explanations and scoring for performance analysis
Scalable Evaluation: Consistent judging across multiple scenarios

Response Capture and Display

The evaluation harness captures and displays comprehensive response information:

Console Output

Scenario Details: Each scenario shows detailed information including status, latency, and token usage
LLM Responses: Actual content returned by the LLM, truncated for readability
MCP Server Responses: Tool calls, arguments, and results from MCP servers
Final Response: The complete final response from the LLM
Color-coded Output: Green for PASS, red for FAIL, with emoji indicators

File Output Details

JSON Format: Complete response data including full LLM responses, MCP tool results, and metadata
CSV Format: Response summaries with counts and truncated content
Markdown Format: Beautifully formatted reports with detailed response sections, including:
- Full LLM response content
- MCP server tool calls with arguments and results
- Final responses for each scenario
- Visual indicators (✅/❌) for easy scanning

Example Output

🚀 Starting MCP evaluation...

📋 Loading configuration from: /path/to/config.yaml
✅ Configuration loaded successfully
   - LLMs: 1
   - MCP Servers: 1
   - Scenarios: 2

🔧 Initializing evaluation orchestrator...
✅ Orchestrator initialized successfully

🏃 Running evaluation scenarios...
✅ Evaluation completed. 2 scenarios processed

===== Evaluation Summary =====

┌──────────────────────────────────┬─────────────────────┬─────────────────────────┬────────┬───────────────┬──────────────────────────────────────────────────┐
│ Scenario                         │ LLM                 │ MCP Server              │ Result │ Latency (ms)  │ Reason                                           │
├──────────────────────────────────┼─────────────────────┼─────────────────────────┼────────┼───────────────┼──────────────────────────────────────────────────┤
│ List files in root directory     │ Gemini 2.5 Flash   │ My Remote Tool Server   │ PASS   │ 1245          │ All expectations met.                            │
│ Answer from own knowledge        │ Gemini 2.5 Flash   │ My Remote Tool Server   │ PASS   │ 832           │ All expectations met.                            │
└──────────────────────────────────┴─────────────────────┴─────────────────────────┴────────┴───────────────┴──────────────────────────────────────────────────┘

📊 Summary:
   - Total scenarios: 2
   - Passed: 2
   - Failed: 0
   - Success rate: 100.0%
   - Average latency: 1038ms

Architecture

The MCP Evaluation Harness is built as a monorepo with two main packages:

Core Package (`@mcp-eval/core`)

Configuration Parser: YAML parsing and validation
LLM Providers: Extensible interface for different LLM APIs
MCP Client: Connection and tool discovery for MCP servers
Orchestration Engine: Multi-turn conversation management
Metrics Collection: Qualitative and quantitative evaluation

CLI Package (`@mcp-eval/cli`)

Commander.js Interface: Command-line argument parsing
Report Generation: Formatted table output with colors
Error Handling: User-friendly error messages

Supported LLM Providers

Google Gemini: Full support via @google/genai SDK
Ollama: Full support via ollama SDK with OpenAI-compatible API
- Requires Ollama server running locally (default: http://localhost:11434)
- Supports all Ollama models including Llama, Qwen, Mistral, etc.
- Tool calling supported via OpenAI-compatible function calling

Adding New Providers

To add support for a new LLM provider:

Implement the LLMProvider interface in packages/core/src/llm-providers/
Add the provider to the factory in packages/core/src/llm-providers/factory.ts
Update the configuration validation to recognize the new provider

Development

Project Structure

mcp-eval/
├── packages/
│   ├── cli/                 # CLI interface
│   │   ├── src/
│   │   │   ├── cli.ts       # Main CLI entry point
│   │   │   ├── commands/    # CLI commands
│   │   │   └── utils/       # Report generation
│   │   └── package.json
│   └── core/               # Core evaluation engine
│       ├── src/
│       │   ├── types.ts     # Type definitions
│       │   ├── config-parser.ts
│       │   ├── mcp-client.ts
│       │   ├── orchestrator.ts
│       │   └── llm-providers/
│       └── package.json
├── package.json            # Root package configuration
└── pnpm-workspace.yaml     # pnpm workspace configuration

Building

# Build all packages
pnpm build

# Build specific package
pnpm --filter @mcp-eval/core build
pnpm --filter @mcp-eval/cli build

Testing

# Run all tests
pnpm test

# Run tests for specific package
pnpm --filter @mcp-eval/core test

Development Mode

# Run CLI in development mode - always use full absolute paths for config files
pnpm --filter @mcp-eval/cli run dev evaluate --config /full/path/to/config.yaml

# Test with Gemini provider (requires GOOGLE_API_KEY)
pnpm --filter @mcp-eval/cli run dev evaluate --config /Users/$(whoami)/path/to/example-config.yaml

# Test with Ollama provider (requires Ollama server running)
pnpm --filter @mcp-eval/cli run dev evaluate --config /Users/$(whoami)/path/to/example-ollama-template.yaml

# Alternative: use the evaluate script from root directory
cd /path/to/mcp-eval
pnpm evaluate --config example-config.yaml

Important Notes:

Always use full absolute paths for configuration files to avoid "file not found" errors
Ensure all required services are running (Ollama server, MCP servers) before evaluation
Check that required API keys are set in your .env file

Troubleshooting

Common Issues

API Key Not Found
- Ensure the environment variable specified in auth.api_key_env is set
- Check that the API key is valid and has the necessary permissions
MCP Server Connection Failed
- Verify the MCP server URI is correct and accessible
- Ensure the MCP server supports the Streamable HTTP transport
Configuration Validation Error
- Check the YAML syntax is valid
- Ensure all required fields are present
- Verify field types match the expected schema
Ollama Connection Issues
- Ensure Ollama server is running: ollama serve
- Verify the model is available: ollama list
- Pull required models: ollama pull llama3.1
- Check OLLAMA_BASE_URL environment variable (should be http://localhost:11434, not http://localhost:11434/v1)
MCP Server Connection Issues
- The example configurations reference cp-labs-mcp-server which may not be running
- For scenarios without MCP servers, remove the mcp_server field from the scenario
- Ensure any referenced MCP servers are running and accessible at the specified URI

Logging and Debug Mode

The tool provides two logging levels:

Default Mode (Minimal Output):

Shows essential progress indicators and results
Clean output suitable for production use
Displays summary table and final statistics

modelcontextprotocol-eval evaluate --config config.yaml

Verbose Mode (Debug Information):

Shows detailed debug information
MCP server connection logs
Full LLM request/response data
Tool call details and configuration info

modelcontextprotocol-eval evaluate --config config.yaml --verbose
# or short form
modelcontextprotocol-eval evaluate -c config.yaml -v

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Run the test suite
Submit a pull request

Roadmap

[ ] Add cache feature to save tokens
[ ] Add OpenAI support

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Changelog

Version 1.0.2

Added centralized logging system with configurable verbosity
Added --verbose flag for detailed debug output
Improved production output by reducing log noise by default
Clean, minimal output for production use with optional verbose mode

Version 1.0.1

Fixed npm publishing configuration
Added proper binary execution support
Updated package metadata for public release

Version 1.0.0

Initial release
Support for Google Gemini LLM provider
MCP Streamable HTTP transport support
YAML configuration system
Console report generation
Comprehensive error handling