modelcontextprotocol-eval
v1.0.2
Published
Model Context Protocol (MCP) evaluation framework and tools
Maintainers
Readme
MCP Evaluation Harness
A command-line interface (CLI) tool designed to benchmark and evaluate the tool-use capabilities of Large Language Models (LLMs) using the Model Context Protocol (MCP).
Overview
The MCP Evaluation Harness provides a standardized, repeatable method for assessing how effectively different LLMs interact with external tools and data sources exposed via MCP servers. The tool orchestrates interactions between specified LLMs and MCP servers, executing predefined scenarios and measuring both qualitative correctness and quantitative performance.
Features
- Multi-LLM Support: Currently supports Google Gemini and Ollama with extensible architecture for additional providers
- Judge-Based Evaluation: Uses separate LLM judges to evaluate agent performance with custom rubrics
- MCP Integration: Connects to MCP servers via Streamable HTTP transport
- YAML Configuration: Easy-to-define evaluation scenarios and configurations
- Comprehensive Metrics: Collects both qualitative (correctness) and quantitative (latency, tokens) metrics
- Multiple Output Formats: JSON, Markdown, and CSV report generation
- Rich Reporting: Formatted console reports with PASS/FAIL status and detailed reasoning
- Error Handling: Graceful handling of network failures, API errors, and configuration issues
Installation
Prerequisites
- Node.js 18.0.0 or higher
Install from npm
# Install globally
npm install -g modelcontextprotocol-eval
# Or use npx directly
npx modelcontextprotocol-eval --helpBuild from Source
# Clone the repository
git clone <repository-url>
cd mcp-eval
# Install dependencies
pnpm install
# Build the project
pnpm build
# Link the CLI globally (optional)
cd packages/cli
npm linkConfiguration
Create a YAML configuration file that defines your LLMs, MCP servers, and evaluation scenarios:
# Define the Language Models with their roles
llms:
- name: "Gemini-Flash-Agent"
provider: "gemini"
model: "gemini-2.5-flash"
role: "agent" # Can be 'agent' or 'judge'
auth:
api_key_env: "GOOGLE_API_KEY"
- name: "Ollama-Agent"
provider: "ollama"
model: "llama3.1"
role: "agent"
auth:
api_key_env: "OLLAMA_API_KEY" # Not used but required
- name: "Gemini-Pro-Judge"
provider: "gemini"
model: "gemini-2.5-pro"
role: "judge"
auth:
api_key_env: "GOOGLE_API_KEY"
# Define the MCP Servers that provide the tools
mcp_servers:
- name: "cp-labs-mcp-server"
uri: "http://localhost:3000/mcp"
# Define the evaluation scenarios
scenarios:
- name: "MCP server integration test"
llm_agent: "Gemini-Flash-Agent" # Reference an LLM by name
llm_judge: "Gemini-Pro-Judge" # Reference a judge by name
mcp_server: "cp-labs-mcp-server" # Reference an MCP server
system_prompt: "You are a helpful assistant. Use available tools when needed."
prompt: "What are the latest 5 CVEs?"
# Required evaluation rubric for the judge LLM
llm_judge_rubric: |
Analyze the conversation transcript.
CRITERIA:
1. Tool Usage: Did the agent correctly use the cve-list tool?
2. Output Quality: Is the response clear and informative?
Respond with JSON:
- "result": "PASS" or "FAIL"
- "reason": explanation
- "score": 1-5 rating
- name: "Answer from own knowledge"
llm_agent: "Gemini-Flash-Agent"
llm_judge: "Gemini-Pro-Judge"
system_prompt: "You are a helpful general knowledge assistant."
prompt: "What is the capital of France?"
llm_judge_rubric: |
Evaluate the response to a knowledge question.
CRITERIA:
1. Correctness: Is the answer "Paris"?
2. Tool Abstinence: Did the agent avoid using tools?
Respond with JSON:
- "result": "PASS" or "FAIL"
- "reason": explanation
- "score": 1-5 ratingConfiguration Reference
LLM Configuration
name: Display name for the LLMprovider: Provider type (supports"gemini"and"ollama")model: Model identifier (e.g.,"gemini-2.5-flash","gemini-2.5-pro"for Gemini;"llama3.1","qwen2.5"for Ollama)role: Either"agent"(performs tasks) or"judge"(evaluates performance)auth.api_key_env: Environment variable containing the API key (not used by Ollama but required by schema)
MCP Server Configuration
name: Display name for the MCP serveruri: HTTP endpoint URL for the MCP server
Scenario Configuration
name: Display name for the scenariollm_agent: Name of the LLM to use as the agentllm_judge: Name of the LLM to use as the judgemcp_server: (Optional) Name of the MCP server to connect tosystem_prompt: (Optional) System prompt to set LLM behaviorprompt: User prompt that initiates the interactionllm_judge_rubric: Evaluation criteria and instructions for the judge LLM
Environment Variables
You can set API keys in two ways:
Option 1: Using a .env file (Recommended)
Create a .env file in the project root:
# .env
GOOGLE_API_KEY=your_actual_google_api_key_here
# For Ollama provider
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_API_KEY=dummy # Not used but required by schemaOption 2: Using environment variables
# For Google Gemini
export GOOGLE_API_KEY="your-gemini-api-key"
# For Ollama (base URL, defaults to http://localhost:11434)
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_API_KEY="dummy" # Not used but required by schema
# Add other provider keys as neededImportant:
- The
api_key_envfield in your configuration should contain the name of the environment variable (e.g.,"GOOGLE_API_KEY"), not the actual API key value - Never commit actual API keys to version control
Usage
Basic Usage
# Run evaluation with a configuration file
modelcontextprotocol-eval evaluate --config config.yaml
# Or use npx
npx modelcontextprotocol-eval evaluate --config config.yaml
# Enable verbose logging for debugging
modelcontextprotocol-eval evaluate --config config.yaml --verbose
# Save results to a file (auto-detects format from extension)
modelcontextprotocol-eval evaluate --config config.yaml --output results.json
modelcontextprotocol-eval evaluate --config config.yaml --output results.csv
modelcontextprotocol-eval evaluate --config config.yaml --output results.md
# Specify format explicitly
modelcontextprotocol-eval evaluate --config config.yaml --output results.txt --format json
# Short form with verbose mode
modelcontextprotocol-eval evaluate -c config.yaml -o results.json -vOutput Formats
The tool supports multiple output formats for saving evaluation results:
- JSON (
.json) - Complete structured data including:- Full conversation transcripts (LLM responses, tool calls, results)
- Detailed metrics (tokens, latency, tool usage)
- Judge evaluations with scores and reasoning
- Timestamps and metadata
- Markdown (
.md) - Human-readable reports featuring:- Executive summary with visual indicators (✅/❌)
- Detailed scenario breakdowns
- Full LLM response content
- Tool interaction details
- Judge evaluation explanations
Response Display Features
Judge-Based Evaluation System
The harness uses a two-LLM system:
- Agent LLM: Performs the actual task (tool usage, reasoning)
- Judge LLM: Evaluates the agent's performance using custom rubrics
This approach provides:
- Objective Assessment: Separate LLM removes bias from self-evaluation
- Custom Rubrics: Tailored evaluation criteria for each scenario
- Detailed Feedback: Rich explanations and scoring for performance analysis
- Scalable Evaluation: Consistent judging across multiple scenarios
Response Capture and Display
The evaluation harness captures and displays comprehensive response information:
Console Output
- Scenario Details: Each scenario shows detailed information including status, latency, and token usage
- LLM Responses: Actual content returned by the LLM, truncated for readability
- MCP Server Responses: Tool calls, arguments, and results from MCP servers
- Final Response: The complete final response from the LLM
- Color-coded Output: Green for PASS, red for FAIL, with emoji indicators
File Output Details
- JSON Format: Complete response data including full LLM responses, MCP tool results, and metadata
- CSV Format: Response summaries with counts and truncated content
- Markdown Format: Beautifully formatted reports with detailed response sections, including:
- Full LLM response content
- MCP server tool calls with arguments and results
- Final responses for each scenario
- Visual indicators (✅/❌) for easy scanning
Example Output
🚀 Starting MCP evaluation...
📋 Loading configuration from: /path/to/config.yaml
✅ Configuration loaded successfully
- LLMs: 1
- MCP Servers: 1
- Scenarios: 2
🔧 Initializing evaluation orchestrator...
✅ Orchestrator initialized successfully
🏃 Running evaluation scenarios...
✅ Evaluation completed. 2 scenarios processed
===== Evaluation Summary =====
┌──────────────────────────────────┬─────────────────────┬─────────────────────────┬────────┬───────────────┬──────────────────────────────────────────────────┐
│ Scenario │ LLM │ MCP Server │ Result │ Latency (ms) │ Reason │
├──────────────────────────────────┼─────────────────────┼─────────────────────────┼────────┼───────────────┼──────────────────────────────────────────────────┤
│ List files in root directory │ Gemini 2.5 Flash │ My Remote Tool Server │ PASS │ 1245 │ All expectations met. │
│ Answer from own knowledge │ Gemini 2.5 Flash │ My Remote Tool Server │ PASS │ 832 │ All expectations met. │
└──────────────────────────────────┴─────────────────────┴─────────────────────────┴────────┴───────────────┴──────────────────────────────────────────────────┘
📊 Summary:
- Total scenarios: 2
- Passed: 2
- Failed: 0
- Success rate: 100.0%
- Average latency: 1038msArchitecture
The MCP Evaluation Harness is built as a monorepo with two main packages:
Core Package (@mcp-eval/core)
- Configuration Parser: YAML parsing and validation
- LLM Providers: Extensible interface for different LLM APIs
- MCP Client: Connection and tool discovery for MCP servers
- Orchestration Engine: Multi-turn conversation management
- Metrics Collection: Qualitative and quantitative evaluation
CLI Package (@mcp-eval/cli)
- Commander.js Interface: Command-line argument parsing
- Report Generation: Formatted table output with colors
- Error Handling: User-friendly error messages
Supported LLM Providers
- Google Gemini: Full support via
@google/genaiSDK - Ollama: Full support via
ollamaSDK with OpenAI-compatible API- Requires Ollama server running locally (default: http://localhost:11434)
- Supports all Ollama models including Llama, Qwen, Mistral, etc.
- Tool calling supported via OpenAI-compatible function calling
Adding New Providers
To add support for a new LLM provider:
- Implement the
LLMProviderinterface inpackages/core/src/llm-providers/ - Add the provider to the factory in
packages/core/src/llm-providers/factory.ts - Update the configuration validation to recognize the new provider
Development
Project Structure
mcp-eval/
├── packages/
│ ├── cli/ # CLI interface
│ │ ├── src/
│ │ │ ├── cli.ts # Main CLI entry point
│ │ │ ├── commands/ # CLI commands
│ │ │ └── utils/ # Report generation
│ │ └── package.json
│ └── core/ # Core evaluation engine
│ ├── src/
│ │ ├── types.ts # Type definitions
│ │ ├── config-parser.ts
│ │ ├── mcp-client.ts
│ │ ├── orchestrator.ts
│ │ └── llm-providers/
│ └── package.json
├── package.json # Root package configuration
└── pnpm-workspace.yaml # pnpm workspace configurationBuilding
# Build all packages
pnpm build
# Build specific package
pnpm --filter @mcp-eval/core build
pnpm --filter @mcp-eval/cli buildTesting
# Run all tests
pnpm test
# Run tests for specific package
pnpm --filter @mcp-eval/core testDevelopment Mode
# Run CLI in development mode - always use full absolute paths for config files
pnpm --filter @mcp-eval/cli run dev evaluate --config /full/path/to/config.yaml
# Test with Gemini provider (requires GOOGLE_API_KEY)
pnpm --filter @mcp-eval/cli run dev evaluate --config /Users/$(whoami)/path/to/example-config.yaml
# Test with Ollama provider (requires Ollama server running)
pnpm --filter @mcp-eval/cli run dev evaluate --config /Users/$(whoami)/path/to/example-ollama-template.yaml
# Alternative: use the evaluate script from root directory
cd /path/to/mcp-eval
pnpm evaluate --config example-config.yamlImportant Notes:
- Always use full absolute paths for configuration files to avoid "file not found" errors
- Ensure all required services are running (Ollama server, MCP servers) before evaluation
- Check that required API keys are set in your
.envfile
Troubleshooting
Common Issues
API Key Not Found
- Ensure the environment variable specified in
auth.api_key_envis set - Check that the API key is valid and has the necessary permissions
- Ensure the environment variable specified in
MCP Server Connection Failed
- Verify the MCP server URI is correct and accessible
- Ensure the MCP server supports the Streamable HTTP transport
Configuration Validation Error
- Check the YAML syntax is valid
- Ensure all required fields are present
- Verify field types match the expected schema
Ollama Connection Issues
- Ensure Ollama server is running:
ollama serve - Verify the model is available:
ollama list - Pull required models:
ollama pull llama3.1 - Check OLLAMA_BASE_URL environment variable (should be
http://localhost:11434, nothttp://localhost:11434/v1)
- Ensure Ollama server is running:
MCP Server Connection Issues
- The example configurations reference
cp-labs-mcp-serverwhich may not be running - For scenarios without MCP servers, remove the
mcp_serverfield from the scenario - Ensure any referenced MCP servers are running and accessible at the specified URI
- The example configurations reference
Logging and Debug Mode
The tool provides two logging levels:
Default Mode (Minimal Output):
- Shows essential progress indicators and results
- Clean output suitable for production use
- Displays summary table and final statistics
modelcontextprotocol-eval evaluate --config config.yamlVerbose Mode (Debug Information):
- Shows detailed debug information
- MCP server connection logs
- Full LLM request/response data
- Tool call details and configuration info
modelcontextprotocol-eval evaluate --config config.yaml --verbose
# or short form
modelcontextprotocol-eval evaluate -c config.yaml -vContributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Run the test suite
- Submit a pull request
Roadmap
- [ ] Add cache feature to save tokens
- [ ] Add OpenAI support
License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Changelog
Version 1.0.2
- Added centralized logging system with configurable verbosity
- Added
--verboseflag for detailed debug output - Improved production output by reducing log noise by default
- Clean, minimal output for production use with optional verbose mode
Version 1.0.1
- Fixed npm publishing configuration
- Added proper binary execution support
- Updated package metadata for public release
Version 1.0.0
- Initial release
- Support for Google Gemini LLM provider
- MCP Streamable HTTP transport support
- YAML configuration system
- Console report generation
- Comprehensive error handling
