npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

modelcontextprotocol-eval

v1.0.2

Published

Model Context Protocol (MCP) evaluation framework and tools

Readme

MCP Evaluation Harness

A command-line interface (CLI) tool designed to benchmark and evaluate the tool-use capabilities of Large Language Models (LLMs) using the Model Context Protocol (MCP).

Overview

The MCP Evaluation Harness provides a standardized, repeatable method for assessing how effectively different LLMs interact with external tools and data sources exposed via MCP servers. The tool orchestrates interactions between specified LLMs and MCP servers, executing predefined scenarios and measuring both qualitative correctness and quantitative performance.

Features

  • Multi-LLM Support: Currently supports Google Gemini and Ollama with extensible architecture for additional providers
  • Judge-Based Evaluation: Uses separate LLM judges to evaluate agent performance with custom rubrics
  • MCP Integration: Connects to MCP servers via Streamable HTTP transport
  • YAML Configuration: Easy-to-define evaluation scenarios and configurations
  • Comprehensive Metrics: Collects both qualitative (correctness) and quantitative (latency, tokens) metrics
  • Multiple Output Formats: JSON, Markdown, and CSV report generation
  • Rich Reporting: Formatted console reports with PASS/FAIL status and detailed reasoning
  • Error Handling: Graceful handling of network failures, API errors, and configuration issues

Installation

Prerequisites

  • Node.js 18.0.0 or higher

Install from npm

# Install globally
npm install -g modelcontextprotocol-eval

# Or use npx directly
npx modelcontextprotocol-eval --help

Build from Source

# Clone the repository
git clone <repository-url>
cd mcp-eval

# Install dependencies
pnpm install

# Build the project
pnpm build

# Link the CLI globally (optional)
cd packages/cli
npm link

Configuration

Create a YAML configuration file that defines your LLMs, MCP servers, and evaluation scenarios:

# Define the Language Models with their roles
llms:
  - name: "Gemini-Flash-Agent"
    provider: "gemini"
    model: "gemini-2.5-flash"
    role: "agent" # Can be 'agent' or 'judge'
    auth:
      api_key_env: "GOOGLE_API_KEY"

  - name: "Ollama-Agent"
    provider: "ollama"
    model: "llama3.1"
    role: "agent"
    auth:
      api_key_env: "OLLAMA_API_KEY" # Not used but required

  - name: "Gemini-Pro-Judge"
    provider: "gemini"
    model: "gemini-2.5-pro"
    role: "judge"
    auth:
      api_key_env: "GOOGLE_API_KEY"

# Define the MCP Servers that provide the tools
mcp_servers:
  - name: "cp-labs-mcp-server"
    uri: "http://localhost:3000/mcp"

# Define the evaluation scenarios
scenarios:
  - name: "MCP server integration test"
    llm_agent: "Gemini-Flash-Agent" # Reference an LLM by name
    llm_judge: "Gemini-Pro-Judge" # Reference a judge by name
    mcp_server: "cp-labs-mcp-server" # Reference an MCP server
    system_prompt: "You are a helpful assistant. Use available tools when needed."
    prompt: "What are the latest 5 CVEs?"

    # Required evaluation rubric for the judge LLM
    llm_judge_rubric: |
      Analyze the conversation transcript.
      CRITERIA:
      1. Tool Usage: Did the agent correctly use the cve-list tool?
      2. Output Quality: Is the response clear and informative?

      Respond with JSON:
      - "result": "PASS" or "FAIL"
      - "reason": explanation
      - "score": 1-5 rating

  - name: "Answer from own knowledge"
    llm_agent: "Gemini-Flash-Agent"
    llm_judge: "Gemini-Pro-Judge"
    system_prompt: "You are a helpful general knowledge assistant."
    prompt: "What is the capital of France?"

    llm_judge_rubric: |
      Evaluate the response to a knowledge question.
      CRITERIA:
      1. Correctness: Is the answer "Paris"?
      2. Tool Abstinence: Did the agent avoid using tools?

      Respond with JSON:
      - "result": "PASS" or "FAIL"
      - "reason": explanation
      - "score": 1-5 rating

Configuration Reference

LLM Configuration

  • name: Display name for the LLM
  • provider: Provider type (supports "gemini" and "ollama")
  • model: Model identifier (e.g., "gemini-2.5-flash", "gemini-2.5-pro" for Gemini; "llama3.1", "qwen2.5" for Ollama)
  • role: Either "agent" (performs tasks) or "judge" (evaluates performance)
  • auth.api_key_env: Environment variable containing the API key (not used by Ollama but required by schema)

MCP Server Configuration

  • name: Display name for the MCP server
  • uri: HTTP endpoint URL for the MCP server

Scenario Configuration

  • name: Display name for the scenario
  • llm_agent: Name of the LLM to use as the agent
  • llm_judge: Name of the LLM to use as the judge
  • mcp_server: (Optional) Name of the MCP server to connect to
  • system_prompt: (Optional) System prompt to set LLM behavior
  • prompt: User prompt that initiates the interaction
  • llm_judge_rubric: Evaluation criteria and instructions for the judge LLM

Environment Variables

You can set API keys in two ways:

Option 1: Using a .env file (Recommended)

Create a .env file in the project root:

# .env
GOOGLE_API_KEY=your_actual_google_api_key_here

# For Ollama provider
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_API_KEY=dummy  # Not used but required by schema

Option 2: Using environment variables

# For Google Gemini
export GOOGLE_API_KEY="your-gemini-api-key"

# For Ollama (base URL, defaults to http://localhost:11434)
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_API_KEY="dummy" # Not used but required by schema

# Add other provider keys as needed

Important:

  • The api_key_env field in your configuration should contain the name of the environment variable (e.g., "GOOGLE_API_KEY"), not the actual API key value
  • Never commit actual API keys to version control

Usage

Basic Usage

# Run evaluation with a configuration file
modelcontextprotocol-eval evaluate --config config.yaml

# Or use npx
npx modelcontextprotocol-eval evaluate --config config.yaml

# Enable verbose logging for debugging
modelcontextprotocol-eval evaluate --config config.yaml --verbose

# Save results to a file (auto-detects format from extension)
modelcontextprotocol-eval evaluate --config config.yaml --output results.json
modelcontextprotocol-eval evaluate --config config.yaml --output results.csv
modelcontextprotocol-eval evaluate --config config.yaml --output results.md

# Specify format explicitly
modelcontextprotocol-eval evaluate --config config.yaml --output results.txt --format json

# Short form with verbose mode
modelcontextprotocol-eval evaluate -c config.yaml -o results.json -v

Output Formats

The tool supports multiple output formats for saving evaluation results:

  • JSON (.json) - Complete structured data including:
    • Full conversation transcripts (LLM responses, tool calls, results)
    • Detailed metrics (tokens, latency, tool usage)
    • Judge evaluations with scores and reasoning
    • Timestamps and metadata
  • Markdown (.md) - Human-readable reports featuring:
    • Executive summary with visual indicators (✅/❌)
    • Detailed scenario breakdowns
    • Full LLM response content
    • Tool interaction details
    • Judge evaluation explanations

Response Display Features

Judge-Based Evaluation System

The harness uses a two-LLM system:

  • Agent LLM: Performs the actual task (tool usage, reasoning)
  • Judge LLM: Evaluates the agent's performance using custom rubrics

This approach provides:

  • Objective Assessment: Separate LLM removes bias from self-evaluation
  • Custom Rubrics: Tailored evaluation criteria for each scenario
  • Detailed Feedback: Rich explanations and scoring for performance analysis
  • Scalable Evaluation: Consistent judging across multiple scenarios

Response Capture and Display

The evaluation harness captures and displays comprehensive response information:

Console Output

  • Scenario Details: Each scenario shows detailed information including status, latency, and token usage
  • LLM Responses: Actual content returned by the LLM, truncated for readability
  • MCP Server Responses: Tool calls, arguments, and results from MCP servers
  • Final Response: The complete final response from the LLM
  • Color-coded Output: Green for PASS, red for FAIL, with emoji indicators

File Output Details

  • JSON Format: Complete response data including full LLM responses, MCP tool results, and metadata
  • CSV Format: Response summaries with counts and truncated content
  • Markdown Format: Beautifully formatted reports with detailed response sections, including:
    • Full LLM response content
    • MCP server tool calls with arguments and results
    • Final responses for each scenario
    • Visual indicators (✅/❌) for easy scanning

Example Output

🚀 Starting MCP evaluation...

📋 Loading configuration from: /path/to/config.yaml
✅ Configuration loaded successfully
   - LLMs: 1
   - MCP Servers: 1
   - Scenarios: 2

🔧 Initializing evaluation orchestrator...
✅ Orchestrator initialized successfully

🏃 Running evaluation scenarios...
✅ Evaluation completed. 2 scenarios processed

===== Evaluation Summary =====

┌──────────────────────────────────┬─────────────────────┬─────────────────────────┬────────┬───────────────┬──────────────────────────────────────────────────┐
│ Scenario                         │ LLM                 │ MCP Server              │ Result │ Latency (ms)  │ Reason                                           │
├──────────────────────────────────┼─────────────────────┼─────────────────────────┼────────┼───────────────┼──────────────────────────────────────────────────┤
│ List files in root directory     │ Gemini 2.5 Flash   │ My Remote Tool Server   │ PASS   │ 1245          │ All expectations met.                            │
│ Answer from own knowledge        │ Gemini 2.5 Flash   │ My Remote Tool Server   │ PASS   │ 832           │ All expectations met.                            │
└──────────────────────────────────┴─────────────────────┴─────────────────────────┴────────┴───────────────┴──────────────────────────────────────────────────┘

📊 Summary:
   - Total scenarios: 2
   - Passed: 2
   - Failed: 0
   - Success rate: 100.0%
   - Average latency: 1038ms

Architecture

The MCP Evaluation Harness is built as a monorepo with two main packages:

Core Package (@mcp-eval/core)

  • Configuration Parser: YAML parsing and validation
  • LLM Providers: Extensible interface for different LLM APIs
  • MCP Client: Connection and tool discovery for MCP servers
  • Orchestration Engine: Multi-turn conversation management
  • Metrics Collection: Qualitative and quantitative evaluation

CLI Package (@mcp-eval/cli)

  • Commander.js Interface: Command-line argument parsing
  • Report Generation: Formatted table output with colors
  • Error Handling: User-friendly error messages

Supported LLM Providers

  • Google Gemini: Full support via @google/genai SDK
  • Ollama: Full support via ollama SDK with OpenAI-compatible API
    • Requires Ollama server running locally (default: http://localhost:11434)
    • Supports all Ollama models including Llama, Qwen, Mistral, etc.
    • Tool calling supported via OpenAI-compatible function calling

Adding New Providers

To add support for a new LLM provider:

  1. Implement the LLMProvider interface in packages/core/src/llm-providers/
  2. Add the provider to the factory in packages/core/src/llm-providers/factory.ts
  3. Update the configuration validation to recognize the new provider

Development

Project Structure

mcp-eval/
├── packages/
│   ├── cli/                 # CLI interface
│   │   ├── src/
│   │   │   ├── cli.ts       # Main CLI entry point
│   │   │   ├── commands/    # CLI commands
│   │   │   └── utils/       # Report generation
│   │   └── package.json
│   └── core/               # Core evaluation engine
│       ├── src/
│       │   ├── types.ts     # Type definitions
│       │   ├── config-parser.ts
│       │   ├── mcp-client.ts
│       │   ├── orchestrator.ts
│       │   └── llm-providers/
│       └── package.json
├── package.json            # Root package configuration
└── pnpm-workspace.yaml     # pnpm workspace configuration

Building

# Build all packages
pnpm build

# Build specific package
pnpm --filter @mcp-eval/core build
pnpm --filter @mcp-eval/cli build

Testing

# Run all tests
pnpm test

# Run tests for specific package
pnpm --filter @mcp-eval/core test

Development Mode

# Run CLI in development mode - always use full absolute paths for config files
pnpm --filter @mcp-eval/cli run dev evaluate --config /full/path/to/config.yaml

# Test with Gemini provider (requires GOOGLE_API_KEY)
pnpm --filter @mcp-eval/cli run dev evaluate --config /Users/$(whoami)/path/to/example-config.yaml

# Test with Ollama provider (requires Ollama server running)
pnpm --filter @mcp-eval/cli run dev evaluate --config /Users/$(whoami)/path/to/example-ollama-template.yaml

# Alternative: use the evaluate script from root directory
cd /path/to/mcp-eval
pnpm evaluate --config example-config.yaml

Important Notes:

  • Always use full absolute paths for configuration files to avoid "file not found" errors
  • Ensure all required services are running (Ollama server, MCP servers) before evaluation
  • Check that required API keys are set in your .env file

Troubleshooting

Common Issues

  1. API Key Not Found

    • Ensure the environment variable specified in auth.api_key_env is set
    • Check that the API key is valid and has the necessary permissions
  2. MCP Server Connection Failed

    • Verify the MCP server URI is correct and accessible
    • Ensure the MCP server supports the Streamable HTTP transport
  3. Configuration Validation Error

    • Check the YAML syntax is valid
    • Ensure all required fields are present
    • Verify field types match the expected schema
  4. Ollama Connection Issues

    • Ensure Ollama server is running: ollama serve
    • Verify the model is available: ollama list
    • Pull required models: ollama pull llama3.1
    • Check OLLAMA_BASE_URL environment variable (should be http://localhost:11434, not http://localhost:11434/v1)
  5. MCP Server Connection Issues

    • The example configurations reference cp-labs-mcp-server which may not be running
    • For scenarios without MCP servers, remove the mcp_server field from the scenario
    • Ensure any referenced MCP servers are running and accessible at the specified URI

Logging and Debug Mode

The tool provides two logging levels:

Default Mode (Minimal Output):

  • Shows essential progress indicators and results
  • Clean output suitable for production use
  • Displays summary table and final statistics
modelcontextprotocol-eval evaluate --config config.yaml

Verbose Mode (Debug Information):

  • Shows detailed debug information
  • MCP server connection logs
  • Full LLM request/response data
  • Tool call details and configuration info
modelcontextprotocol-eval evaluate --config config.yaml --verbose
# or short form
modelcontextprotocol-eval evaluate -c config.yaml -v

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Run the test suite
  6. Submit a pull request

Roadmap

  • [ ] Add cache feature to save tokens
  • [ ] Add OpenAI support

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Changelog

Version 1.0.2

  • Added centralized logging system with configurable verbosity
  • Added --verbose flag for detailed debug output
  • Improved production output by reducing log noise by default
  • Clean, minimal output for production use with optional verbose mode

Version 1.0.1

  • Fixed npm publishing configuration
  • Added proper binary execution support
  • Updated package metadata for public release

Version 1.0.0

  • Initial release
  • Support for Google Gemini LLM provider
  • MCP Streamable HTTP transport support
  • YAML configuration system
  • Console report generation
  • Comprehensive error handling