mcp-evals

v2.0.1

Published

8 months ago

GitHub Action for evaluating MCP server tool calls using LLM-based scoring

Downloads

265,708

0High
0Medium
0Low

mclenhard

mcp evaluation github-actions llm

MCP Evals

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring, with built-in observability support. This helps ensure your MCP server's tools are working correctly, performing well, and are fully observable with integrated monitoring and metrics.

Installation

As a Node.js Package

npm install mcp-evals

As a GitHub Action

Add the following to your workflow file:

name: Run MCP Evaluations
on:
  pull_request:
    types: [opened, synchronize, reopened]
jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          
      - name: Install dependencies
        run: npm install
        
      - name: Run MCP Evaluations
        uses: mclenhard/[email protected]
        with:
          evals_path: 'src/evals/evals.ts'    # Can also use .yaml files
          server_path: 'src/index.ts'
          openai_api_key: ${{ secrets.OPENAI_API_KEY }}
          model: 'gpt-4'  # Optional, defaults to gpt-4

Usage -- Evals

1. Create Your Evaluation File

You can create evaluation configurations in either TypeScript or YAML format.

Option A: TypeScript Configuration

Create a file (e.g., evals.ts) that exports your evaluation configuration:

import { EvalConfig } from 'mcp-evals';
import { openai } from "@ai-sdk/openai";
import { grade, EvalFunction} from "mcp-evals";

const weatherEval: EvalFunction = {
    name: 'Weather Tool Evaluation',
    description: 'Evaluates the accuracy and completeness of weather information retrieval',
    run: async () => {
      const result = await grade(openai("gpt-4"), "What is the weather in New York?");
      return JSON.parse(result);
    }
};
const config: EvalConfig = {
    model: openai("gpt-4"),
    evals: [weatherEval]
  };
  
  export default config;
  
  export const evals = [
    weatherEval,
    // add other evals here
];

Option B: YAML Configuration

For simpler configuration, you can use YAML format (e.g., evals.yaml):

# Model configuration
model:
  provider: openai     # 'openai' or 'anthropic'
  name: gpt-4o        # Model name
  # api_key: sk-...   # Optional, uses OPENAI_API_KEY env var by default

# List of evaluations to run
evals:
  - name: weather_query_basic
    description: Test basic weather information retrieval
    prompt: "What is the current weather in San Francisco?"
    expected_result: "Should return current weather data for San Francisco including temperature, conditions, etc."

  - name: weather_forecast
    description: Test weather forecast functionality
    prompt: "Can you give me the 3-day weather forecast for Seattle?"
    expected_result: "Should return a multi-day forecast for Seattle"

  - name: invalid_location
    description: Test handling of invalid location requests
    prompt: "What's the weather in Atlantis?"
    expected_result: "Should handle invalid location gracefully with appropriate error message"

2. Run the Evaluations

As a Node.js Package

You can run the evaluations using the CLI with either TypeScript or YAML files:

# Using TypeScript configuration
npx mcp-eval path/to/your/evals.ts path/to/your/server.ts

# Using YAML configuration
npx mcp-eval path/to/your/evals.yaml path/to/your/server.ts

As a GitHub Action

The action will automatically:

Run your evaluations
Post the results as a comment on the PR
Update the comment if the PR is updated

Evaluation Results

Each evaluation returns an object with the following structure:

interface EvalResult {
  accuracy: number;        // Score from 1-5
  completeness: number;    // Score from 1-5
  relevance: number;       // Score from 1-5
  clarity: number;         // Score from 1-5
  reasoning: number;       // Score from 1-5
  overall_comments: string; // Summary of strengths and weaknesses
}

Configuration

Environment Variables

OPENAI_API_KEY: Your OpenAI API key (required for OpenAI models)
ANTHROPIC_API_KEY: Your Anthropic API key (required for Anthropic models)

[!NOTE] If you're using this GitHub Action with open source software, enable data sharing in the OpenAI billing dashboard to claim 2.5 million free GPT-4o mini tokens per day, making this Action effectively free to use.

Evaluation Configuration

TypeScript Configuration

The EvalConfig interface requires:

model: The language model to use for evaluation (e.g., GPT-4)
evals: Array of evaluation functions to run

Each evaluation function must implement:

name: Name of the evaluation
description: Description of what the evaluation tests
run: Async function that takes a model and returns an EvalResult

YAML Configuration

YAML configuration files support:

Model Configuration:

provider: Either 'openai' or 'anthropic'
name: Model name (e.g., 'gpt-4o', 'claude-3-opus-20240229')
api_key: Optional API key (uses environment variables by default)

Evaluation Configuration:

name: Name of the evaluation (required)
description: Description of what the evaluation tests (required)
prompt: The prompt to send to your MCP server (required)
expected_result: Optional description of expected behavior

Supported File Extensions: .yaml, .yml

Usage -- Monitoring

Note: The metrics functionality is still in alpha. Features and APIs may change, and breaking changes are possible.

Add the following to your application before you initilize the MCP server.

import { metrics } from 'mcp-evals';
metrics.initialize(9090, { enableTracing: true, otelEndpoint: 'http://localhost:4318/v1/traces' });

Start the monitoring stack:

docker-compose up -d

Run your MCP server and it will automatically connect to the monitoring stack.

Accessing the Dashboards

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (username: admin, password: admin)
Jaeger UI: http://localhost:16686

Metrics Available

Tool Calls: Number of tool calls by tool name
Tool Errors: Number of errors by tool name
Tool Latency: Distribution of latency times by tool name

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

MCP Evals

Installation

As a Node.js Package

As a GitHub Action

Usage -- Evals

1. Create Your Evaluation File

Option A: TypeScript Configuration

Option B: YAML Configuration

2. Run the Evaluations

As a Node.js Package

As a GitHub Action

Evaluation Results

Configuration

Environment Variables

Evaluation Configuration

TypeScript Configuration

YAML Configuration

Usage -- Monitoring

Accessing the Dashboards

Metrics Available

License