ai-eval-cli

v0.1.0

Published

21 days ago

Vendor-neutral AI evaluation and regression testing CLI — test prompts, agents, MCP servers, RAG, and JSON outputs with baselines and CI integration

Downloads

138

0High
0Medium
0Low

hitujino56

ai eval evaluation llm testing regression prompt agent mcp rag baseline ci anthropic openai ollama japanese

ai-eval-cli

Vendor-neutral AI evaluation and regression testing CLI. Test prompts, agents, MCP servers, RAG pipelines, and JSON outputs — then catch regressions in CI.

Why ai-eval?

Vendor-neutral — Works with Anthropic, OpenAI, Ollama, and any LLM
Regression detection — Save baselines, compare runs, fail CI on quality drops
19 assertion types — String, JSON Schema, LLM-rubric, agent tool calls, latency, cost
Stability controls — Multi-run aggregation, confidence bands, seed support
Japanese eval support — Keigo checks, tone consistency (coming soon)
MCP & n8n ready — Test MCP servers and n8n workflows natively

Quick Start

# Install
npm install -g ai-eval-cli

# Initialize a project
ai-eval init

# Edit ai-eval.config.yaml, then run
ai-eval run

# Save a baseline
ai-eval run --save-baseline

# Compare against baseline (exits 1 on regression)
ai-eval run --compare latest

Config Example

version: "1"
description: "My AI eval suite"

providers:
  - id: anthropic:claude-sonnet-4-20250514
    config:
      temperature: 0
      max_tokens: 1024

defaults:
  timeout_ms: 30000
  max_concurrency: 3

suites:
  - name: "cs-quality"
    type: prompt
    tests:
      - description: "Polite CS response"
        input: "My order hasn't arrived"
        prompt: "You are a CS agent. Respond to: {{input}}"
        assert:
          - type: contains
            value: "sorry"
          - type: llm-rubric
            value: "Response is empathetic and provides next steps"
            threshold: 0.8
          - type: latency
            max_ms: 3000

Assertion Types

| Category | Types | Status | |----------|-------|--------| | String | contains, not-contains, equals, regex, starts-with | Implemented | | JSON | is-json, json-schema, json-path | Implemented (json-path: stub) | | LLM | llm-rubric, similar, factuality | llm-rubric implemented | | Agent | tool-called, tool-args-match, no-tool-called, tool-sequence | Implemented (sequence: stub) | | Performance | latency, cost | Implemented | | MCP | mcp-response | Coming soon | | Japanese | keigo-check, tone-consistency | Coming soon |

CI Integration

# GitHub Actions
- name: Run AI Evals
  run: npx ai-eval-cli run --compare latest --format json
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Exit codes:

0 — All tests passed, no regressions
1 — Test failures or regressions detected

Output Formats

ai-eval run                    # Console (default, colored)
ai-eval run --format json      # JSON (for CI parsing)
ai-eval run --format markdown  # Markdown (for PR comments)

Stability Controls

Handle LLM output stochasticity:

stability:
  runs_per_test: 3
  score_aggregation: median     # median | mean | worst | best
  binary_aggregation: majority  # majority | all_pass | any_pass
  confidence_band: 0.15
  temperature_override: 0

Providers

| Provider | Config ID | API Key Env Var | |----------|-----------|-----------------| | Anthropic | anthropic:claude-sonnet-4-20250514 | ANTHROPIC_API_KEY | | OpenAI | openai:gpt-4o | OPENAI_API_KEY | | Ollama | ollama:llama3.3:70b | (none, local) |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ai-eval-cli

Why ai-eval?

Quick Start

Config Example

Assertion Types

CI Integration

Output Formats

Stability Controls

Providers

License