ai-eval-cli
v0.1.0
Published
Vendor-neutral AI evaluation and regression testing CLI — test prompts, agents, MCP servers, RAG, and JSON outputs with baselines and CI integration
Downloads
138
Maintainers
Readme
ai-eval-cli
Vendor-neutral AI evaluation and regression testing CLI. Test prompts, agents, MCP servers, RAG pipelines, and JSON outputs — then catch regressions in CI.
Why ai-eval?
- Vendor-neutral — Works with Anthropic, OpenAI, Ollama, and any LLM
- Regression detection — Save baselines, compare runs, fail CI on quality drops
- 19 assertion types — String, JSON Schema, LLM-rubric, agent tool calls, latency, cost
- Stability controls — Multi-run aggregation, confidence bands, seed support
- Japanese eval support — Keigo checks, tone consistency (coming soon)
- MCP & n8n ready — Test MCP servers and n8n workflows natively
Quick Start
# Install
npm install -g ai-eval-cli
# Initialize a project
ai-eval init
# Edit ai-eval.config.yaml, then run
ai-eval run
# Save a baseline
ai-eval run --save-baseline
# Compare against baseline (exits 1 on regression)
ai-eval run --compare latestConfig Example
version: "1"
description: "My AI eval suite"
providers:
- id: anthropic:claude-sonnet-4-20250514
config:
temperature: 0
max_tokens: 1024
defaults:
timeout_ms: 30000
max_concurrency: 3
suites:
- name: "cs-quality"
type: prompt
tests:
- description: "Polite CS response"
input: "My order hasn't arrived"
prompt: "You are a CS agent. Respond to: {{input}}"
assert:
- type: contains
value: "sorry"
- type: llm-rubric
value: "Response is empathetic and provides next steps"
threshold: 0.8
- type: latency
max_ms: 3000Assertion Types
| Category | Types | Status |
|----------|-------|--------|
| String | contains, not-contains, equals, regex, starts-with | Implemented |
| JSON | is-json, json-schema, json-path | Implemented (json-path: stub) |
| LLM | llm-rubric, similar, factuality | llm-rubric implemented |
| Agent | tool-called, tool-args-match, no-tool-called, tool-sequence | Implemented (sequence: stub) |
| Performance | latency, cost | Implemented |
| MCP | mcp-response | Coming soon |
| Japanese | keigo-check, tone-consistency | Coming soon |
CI Integration
# GitHub Actions
- name: Run AI Evals
run: npx ai-eval-cli run --compare latest --format json
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}Exit codes:
0— All tests passed, no regressions1— Test failures or regressions detected
Output Formats
ai-eval run # Console (default, colored)
ai-eval run --format json # JSON (for CI parsing)
ai-eval run --format markdown # Markdown (for PR comments)Stability Controls
Handle LLM output stochasticity:
stability:
runs_per_test: 3
score_aggregation: median # median | mean | worst | best
binary_aggregation: majority # majority | all_pass | any_pass
confidence_band: 0.15
temperature_override: 0Providers
| Provider | Config ID | API Key Env Var |
|----------|-----------|-----------------|
| Anthropic | anthropic:claude-sonnet-4-20250514 | ANTHROPIC_API_KEY |
| OpenAI | openai:gpt-4o | OPENAI_API_KEY |
| Ollama | ollama:llama3.3:70b | (none, local) |
License
MIT
