evalview
v0.2.5
Published
> You changed a prompt. Swapped a model. Updated a tool. > Did anything break? **Run EvalView. Know for sure.**
Readme
EvalView — Proof that your agent still works.
You changed a prompt. Swapped a model. Updated a tool. Did anything break? Run EvalView. Know for sure.
pip install evalview && evalview demo # No API key needed🔍 What EvalView Catches
| Status | What it means | What you do | |--------|--------------|-------------| | ✅ PASSED | Agent behavior matches baseline | Ship with confidence | | ⚠️ TOOLS_CHANGED | Agent is calling different tools | Review the diff | | ⚠️ OUTPUT_CHANGED | Same tools, output quality shifted | Review the diff | | ❌ REGRESSION | Score dropped significantly | Fix before shipping |
🤔 How It Works
Simple workflow (recommended):
# 1. Your agent works correctly
evalview snapshot # 📸 Save current behavior as baseline
# 2. You change something (prompt, model, tools)
evalview check # 🔍 Detect regressions automatically
# 3. EvalView tells you exactly what changed
# → ✅ All clean! No regressions detected.
# → ⚠️ TOOLS_CHANGED: +web_search, -calculator
# → ❌ REGRESSION: score 85 → 71Advanced workflow (more control):
evalview run --save-golden # Save specific result as baseline
evalview run --diff # Compare with custom optionsThat's it. Deterministic proof, no LLM-as-judge required, no API keys needed.
🎯 New: Habit-Forming Regression Detection
EvalView now tracks your progress and celebrates wins:
evalview check
# 🔍 Comparing against your baseline...
# ✨ All clean! No regressions detected.
# 🎯 5 clean checks in a row! You're on a roll.Features:
- 🔥 Streak tracking — Celebrate consecutive clean checks (3, 5, 10, 25+ milestones)
- 📊 Health score — See your project's stability at a glance
- 🔔 Smart recaps — "Since last time" summaries to stay in context
- 📈 Progress visualization — Track improvement over time
🎨 Multi-Reference Goldens (for non-deterministic agents)
Some agents produce valid variations. Save up to 5 golden variants per test:
# Save multiple acceptable behaviors
evalview snapshot --variant variant1
evalview snapshot --variant variant2
# EvalView compares against ALL variants, passes if ANY match
evalview check
# ✅ Matched variant 2/3Perfect for LLM-based agents with creative variation.
🚀 Quick Start
Install EvalView
pip install evalviewTry the demo (zero setup, no API key)
evalview demoSet up a working example in 2 minutes
evalview quickstartWant LLM-as-judge scoring too?
export OPENAI_API_KEY='your-key' evalview runPrefer local/free evaluation?
evalview run --judge-provider ollama --judge-model llama3.2
🆕 New in v0.2.9: Claude Code MCP Server
If you're using Claude Code, this is the biggest upgrade in recent releases:
- Run EvalView checks inline from Claude Code via MCP tools
- Generate tests from natural language (
create_test) - Capture baselines and detect regressions without leaving the editor/conversation
👉 Jump to Claude Code Integration (MCP)
💡 Why EvalView?
- 🔄 Automatic regression detection — Know instantly when your agent breaks
- 📸 Golden baseline diffing — Save known-good behavior, compare every change
- 🔑 Works without API keys — Deterministic scoring, no LLM-as-judge needed
- 💸 Free & open source — No vendor lock-in, no SaaS pricing
- 🏠 Works offline — Use Ollama for fully local evaluation
| | Observability (LangSmith) | Benchmarks (Braintrust) | EvalView | |---|:---:|:---:|:---:| | Answers | "What did my agent do?" | "How good is my agent?" | "Did my agent change?" | | Detects regressions | ❌ | ⚠️ Manual | ✅ Automatic | | Golden baseline diffing | ❌ | ❌ | ✅ | | Works without API keys | ❌ | ❌ | ✅ | | Free & open source | ❌ | ❌ | ✅ | | Works offline (Ollama) | ❌ | ⚠️ Some | ✅ |
Use observability tools to see what happened. Use EvalView to prove it didn't break.
🧭 Explore & Learn
💬 Interactive Chat
Talk to your tests. Debug failures. Compare runs.
evalview chatYou: run the calculator test
🤖 Running calculator test...
✅ Passed (score: 92.5)
You: compare to yesterday
🤖 Score: 92.5 → 87.2 (-5.3)
Tools: +1 added (validator)
Cost: $0.003 → $0.005 (+67%)Slash commands: /run, /test, /compare, /traces, /skill, /adapters
🏋️ EvalView Gym
Practice agent eval patterns with guided exercises.
evalview gym⚡ Supported Agents & Frameworks
| Agent | E2E Testing | Trace Capture | |-------|:-----------:|:-------------:| | Claude Code | ✅ | ✅ | | OpenAI Codex | ✅ | ✅ | | LangGraph | ✅ | ✅ | | CrewAI | ✅ | ✅ | | OpenAI Assistants | ✅ | ✅ | | Custom (any CLI/API) | ✅ | ✅ |
Also works with: AutoGen • Dify • Ollama • HuggingFace • Any HTTP API
🔧 Automate It
GitHub Actions
evalview init --ci # Generates workflow fileOr add manually:
# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hidai25/[email protected]
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
command: check # Use new check command
fail-on: 'REGRESSION' # Block PRs on regressions
json: true # Structured output for CIOr use the CLI directly:
- run: evalview check --fail-on REGRESSION --json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}PRs with regressions get blocked. Add a PR comment showing exactly what changed:
- run: evalview ci comment
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}🤖 Claude Code Integration (MCP)
Test your agent without leaving the conversation. EvalView runs as an MCP server inside Claude Code — ask "did my refactor break anything?" and get the answer inline.
Setup (3 steps, one-time)
# 1. Install
pip install evalview
# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve
# 3. Make Claude Code proactive (auto-checks after every edit)
cp CLAUDE.md.example CLAUDE.mdWhat you get
4 tools Claude Code can call on your behalf:
| Tool | What it does |
|------|-------------|
| create_test | Generate a test case from natural language — no YAML needed |
| run_snapshot | Capture current agent behavior as the golden baseline |
| run_check | Detect regressions vs baseline, returns structured JSON diff |
| list_tests | Show all golden baselines with scores and timestamps |
How it works in practice
You: Add a test for my weather agent
Claude: [create_test] ✅ Created tests/weather-lookup.yaml
[run_snapshot] 📸 Baseline captured — regression detection active.
You: Refactor the weather tool to use async
Claude: [makes code changes]
[run_check] ✨ All clean! No regressions detected.
You: Switch to a different weather API
Claude: [makes code changes]
[run_check] ⚠️ TOOLS_CHANGED: weather_api → open_meteo
Output similarity: 94% — review the diff?No YAML. No terminal switching. No context loss.
Manual server start (advanced)
evalview mcp serve # Uses tests/ by default
evalview mcp serve --test-path my_tests/ # Custom test directory📦 Features
| Feature | Description | Docs |
|---------|-------------|------|
| 📸 Snapshot/Check Workflow | Simple snapshot → check commands for regression detection | → |
| 🤖 Claude Code MCP | Run checks inline in Claude Code — no terminal switching | ↑ |
| 🔥 Streak Tracking | Habit-forming celebrations for consecutive clean checks | → |
| 🎨 Multi-Reference Goldens | Save up to 5 variants per test for non-deterministic agents | → |
| 💬 Chat Mode | AI assistant: /run, /test, /compare | → |
| 🏷️ Tool Categories | Match by intent, not exact tool names | → |
| 📊 Statistical Mode | Handle flaky LLMs with --runs N and pass@k | → |
| 💰 Cost & Latency | Automatic threshold enforcement | → |
| 📈 HTML Reports | Interactive Plotly charts | → |
| 🧪 Test Generation | Generate 1000 tests from 1 | → |
| 🏗️ Suite Types | Separate capability vs regression tests | → |
| 🎯 Difficulty Levels | Filter by --difficulty hard, benchmark by tier | → |
| 🔬 Behavior Coverage | Track tasks, tools, paths tested | → |
🔬 Advanced: Skills Testing
Test that your agent's code actually works — not just that the output looks right. Best for teams maintaining SKILL.md workflows for Claude Code or Codex.
tests:
- name: creates-working-api
input: "Create an express server with /health endpoint"
expected:
files_created: ["index.js", "package.json"]
build_must_pass:
- "npm install"
- "npm run lint"
smoke_tests:
- command: "node index.js"
background: true
health_check: "http://localhost:3000/health"
expected_status: 200
timeout: 10
no_sudo: true
git_clean: trueevalview skill test tests.yaml --agent claude-code
evalview skill test tests.yaml --agent codex
evalview skill test tests.yaml --agent langgraph| Check | What it catches |
|-------|-----------------|
| build_must_pass | Code that doesn't compile, missing dependencies |
| smoke_tests | Runtime crashes, wrong ports, failed health checks |
| git_clean | Uncommitted files, dirty working directory |
| no_sudo | Privilege escalation attempts |
| max_tokens | Cost blowouts, verbose outputs |
📚 Documentation
| | | |---|---| | Getting Started | CLI Reference | | Golden Traces | CI/CD Integration | | Tool Categories | Statistical Mode | | Chat Mode | Evaluation Metrics | | Skills Testing | Debugging | | FAQ | |
Guides: Testing LangGraph in CI • Detecting Hallucinations
📂 Examples
| Framework | Link | |-----------|------| | Claude Code (E2E) | examples/agent-test/ | | LangGraph | examples/langgraph/ | | CrewAI | examples/crewai/ | | Anthropic Claude | examples/anthropic/ | | Dify | examples/dify/ | | Ollama (Local) | examples/ollama/ |
Node.js? See @evalview/node
🗺️ Roadmap
Shipped: Golden traces • Snapshot/check workflow • Streak tracking & celebrations • Multi-reference goldens • Tool categories • Statistical mode • Difficulty levels • Partial sequence credit • Skills validation • E2E agent testing • Build & smoke tests • Health checks • Safety guards (no_sudo, git_clean) • Claude Code & Codex adapters • Opus 4.6 cost tracking • MCP servers • HTML reports • Interactive chat mode • EvalView Gym
Coming: Agent Teams trace analysis • Multi-turn conversations • Grounded hallucination detection • Error compounding metrics • Container isolation
🤝 Get Help & Contributing
- Questions? GitHub Discussions
- Bugs? GitHub Issues
- Want setup help? Email [email protected] — happy to help configure your first tests
- Contributing? See CONTRIBUTING.md
License: Apache 2.0
⭐ Thank You for the Support!
🌟 Don't miss out on future updates! Star the repo and be the first to know about new features.
EvalView is an independent open-source project, not affiliated with LangGraph, CrewAI, OpenAI, Anthropic, or any other third party.
