npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

evalview

v0.2.5

Published

> You changed a prompt. Swapped a model. Updated a tool. > Did anything break? **Run EvalView. Know for sure.**

Readme

EvalView — Proof that your agent still works.

You changed a prompt. Swapped a model. Updated a tool. Did anything break? Run EvalView. Know for sure.

pip install evalview && evalview demo   # No API key needed

🔍 What EvalView Catches

| Status | What it means | What you do | |--------|--------------|-------------| | ✅ PASSED | Agent behavior matches baseline | Ship with confidence | | ⚠️ TOOLS_CHANGED | Agent is calling different tools | Review the diff | | ⚠️ OUTPUT_CHANGED | Same tools, output quality shifted | Review the diff | | ❌ REGRESSION | Score dropped significantly | Fix before shipping |


🤔 How It Works

Simple workflow (recommended):

# 1. Your agent works correctly
evalview snapshot                 # 📸 Save current behavior as baseline

# 2. You change something (prompt, model, tools)
evalview check                    # 🔍 Detect regressions automatically

# 3. EvalView tells you exactly what changed
#    → ✅ All clean! No regressions detected.
#    → ⚠️ TOOLS_CHANGED: +web_search, -calculator
#    → ❌ REGRESSION: score 85 → 71

Advanced workflow (more control):

evalview run --save-golden        # Save specific result as baseline
evalview run --diff               # Compare with custom options

That's it. Deterministic proof, no LLM-as-judge required, no API keys needed.

🎯 New: Habit-Forming Regression Detection

EvalView now tracks your progress and celebrates wins:

evalview check
# 🔍 Comparing against your baseline...
# ✨ All clean! No regressions detected.
# 🎯 5 clean checks in a row! You're on a roll.

Features:

  • 🔥 Streak tracking — Celebrate consecutive clean checks (3, 5, 10, 25+ milestones)
  • 📊 Health score — See your project's stability at a glance
  • 🔔 Smart recaps — "Since last time" summaries to stay in context
  • 📈 Progress visualization — Track improvement over time

🎨 Multi-Reference Goldens (for non-deterministic agents)

Some agents produce valid variations. Save up to 5 golden variants per test:

# Save multiple acceptable behaviors
evalview snapshot --variant variant1
evalview snapshot --variant variant2

# EvalView compares against ALL variants, passes if ANY match
evalview check
# ✅ Matched variant 2/3

Perfect for LLM-based agents with creative variation.


🚀 Quick Start

  1. Install EvalView

    pip install evalview
  2. Try the demo (zero setup, no API key)

    evalview demo
  3. Set up a working example in 2 minutes

    evalview quickstart
  4. Want LLM-as-judge scoring too?

    export OPENAI_API_KEY='your-key'
    evalview run
  5. Prefer local/free evaluation?

    evalview run --judge-provider ollama --judge-model llama3.2

Full getting started guide →


🆕 New in v0.2.9: Claude Code MCP Server

If you're using Claude Code, this is the biggest upgrade in recent releases:

  • Run EvalView checks inline from Claude Code via MCP tools
  • Generate tests from natural language (create_test)
  • Capture baselines and detect regressions without leaving the editor/conversation

👉 Jump to Claude Code Integration (MCP)


💡 Why EvalView?

  • 🔄 Automatic regression detection — Know instantly when your agent breaks
  • 📸 Golden baseline diffing — Save known-good behavior, compare every change
  • 🔑 Works without API keys — Deterministic scoring, no LLM-as-judge needed
  • 💸 Free & open source — No vendor lock-in, no SaaS pricing
  • 🏠 Works offline — Use Ollama for fully local evaluation

| | Observability (LangSmith) | Benchmarks (Braintrust) | EvalView | |---|:---:|:---:|:---:| | Answers | "What did my agent do?" | "How good is my agent?" | "Did my agent change?" | | Detects regressions | ❌ | ⚠️ Manual | ✅ Automatic | | Golden baseline diffing | ❌ | ❌ | ✅ | | Works without API keys | ❌ | ❌ | ✅ | | Free & open source | ❌ | ❌ | ✅ | | Works offline (Ollama) | ❌ | ⚠️ Some | ✅ |

Use observability tools to see what happened. Use EvalView to prove it didn't break.


🧭 Explore & Learn

💬 Interactive Chat

Talk to your tests. Debug failures. Compare runs.

evalview chat
You: run the calculator test
🤖 Running calculator test...
✅ Passed (score: 92.5)

You: compare to yesterday
🤖 Score: 92.5 → 87.2 (-5.3)
   Tools: +1 added (validator)
   Cost: $0.003 → $0.005 (+67%)

Slash commands: /run, /test, /compare, /traces, /skill, /adapters

Chat mode docs →

🏋️ EvalView Gym

Practice agent eval patterns with guided exercises.

evalview gym

⚡ Supported Agents & Frameworks

| Agent | E2E Testing | Trace Capture | |-------|:-----------:|:-------------:| | Claude Code | ✅ | ✅ | | OpenAI Codex | ✅ | ✅ | | LangGraph | ✅ | ✅ | | CrewAI | ✅ | ✅ | | OpenAI Assistants | ✅ | ✅ | | Custom (any CLI/API) | ✅ | ✅ |

Also works with: AutoGen • Dify • Ollama • HuggingFace • Any HTTP API

Compatibility details →


🔧 Automate It

GitHub Actions

evalview init --ci    # Generates workflow file

Or add manually:

# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/[email protected]
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          command: check                   # Use new check command
          fail-on: 'REGRESSION'            # Block PRs on regressions
          json: true                       # Structured output for CI

Or use the CLI directly:

      - run: evalview check --fail-on REGRESSION --json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

PRs with regressions get blocked. Add a PR comment showing exactly what changed:

      - run: evalview ci comment
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Full CI/CD setup →


🤖 Claude Code Integration (MCP)

Test your agent without leaving the conversation. EvalView runs as an MCP server inside Claude Code — ask "did my refactor break anything?" and get the answer inline.

Setup (3 steps, one-time)

# 1. Install
pip install evalview

# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve

# 3. Make Claude Code proactive (auto-checks after every edit)
cp CLAUDE.md.example CLAUDE.md

What you get

4 tools Claude Code can call on your behalf:

| Tool | What it does | |------|-------------| | create_test | Generate a test case from natural language — no YAML needed | | run_snapshot | Capture current agent behavior as the golden baseline | | run_check | Detect regressions vs baseline, returns structured JSON diff | | list_tests | Show all golden baselines with scores and timestamps |

How it works in practice

You: Add a test for my weather agent
Claude: [create_test] ✅ Created tests/weather-lookup.yaml
        [run_snapshot] 📸 Baseline captured — regression detection active.

You: Refactor the weather tool to use async
Claude: [makes code changes]
        [run_check] ✨ All clean! No regressions detected.

You: Switch to a different weather API
Claude: [makes code changes]
        [run_check] ⚠️ TOOLS_CHANGED: weather_api → open_meteo
                   Output similarity: 94% — review the diff?

No YAML. No terminal switching. No context loss.

Manual server start (advanced)

evalview mcp serve                        # Uses tests/ by default
evalview mcp serve --test-path my_tests/  # Custom test directory

📦 Features

| Feature | Description | Docs | |---------|-------------|------| | 📸 Snapshot/Check Workflow | Simple snapshotcheck commands for regression detection | | | 🤖 Claude Code MCP | Run checks inline in Claude Code — no terminal switching | | | 🔥 Streak Tracking | Habit-forming celebrations for consecutive clean checks | | | 🎨 Multi-Reference Goldens | Save up to 5 variants per test for non-deterministic agents | | | 💬 Chat Mode | AI assistant: /run, /test, /compare | | | 🏷️ Tool Categories | Match by intent, not exact tool names | | | 📊 Statistical Mode | Handle flaky LLMs with --runs N and pass@k | | | 💰 Cost & Latency | Automatic threshold enforcement | | | 📈 HTML Reports | Interactive Plotly charts | | | 🧪 Test Generation | Generate 1000 tests from 1 | | | 🏗️ Suite Types | Separate capability vs regression tests | | | 🎯 Difficulty Levels | Filter by --difficulty hard, benchmark by tier | | | 🔬 Behavior Coverage | Track tasks, tools, paths tested | |


🔬 Advanced: Skills Testing

Test that your agent's code actually works — not just that the output looks right. Best for teams maintaining SKILL.md workflows for Claude Code or Codex.

tests:
  - name: creates-working-api
    input: "Create an express server with /health endpoint"
    expected:
      files_created: ["index.js", "package.json"]
      build_must_pass:
        - "npm install"
        - "npm run lint"
      smoke_tests:
        - command: "node index.js"
          background: true
          health_check: "http://localhost:3000/health"
          expected_status: 200
          timeout: 10
      no_sudo: true
      git_clean: true
evalview skill test tests.yaml --agent claude-code
evalview skill test tests.yaml --agent codex
evalview skill test tests.yaml --agent langgraph

| Check | What it catches | |-------|-----------------| | build_must_pass | Code that doesn't compile, missing dependencies | | smoke_tests | Runtime crashes, wrong ports, failed health checks | | git_clean | Uncommitted files, dirty working directory | | no_sudo | Privilege escalation attempts | | max_tokens | Cost blowouts, verbose outputs |

Skills testing docs →


📚 Documentation

| | | |---|---| | Getting Started | CLI Reference | | Golden Traces | CI/CD Integration | | Tool Categories | Statistical Mode | | Chat Mode | Evaluation Metrics | | Skills Testing | Debugging | | FAQ | |

Guides: Testing LangGraph in CIDetecting Hallucinations


📂 Examples

| Framework | Link | |-----------|------| | Claude Code (E2E) | examples/agent-test/ | | LangGraph | examples/langgraph/ | | CrewAI | examples/crewai/ | | Anthropic Claude | examples/anthropic/ | | Dify | examples/dify/ | | Ollama (Local) | examples/ollama/ |

Node.js? See @evalview/node


🗺️ Roadmap

Shipped: Golden traces • Snapshot/check workflowStreak tracking & celebrationsMulti-reference goldens • Tool categories • Statistical mode • Difficulty levels • Partial sequence credit • Skills validation • E2E agent testing • Build & smoke tests • Health checks • Safety guards (no_sudo, git_clean) • Claude Code & Codex adapters • Opus 4.6 cost tracking • MCP servers • HTML reports • Interactive chat mode • EvalView Gym

Coming: Agent Teams trace analysis • Multi-turn conversations • Grounded hallucination detection • Error compounding metrics • Container isolation

Vote on features →


🤝 Get Help & Contributing

License: Apache 2.0


⭐ Thank You for the Support!

Star History Chart

🌟 Don't miss out on future updates! Star the repo and be the first to know about new features.



EvalView is an independent open-source project, not affiliated with LangGraph, CrewAI, OpenAI, Anthropic, or any other third party.