claude-eval

v1.1.6

Published

4 months ago

Simple evaluation tool for Claude Code. PASS/FAIL testing with LLM-as-a-judge simplified approach.

0High
0Medium
0Low

maelcaldas

claude ai evaluation llm judge anthropic cli testing

claude-eval

An evaluation tool for Claude Code, using a LLM-as-a-judge simplified approach.

When changing Claude Code contexts and models:

How can you know if the change work as expected?
How can you know if the change will NOT break something else?

This tool solves those problems by enabling Eval-driven development for Claude Code.

Claude Code Eval Demo

No complex scoring or ranking — just clear PASSED ✅ / FAILED ❌ results for your evaluation criteria.

It's like TDD for AI.

Prerequisites

Node.js 18+ or Bun
Claude Code installed and configured in your project

Installation

Global Installation (Recommended)

For regular use, install claude-eval globally:

# Using npm
npm install -g claude-eval

# Using bun
bun install -g claude-eval

After global installation, you can use the claude-eval command directly and access the update functionality:

claude-eval --version
claude-eval update

One-time Usage

If you prefer not to install globally, you can run evaluations directly with npx:

npx claude-eval evals/*.yaml

Usage

# Single evaluation
claude-eval evals/say-dont-know-clear-way.yaml

# Multiple evaluations (batch)
claude-eval evals/*.yaml

# Custom concurrency
claude-eval evals/*.yaml --concurrency=3

# Check for updates
claude-eval update

# Show help
claude-eval --help

Evaluation File Format

Evaluation files are YAML documents with the following structure:


prompt: >
  What is the weather for today?

expected_behavior:
  - Just say you don't know in a clear way.
  - Don't give user alternatives.
  - Don't recommend user to research for the answer elsewhere.

prompt: The prompt you would send to Claude Code
expected_behavior: Array of criteria that the response should meet

How It Works

Parse YAML: Loads and validates the evaluation specification
Query Claude: Executes the prompt on Sonnet model, on plan mode
Judge Response: Evaluate the response with Haiku model
Format Results: Displays results with ✅/❌ indicators and summary

Contributing

We welcome contributions! Please:

Open an issue to discuss major changes before starting
Follow existing code style and patterns in the codebase
Add tests for new features and bug fixes
Update documentation as needed
Keep it simple - this tool is intentionally minimal and focused

For bug reports and feature requests, please use the GitHub Issues page.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

claude-eval

Prerequisites

Installation

Global Installation (Recommended)

One-time Usage

Usage

Evaluation File Format

How It Works

Contributing