whale-eval

v1.0.3

Published

2 months ago

Enterprise-grade eval system for Whale Code agent quality

Downloads

0High
0Medium
0Low

neowhale

eval evaluation llm agent testing anthropic claude whale-code

Standalone project — zero runtime dependency on Whale Code. Uses the Anthropic SDK directly to run agent trials, grade outputs, and track quality over time.

Quick Start

# Install
npm install

# Validate all task definitions
npx tsx bin/whale-eval.js run --dry-run

# Run regression suite
npx tsx bin/whale-eval.js run regression

# Run a single task
npx tsx bin/whale-eval.js run regression/compaction-loop

# Run capability suite with 5 trials per task
npx tsx bin/whale-eval.js run capability --trials 5

# List all suites and tasks
npx tsx bin/whale-eval.js list

# Export results as JSON
npx tsx bin/whale-eval.js run regression --output json --output-file results.json

Environment Variables

# Required
ANTHROPIC_API_KEY=sk-ant-...     # Anthropic API key for agent trials + LLM graders

# Optional — Supabase persistence
SUPABASE_URL=https://xxx.supabase.co
SUPABASE_SERVICE_KEY=eyJ...

# Optional — override defaults
EVAL_MODEL=claude-sonnet-4-6    # Default model for trials
EVAL_MAX_TURNS=25               # Default max agent turns
EVAL_TIMEOUT_MS=300000          # Default trial timeout (5 min)

Architecture

WhaleEval/
├── bin/whale-eval.js          # CLI entry point
├── src/
│   ├── types.ts               # All interfaces
│   ├── task-loader.ts         # YAML → TaskDefinition
│   ├── runner.ts              # Orchestrator: suite → tasks → trials
│   ├── trial-executor.ts      # Single trial: env setup → agent loop → grading
│   ├── events.ts              # Lightweight eval event emitter
│   ├── transcript-recorder.ts # Event → structured transcript + metrics
│   ├── graders/
│   │   ├── index.ts           # Registry + factory
│   │   ├── code-grader.ts     # TestRunner, FileState, OutputRegex
│   │   ├── llm-grader.ts      # LLMRubric, LLMAssertion (Haiku-as-judge)
│   │   ├── tool-call-grader.ts
│   │   └── composite-grader.ts
│   ├── metrics/
│   │   ├── pass-at-k.ts       # pass@k, pass^k (Chen et al. 2021)
│   │   ├── aggregator.ts      # Suite/task metric rollup
│   │   └── cost-tracker.ts    # Per-trial token/cost estimation
│   ├── storage/
│   │   └── supabase-store.ts  # Eval persistence
│   └── reporters/
│       ├── console-reporter.ts
│       ├── json-reporter.ts
│       └── github-reporter.ts # PR comment formatting
├── evals/
│   ├── suites/
│   │   ├── regression/        # Must stay 100% — derived from production bugs
│   │   └── capability/        # Frontier abilities — baseline tracking
│   └── fixtures/              # Isolated test environments per task
├── __tests__/                 # 46 unit tests
└── migrations/                # Supabase schema

How It Works

Trial Execution

Each trial runs the agent in an isolated temp directory using the Anthropic SDK directly:

mkdtemp() → isolated workspace
Copy fixture files → workspace
Run setup commands
Agent loop: client.messages.create() → tool execution → repeat until end_turn
Grade output with code-based + LLM-as-judge graders
Cleanup workspace

The agent gets 6 tools: read_file, write_file, edit_file, list_directory, run_command, search_content. All paths are sandboxed to the trial workspace.

Grader Hierarchy

Following Anthropic's recommended hierarchy:

| Tier | Grader | Speed | Use When | |------|--------|-------|----------| | 1 | TestRunner | Fast | Exit code check (npm test, pytest) | | 1 | FileState | Fast | File exists/contains/matches assertions | | 1 | OutputRegex | Fast | Agent output pattern matching | | 1 | ToolCalls | Fast | Tool usage pattern verification | | 2 | LLMRubric | Slow | Open-ended quality scoring (0-100) | | 2 | LLMAssertion | Slow | Binary yes/no semantic checks | | — | Composite | — | Combine graders: weighted or all-must-pass |

Metrics

pass@k — probability at least 1 of k trials passes (capability ceiling)
pass^k — probability all k trials pass (reliability measure)
Both use the unbiased estimator: 1 - C(n-c, k) / C(n, k) per Chen et al. 2021

Suite Types

| Type | Trials | Threshold | Purpose | |------|--------|-----------|---------| | Regression | 1 | 100% | From production bugs — must never regress | | Capability | 3-5 | Varies | Frontier abilities — track improvement over time |

Task YAML Format

Suite Definition

suite:
  name: "regression"
  description: "Must-pass regression tests from production bugs"
  eval_type: "regression"
  config:
    model: "claude-sonnet-4-6"
    trials_per_task: 1
    pass_threshold: 1.0
    max_parallel_trials: 4
    timeout_ms: 300000
    max_turns: 25

Task Definition

task:
  id: "compaction-loop"
  description: "Agent handles context compaction without infinite loop"
  prompt: |
    Read these 5 large JSON files and summarize the patterns.

  graders:
    - type: output_regex
      patterns:
        - pattern: "analysis|summary"
          match: true
        - pattern: "error|stuck"
          match: false
    - type: tool_calls
      assertions:
        - tool: "read_file"
          min_calls: 3
    - type: llm_rubric
      model: "haiku"
      rubric: "Did the agent produce a coherent summary?"
      pass_threshold: 0.6

  bidirectional:
    should_not:
      - "Agent should NOT enter an infinite loop"

  fixture_path: "fixtures/large-json-files"
  timeout_ms: 180000
  tags: ["compaction", "context-management"]
  added_from: "production-bug-2026-02-15"

CI/CD

The GitHub Actions workflow (.github/workflows/eval.yml) runs:

| Trigger | Suite | Gate | |---------|-------|------| | Nightly (6 AM UTC) | regression | Fail if < 100% pass rate | | Manual dispatch | configurable | Advisory |

Results appear in GitHub Step Summary with per-task pass/fail tables.

Development

# Run tests
npm test

# Type check
npx tsc --noEmit

# Add a new regression task
# 1. Create evals/suites/regression/my-bug.yaml
# 2. Create fixture in evals/fixtures/my-bug/
# 3. Validate: npx tsx bin/whale-eval.js run --dry-run
# 4. Test: npx tsx bin/whale-eval.js run regression/my-bug

Design Principles

Grade outcomes, not paths — graders check final state, not how the agent got there
Bidirectional testing — verify both what should happen AND what should not
Code graders first — deterministic, fast, trustworthy. LLM judges only when needed
Isolated environments — every trial gets a fresh temp directory
Zero coupling — no imports from Whale Code. Uses Anthropic SDK directly
Reference solutions — every task must be provably solvable