whale-eval
v1.0.3
Published
Enterprise-grade eval system for Whale Code agent quality
Maintainers
Readme
Standalone project — zero runtime dependency on Whale Code. Uses the Anthropic SDK directly to run agent trials, grade outputs, and track quality over time.
Quick Start
# Install
npm install
# Validate all task definitions
npx tsx bin/whale-eval.js run --dry-run
# Run regression suite
npx tsx bin/whale-eval.js run regression
# Run a single task
npx tsx bin/whale-eval.js run regression/compaction-loop
# Run capability suite with 5 trials per task
npx tsx bin/whale-eval.js run capability --trials 5
# List all suites and tasks
npx tsx bin/whale-eval.js list
# Export results as JSON
npx tsx bin/whale-eval.js run regression --output json --output-file results.jsonEnvironment Variables
# Required
ANTHROPIC_API_KEY=sk-ant-... # Anthropic API key for agent trials + LLM graders
# Optional — Supabase persistence
SUPABASE_URL=https://xxx.supabase.co
SUPABASE_SERVICE_KEY=eyJ...
# Optional — override defaults
EVAL_MODEL=claude-sonnet-4-6 # Default model for trials
EVAL_MAX_TURNS=25 # Default max agent turns
EVAL_TIMEOUT_MS=300000 # Default trial timeout (5 min)Architecture
WhaleEval/
├── bin/whale-eval.js # CLI entry point
├── src/
│ ├── types.ts # All interfaces
│ ├── task-loader.ts # YAML → TaskDefinition
│ ├── runner.ts # Orchestrator: suite → tasks → trials
│ ├── trial-executor.ts # Single trial: env setup → agent loop → grading
│ ├── events.ts # Lightweight eval event emitter
│ ├── transcript-recorder.ts # Event → structured transcript + metrics
│ ├── graders/
│ │ ├── index.ts # Registry + factory
│ │ ├── code-grader.ts # TestRunner, FileState, OutputRegex
│ │ ├── llm-grader.ts # LLMRubric, LLMAssertion (Haiku-as-judge)
│ │ ├── tool-call-grader.ts
│ │ └── composite-grader.ts
│ ├── metrics/
│ │ ├── pass-at-k.ts # pass@k, pass^k (Chen et al. 2021)
│ │ ├── aggregator.ts # Suite/task metric rollup
│ │ └── cost-tracker.ts # Per-trial token/cost estimation
│ ├── storage/
│ │ └── supabase-store.ts # Eval persistence
│ └── reporters/
│ ├── console-reporter.ts
│ ├── json-reporter.ts
│ └── github-reporter.ts # PR comment formatting
├── evals/
│ ├── suites/
│ │ ├── regression/ # Must stay 100% — derived from production bugs
│ │ └── capability/ # Frontier abilities — baseline tracking
│ └── fixtures/ # Isolated test environments per task
├── __tests__/ # 46 unit tests
└── migrations/ # Supabase schemaHow It Works
Trial Execution
Each trial runs the agent in an isolated temp directory using the Anthropic SDK directly:
mkdtemp()→ isolated workspace- Copy fixture files → workspace
- Run setup commands
- Agent loop:
client.messages.create()→ tool execution → repeat untilend_turn - Grade output with code-based + LLM-as-judge graders
- Cleanup workspace
The agent gets 6 tools: read_file, write_file, edit_file, list_directory, run_command, search_content. All paths are sandboxed to the trial workspace.
Grader Hierarchy
Following Anthropic's recommended hierarchy:
| Tier | Grader | Speed | Use When | |------|--------|-------|----------| | 1 | TestRunner | Fast | Exit code check (npm test, pytest) | | 1 | FileState | Fast | File exists/contains/matches assertions | | 1 | OutputRegex | Fast | Agent output pattern matching | | 1 | ToolCalls | Fast | Tool usage pattern verification | | 2 | LLMRubric | Slow | Open-ended quality scoring (0-100) | | 2 | LLMAssertion | Slow | Binary yes/no semantic checks | | — | Composite | — | Combine graders: weighted or all-must-pass |
Metrics
- pass@k — probability at least 1 of k trials passes (capability ceiling)
- pass^k — probability all k trials pass (reliability measure)
- Both use the unbiased estimator:
1 - C(n-c, k) / C(n, k)per Chen et al. 2021
Suite Types
| Type | Trials | Threshold | Purpose | |------|--------|-----------|---------| | Regression | 1 | 100% | From production bugs — must never regress | | Capability | 3-5 | Varies | Frontier abilities — track improvement over time |
Task YAML Format
Suite Definition
suite:
name: "regression"
description: "Must-pass regression tests from production bugs"
eval_type: "regression"
config:
model: "claude-sonnet-4-6"
trials_per_task: 1
pass_threshold: 1.0
max_parallel_trials: 4
timeout_ms: 300000
max_turns: 25Task Definition
task:
id: "compaction-loop"
description: "Agent handles context compaction without infinite loop"
prompt: |
Read these 5 large JSON files and summarize the patterns.
graders:
- type: output_regex
patterns:
- pattern: "analysis|summary"
match: true
- pattern: "error|stuck"
match: false
- type: tool_calls
assertions:
- tool: "read_file"
min_calls: 3
- type: llm_rubric
model: "haiku"
rubric: "Did the agent produce a coherent summary?"
pass_threshold: 0.6
bidirectional:
should_not:
- "Agent should NOT enter an infinite loop"
fixture_path: "fixtures/large-json-files"
timeout_ms: 180000
tags: ["compaction", "context-management"]
added_from: "production-bug-2026-02-15"CI/CD
The GitHub Actions workflow (.github/workflows/eval.yml) runs:
| Trigger | Suite | Gate | |---------|-------|------| | Nightly (6 AM UTC) | regression | Fail if < 100% pass rate | | Manual dispatch | configurable | Advisory |
Results appear in GitHub Step Summary with per-task pass/fail tables.
Development
# Run tests
npm test
# Type check
npx tsc --noEmit
# Add a new regression task
# 1. Create evals/suites/regression/my-bug.yaml
# 2. Create fixture in evals/fixtures/my-bug/
# 3. Validate: npx tsx bin/whale-eval.js run --dry-run
# 4. Test: npx tsx bin/whale-eval.js run regression/my-bugDesign Principles
- Grade outcomes, not paths — graders check final state, not how the agent got there
- Bidirectional testing — verify both what should happen AND what should not
- Code graders first — deterministic, fast, trustworthy. LLM judges only when needed
- Isolated environments — every trial gets a fresh temp directory
- Zero coupling — no imports from Whale Code. Uses Anthropic SDK directly
- Reference solutions — every task must be provably solvable
