npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

whale-eval

v1.0.3

Published

Enterprise-grade eval system for Whale Code agent quality

Readme

Standalone project — zero runtime dependency on Whale Code. Uses the Anthropic SDK directly to run agent trials, grade outputs, and track quality over time.

Quick Start

# Install
npm install

# Validate all task definitions
npx tsx bin/whale-eval.js run --dry-run

# Run regression suite
npx tsx bin/whale-eval.js run regression

# Run a single task
npx tsx bin/whale-eval.js run regression/compaction-loop

# Run capability suite with 5 trials per task
npx tsx bin/whale-eval.js run capability --trials 5

# List all suites and tasks
npx tsx bin/whale-eval.js list

# Export results as JSON
npx tsx bin/whale-eval.js run regression --output json --output-file results.json

Environment Variables

# Required
ANTHROPIC_API_KEY=sk-ant-...     # Anthropic API key for agent trials + LLM graders

# Optional — Supabase persistence
SUPABASE_URL=https://xxx.supabase.co
SUPABASE_SERVICE_KEY=eyJ...

# Optional — override defaults
EVAL_MODEL=claude-sonnet-4-6    # Default model for trials
EVAL_MAX_TURNS=25               # Default max agent turns
EVAL_TIMEOUT_MS=300000          # Default trial timeout (5 min)

Architecture

WhaleEval/
├── bin/whale-eval.js          # CLI entry point
├── src/
│   ├── types.ts               # All interfaces
│   ├── task-loader.ts         # YAML → TaskDefinition
│   ├── runner.ts              # Orchestrator: suite → tasks → trials
│   ├── trial-executor.ts      # Single trial: env setup → agent loop → grading
│   ├── events.ts              # Lightweight eval event emitter
│   ├── transcript-recorder.ts # Event → structured transcript + metrics
│   ├── graders/
│   │   ├── index.ts           # Registry + factory
│   │   ├── code-grader.ts     # TestRunner, FileState, OutputRegex
│   │   ├── llm-grader.ts      # LLMRubric, LLMAssertion (Haiku-as-judge)
│   │   ├── tool-call-grader.ts
│   │   └── composite-grader.ts
│   ├── metrics/
│   │   ├── pass-at-k.ts       # pass@k, pass^k (Chen et al. 2021)
│   │   ├── aggregator.ts      # Suite/task metric rollup
│   │   └── cost-tracker.ts    # Per-trial token/cost estimation
│   ├── storage/
│   │   └── supabase-store.ts  # Eval persistence
│   └── reporters/
│       ├── console-reporter.ts
│       ├── json-reporter.ts
│       └── github-reporter.ts # PR comment formatting
├── evals/
│   ├── suites/
│   │   ├── regression/        # Must stay 100% — derived from production bugs
│   │   └── capability/        # Frontier abilities — baseline tracking
│   └── fixtures/              # Isolated test environments per task
├── __tests__/                 # 46 unit tests
└── migrations/                # Supabase schema

How It Works

Trial Execution

Each trial runs the agent in an isolated temp directory using the Anthropic SDK directly:

  1. mkdtemp() → isolated workspace
  2. Copy fixture files → workspace
  3. Run setup commands
  4. Agent loop: client.messages.create() → tool execution → repeat until end_turn
  5. Grade output with code-based + LLM-as-judge graders
  6. Cleanup workspace

The agent gets 6 tools: read_file, write_file, edit_file, list_directory, run_command, search_content. All paths are sandboxed to the trial workspace.

Grader Hierarchy

Following Anthropic's recommended hierarchy:

| Tier | Grader | Speed | Use When | |------|--------|-------|----------| | 1 | TestRunner | Fast | Exit code check (npm test, pytest) | | 1 | FileState | Fast | File exists/contains/matches assertions | | 1 | OutputRegex | Fast | Agent output pattern matching | | 1 | ToolCalls | Fast | Tool usage pattern verification | | 2 | LLMRubric | Slow | Open-ended quality scoring (0-100) | | 2 | LLMAssertion | Slow | Binary yes/no semantic checks | | — | Composite | — | Combine graders: weighted or all-must-pass |

Metrics

  • pass@k — probability at least 1 of k trials passes (capability ceiling)
  • pass^k — probability all k trials pass (reliability measure)
  • Both use the unbiased estimator: 1 - C(n-c, k) / C(n, k) per Chen et al. 2021

Suite Types

| Type | Trials | Threshold | Purpose | |------|--------|-----------|---------| | Regression | 1 | 100% | From production bugs — must never regress | | Capability | 3-5 | Varies | Frontier abilities — track improvement over time |

Task YAML Format

Suite Definition

suite:
  name: "regression"
  description: "Must-pass regression tests from production bugs"
  eval_type: "regression"
  config:
    model: "claude-sonnet-4-6"
    trials_per_task: 1
    pass_threshold: 1.0
    max_parallel_trials: 4
    timeout_ms: 300000
    max_turns: 25

Task Definition

task:
  id: "compaction-loop"
  description: "Agent handles context compaction without infinite loop"
  prompt: |
    Read these 5 large JSON files and summarize the patterns.

  graders:
    - type: output_regex
      patterns:
        - pattern: "analysis|summary"
          match: true
        - pattern: "error|stuck"
          match: false
    - type: tool_calls
      assertions:
        - tool: "read_file"
          min_calls: 3
    - type: llm_rubric
      model: "haiku"
      rubric: "Did the agent produce a coherent summary?"
      pass_threshold: 0.6

  bidirectional:
    should_not:
      - "Agent should NOT enter an infinite loop"

  fixture_path: "fixtures/large-json-files"
  timeout_ms: 180000
  tags: ["compaction", "context-management"]
  added_from: "production-bug-2026-02-15"

CI/CD

The GitHub Actions workflow (.github/workflows/eval.yml) runs:

| Trigger | Suite | Gate | |---------|-------|------| | Nightly (6 AM UTC) | regression | Fail if < 100% pass rate | | Manual dispatch | configurable | Advisory |

Results appear in GitHub Step Summary with per-task pass/fail tables.

Development

# Run tests
npm test

# Type check
npx tsc --noEmit

# Add a new regression task
# 1. Create evals/suites/regression/my-bug.yaml
# 2. Create fixture in evals/fixtures/my-bug/
# 3. Validate: npx tsx bin/whale-eval.js run --dry-run
# 4. Test: npx tsx bin/whale-eval.js run regression/my-bug

Design Principles

  1. Grade outcomes, not paths — graders check final state, not how the agent got there
  2. Bidirectional testing — verify both what should happen AND what should not
  3. Code graders first — deterministic, fast, trustworthy. LLM judges only when needed
  4. Isolated environments — every trial gets a fresh temp directory
  5. Zero coupling — no imports from Whale Code. Uses Anthropic SDK directly
  6. Reference solutions — every task must be provably solvable