npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

eval-bench

v0.21.1

Published

Benchmark Claude Code plugins by A/B comparing plugin versions with LLM-judged evaluation prompts.

Readme

eval-bench

Benchmark Claude Code plugins by A/B comparing plugin versions with LLM-judged evaluation prompts.

Runs a fixed set of prompts against two versions of your plugin (baseline vs current), invokes the real claude CLI so skills, MCP servers, subagents, slash commands, and hooks actually load, grades each output with a configurable judge (local Ollama, Anthropic, OpenAI, or any OpenAI-compatible endpoint), and produces a side-by-side comparison.

How it works

sequenceDiagram
    actor User
    participant eb as eb CLI
    participant git
    participant claude as claude CLI
    participant judge as Judge (Ollama/Anthropic/OpenAI)
    participant snap as snapshot.json

    User->>eb: eb run --baseline <ref>
    eb->>git: worktree add baseline + current
    git-->>eb: two plugin dirs

    Note over eb,claude: Run phase (parallel)
    loop prompt × {baseline, current} × samples
        eb->>claude: spawn with plugin dir + prompt
        claude-->>eb: stdout
    end
    Note right of eb: with --baseline-from <name>, baseline runs<br/>are reused from a saved snapshot.<br/>--current-from <name> does the same<br/>for the current side.

    Note over eb,judge: Judge phase (parallel)
    loop each run
        eb->>judge: POST {prompt, output, rubric}
        judge-->>eb: score 0–5 + rationale
    end

    eb->>snap: write runs + judgments + stats
    eb-->>User: eb compare → markdown / HTML

The provider (claude CLI) and judge (HTTP API) are independent — the judge never sees claude, only the captured output and your rubric.

Install

npm i -g eval-bench
# or
npx eval-bench --help

Requires:

  • Node 20+
  • claude CLI on PATH (install instructions)
  • Your plugin in a git repo (required for baseline checkout via git worktree)
  • A judge: either Ollama installed locally, or an API key for Anthropic/OpenAI

Note: You don't need a full plugin structure—if you only have standalone skills/*.md or agents/*.md files without .claude-plugin/plugin.json, eval-bench will automatically create a temporary minimal plugin manifest for you.

Quickstart

cd my-claude-plugin

# scaffold .eval-bench/ (config, prompts template, snapshots dir)
eb init

# write your eval prompts and rubrics
$EDITOR .eval-bench/prompts.yaml

# freeze a reference snapshot at a known-good ref. This becomes the
# starting point for the rolling-baseline workflow below.
eb eval --ref v1.0.0 --save-as v1-baseline

From here you'll typically use one of three workflows. Pick by what you're trying to do.

Workflow A — rolling baseline (the common case)

You changed something in the plugin and want to know if it regressed quality vs the last accepted snapshot. The previous snapshot's current runs become this snapshot's baseline — zero claude calls and zero judge calls for that side, so you only pay for the current ref:

sequenceDiagram
    actor You
    participant tree as Plugin (working tree)
    participant eb as eb CLI
    participant prev as v1-baseline (snapshot)
    participant claude as claude CLI
    participant judge as Judge (Ollama/Anthropic/OpenAI)
    participant snap as wip (snapshot)

    You->>tree: edit your plugin
    You->>eb: eb run --baseline-from v1-baseline --save-as wip

    Note over eb,prev: The previous snapshot's "current" runs become<br/>this snapshot's "baseline". Zero claude calls,<br/>zero judge calls for that side.
    eb->>prev: load cached runs + judgments
    prev-->>eb: re-stamped as baseline rows

    Note over eb,claude: Each "spawn" = a fresh `claude -p <prompt>` subprocess.<br/>It loads the plugin source from your working tree —<br/>uncommitted changes are included.
    loop prompt × samples
        eb->>claude: spawn `claude -p <prompt>`
        claude-->>eb: stdout
        eb->>judge: prompt + output + rubric
        judge-->>eb: score + rationale
    end

    eb->>snap: write baseline + current + summary
    eb-->>You: baseline mean 4.20 (n=15)<br/>current  mean 4.45 (n=15)<br/>Δ +0.25
    You->>eb: eb view wip
    eb-->>You: side-by-side HTML
# Make a change in your plugin, then:
eb run --baseline-from v1-baseline --save-as wip
eb view wip

Δ is current.mean − baseline.mean. Scores are 0–5 from the judge; means are arithmetic averages across all (prompt × samples) runs. Once wip looks good, it becomes the next iteration's --baseline-from argument — there's no separate "promote" command.

Workflow B.1 — iterate on one prompt or rubric

Use this when one of your committed prompts regressed (or never scored well) and you want a tight fix-and-rerun loop on just that prompt — without paying for the full matrix on every iteration. Pair --only with --no-save so iterating doesn't pile up directories under your configured snapshots.dir:

sequenceDiagram
    actor You
    participant prompts as prompts.yaml
    participant eb as eb CLI
    participant prev as v1-baseline (snapshot)
    participant claude as claude CLI
    participant judge as Judge
    participant tmp as tempdir (/tmp/eb-ephemeral-…)

    You->>prompts: edit one prompt + rubric
    You->>eb: eb run --baseline-from v1-baseline --only id --no-save

    Note over eb,prev: --only filters cached baseline to just this prompt
    eb->>prev: load cached baseline runs for id (one per sample)
    prev-->>eb: cached runs + judgments

    Note over eb,tmp: --no-save reroutes snapshots dir to a fresh tempdir<br/>(your configured snapshots.dir stays untouched)
    loop samples
        eb->>claude: spawn `claude -p <prompt>` (working-tree plugin)
        claude-->>eb: stdout
        eb->>judge: prompt + output + rubric
        judge-->>eb: score + rationale
    end
    eb->>tmp: write snapshot.json + view.html + per-row outputs
    eb-->>You: per-row score + rationale to stdout<br/>+ "View HTML" / "View CLI" copy-paste hints

    Note over You: Read rationale or open view.html.<br/>Skill bad? → fix skill / agent / hook.<br/>Rubric off? → edit prompts.yaml.
    You->>eb: same command, again
    Note over eb,tmp: Fresh tempdir per run; OS reclaims /tmp on reboot.
eb run --baseline-from v1-baseline --only find-user-by-email --no-save

Each row's score + rationale prints to stdout so you can read why a row scored what it did without opening the HTML. To see the actual model output, open the view.html path the CLI prints at the end (or paste the eb view <name> --snapshot-dir <tempdir> line). When you're happy, run once with --save-as <name> to capture the new state for workflow A.

Workflow B.2 — throwaway rubric, no commit

Use this when you want to try a prompt + rubric without committing it to prompts.yaml — exploring a new test path, sanity-checking whether your rubric actually scores answers the way you intended, or sketching before deciding what's worth keeping. --prompt-inline reads one prompt + rubric interactively from your terminal:

sequenceDiagram
    actor You
    participant tty as Terminal
    participant eb as eb CLI
    participant claude as claude CLI
    participant judge as Judge

    You->>eb: eb run --prompt-inline
    eb->>tty: Step 1/3 · prompt id
    You->>tty: kebab-case id (or enter for default)
    eb->>tty: Step 2/3 · prompt body
    You->>tty: paste prompt body, then "."
    eb->>tty: Step 3/3 · rubric (with example template)
    You->>tty: type rubric, then "."

    Note over eb: Current-side only, no baseline.<br/>Snapshot lands in a tempdir; configured snapshots.dir is untouched.
    loop samples
        eb->>claude: spawn `claude -p <prompt>` (working-tree plugin)
        claude-->>eb: stdout
        eb->>judge: prompt + output + rubric
        judge-->>eb: score + rationale
    end
    eb-->>You: score + rationale to stdout<br/>+ tempdir path / view.html path / view CLI hint

    Note over You: Tweak skill or rubric.<br/>Up-arrow → run again.
eb run --prompt-inline

The interactive flow shows a working rubric template inline (sub-criteria with point caps, plus a penalty line) so you don't have to read docs/rubrics.md before sketching one. After the run, the CLI prints the path to view.html and an eb view --snapshot-dir <tempdir> form so you can inspect the actual model outputs, not just the judge's summary.

Other recipes

# CI gating — A/B two refs in one shot, fail if regression > threshold
eb run --baseline-from v1-baseline --save-as wip \
       --compare v1-baseline --fail-on-regression 0.5

# already have an `eb eval` snapshot at HEAD? promote it to a dual-variant
# snapshot by stitching it against the saved baseline — no fresh claude runs,
# both `eb compare` and `eb view` work
eb run --baseline-from v1-baseline --current-from wip --save-as wip-vs-v1

# a few rows failed yesterday (judge timeout, quota)? re-run only those
eb run --baseline main --save-as baseline --retry-failed

# changed the judge in eval-bench.yaml? re-score cached Claude outputs
# without re-running Claude — answers "did the new judge change the verdict?"
eb run --save-as wip --rejudge

# override the wrapping judge prompt for one run — useful for prompt-engineering
# the judge itself before committing to a new template in eval-bench.yaml
eb run --save-as wip --judge-template "$(cat my-judge.txt)"
# (or persist it: set judge.template in .eval-bench/eval-bench.yaml)

# diagnose a slow / stuck judge — writes a per-invocation debug log under
# .eval-bench/snapshots/<name>/debug-<ts>.log with full HTTP bodies and
# Ollama timing fields, plus a colorized stderr mirror
eb run --baseline main --save-as baseline --debug

See docs/judges.md#customizing-the-judge-prompt for the verbatim default judge template, the placeholder rules, and worked examples.

Full walkthrough: docs/quickstart.md.

Docs

License

MIT.