npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@tlahey/agent-eval

v0.2.0-alpha

Published

AI coding agent evaluation framework with Vitest-like DX

Readme


Dashboard


Features

  • Everything is a Variant — Unified API for single runs and A/B experiments.
  • Stability Analysis — Automated multiple iterations per variant to measure consistency.
  • Isolated Parallel Execution — Support for Docker and macOS sandbox-exec to run multiple agents simultaneously.
  • Procedural Command Validation — Deterministic check of required CLI commands (build, test, lint) without LLM guesswork.
  • Zero Magic Philosophy — Explicit runner selection per test for total budget and execution control.
  • Analytical Explorer — Hierarchical tree view with analytical metrics and agent rankings.
  • LLM-as-a-Judge — Structured evaluation via Anthropic, OpenAI, Ollama, or GitHub Models.
  • Visual Dashboard — React dashboard with charts, diff viewer, and delta analysis for experiments.

Quick Start

Prerequisites

  • Node.js ≥ 22 (required for node:sqlite)
  • pnpm ≥ 10

Install

pnpm add -D @tlahey/agent-eval

Configure

AgentEval uses a registry model. Define your resources once, use them by ID in tests.

// agenteval.config.ts
import { defineConfig } from "@tlahey/agent-eval";
import { CliModel, OpenAIModel } from "@tlahey/agent-eval/llm";
import { DockerEnvironment } from "@tlahey/agent-eval/environment";

export default defineConfig({
  // Library of available technical resources
  runners: [
    { id: "copilot", model: new CliModel({ command: 'gh copilot suggest "{{prompt}}"' }) },
    { id: "sonnet", model: new AnthropicModel({ model: "claude-3-5-sonnet-latest" }) },
  ],
  judge: {
    model: new OpenAIModel({ model: "gpt-4o" }),
  },
  // Collect 3 runs per variant to compute stability metrics
  runs: 3,
  // Enable parallel execution via Docker (optional)
  environment: new DockerEnvironment({ image: "node:22" }),
});

Write a test (Baseline)

Every test requires an explicit variant array. Use requiredCommands for procedural verification.

// evals/banner.eval.ts
import { test, expect } from "@tlahey/agent-eval";

test("Add a Close button", [{ name: "Baseline", runner: "sonnet" }], async ({ ctx }) => {
  ctx.prompt("Add a Close button to the Banner component");

  ctx.addTask({
    name: "Check component",
    action: ({ exec }) => exec('grep -q "aria-label" src/components/Banner.tsx'),
    criteria: "Navbar should contain 'aria-label' for accessibility",
  });

  await expect(ctx).toPassJudge({
    criteria: "Uses a proper close button, accessibility is respected.",
    requiredCommands: ["pnpm run build"], // procedural validation
    expectedFiles: ["src/components/Banner.tsx"],
  });
});

A/B Testing (Experiments)

Compare models or prompt engineering strategies by adding more variants. The dashboard will automatically show deltas and stability (variance) between variants.

test(
  "Refactor Logic",
  [
    { name: "Direct", runner: "sonnet" },
    {
      name: "Expert Persona",
      runner: "sonnet",
      enrichPrompt: "Act as a Senior Engineer. Mission: {{prompt}}",
    },
    { name: "GPT-4o", runner: "gpt4" },
  ],
  async ({ ctx }) => {
    ctx.prompt("Refactor the auth middleware to use JWT.");
    await expect(ctx).toPassJudge({
      criteria: "Logic is secure and idiomatic.",
      requiredCommands: ["pnpm test"],
    });
  },
);

For examples it's possible to compare different models, a model against itself with skills.


Real World Examples

Check out our Example Target App for complete scenarios:


License

ISC