@tlahey/agent-eval

v0.2.0-alpha

Published

4 months ago

AI coding agent evaluation framework with Vitest-like DX

0High
0Medium
0Low

tlahey

ai agent eval testing llm judge

Dashboard

Features

Everything is a Variant — Unified API for single runs and A/B experiments.
Stability Analysis — Automated multiple iterations per variant to measure consistency.
Isolated Parallel Execution — Support for Docker and macOS sandbox-exec to run multiple agents simultaneously.
Procedural Command Validation — Deterministic check of required CLI commands (build, test, lint) without LLM guesswork.
Zero Magic Philosophy — Explicit runner selection per test for total budget and execution control.
Analytical Explorer — Hierarchical tree view with analytical metrics and agent rankings.
LLM-as-a-Judge — Structured evaluation via Anthropic, OpenAI, Ollama, or GitHub Models.
Visual Dashboard — React dashboard with charts, diff viewer, and delta analysis for experiments.

Quick Start

Prerequisites

Node.js ≥ 22 (required for node:sqlite)
pnpm ≥ 10

Install

pnpm add -D @tlahey/agent-eval

Configure

AgentEval uses a registry model. Define your resources once, use them by ID in tests.

// agenteval.config.ts
import { defineConfig } from "@tlahey/agent-eval";
import { CliModel, OpenAIModel } from "@tlahey/agent-eval/llm";
import { DockerEnvironment } from "@tlahey/agent-eval/environment";

export default defineConfig({
  // Library of available technical resources
  runners: [
    { id: "copilot", model: new CliModel({ command: 'gh copilot suggest "{{prompt}}"' }) },
    { id: "sonnet", model: new AnthropicModel({ model: "claude-3-5-sonnet-latest" }) },
  ],
  judge: {
    model: new OpenAIModel({ model: "gpt-4o" }),
  },
  // Collect 3 runs per variant to compute stability metrics
  runs: 3,
  // Enable parallel execution via Docker (optional)
  environment: new DockerEnvironment({ image: "node:22" }),
});

Write a test (Baseline)

Every test requires an explicit variant array. Use requiredCommands for procedural verification.

// evals/banner.eval.ts
import { test, expect } from "@tlahey/agent-eval";

test("Add a Close button", [{ name: "Baseline", runner: "sonnet" }], async ({ ctx }) => {
  ctx.prompt("Add a Close button to the Banner component");

  ctx.addTask({
    name: "Check component",
    action: ({ exec }) => exec('grep -q "aria-label" src/components/Banner.tsx'),
    criteria: "Navbar should contain 'aria-label' for accessibility",
  });

  await expect(ctx).toPassJudge({
    criteria: "Uses a proper close button, accessibility is respected.",
    requiredCommands: ["pnpm run build"], // procedural validation
    expectedFiles: ["src/components/Banner.tsx"],
  });
});

A/B Testing (Experiments)

Compare models or prompt engineering strategies by adding more variants. The dashboard will automatically show deltas and stability (variance) between variants.

test(
  "Refactor Logic",
  [
    { name: "Direct", runner: "sonnet" },
    {
      name: "Expert Persona",
      runner: "sonnet",
      enrichPrompt: "Act as a Senior Engineer. Mission: {{prompt}}",
    },
    { name: "GPT-4o", runner: "gpt4" },
  ],
  async ({ ctx }) => {
    ctx.prompt("Refactor the auth middleware to use JWT.");
    await expect(ctx).toPassJudge({
      criteria: "Logic is secure and idiomatic.",
      requiredCommands: ["pnpm test"],
    });
  },
);

For examples it's possible to compare different models, a model against itself with skills.

Real World Examples

Check out our Example Target App for complete scenarios:

License

ISC

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme