@tlahey/agent-eval
v0.2.0-alpha
Published
AI coding agent evaluation framework with Vitest-like DX
Readme
Dashboard
Features
- Everything is a Variant — Unified API for single runs and A/B experiments.
- Stability Analysis — Automated multiple iterations per variant to measure consistency.
- Isolated Parallel Execution — Support for Docker and macOS sandbox-exec to run multiple agents simultaneously.
- Procedural Command Validation — Deterministic check of required CLI commands (build, test, lint) without LLM guesswork.
- Zero Magic Philosophy — Explicit runner selection per test for total budget and execution control.
- Analytical Explorer — Hierarchical tree view with analytical metrics and agent rankings.
- LLM-as-a-Judge — Structured evaluation via Anthropic, OpenAI, Ollama, or GitHub Models.
- Visual Dashboard — React dashboard with charts, diff viewer, and delta analysis for experiments.
Quick Start
Prerequisites
- Node.js ≥ 22 (required for
node:sqlite) - pnpm ≥ 10
Install
pnpm add -D @tlahey/agent-evalConfigure
AgentEval uses a registry model. Define your resources once, use them by ID in tests.
// agenteval.config.ts
import { defineConfig } from "@tlahey/agent-eval";
import { CliModel, OpenAIModel } from "@tlahey/agent-eval/llm";
import { DockerEnvironment } from "@tlahey/agent-eval/environment";
export default defineConfig({
// Library of available technical resources
runners: [
{ id: "copilot", model: new CliModel({ command: 'gh copilot suggest "{{prompt}}"' }) },
{ id: "sonnet", model: new AnthropicModel({ model: "claude-3-5-sonnet-latest" }) },
],
judge: {
model: new OpenAIModel({ model: "gpt-4o" }),
},
// Collect 3 runs per variant to compute stability metrics
runs: 3,
// Enable parallel execution via Docker (optional)
environment: new DockerEnvironment({ image: "node:22" }),
});Write a test (Baseline)
Every test requires an explicit variant array. Use requiredCommands for procedural verification.
// evals/banner.eval.ts
import { test, expect } from "@tlahey/agent-eval";
test("Add a Close button", [{ name: "Baseline", runner: "sonnet" }], async ({ ctx }) => {
ctx.prompt("Add a Close button to the Banner component");
ctx.addTask({
name: "Check component",
action: ({ exec }) => exec('grep -q "aria-label" src/components/Banner.tsx'),
criteria: "Navbar should contain 'aria-label' for accessibility",
});
await expect(ctx).toPassJudge({
criteria: "Uses a proper close button, accessibility is respected.",
requiredCommands: ["pnpm run build"], // procedural validation
expectedFiles: ["src/components/Banner.tsx"],
});
});A/B Testing (Experiments)
Compare models or prompt engineering strategies by adding more variants. The dashboard will automatically show deltas and stability (variance) between variants.
test(
"Refactor Logic",
[
{ name: "Direct", runner: "sonnet" },
{
name: "Expert Persona",
runner: "sonnet",
enrichPrompt: "Act as a Senior Engineer. Mission: {{prompt}}",
},
{ name: "GPT-4o", runner: "gpt4" },
],
async ({ ctx }) => {
ctx.prompt("Refactor the auth middleware to use JWT.");
await expect(ctx).toPassJudge({
criteria: "Logic is secure and idiomatic.",
requiredCommands: ["pnpm test"],
});
},
);For examples it's possible to compare different models, a model against itself with skills.
Real World Examples
Check out our Example Target App for complete scenarios:
License
ISC
