@razroo/iso-eval
v0.4.0
Published
Behavioral eval runner for AI coding agents — snapshot a workspace, hand it to a runner with a task prompt, score the resulting filesystem/git state.
Maintainers
Readme
@razroo/iso-eval
Behavioral eval runner for AI coding agents.
agentmd lints prompt structure, isolint lints prompt prose,
iso-harness fans out the compiled source into every harness file layout.
None of them answer the next question: did the agent actually do the
task? That's what @razroo/iso-eval scores.
You give it a suite of tasks — each with a baseline workspace, a prompt, and a set of checks — and it snapshots the workspace per trial, hands it to a runner, then verifies the resulting filesystem / command state against your checks.
Built-in runners today:
fake— deterministic CI/offline runner that executes$ ...lines from the prompt as shell in the snapshotted workspace.codex— real-agent runner that shells out tocodex execin the per-trial workspace and captures the final assistant message.claude-code— real-agent runner that shells out toclaude -pin the per-trial workspace.cursor— real-agent runner that shells out tocursor-agent --printin the per-trial workspace.opencode— real-agent runner that shells out toopencode runin the per-trial workspace.
The library API still accepts any RunnerFn, so you can plug in other
harnesses without waiting on a packaged runner.
Install
npm install -D @razroo/iso-evalSuite shape
# eval.yml
suite: refactor-basic
runner: fake # fake | codex | claude-code | cursor | opencode
timeoutMs: 120000
harness:
source: ../dist # optional: stage generated harness files into each trial
tasks:
- id: write-greeting
prompt: tasks/write-greeting.md # path (relative to eval.yml) or inline
workspace: workspace/ # baseline dir, copied per-trial into tmpdir
trials: 1
checks:
- { type: file_exists, path: greeting.txt }
- { type: file_contains, path: greeting.txt, value: "hello" }
- { type: file_not_contains, path: greeting.txt, value: "TODO" }
- { type: command, run: "test -f greeting.txt", expectExit: 0 }Supported checks
| type | asserts |
| --------------------- | ---------------------------------------------------------------- |
| command | shell command exits with expectExit (default 0); optional stdout contains/matches |
| file_exists | file at path exists in the workspace |
| file_contains | file at path contains the literal substring value |
| file_not_contains | file at path does NOT contain value |
| file_matches | file at path matches the regex matches |
| llm_judge | a user-supplied JudgeFn answers yes to prompt against runner stdout/stderr |
| agentmd_adherence | per-rule pass rate from agentmd test meets minPassRate; optional ruleId filter |
agentmd_adherence
- type: agentmd_adherence
promptFile: ../agent.md # path to agentmd source (relative to eval.yml)
fixtures: ../fixtures.yml # path to agentmd fixture file
ruleId: H3 # optional — score only this rule
minPassRate: 0.9 # required — pass rate floor in [0, 1]
via: claude-code # optional — default claude-code (api | claude-code | fake)
model: claude-haiku-4-5 # optional — forwarded as --model
timeoutMs: 180000 # optional — subprocess timeoutShells out to the agentmd CLI (bundled as a runtime dependency) via
agentmd test <promptFile> --fixtures <fixtures> --format json, parses
the per-rule check outcomes, computes the pass rate for ruleId (or
overall if omitted), and fails the check when the rate is below
minPassRate. Tests can inject a fake subprocess runner via the
library API (AgentmdSpawnFn) so CI doesn't need an API key.
CLI
iso-eval run examples/suites/echo-basic/eval.yml
iso-eval plan examples/suites/echo-basic/eval.yml
iso-eval run eval.yml --filter write-greeting --concurrency 2 --json
iso-eval run eval.yml --runner claude-code --harness-source ../dist
iso-eval run eval.yml --runner cursor --harness-source ../dist
iso-eval run eval.yml --runner opencode --harness-source ../dist
iso-eval run eval.yml --keep-workspaces # skip tmpdir cleanup for debuggingrun exits 0 on all-pass, 1 on any failure, 2 on invalid invocation.
--runner and --harness-source let you replay the same suite through a
different packaged harness without rewriting checks.yml.
Real runners and harness staging
Set runner: in YAML, or override it at the CLI with --runner.
harness.source is optional; when present, iso-eval stages the generated
harness files you want the runner to see into each snapshotted workspace.
codex
suite: refactor-basic
runner: codex
timeoutMs: 180000
harness:
source: ../distAccepted harness.source shapes:
- a project directory containing
AGENTS.mdand/or.codex/ - a direct
AGENTS.mdpath - a direct
.codex/config.tomlpath
claude-code
Accepted harness.source shapes:
- a project directory containing
CLAUDE.md,.claude/, and/or.mcp.json - a direct
CLAUDE.mdpath - a direct
.claude/path - a direct
.claude/settings.jsonpath - a direct
.mcp.jsonpath
The runner shells out to claude -p --no-session-persistence and passes
.mcp.json through --mcp-config when present.
opencode
Accepted harness.source shapes:
- a project directory containing
AGENTS.md,opencode.json, and/or.opencode/ - a direct
AGENTS.mdpath - a direct
opencode.jsonpath - a direct
.opencode/path
The runner shells out to opencode run --dir <workspace> and defaults to
--pure so each trial stays self-contained.
cursor
Accepted harness.source shapes:
- a project directory containing
.cursor/,AGENTS.md, and/orCLAUDE.md - a direct
.cursor/path - a direct
.cursor/rules/path - a direct
.cursor/rules/*.mdcpath - a direct
.cursor/mcp.jsonpath - a direct
AGENTS.mdpath - a direct
CLAUDE.mdpath
The runner shells out to cursor-agent --print --output-format text
--workspace <workspace> and stages any Cursor harness files you exported
with iso-harness into the per-trial workspace first.
This lets one suite exported from iso-trace be replayed across the
packaged runners with the same task prompt and checks.
Library API
import { loadSuite, run, formatReport, fakeRunner } from "@razroo/iso-eval";
const suite = loadSuite("./eval.yml");
const report = await run(suite, {
runner: fakeRunner,
concurrency: 2,
onTaskComplete: (t) => console.log(t.id, t.passed ? "✓" : "✗"),
});
console.log(formatReport(report));
process.exit(report.passed ? 0 : 1);Bring your own runner
The YAML runner: field selects from shipped runners; the library
accepts any RunnerFn:
import type { RunnerFn } from "@razroo/iso-eval";
const myRunner: RunnerFn = async ({ workspaceDir, taskPrompt, timeoutMs, harnessSource }) => {
// spawn your agent (claude -p / codex exec / …) with cwd = workspaceDir
// optionally stage files from harnessSource before invoking it
// return { exitCode, stdout, stderr, durationMs }
};Bring your own judge (for llm_judge checks)
import type { JudgeFn } from "@razroo/iso-eval";
const judge: JudgeFn = async (prompt, output) => {
// call your model; return true if the rule was followed
};
await run(suite, { runner: fakeRunner, judge });How this fits the rest of the pipeline
agent.md → agentmd lint → agentmd render → isolint lint → iso-harness build
│
▼
project w/ CLAUDE.md etc.
│ iso-eval run
▼
per-task pass / fail@razroo/agentmdmeasures per-rule adherence on text output (input string → output string → check).@razroo/iso-evalmeasures task success on a real workspace (snapshot dir → agent acts → filesystem state → check).
The two compose: an iso-eval suite can include llm_judge checks that
reuse the same judge convention (yes = rule followed), plus
agentmd_adherence checks that fold a fixture-level adherence score into
the task report.
License
MIT — see LICENSE.
