agent-regression-lab
v0.7.1
Published
Regression testing for AI agents — catch prompt and behavior changes before they ship.
Maintainers
Readme
Agent Regression Lab
Agent Regression Lab is the local-first regression spine for agent engineering teams.
It gives teams a repeatable way to define expected agent behavior in YAML, replay it against deterministic tool surfaces or live HTTP agents, store traces and scores locally, and compare candidate behavior against known baselines over time.
This is a local-first alpha for early technical teams. It is strongest when used across one workflow spine:
- debug a single scenario while building
- validate a branch with a suite before merge
- run curated golden suites before release
- keep incident-derived scenarios as engineering memory
Who It Is For
- teams shipping prompt, model, tool, workflow, and memory changes
- engineers who need repeatable before/after evidence instead of vibes
- teams validating live HTTP agents as well as deterministic local scenarios
- researchers and technical operators who want local control before adopting heavier hosted infrastructure
Why Teams Use It
- catch regressions before merge or release
- debug subtle behavioral changes with full traces
- compare model, prompt, tool, and workflow changes against a known baseline
- build a portfolio of golden workflows, historical regressions, and ugly edge cases
- preserve engineering memory so old failures do not quietly return
What It Supports Today
- YAML scenarios under
scenarios/ - deterministic built-in tools plus custom tools from
agentlab.config.yaml - named agents from
agentlab.config.yaml - built-in
mock,openai,external_process, andhttpagent modes type: conversationmulti-turn dialog scenarios for HTTP agents- SQLite-backed local run history under
artifacts/agentlab.db - CLI commands to list, run, show, compare, and launch the UI
- local web UI for run inspection, run comparison, and suite batch comparison
Workflow Spine
Use this as the default product story:
- debug locally with one scenario
- validate a branch with a suite
- run curated golden suites before release
- keep incident-derived scenarios as permanent regression assets
Start Here
If your agent runs as an HTTP service:
- use
provider: http - start with arl-test
- read docs/agents.md and docs/scenarios.md
If you are validating coding-agent changes:
- start with the coding scenarios under
scenarios/coding/ - read docs/coding-agents.md
- use deterministic tool-loop runs first, then compare before/after behavior
If you want pre-merge regression checks in CI:
- use
suite_definitions - start with
.github/workflows/agentlab-pre-merge.yml - run
agentlab run --suite-def pre_merge --agent mock-default
First 10 Minutes
The fastest path for new users is the installed CLI.
Path A: npm install
npm install -g agent-regression-lab
agentlab run --demo
agentlab init
agentlab list scenarios
agentlab run support.generated-happy-path --agent mock-default
agentlab approve @lastagentlab init writes agentlab.config.yaml, starter scenarios under scenarios/, fixture stubs under fixtures/, and .gitignore coverage for artifacts/.
Add more starter coverage any time:
agentlab generate --domain support --count 5 --agent mock-defaultUse shorthands instead of copying UUIDs:
agentlab show @last
agentlab compare @prev @last
agentlab compare --baseline support.generated-happy-path @lastPath B: local development
npm install
npm run check
npm test
npm run build
npm link
agentlab --helpTry the zero-config demo from either path:
agentlab run --demoThis runs a 2-phase narrative demo: baseline run → simulated prompt change → regression caught.
Launch the local UI:
agentlab uiThe UI starts on http://127.0.0.1:4173.
- Run a suite and compare two suite batches:
agentlab run --suite support --agent mock-default
agentlab run --suite support --agent mock-default
agentlab compare --suite <baseline-batch-id> <candidate-batch-id>run --suite prints a Suite batch: id at the end. That is the id used by compare --suite.
Install
Installed CLI
After the package is published:
npm install -g agent-regression-lab
agentlab --helpYou can also use:
npx agent-regression-lab --helpLocal Development Install
From this repo:
npm install
npm run build
npm link
agentlab --helpRepo-Local Dev Mode
If you do not want to link the package yet:
npm run start -- --help
npm run start -- run support.refund-correct-order --agent mock-defaultCLI
Supported command surface:
agentlab init [project-name]
agentlab generate [--agent <name>] [--domain support|coding|research|ops|general] [--count <n>]
agentlab run --demo
agentlab run <scenario-id> [--agent <name>]
agentlab run --suite <suite-id> [--agent <name>]
agentlab run --suite-def <name> [--agent <name>]
agentlab run <scenario-id> [--variant-set <name>]
agentlab show <run-id|@last|@prev>
agentlab approve <run-id|@last|@prev>
agentlab compare <baseline-run-id|@prev> <candidate-run-id|@last>
agentlab compare --baseline <scenario-id> <candidate-run-id|@last>
agentlab compare --suite <baseline-batch-id> <candidate-batch-id>
agentlab ui
agentlab version
agentlab helpThe CLI operates on the current working directory. Run it from the root of a project that contains scenarios/, fixtures/, and optional agentlab.config.yaml.
Canonical Workflow
Use this as the default mental model:
- list scenarios
- run one scenario or one suite
- note the run id or suite batch id
- inspect the run in CLI or UI
- compare two runs or two suite batches
- extend the setup with a named agent or custom tools from repo-local files or installed packages when needed
Canonical Live HTTP Fixture
arl-test/ is the canonical live HTTP regression fixture in this repo.
Use it to verify the production-like HTTP path end to end:
cd arl-test
npm start
node ../dist/index.js list scenarios
node ../dist/index.js run order-tracking-in-transit --agent support-agentThe arl-test scenarios are intended to behave like a real internal-team regression fixture, not just a toy demo.
Config And Extension Points
agentlab.config.yaml is the public extension point for:
- named agents
- custom tools from repo-local files or installed npm packages
Supported agent providers:
mockopenaiexternal_processhttp— point at a running HTTP service for multi-turn conversation testing
Working sample assets already live in this repo:
- external agents:
custom_agents/node_agent.mjs,custom_agents/python_agent.py - custom tool:
user_tools/findDuplicateCharge.ts - package-style tool examples:
examples/support-tools,examples/coding-tools - sample config:
agentlab.config.yaml
See:
Local Data And Artifacts
By default the product writes local state under artifacts/.
Important paths:
- SQLite DB:
artifacts/agentlab.db - per-run trace output:
artifacts/<run-id>/trace.json - local UI assets at runtime: served from packaged
dist/ui-assetsor built intoartifacts/ui/in repo mode
If you delete artifacts/, you remove stored run history and generated local outputs.
Determinism
The benchmark is designed to be deterministic enough for repeated local evaluation:
- built-in tools read from local fixtures
- scenarios declare fixed tool allowlists and evaluator rules
- scoring is rule-based
- suite comparison is based on stored local runs and suite batch ids
Agent behavior can still vary depending on the provider path. The built-in mock path is the most deterministic path for smoke tests and baseline examples.
Limitations
- this is a local-first alpha, not a hosted platform
- the published package/example ecosystem is still small
- external agents integrate through the local stdin/stdout protocol only
- the UI is intentionally minimal and optimized for debugging
- SQLite-backed local storage still makes sequential live verification the safest path when reusing the same local artifacts DB
- the benchmark is broader than before, but still small compared to a mature benchmark product
Next Docs
- scenario authoring: docs/scenarios.md
- golden suites: docs/golden-suites.md
- integrations and live services: docs/integrations-and-live-services.md
- memory and stateful agents: docs/memory-and-stateful-agents.md
- custom tools: docs/tools.md
- named agents and external-process protocol: docs/agents.md
- common failure modes: docs/troubleshooting.md
