@ocarinalabs/quaver
v1.1.0
Published
Crash-test AI agents before they ship. Drives a coding agent (Claude or Pi) to generate adversarially-validated benchmark worlds. Ships `quaver agent generate`, `quaver lint`, `quaver validate`.
Downloads
782
Readme
@ocarinalabs/quaver
The quaver CLI. Crash-test AI agents before they ship.
Quaver drives a coding agent (Claude or Pi) to generate adversarially- validated benchmark worlds, and gives you a lint + probe gate to confirm a world is ungameable before you publish it.
Install
bun add -g @ocarinalabs/quaver
# or run without installing:
bunx @ocarinalabs/quaver --helpRequires bun >= 1.3. The validate command shells out to
harbor;
install it first:
uv tool install harbor \
--from "git+https://github.com/ocarinalabs/harbor.git@quaver-patches"Commands
quaver agent generate <description> # drive a coding agent to author a world
quaver lint <world> # fast static gate (<2s, no docker)
quaver validate <world> # full gate: lint + 5 adversarial probesThat is the entire CLI. Run quaver <cmd> --help for flag-by-flag usage.
quaver agent generate
Spawns a coding-agent session seeded with the @quaver/harness reference
world and skills. The agent reads your description, clones the
reference-world boilerplate into .quaver/worlds/<slug>/, adapts the
files, runs quaver lint between edits, and exits only after the world
passes quaver validate.
Two backends (same CLI surface):
--backend pi(default) —@mariozechner/pi-coding-agentdriving a Pi session. Model strings areprovider/model, e.g.vercel-ai-gateway/google/gemini-3.1-pro-preview.--backend claude—@anthropic-ai/claude-agent-sdkdriving Claude Code's full tool surface. Model strings are Anthropic ids, e.g.claude-opus-4-7,claude-sonnet-4-6.
# basic
quaver agent generate "support bench with contradictory refund policies"
# pick a specific Pi model, skip the gate (faster iteration)
quaver agent generate "context-pressure bench" \
--model vercel-ai-gateway/google/gemini-3.1-pro-preview \
--no-verify
# use the Claude Agent SDK backend
quaver agent generate "support bench" --backend claude --model claude-opus-4-7
# override the output dir
quaver agent generate "ticketing bench" --output /tmp/scratchquaver lint <world>
Static checks on an already-authored world. Runs in under 2 seconds, no docker. Catches the bug classes that burn generation cycles:
- Required-file presence (
task.toml,instruction.md,environment/,tests/) task.tomlschema (validates against Harbor'sTaskConfig)tests/*/check.pycompiles underpython3 -m py_compile- Rewardkit criterion whitelist (flags unknown
rk.*calls) - Trajectory-path discipline (every
trajectory_tool_usedmust passpath="/logs/agent/trajectory.json") eval/exec/compile/__import__/subprocesssurfaces in check.py- Answer leakage from rubric literals into
instruction.mdorenvironment/context/** - Bundled answer-key files in the context
- Harness tool bodies that log without mutating state (
log-stub-tool)
quaver lint .quaver/worlds/my-bench
quaver lint .quaver/worlds/my-bench --json # one finding object per lineExit 0 if clean, non-zero if any error-severity finding fires.
quaver validate <world>
Full pre-publication gate. Runs quaver lint first (and bails if lint
fails), then spawns the 5-probe adversarial cascade via Harbor:
nop— null-probe (Berkeley pattern 6)quaver-pattern-1— isolation-boundary probequaver-pattern-4— LLM-judge probequaver-pattern-5— string-match probequaver-pattern-7— verifier-log leak probe
Each probe must score at or below --threshold (default 0.01) on
every reward key. Anything above fails the gate.
# typical: lint + 5 probes, dotenv forwarded to harbor for provider keys
quaver validate .quaver/worlds/my-bench --env-file .env.local
# re-run only the probe cascade (lint was already clean)
quaver validate .quaver/worlds/my-bench --skip-lint
# emit a final JSON block (probe matrix + overall pass/fail) for CI
quaver validate .quaver/worlds/my-bench --env-file .env.local --jsonProbes write to .quaver/gates/validate-<timestamp>/probe-<agent>/ — no
stray jobs/ dir in your repo root.
Layout
.quaver/
worlds/<slug>/ # worlds authored by `quaver agent generate`
jobs/<ts>/ # ad-hoc harbor runs the coding agent makes,
# encapsulated inside each world so the world
# stays self-contained when copied or shared
gates/validate-<ts>/ # `quaver validate`'s own probe outputAll of .quaver/ (including per-world jobs/ subdirs) is gitignored.
Repo
Source, issues, and the full Quaver stack (harness, agent, harbor fork, reference worlds): https://github.com/ocarinalabs/quaver.
