@ocarinalabs/quaver

v1.1.0

Published

12 days ago

Crash-test AI agents before they ship. Drives a coding agent (Claude or Pi) to generate adversarially-validated benchmark worlds. Ships `quaver agent generate`, `quaver lint`, `quaver validate`.

Downloads

782

0High
0Medium
0Low

fawxyz

@ocarinalabs/quaver

The quaver CLI. Crash-test AI agents before they ship.

Quaver drives a coding agent (Claude or Pi) to generate adversarially- validated benchmark worlds, and gives you a lint + probe gate to confirm a world is ungameable before you publish it.

Install

bun add -g @ocarinalabs/quaver
# or run without installing:
bunx @ocarinalabs/quaver --help

Requires bun >= 1.3. The validate command shells out to harbor; install it first:

uv tool install harbor \
  --from "git+https://github.com/ocarinalabs/harbor.git@quaver-patches"

Commands

quaver agent generate <description>   # drive a coding agent to author a world
quaver lint <world>                   # fast static gate (<2s, no docker)
quaver validate <world>               # full gate: lint + 5 adversarial probes

That is the entire CLI. Run quaver <cmd> --help for flag-by-flag usage.

`quaver agent generate`

Spawns a coding-agent session seeded with the @quaver/harness reference world and skills. The agent reads your description, clones the reference-world boilerplate into .quaver/worlds/<slug>/, adapts the files, runs quaver lint between edits, and exits only after the world passes quaver validate.

Two backends (same CLI surface):

--backend pi (default) — @mariozechner/pi-coding-agent driving a Pi session. Model strings are provider/model, e.g. vercel-ai-gateway/google/gemini-3.1-pro-preview.
--backend claude — @anthropic-ai/claude-agent-sdk driving Claude Code's full tool surface. Model strings are Anthropic ids, e.g. claude-opus-4-7, claude-sonnet-4-6.

# basic
quaver agent generate "support bench with contradictory refund policies"

# pick a specific Pi model, skip the gate (faster iteration)
quaver agent generate "context-pressure bench" \
  --model vercel-ai-gateway/google/gemini-3.1-pro-preview \
  --no-verify

# use the Claude Agent SDK backend
quaver agent generate "support bench" --backend claude --model claude-opus-4-7

# override the output dir
quaver agent generate "ticketing bench" --output /tmp/scratch

`quaver lint <world>`

Static checks on an already-authored world. Runs in under 2 seconds, no docker. Catches the bug classes that burn generation cycles:

Required-file presence (task.toml, instruction.md, environment/, tests/)
task.toml schema (validates against Harbor's TaskConfig)
tests/*/check.py compiles under python3 -m py_compile
Rewardkit criterion whitelist (flags unknown rk.* calls)
Trajectory-path discipline (every trajectory_tool_used must pass path="/logs/agent/trajectory.json")
eval/exec/compile/__import__/subprocess surfaces in check.py
Answer leakage from rubric literals into instruction.md or environment/context/**
Bundled answer-key files in the context
Harness tool bodies that log without mutating state (log-stub-tool)

quaver lint .quaver/worlds/my-bench
quaver lint .quaver/worlds/my-bench --json   # one finding object per line

Exit 0 if clean, non-zero if any error-severity finding fires.

`quaver validate <world>`

Full pre-publication gate. Runs quaver lint first (and bails if lint fails), then spawns the 5-probe adversarial cascade via Harbor:

nop — null-probe (Berkeley pattern 6)
quaver-pattern-1 — isolation-boundary probe
quaver-pattern-4 — LLM-judge probe
quaver-pattern-5 — string-match probe
quaver-pattern-7 — verifier-log leak probe

Each probe must score at or below --threshold (default 0.01) on every reward key. Anything above fails the gate.

# typical: lint + 5 probes, dotenv forwarded to harbor for provider keys
quaver validate .quaver/worlds/my-bench --env-file .env.local

# re-run only the probe cascade (lint was already clean)
quaver validate .quaver/worlds/my-bench --skip-lint

# emit a final JSON block (probe matrix + overall pass/fail) for CI
quaver validate .quaver/worlds/my-bench --env-file .env.local --json

Probes write to .quaver/gates/validate-<timestamp>/probe-<agent>/ — no stray jobs/ dir in your repo root.

Layout

.quaver/
  worlds/<slug>/          # worlds authored by `quaver agent generate`
    jobs/<ts>/            # ad-hoc harbor runs the coding agent makes,
                          # encapsulated inside each world so the world
                          # stays self-contained when copied or shared
  gates/validate-<ts>/    # `quaver validate`'s own probe output

All of .quaver/ (including per-world jobs/ subdirs) is gitignored.

Repo

Source, issues, and the full Quaver stack (harness, agent, harbor fork, reference worlds): https://github.com/ocarinalabs/quaver.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@ocarinalabs/quaver

Install

Commands

quaver agent generate

quaver lint <world>

quaver validate <world>

Layout

Repo

`quaver agent generate`

`quaver lint <world>`

`quaver validate <world>`