soundcheck-cli
v2.3.0
Published
Voice-agent test & tuning harness — Playwright + an LLM judge + a synthetic caller, for speech. Runs on a single Deepgram key.
Maintainers
Readme
🎙️ Soundcheck — the test & tuning harness for voice agents
The thing a unit test can't do: hear whether your voice agent's speech is actually correct. Soundcheck drives a synthetic caller through a real spoken conversation with your agent, records the call, transcribes it with its own STT — the oracle — and checks what the agent actually said, not what the model claimed in text. One Deepgram key, zero runtime dependencies.
Why
Web apps have Playwright; real-time voice agents have had nothing like it. A conventional test can't hear that your agent said "star star confirmed," spoke a 24-hour time, read a confirmation number as "four thousand four hundred seventeen," talked over the caller, or lost the reservation halfway through the call.
That blind spot is sharpest for a coding agent. It can read logs, run unit tests, and inspect API responses, but it cannot hear — so it converges on green text tests while the spoken agent is still wrong, with no signal to fix. Soundcheck is built to close that loop. It drives a synthetic caller through a real conversation, transcribes the call with its own STT oracle, and returns the failure as machine-readable evidence: which invariant broke, what the oracle actually heard, a trace-driven diagnosis with a fix hint, and a command to reproduce it (run --json). A coding agent reads that and patches the agent; the deterministic gates decide pass/fail with no model in their path; you review the result.
So Soundcheck is useful two ways:
- For a coding agent building voice autonomously. The build → test → ship loop that works for code finally works for speech: author scenarios from the agent's own tools → run over real audio → read the
--jsondiagnosis →tunethe prompt until a held-out set passes → grow the suite from what it discovers. Most voice-testing tools assume a human watching a dashboard; Soundcheck is built for the agent doing the work, with you as the reviewer. - As a test harness for voice you drive yourself. Playwright, but for what your agent says — write a scenario, run it against your real agent over audio, gate it in CI, catch the spoken bug before you ship.
Getting Started
Prerequisites
- Node 22.6+ — Soundcheck runs TypeScript natively; there's no build step for development.
- A Deepgram API key — the only credential (free at deepgram.com). Live runs need it; offline
--replayruns don't. - ffmpeg (optional) — only for
--mp3compact reports; everything else works without it.
1. Install
The published CLI — to run Soundcheck against your own agent from any directory:
npm install -g soundcheck-cli # puts the `soundcheck` command on your PATH (Node 22.6+)Or install from source — to run the bundled examples (the offline demo below) or hack on Soundcheck itself:
git clone https://github.com/darrenapfel/Soundcheck.git && cd Soundcheck
npm install # devDeps only — zero runtime dependencies
npm link # puts the `soundcheck` command on your PATHSet your key once — Soundcheck checks, in order: the DEEPGRAM_API_KEY env var → a ./.env in the current directory → ~/.config/soundcheck/.env (a user-global fallback, so soundcheck works from any directory):
mkdir -p ~/.config/soundcheck && printf 'DEEPGRAM_API_KEY=dg_...\n' > ~/.config/soundcheck/.envOn a source checkout without npm link? Prefix any command with npm run soundcheck -- …. Point --aut at your own agent's .ts config; it lives in your project, so Node strips its types normally. TypeScript consumers also want npm i -D @types/node.
2. Install the coding-agent skill
Teach your coding agent how to drive Soundcheck — when to use it, every command, the 10 gates, the scenario schema, and an end-to-end tutorial (SKILL.md):
soundcheck install-skill # Claude Code + any other agent (Codex/Gemini) already on your machine
soundcheck install-skill --all # force all three; --claude-only for just Claude; --link to symlink
# fresh clone, before `npm link`: npm run skill:installThe skill also auto-loads for anyone using Claude Code inside this repo (.claude/skills/soundcheck/).
3. Run the basics
Try it offline first — no key, no network, instant (it replays recorded calls and runs the gates). Run this from a source checkout, where the example agent and its recorded calls live:
soundcheck run scenarios --aut examples/tabletalk/grounded.ts --replayThen the live loop against your own agent (needs your Deepgram key):
soundcheck author --spec ./my-agent.ts --out scenarios # 1. draft a scenario suite from the agent's tools + prompt
soundcheck run scenarios --aut ./my-agent.ts # 2. drive it live, gate it, write a report (exit ≠ 0 on any gate failure)
open runs/report-*.html # 3. hear the call + read what the oracle (STT) heard
soundcheck tune --agent ./my-agent.ts --fixer "claude -p" # 4. tune the prompt until the held-out set goes greenDriving Soundcheck from a coding agent or CI? Add --json to run — instead of the HTML report it emits the machine-readable contract on stdout: per-scenario gate results, what the oracle heard, the trace-driven diagnosis (evidence + a fix hint), and a reproduce command. soundcheck run … --json | jq .summary, or --json <file> to write it alongside the human output.
Full reference: docs/COMMANDS.md · the gates: docs/GATES.md · the end-to-end walkthrough: docs/TUTORIAL.md.
4. Self-improving tuning — with Codex
tune hands each failure's trace-driven evidence to a coding-agent fixer, which rewrites the system prompt, and keeps the edit only if a held-out set the fixer never saw improves (the Goodhart guard). examples/tune-demo/ ships two reference fixers: a deterministic rule-based one, and codex-fixer.sh — the Codex CLI (gpt-5.5), run read-only so it can't touch your files:
# needs a live DEEPGRAM_API_KEY and `codex login`
soundcheck tune \
--agent examples/tabletalk/bare.ts \
--train examples/self-improving-loop/scenarios/book-this-saturday-regression.json \
--heldout examples/self-improving-loop/heldout-book-sunday.json \
--fixer examples/tune-demo/codex-fixer.sh --max 1In a real run, the bare agent (its prompt lacks a date anchor, so it hallucinates the year) went 0/1 → 1/1 on both the trained and the unseen held-out call — Codex rewrote the prompt with a general date resolver, kept only because it generalized. Details: examples/tune-demo/README.md. Any stdin→stdout coding agent works as a --fixer (claude -p, a script, etc.).
▶ Prefer to listen? The sample gallery has real recorded calls — each domain's agent handled by a polite, an impatient, and a hostile caller (all pass), plus Soundcheck catching two deliberately-broken agents. Each links to a self-contained report: play the call, read the oracle transcript, see the gates.
No agent of your own yet? Five bundled example domains show the same gates working everywhere:
| Domain | Folder | What it exercises | Runs |
|---|---|---|---|
| Restaurant booking | examples/tabletalk/ | spoken symbols, ISO/grounded dates, read-back — bare/hardened/grounded | ✅ offline replay — grounded full; bare/hardened via --only book-modify-confirm |
| IT support | examples/support/ | verify-before-reset, never-delete — bare/grounded/insecure | ✅ offline replay (cassettes) |
| Healthcare clinic | examples/healthcare/ | verify-before-PHI, never-prescribe, grounded dates | live (goal-driven) |
| Bank card services | examples/banking/ | verify-before-any-action, never-wire, clean spoken money | live (goal-driven) |
| Airline rebooking | examples/travel/ | lookup-before-rebook, "tomorrow" grounded, integer bag counts | live (goal-driven) |
Offline, no key — replay the recorded ladders: watch the gates pass on the clean agent and catch the planted bugs on the broken one (each command below works as written):
soundcheck run scenarios --aut examples/tabletalk/grounded.ts --replay # ✅ all pass
soundcheck run scenarios --aut examples/tabletalk/bare.ts --replay --only book-modify-confirm # 🚩 catches STAR STAR + ungrounded date
soundcheck run examples/support/scenarios --aut examples/support/grounded.ts --replay # ✅ (skips the goal-driven demo)
soundcheck run examples/support/scenarios --aut examples/support/insecure.ts --replay --only frustrated-reset # 🚩 catches reset-before-verify + forbidden deleteThe healthcare, banking, travel suites (and support's adversarial-discovery) are goal-driven, live-only: an LLM improvises the caller, so they can't be replayed from a cassette — run them live with your key, e.g. soundcheck run examples/healthcare/scenarios --aut examples/healthcare/grounded.ts. (--replay skips them and says so; a replay that would run nothing fails closed.)
The loop: Scenario → Trace → Assess → Refine
Soundcheck is organized as a closed loop. Each step has a command; the deterministic checks gate a merge with no model in their path, so the loop runs unattended between your reviews.
| Step | What it is | Command |
|---|---|---|
| Scenario | a portable fixture — a caller goal/persona + declarative checks — drafted from your agent's own tools; runs unchanged across versions and implementations | author · scenarios/*.json |
| Trace | the flight recorder — the real call (caller + agent mixed at true timing) + the oracle STT + per-turn text + tool calls + timings, saved as a replayable cassette | run (records) / --replay (offline) |
| Assess | deterministic gates (hard pass/fail) + an advisory LLM judge (the fuzzy stuff) + latency | run · bakeoff |
| Refine | a coding agent reads a trace-driven diagnosis, patches the agent, re-runs — keeping an edit only if a held-out set improves (Goodhart guard) | tune --fixer |
| Test the tests | the judge is calibrated against a Golden Set and given a trust verdict before anything leans on it; a drift guard fails CI if it regresses | calibrate |
| Grow the suite | a failure a red-team caller discovers is frozen into a permanent, replayable regression — the suite grows itself | run --promote-failures |
What it checks — declarative, domain-agnostic gates
You declare a scenario's invariants; the registry enforces them deterministically against what was actually spoken:
"assert": [
"no_spoken_symbols", // never speak markup/symbols aloud
"no_spoken_cardinal_ids", // say IDs/SSN/ZIP digit-by-digit, not "four thousand…"
{ "tool_args_match_schema": "bookAppointment" }, // type/required/format/enum/pattern
{ "tool_sequence": ["verifyIdentity", "before", "accessRecord"] }, // ordering invariants
{ "spoken_matches_tool": { "field": "date", "tool": "bookAppointment" } }, // say what you did
{ "spoken_consistent_with_tool": { "field": "date", "tool": "bookAppointment", "now": "2026-05-29" } }, // don't verbally cave to a date you never booked
{ "required_tool": "scheduleCallback" },
{ "forbidden_tool": "chargeCard" },
{ "grounding": { "tool": "bookAppointment", "field": "date", "now": "2026-05-29", "expected": "2026-05-30" } }, // resolve relative dates
{ "latency": { "ttfb_ms": { "max": 2000 } } }
]The same registry tests a restaurant booker, an IT-support bot, a healthcare scheduler, or a finance IVR — any STS agent. Adding a gate is a function plus one registry entry.
Capabilities
Scenario — Declarative, portable fixtures via the AUTAdapter abstraction. Autonomous, domain-agnostic authoring: author --spec generates one scenario per tool from any agent's tools + prompt (destructive tools skipped, identity-gated tools get a proactive caller).
Trace — Real-time mixed recording of the whole call, played back in the report. Oracle self-validation: Soundcheck runs its own STT over the recording and shows "what it heard" on every live run (this is what caught our own bugs). A first-class, versioned Trace persisted as a record/replay cassette so gates and judges run offline, without re-running the agent.
Assess — A composable gate registry (above). An advisory LLM judge (rubric + findings, panel aggregation) for the fuzzy dimensions — it informs, it never hard-gates. Judge alignment: calibrate scores the judge against a no-human Golden Set, reports a trust verdict, corroborates with a stronger reference model, and pins a drift guard in CI.
Refine — A trace-driven tuning loop: tune produces a root-cause diagnosis per failing gate, feeds it to a pluggable --fixer, and keeps an edit only if a held-out set improves. Demonstrated generalizing a date-grounding fix to an unseen relative date.
Voice-native — Faithful turn-taking (endpoints on real end-of-speech). Barge-in (the caller cuts in; the agent's interruption is captured faithfully). Reactive caller (Evaline) that improvises toward a goal. Adversarial discovery: an adversarial Evaline red-teams the agent, improvising attacks that surface failures nobody scripted (it drove a deliberately-insecure agent into reset-before-verify + account-deletion, oracle-confirmed). A/B & vendor bake-off: bakeoff runs one suite against two configs and diffs the per-gate results (plus, with --judge, the advisory judge).
Self-improving — A discovered failure is promoted into a permanent regression (run --promote-failures) and tune refines the agent against the grown suite. End-to-end in examples/self-improving-loop/; the closure is pinned offline in test/regress.test.ts.
Autonomous — minimal human in the loop
Built to be driven by a coding agent and supervised, not babysat:
- Deterministic gates own the merge-gating verdicts — no model in that path.
- The LLM judge is advisory and calibrated — it informs; its trust is measured and reported before anything relies on it.
- Determinism via record/replay — a stochastic live call becomes a reproducible CI fixture.
- The oracle is the arbiter — every claim about a run is checkable against Soundcheck's own STT of the real audio, so neither the agent nor a reviewer takes a pass/fail on faith. You review the result, not every call.
Soundcheck verifies Soundcheck
The harness proves itself with its own tools — the only honest way to ask others to trust it:
- The oracle transcribes every live recording; the report shows it.
- End-to-end self-test (
test/self-test.test.ts): the generic gates catch deliberately-regressed builds (a buggy mock + insecure/bare example agents) and pass correct ones — with a coverage contract that every core safety gate family is shown catching a real regression. Offline, in CI. - Self-evaluation — Evaline-as-AUT, with a deliberately-broken-Evaline fixture that must fail.
- Judge calibration against a labeled corpus, with a drift guard.
One key, zero deps
Default + CI operation needs only DEEPGRAM_API_KEY — caller brain (the Voice Agent's think LLM), voice (TTS), transcription/oracle (STT), and the judge all run on Deepgram. No OpenAI/Anthropic key. (The optional openai-realtime reference adapter reads OPENAI_API_KEY only if a developer wires it; CI never touches it.) Zero runtime dependencies (Node 22 native TypeScript). MIT licensed. See docs/ARCHITECTURE.md.
Use it in your repo's CI
Soundcheck ships a reusable composite GitHub Action (action.yml) — zero runtime deps, so it needs nothing but Node 22. Replay recorded cassettes as an offline merge gate:
# .github/workflows/voice.yml in your agent's repo
jobs:
soundcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4 # your scenarios/ + fixtures/cassettes/ + agent.ts
- uses: darrenapfel/Soundcheck@v2 # pin the released major tag (v2 -> latest v2.x; or pin @v2.0.0)
with:
aut: agent.ts # your agent-under-test config (required)
scenarios: scenarios # dir of scenario .json files
cassette-dir: fixtures/cassettes # recorded cassettes (the --replay default)
# `args` defaults to --replay (offline, deterministic, no key). For a LIVE run instead:
# args: ""
# deepgram-api-key: ${{ secrets.DEEPGRAM_API_KEY }}Capability status
Everything below is shipped and oracle/test-verified.
| Capability | Status |
|---|---|
| Real-time recorder + oracle self-validation | ✅ Shipped |
| Faithful turn-taking + barge-in | ✅ Shipped |
| Reactive goal-driven caller (Evaline) + adversarial red-teamer | ✅ Shipped |
| Declarative, domain-agnostic gate registry | ✅ Shipped |
| Record/replay cassettes (first-class versioned Trace) | ✅ Shipped |
| Advisory LLM judge + panel | ✅ Shipped |
| Judge alignment loop (trust + cross-model + drift guard) | ✅ Shipped |
| Autonomous, domain-agnostic authoring | ✅ Shipped |
| Trace-driven Refine (red-green tuning) | ✅ Shipped |
| Self-improving loop: discover → promote → refine (run --promote-failures) | ✅ Shipped (examples/self-improving-loop/) |
| A/B & vendor bake-off | ✅ Shipped |
| End-to-end Soundcheck-tests-Soundcheck CI proof | ✅ Shipped |
| Five example domains (restaurant, IT-support, healthcare, banking, travel) | ✅ Shipped |
| Regression from production traffic (vs. a synthetic caller) | 🚧 Future |
| Online / production monitoring | 🚧 Future |
| Standalone STT / TTS validators | 🚧 Out of scope by design |
Soundcheck is a pre-ship harness today: sourcing discovered failures from real production traffic and online monitoring of live calls are separate surfaces, deliberately deferred. Standalone STT/TTS validators are out of scope by design — evaluate them after STS is great, not bundled in.
Docs
- 🧭
docs/GUIDE.md— start here: how it works + the workflow, end to end - 🛠️
docs/COMMANDS.md— every command and flag (run,author,tune,bakeoff,calibrate,validate,install-skill) - 🚦
docs/GATES.md— the 10 gates: what each asserts and when to declare it - 🎓
docs/TUTORIAL.md— zero to a well-tuned agent, step by step - 📖
docs/ABOUT.md— what Soundcheck is, both uses, in one page - 📐
docs/ARCHITECTURE.md— system design - 🧪
docs/TESTING.md— how we earn trust (record/replay, self-evaluation, calibration) - ⚖️
docs/CALIBRATION.md— live judge agreement numbers - ⚠️
docs/LIMITATIONS.md— honest limits - 🤝
CONTRIBUTING.md— add an adapter / scenario / gate
MIT licensed. Built for the agents-test-agents era — for voice.
