@dogfood-lab/dogfood-swarm

v1.3.2

Published

3 days ago

10-phase parallel-agent protocol runner for testing-os. SQLite-backed control plane, durable receipts, domain-aware orchestration. Three R's recovery contract: revalidate / rewind / redrive.

0High
0Medium
0Low

mikefrilot

testing-os dogfood-lab swarm agent-protocol audit recovery rewind redrive revalidate

@dogfood-lab/dogfood-swarm

10-phase parallel-agent protocol runner for testing-os. SQLite-backed control plane, durable receipts, domain-aware orchestration. Three R's recovery contract: revalidate / rewind / redrive.

Part of the testing-os monorepo — the operating system for testing in the AI era.

The swarm CLI runs parallel-agent audits against a codebase. Each wave dispatches multiple agents under exclusive file ownership, collects their outputs through the verifier, and persists durable receipts to a SQLite control plane. The wave-level and agent-level state machines surface every status transition as an auditable event; the recovery verbs (revalidate, rewind, redrive) handle wave failures lawfully without raw SQL surgery.

Install

npm install -g @dogfood-lab/dogfood-swarm

Binary: swarm. Requires Node ≥ 22.

Quick start

# Initialize a swarm run — detects domains, records a save-point
swarm init <repo-path>

# Review the detected domain draft, then freeze it (dispatch refuses
# to run until the domain map is frozen)
swarm domains <run-id> --freeze

# Dispatch a wave for a named phase (NOT a wave number — phase names below)
swarm dispatch <run-id> <phase>

# (Agents execute externally — e.g., parallel Claude sessions — and write
#  their outputs to swarms/<run-id>/wave-N/<domain>/output.json)

# Collect outputs through the verifier — one --domain per dispatched agent
swarm collect <run-id> \
  --domain=backend:swarms/<run-id>/wave-N/backend/output.json \
  --domain=tests:swarms/<run-id>/wave-N/tests/output.json

# Inspect current wave + agent state
swarm status <run-id>

# Inspect wave transition history (full audit chain)
swarm history <wave-id>

# Generate per-wave receipt artifact
swarm receipt <run-id>

# Advance to the next phase once gates pass
swarm advance <run-id>

<phase> is a named phase, not a wave number. The valid values are: health-audit-a, health-audit-b, health-audit-c, stage-d-audit, feature-audit (audit phases) and health-amend-a, health-amend-b, health-amend-c, stage-d-amend, feature-execute (amend phases). Run swarm dispatch --help for the same list.

Recovery — the Three R's

| Verb | When to use | Behavior | |---|---|---| | swarm revalidate | Agents wrote invalid_output — schema mismatch, validator rejection | Repairs in place; transitions agent_runs out of BLOCKED status via override (with operator --reason); wave-level rollback if all 4 agents repaired | | swarm rewind | Wave needs full restart from a save-point; tree state needs reset | Restores tree via git reset --hard <tag>; lawfully aborts orphaned in-flight runs to terminal aborted_for_rewind; preserves audit chain (append-only) | | swarm redrive | Some agents failed, others completed; want to resume the failing tail without re-running completed work | Same wave_id, completed receipts preserved byte-identical, only failed/pending agents made re-dispatchable |

All three recovery verbs share the same operator-safety contract:

Dry-run by default — --apply required to mutate
--reason "<text>" required, non-empty — recorded in wave_state_events / agent_state_events with a verb-specific prefix (revalidate: / rewind: / redrive:)
Zero raw SQL on agent_runs.status or waves.status — every state mutation routes through transitionAgent / transitionWave; static-scan guard test (Pattern #10) blocks regressions

Example session:

# Failed wave needs schema-mismatch repair (re-supply the agent's output path)
swarm revalidate <run-id> --reason "wave-2 schema mismatch corrected" \
  --domain=backend:swarms/<run-id>/wave-2/backend/output.json --apply

# Wedged wave — restart from save-point tag
swarm rewind <save-point-tag> --reason "rolling back wedged amend wave" --apply

# Transient infra failure — resume only the failed agents
swarm redrive <wave-id> --reason "GitHub API outage retry" --apply

# Audit the full transition chain for any wave
swarm history <wave-id>

Exit codes

The verbs designed to gate CI propagate a machine-readable exit code, not just human-readable stdout. Wire these into a workflow step or a && chain and the gate fails closed:

| Verb | Exit code contract | |---|---| | swarm verify | 0 only when the verdict is pass; 1 for every other verdict (fail, skip, no_tests, tool_missing). Each non-pass verdict is "not a verified pass" — see Verify verdicts — so the machine signal matches the human one and a CI && chain fails closed (a no_tests or tool_missing never reads as success). | | swarm verify-fixed | 0 clean / 1 threshold exceeded (regressed + claimed-but-still-present > --threshold, default 0) / 2 audit pipeline broken | | swarm verify-recurring | 0 / 1 / 2 (same 3-way contract as verify-fixed) | | swarm verify-unverified | 0 / 1 / 2 (same 3-way contract) | | swarm verify-approved | 0 / 1 / 2 — exit 2 (broken finding anchor) is the pre-amend gate that blocks subsequent swarm dispatch of an amend phase | | swarm findings | 0 clean / 1 findings present / 2 audit pipeline broken | | swarm persist --ingest | 0 when the dogfood ingest succeeded (or was a --dry-run); 1 when the ingest failed. A bare swarm persist with no --ingest exits 0. |

Any command also exits 1 on a structured operator error (the typed code / message / Next: envelope). Exit 2 is reserved for the "pipeline broken" case on the verbs above so a CI gate can tell findings/regressions exist (1) apart from the audit itself could not run (2).

Troubleshooting — when a wave fails

Every command emits its stage transitions as NDJSON on stderr, so the first move in an incident is to capture that forensic stream and read it back:

swarm collect <run-id> \
  --domain=backend:swarms/<run-id>/wave-N/backend/output.json 2>collect.ndjson
grep '"stage"' collect.ndjson   # the ordered chain of what happened, with codes

Then map the symptom to the recovery verb:

| Symptom | What it means | Recovery | |---|---|---| | collect failed mid-upsert (COLLECT_UPSERT_FAILED) | One agent's output failed validation or the merge transaction aborted; the wave is failed. | swarm revalidate <run-id> --reason "..." --domain=name:path --apply — re-runs the same validators on the re-supplied output, and on pass flips the wave back to collected in one transaction. | | Wave stuck in dispatched — never reached collected | Agents didn't all finish, or the run was interrupted before collect. | swarm resume <run-id> to re-dispatch the incomplete agents; or swarm redrive <wave-id> --reason "..." --apply to resume only the failed/unstarted tail while preserving completed receipts byte-identical. | | Agents BLOCKED (invalid_output / ownership_violation) | Schema mismatch, or an agent wrote outside its frozen domain. | invalid_output → swarm revalidate. ownership_violation → extend the domain via swarm domains --unfreeze … --edit … --freeze, then swarm revalidate. | | Wave wedged — tree state needs a full reset | The working tree drifted and the wave must restart from a save-point. | swarm rewind <save-point-tag> --reason "..." --apply — git reset --hard <tag> plus lawful abort of orphaned in-flight runs, audit chain preserved. |

All recovery verbs are dry-run by default — run them without --apply first to preview the transitions, then add --apply. Every error carries a typed code and a Next: hint; the full table is in the handbook.

📖 Deeper incident docs: Recovery · Error codes

State machines

Two parallel state machines:

Agent runs (lib/state-machine.js): pending → dispatched → running → complete | failed | timed_out | invalid_output | ownership_violation | aborted_for_rewind
Waves (lib/wave-state-machine.js): dispatched → collected → verified → advanced | failed | aborted_for_rewind

Discipline:

Terminal statuses (complete, advanced, aborted_for_rewind) cannot be transitioned out of — not even with override=true.
BLOCKED statuses (failed, invalid_output, ownership_violation) require explicit override=true + non-empty reason to transition out.
Every transition lands in wave_state_events / agent_state_events atomically with the underlying status mutation, inside the same SQLite transaction.

Verify verdicts

swarm verify <run-id> runs the build-verification adapter and prints Verification: <VERDICT>. Only pass advances the wave to verified — the other four verdicts are deliberately distinct so a no-op never masquerades as a clean pass:

| Verdict | Means | Advances the wave? | |---|---|---| | pass | Every required step ran and passed. | Yes | | fail | A required step ran and failed — the code is broken. | No | | skip | No required steps ran (every step was optional or filtered away). Nothing was verified. | No | | no_tests | The repo has no test script; npm test --if-present ran zero tests. Not a verified pass — supply a real test command via a step override or pick an explicit --adapter. | No | | tool_missing | A required tool (e.g. npm, npx) is absent from PATH, so verification could not run in this environment. Not a failure of the code under test — install the tool or run on a host that has it. | No |

no_tests and tool_missing exist precisely so the wave gate stays honest: it refuses to advance without positive evidence, but it does not falsely report FAIL when the cause is a missing test script or a missing build tool rather than a real regression.

Control plane

SQLite-backed. Each swarm run gets swarms/<run-id>/control-plane.db:

| Table | Purpose | |---|---| | waves | Wave records (status, phase, wave_number, run_id, snapshot, serial_verify_required) | | agent_runs | Per-agent dispatch records (status, domain, output_path, verification_skipped) | | wave_state_events | Append-only wave-status audit log (from_status, to_status, reason, created_at) | | agent_state_events | Append-only agent-status audit log (mirror shape of wave_state_events) | | findings | Findings derived from agent outputs | | domain_events | Domain-map mutation audit log (unfreeze / edit / freeze) |

Read via swarm status, swarm history, swarm receipt. Never via raw SQL in scripts — the state-machine helpers are the supported interface and the audit chain depends on going through them.

Environment variables

Three environment variables are part of the scriptable surface — they are honored on every invocation:

| Variable | Accepted values | Effect | |---|---|---| | SWARM_DB | a filesystem path | Overrides the control-plane DB path. Unset → the default swarms/<run-id>/control-plane.db. Point this at a non-default DB to run against an alternate control plane. | | DOGFOOD_FINDINGS_FORMAT | raw | human | json | Forces the swarm findings output format, overriding both the --format flag and TTY auto-detection. raw → markdown, human → text, json → JSON. | | DOGFOOD_LOG_HUMAN | 0 | 1 | Controls the human-readable companion banner printed alongside the NDJSON stage stream on stderr. 0 → never emit the banner (deterministic machine-readable stderr for CI), 1 → always emit it. Unset → emit only when stderr is a TTY. |

Stage transitions are emitted as NDJSON on stderr — one JSON object per line, greppable — while stdout carries the command's parse target. Set DOGFOOD_LOG_HUMAN=0 when you want a clean, machine-parseable stderr stream (e.g. swarm collect ... 2>collect.ndjson).

10-phase protocol

| Phase | Purpose | |---|---| | 1–4 (Health Pass) | Audit → Review → Amend → Repeat. Three stages: A (bug/security fix), B (proactive health), C (humanization), D (visual/presentation truth). Closes at 0 CRIT / 0 HIGH. | | 5–8 (Feature Pass) | Feature audit → user review → execution → repeat. Production-readiness focus. | | 9 | Final test pass — comprehensive validation across the whole system. | | 10 | Full Treatment — shipcheck, README finalize + translations, landing page, handbook, repo-knowledge DB entry, deploy + verify. |

Each wave produces a manifest (swarms/<run-id>/manifest.json) and per-wave receipts (swarms/<run-id>/wave-N/receipt.md) for durable audit. A swarm is not complete until Phase 10 finishes.

Domain ownership

Agents in a wave have exclusive file ownership scoped to their domain (typical domains: backend, bridge, tests, ci-tooling, frontend, docs). The frozen domain map at dispatch time is the canonical authority; the agent prompt is derived from the frozen state so dispatch + agent + verifier all consume the same shape.

Cross-domain mutation surfaces at collect time as status ownership_violation (BLOCKED). Recovery options:

swarm domains --unfreeze --reason "..." → --edit <domain> --globs "..." → --freeze — to legitimately extend a domain's scope (recorded in domain_events)
swarm revalidate — if the agent's files_changed self-report turns out to match the original frozen scope after coordinator review

Save-point discipline

The repo's git tags + commits ARE the save points. Commits on main are the durable mechanism for "I can roll back to here." swarm rewind accepts any git tag matching swarm-save-* by default (--force-arbitrary-ref opts into any ref). Rewind is dry-run safe; verified by an explicit HEAD-guard test that confirms the actual repo's HEAD is unchanged after the rewind test suite runs (cordoned-test discipline).

Docs

📖 Full handbook: https://dogfood-lab.github.io/testing-os/handbook/

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@dogfood-lab/dogfood-swarm

Install

Quick start

Recovery — the Three R's

Exit codes

Troubleshooting — when a wave fails

State machines

Verify verdicts

Control plane

Environment variables

10-phase protocol

Domain ownership

Save-point discipline

Docs

License