quality-playbook

v1.5.8

Published

12 days ago

Quality engineering for AI-driven development — a skill for AI coding agents that finds the bugs review misses.

Downloads

207

0High
0Medium
0Low

andrewstellman

quality tdd code-review ai claude copilot cursor codex skill

Quality Playbook

Version: 1.5.8 | Author: Andrew Stellman | License: Apache 2.0

Find the bugs that code review misses

Most AI code review can only find structural issues: null dereferences, resource leaks, race conditions. That catches about 65% of real defects. The other 35% are intent violations -- bugs that can only be found if you know what the code is supposed to do. A function that silently returns null instead of throwing, a duplicate-key check that passes when the first value is null, a sanitization step that runs after the branch decision it was supposed to guard. These bugs look correct to any reviewer that doesn't know the spec.

The playbook closes that gap. It reads your codebase, derives behavioral requirements from every source it can find (code, docs, specs, comments, defensive patterns, community documentation), and uses those requirements to drive review. The result is a quality system grounded in intent, not just structure. For a deeper look at this problem, see the O'Reilly Radar article AI Is Writing Our Code Faster Than We Can Verify It.

How to install the Quality Playbook

The fastest way is to let your AI coding tool do it.

Clone this repo somewhere on your machine — for example, git clone https://github.com/andrewstellman/quality-playbook ~/quality-playbook. One clone installs into any number of projects.
Open your target project in Claude Code, Cursor, GitHub Copilot, Windsurf, Continue, or another AI coding tool.
Ask the AI to install it. Something like:
"Install the Quality Playbook into this project from ~/quality-playbook."
The agent reads AGENTS.md, figures out which install location your tool uses, and runs the installer. Done.

Prefer to install by hand or use the script directly? See Step 1 of the walkthrough for the script invocation and Step 3 for the manual cp recipes.

Prerequisite: Python 3.10 or later on your PATH. QPB's runtime floor was raised from 3.9 to 3.10 in v1.5.7 089i — adopters must have 3.10+ available (the test suite uses 3.10-only features such as unittest.TestCase.assertNoLogs).

The more documentation you give it, the better it finds bugs. The playbook reads written specs, design docs, GitHub or Jira issues from real users, chat history, and post-mortems — then derives what your code is supposed to do from those sources. Without documentation it still runs (from the source tree alone), but bug recall drops materially. See Step 2: Provide documentation (strongly recommended) for what to gather and the best ways to gather it.

Gather it in one step. Copy references/DOC_GATHERING_PROMPT.md, open your project in Claude Code, Codex, Copilot, Cursor, Windsurf (or any capable AI tool), paste it in, and run it — it confirms your project, then crawls its docs, issues, and advisories into reference_docs/ for you. See Step 2 for details.

How to run the Quality Playbook

Open your project in your AI coding tool (Claude Code, Cursor, GitHub Copilot, Windsurf, Continue, etc.) and tell the agent:

"Run the Quality Playbook on this project."

That one line is all you need — once the skill is installed, the agent auto-discovers it; you don't have to open, read, or point at SKILL.md or any other file. The agent runs all six phases — explore, generate requirements + tests + protocols, code review, spec audit, reconcile findings, verify — and drops the results into a quality/ folder in your project.

A full six-phase run takes a while and uses a lot of tokens. To split it up across sessions (e.g., for daily token-budget management), tell the agent to run a subset:

"Run phases 1 to 3 of the Quality Playbook on this project."

Then later:

"Continue the Quality Playbook from phase 4."

When the run finishes, the quality/ folder contains:

quality/
├── BUGS.md                  ← consolidated bug report with spec basis (start here)
├── REQUIREMENTS.md          ← behavioral requirements derived from your code + docs
├── EXPLORATION.md           ← Phase 1 findings — patterns explored, files tagged
├── QUALITY.md               ← quality constitution for your codebase
├── CONTRACTS.md             ← extracted behavioral contracts
├── COVERAGE_MATRIX.md       ← contract-to-requirement traceability
├── COMPLETENESS_REPORT.md   ← final gate report with post-reconciliation verdict
├── PROGRESS.md              ← phase checkpoint log + cumulative bug tracker
├── test_functional.py       ← functional tests traced to requirements
├── test_regression.py       ← regression tests for confirmed bugs
├── writeups/                ← per-bug detailed writeups with patches (BUG-NNN.md)
├── patches/                 ← fix and regression-test patches
├── code_reviews/            ← three-pass code review output
├── spec_audits/             ← Council of Three auditor reports + triage
└── results/                 ← TDD red/green logs, integration results, gate log

Start with BUGS.md for the headline findings. Then read REQUIREMENTS.md to see what the playbook learned your code is supposed to do — including requirements derived from issues and docs that you may not have realized were there. The gap between what REQUIREMENTS.md says and what your code actually does is exactly the bug surface the playbook is built to find.

Need help? Just ask your AI

The rest of this README has detailed instructions for installing and running the playbook — commands, prompts, screenshots, the whole walkthrough. But the easiest way to get started is to skip the documentation entirely: download one file, upload it to your favorite AI chatbot, and ask it for help.

The file is ai_context/TOOLKIT.md. It's a single Markdown document that explains everything about the Quality Playbook in a format designed for AI assistants to read and answer questions from.

Open a chat in whatever AI tool you use — Claude, ChatGPT, Cursor, GitHub Copilot, Gemini — attach TOOLKIT.md, and tell it:

"Read TOOLKIT.md. Now you're an expert in the Quality Playbook."

Then ask it anything: How do I set this up? What does Phase 3 actually do? How does it find bugs that structural code review misses? What's the difference between gap and adversarial iteration? Why did my run only find one bug? Your AI assistant will walk you through setup, running, interpreting results, and improving your next run.

Here's what that conversation looks like in ChatGPT — it works the same in any other AI tool.

If you'd rather read the docs yourself, the rest of this README has the same information at higher resolution.

How to use the Quality Playbook to find bugs in your code

Step 1: Install the skill

The playbook ships as a complete bundle of 50 files (SKILL.md, quality_gate.py, references/, phase_prompts/, agents/, and 13 bin/*.py modules — see bin/install_skill.py::_bundle_files() for the authoritative list, or the Step 3 manual recipe below) that need to land in a directory your AI coding tool reads as a skill. The recommended path is to have your AI tool do the install for you.

Recommended: have your AI tool install it. Open a chat with Claude Code, Cursor, GitHub Copilot, or another AI coding assistant inside your target repo. Ask it:

"Read AGENTS.md from the Quality Playbook repo and follow the install procedure to set up the skill in this project."

The AI agent reads AGENTS.md, runs python3 -m bin.install_skill against the target, parses the structured output, and reports back. This is the default mode the install path is designed for.

Alternative: run the script directly. From your local QPB clone:

python3 -m bin.install_skill --into /path/to/target-repo --ai-tool cursor   # canonical: name the AI tool
python3 -m bin.install_skill --into /path/to/target-repo                    # auto-detect via marker dir
python3 -m bin.install_skill --target /path/to/install-root                 # literal install path
python3 -m bin.install_skill --verbose                                      # human-readable output

--ai-tool <name> is the canonical way to invoke when you know which tool will use the project; values are cursor, claude, copilot (alias github), continue, codex, windsurf, cline, and aider — the full 8-tool set the installer supports. The script creates the marker directory if it doesn't exist and installs into that tool's canonical subdirectory (.cursor/skills/quality-playbook/, .claude/skills/quality-playbook/, .github/skills/quality-playbook/, .continue/skills/quality-playbook/, .codex/skills/quality-playbook/, .windsurf/skills/quality-playbook/, .cline/skills/quality-playbook/, or .aider/skills/quality-playbook/). Bare --into <target-repo> falls back to auto-detecting from a marker directory inside the target — which only works if the target has been opened by your AI tool at least once. Codex, Windsurf, Cline, and Aider don't pre-create a project marker directory (nor do Cursor and Copilot before first project open), so bare---into auto-detection won't find them — but in the recommended flow (the "How to install" section above) you don't have to worry about this: the AI agent doing the install self-identifies its own tool and passes the matching --ai-tool itself, which installs to the canonical subdirectory and creates the marker dir whether or not it exists yet. You only pass --ai-tool <tool> yourself when you run the installer directly, with no agent in the loop. --target <path> treats the path as the literal install root and writes the skill files directly there; useful for operators with a non-standard install location. --target is mutually exclusive with both --into and --ai-tool.

Alternative: install via pip or npm (no clone needed). If you'd rather not clone the QPB repo, install from a package manager. The Quality Playbook ships as an application / scaffolder that copies the skill into your project — not a library you import:

# pip / uvx / pipx (Python):
uvx quality-playbook install --into /path/to/target-repo --ai-tool <tool>        # one-shot, no global install
pipx run quality-playbook install --into /path/to/target-repo --ai-tool <tool>
pip install quality-playbook && quality-playbook install --into /path/to/target-repo --ai-tool <tool>

# npx (Node):
npx quality-playbook init --ai-tool=<tool>                                        # e.g. --ai-tool=claude

Both channels run the same Python installer (Python 3.10+ is still required at runtime — the npm package is a thin Node shim, not a reimplementation), route the skill into the tool's canonical directory, and support the same --ai-tool self-identification described above. The channel sets QPB_CHANNEL (pip / npm) so the Phase-0 validator's remediation hints are channel-aware; neither channel ships compiled .pyc artifacts.

Already manually copied SKILL.md to your skills directory? Skip this step. The manual install paths described in Step 3 below continue to work — bin/install_skill.py is additive, not a replacement.

What the install does: copies the full skill bundle (50 files: SKILL.md, quality_gate.py, references/, phase_prompts/, agents/, and 13 bin/*.py modules — see bin/install_skill.py::_bundle_files() for the authoritative list) into the chosen install location. Runs a smoke check at the end (verifies quality_gate.py is loadable Python, SKILL.md parses with the expected frontmatter, references/exploration_patterns.md loads). Reports any failures in the structured output. Re-installs preserve operator-edited files as <file>.operator-backup-<UTC-timestamp> so your local edits aren't silently overwritten.

Step 2: Provide documentation (strongly recommended)

The playbook produces better requirements, fewer false positives, and more specific bugs when it has written documentation to work from.

Where to find documentation worth providing. The single biggest leverage is issue trackers — GitHub issues, Jira tickets, Linear, Shortcut. Bug reports and feature requests written by real users tell you what they expect the code to do, which is usually not fully captured in any spec you've written. Other high-value sources, in rough order of leverage:

Issue trackers — GitHub Issues, Jira, Linear, Shortcut. Filter for bug and feature-request; user words capture intent.
Project specs and design docs — RFCs, API contracts, architecture decision records (ADRs). Authoritative when they exist.
Post-mortems and incident retrospectives — capture intent that wasn't in the spec when the spec was written.
Chat history — Slack channels, Microsoft Teams, Discord. Especially design discussions, triage threads, and on-call rotation handoffs.
AI chat logs — Claude / ChatGPT / Cursor conversations where you reasoned through behavior.
Public standards you cite — RFCs, W3C specs, vendor API docs.

Tools that help gather these into plaintext. Two open agent-driven tools fit this use case well:

Cowork — Anthropic's desktop tool for non-developers; can connect to GitHub, Jira, Slack, Google Drive, Notion, and similar sources via MCP connectors, search across them, and export results to files. Good fit if you're already in the Anthropic ecosystem and want a graphical workflow.
OpenClaw — open-source AI agent that runs as a local gateway connecting LLMs to your messaging platforms (Slack, Teams, Discord, IRC, plus 20+ others). Uses the same SKILL.md-based skills system QPB does, so you can give it tooling and ask it to traverse your channels and export the relevant threads. Good fit if your project's intent lives in chat history and you want self-hosted, open-source tooling.

The easiest way: the guided gathering prompt. Copy references/DOC_GATHERING_PROMPT.md (or fetch it raw from https://raw.githubusercontent.com/andrewstellman/quality-playbook/refs/heads/main/references/DOC_GATHERING_PROMPT.md), paste it into any of the tools above, and run it — it only needs a project name to start. With QPB installed, you can also just ask your AI tool to gather docs for a project and it follows the same protocol. It identifies the project, proposes a source plan you can narrow or extend (including internal Jira/Confluence/Slack via your connectors), and writes well-structured files into reference_docs/ (with cite/ for authoritative specs). It grounds itself in the playbook first, so it gathers the intent and invariants QPB checks against rather than generic docs.

Or a quick one-liner if you just want something fast:

"Search [GitHub issues / Jira / Slack #project-channel / your-doc-source] for everything related to this codebase. Export to Markdown files in reference_docs/. Prioritize user-reported bugs and feature requests — those tell us what users expected that we may not have documented."

After the playbook runs, read quality/REQUIREMENTS.md to see what it actually learned from those sources. The requirements there are what the documentation says your code is supposed to do — which is frequently not what you thought it did. That gap is the bug surface the playbook finds.

File format. Plaintext only — .txt and .md. Convert other formats first:

pdftotext spec.pdf spec.txt
pandoc -t plain spec.docx -o spec.txt
lynx -dump https://example.org/spec.html > spec.txt

Where to put documentation in your target repo:

reference_docs/
├── claude-chat-2026-03-15.md    ← AI chat logs, design notes (Tier 4 context)
├── design-notes.md              ← exploratory writeups, retrospectives
├── incident-2026-02-retro.md    ← post-mortems, lessons learned
└── cite/
    ├── my-project-spec.md       ← your project's own spec (citable)
    └── rfc7807.txt              ← external standards you cite (citable)

Top-level reference_docs/ holds Tier 4 context — chat logs, design notes, retrospectives, any exploratory material. The playbook reads these into Phase 1 as background but does not byte-verify quotes from them.

reference_docs/cite/ holds citable material — specs, RFCs, API contracts, published standards. Every file here produces a FORMAL_DOC record with a mechanical citation excerpt that quality_gate.py byte-verifies. If you cite it in a BUG or REQ, the gate checks the quote matches the bytes on disk.

You do not need a sidecar file, a frontmatter header, or any metadata. Placement in cite/ is the flag that says "this is citable." (Optional: the first non-blank line of a cite/ file may carry  or # qpb-tier: 2 to mark it as Tier 2. Absent marker defaults to Tier 1.)

If you have no documentation at all, the playbook still runs. It will operate from the source tree alone (Tier 3 evidence) and produce Tier 5 inferred requirements. The results are weaker but valid.

What does not belong in reference_docs:

Binary or formatted files (PDF, DOCX, HTML) — convert first, commit plaintext
Code excerpts — the source tree is already Tier 3 authority
Test fixtures or sample data — these are project artifacts, not documentation
Anything private or sensitive that should not be read by an LLM — reference_docs/ contents are loaded into Phase 1 prompts

Step 3: Install the skill (manual flow — fallback)

If you prefer to do the install by hand instead of using bin/install_skill.py from Step 1, copy the skill files into your project directly:

Claude Code:

mkdir -p .claude/skills/quality-playbook/references
mkdir -p .claude/skills/quality-playbook/phase_prompts
mkdir -p .claude/skills/quality-playbook/agents
mkdir -p .claude/skills/quality-playbook/bin
cp SKILL.md .claude/skills/quality-playbook/SKILL.md
cp .github/skills/quality_gate/quality_gate.py .claude/skills/quality-playbook/quality_gate.py
cp references/* .claude/skills/quality-playbook/references/
cp phase_prompts/*.md .claude/skills/quality-playbook/phase_prompts/
# v1.5.6: agents/*.md needed by README Step 4's `claude --agent agents/...` invocation.
cp agents/*.md .claude/skills/quality-playbook/agents/
# v1.5.7 089 (F1/A-29): the full bin/ closure SKILL.md + phase_prompts
# hard-reference. MIRRORED from install_skill.py::_bundle_files() and
# pinned by test_install_skill_bundle_completeness (drift recreates
# the A-26 ship-blocker via this doc-sanctioned manual path).
cp bin/__init__.py                          .claude/skills/quality-playbook/bin/__init__.py
cp bin/_purpose.py                          .claude/skills/quality-playbook/bin/_purpose.py
cp bin/archive_lib.py                       .claude/skills/quality-playbook/bin/archive_lib.py
cp bin/benchmark_lib.py                     .claude/skills/quality-playbook/bin/benchmark_lib.py
cp bin/citation_verifier.py                 .claude/skills/quality-playbook/bin/citation_verifier.py
cp bin/council_config.py                    .claude/skills/quality-playbook/bin/council_config.py
cp bin/council_semantic_check.py            .claude/skills/quality-playbook/bin/council_semantic_check.py
cp bin/migrate_v1_5_0_layout.py             .claude/skills/quality-playbook/bin/migrate_v1_5_0_layout.py
cp bin/qpb_config.py                        .claude/skills/quality-playbook/bin/qpb_config.py
cp bin/quality_playbook.py                  .claude/skills/quality-playbook/bin/quality_playbook.py
cp bin/reference_docs_ingest.py             .claude/skills/quality-playbook/bin/reference_docs_ingest.py
cp bin/role_map.py                          .claude/skills/quality-playbook/bin/role_map.py
cp bin/run_state_lib.py                     .claude/skills/quality-playbook/bin/run_state_lib.py
cp bin/validate_phase_artifacts.py          .claude/skills/quality-playbook/bin/validate_phase_artifacts.py
cp bin/qpb_validate.py                      .claude/skills/quality-playbook/bin/qpb_validate.py
cp bin/qpb_phase.py                         .claude/skills/quality-playbook/bin/qpb_phase.py
# v1.5.2: single reference_docs/ tree at the target repo root.
# No README ships — cite/ contents are adopter-provided plaintext.
mkdir -p reference_docs reference_docs/cite
# v1.5.7: the quality/RUN_INDEX.md sentinel for the gitignore negation
# rule (without it run_playbook.py's pre-flight aborts "Required
# sentinel files missing"; install_skill.py creates it too).
mkdir -p quality
echo "# Run Index" > quality/RUN_INDEX.md
# Optional: append the suggested .gitignore rules for adopters (keeps bulk
# archived runs + reference_docs content out of version control while tracking
# the top-level RUN_INDEX.md).
cat skill-template.gitignore >> .gitignore

GitHub Copilot (flat layout):

mkdir -p .github/skills/references
mkdir -p .github/skills/phase_prompts
mkdir -p .github/skills/agents
mkdir -p .github/skills/bin
cp SKILL.md .github/skills/SKILL.md
cp .github/skills/quality_gate/quality_gate.py .github/skills/quality_gate.py
cp references/* .github/skills/references/
cp phase_prompts/*.md .github/skills/phase_prompts/
# v1.5.6: agents/*.md needed by README Step 4's `claude --agent agents/...` invocation.
cp agents/*.md .github/skills/agents/
# v1.5.7 089 (F1/A-29): the full bin/ closure SKILL.md + phase_prompts
# hard-reference. MIRRORED from install_skill.py::_bundle_files() and
# pinned by test_install_skill_bundle_completeness (drift recreates
# the A-26 ship-blocker via this doc-sanctioned manual path).
cp bin/__init__.py                          .github/skills/bin/__init__.py
cp bin/_purpose.py                          .github/skills/bin/_purpose.py
cp bin/archive_lib.py                       .github/skills/bin/archive_lib.py
cp bin/benchmark_lib.py                     .github/skills/bin/benchmark_lib.py
cp bin/citation_verifier.py                 .github/skills/bin/citation_verifier.py
cp bin/council_config.py                    .github/skills/bin/council_config.py
cp bin/council_semantic_check.py            .github/skills/bin/council_semantic_check.py
cp bin/migrate_v1_5_0_layout.py             .github/skills/bin/migrate_v1_5_0_layout.py
cp bin/qpb_config.py                        .github/skills/bin/qpb_config.py
cp bin/quality_playbook.py                  .github/skills/bin/quality_playbook.py
cp bin/reference_docs_ingest.py             .github/skills/bin/reference_docs_ingest.py
cp bin/role_map.py                          .github/skills/bin/role_map.py
cp bin/run_state_lib.py                     .github/skills/bin/run_state_lib.py
cp bin/validate_phase_artifacts.py          .github/skills/bin/validate_phase_artifacts.py
cp bin/qpb_validate.py                      .github/skills/bin/qpb_validate.py
cp bin/qpb_phase.py                         .github/skills/bin/qpb_phase.py
# v1.5.2: single reference_docs/ tree at the target repo root.
mkdir -p reference_docs reference_docs/cite
# v1.5.7: the quality/RUN_INDEX.md sentinel for the gitignore negation
# rule (without it run_playbook.py's pre-flight aborts "Required
# sentinel files missing"; install_skill.py creates it too).
mkdir -p quality
echo "# Run Index" > quality/RUN_INDEX.md
cat skill-template.gitignore >> .gitignore

GitHub Copilot (nested layout):

mkdir -p .github/skills/quality-playbook/references
mkdir -p .github/skills/quality-playbook/phase_prompts
mkdir -p .github/skills/quality-playbook/agents
mkdir -p .github/skills/quality-playbook/bin
cp SKILL.md .github/skills/quality-playbook/SKILL.md
cp .github/skills/quality_gate/quality_gate.py .github/skills/quality-playbook/quality_gate.py
cp references/* .github/skills/quality-playbook/references/
cp phase_prompts/*.md .github/skills/quality-playbook/phase_prompts/
# v1.5.6: agents/*.md needed by README Step 4's `claude --agent agents/...` invocation.
cp agents/*.md .github/skills/quality-playbook/agents/
# v1.5.7 089 (F1/A-29): the full bin/ closure SKILL.md + phase_prompts
# hard-reference. MIRRORED from install_skill.py::_bundle_files() and
# pinned by test_install_skill_bundle_completeness (drift recreates
# the A-26 ship-blocker via this doc-sanctioned manual path).
cp bin/__init__.py                          .github/skills/quality-playbook/bin/__init__.py
cp bin/_purpose.py                          .github/skills/quality-playbook/bin/_purpose.py
cp bin/archive_lib.py                       .github/skills/quality-playbook/bin/archive_lib.py
cp bin/benchmark_lib.py                     .github/skills/quality-playbook/bin/benchmark_lib.py
cp bin/citation_verifier.py                 .github/skills/quality-playbook/bin/citation_verifier.py
cp bin/council_config.py                    .github/skills/quality-playbook/bin/council_config.py
cp bin/council_semantic_check.py            .github/skills/quality-playbook/bin/council_semantic_check.py
cp bin/migrate_v1_5_0_layout.py             .github/skills/quality-playbook/bin/migrate_v1_5_0_layout.py
cp bin/qpb_config.py                        .github/skills/quality-playbook/bin/qpb_config.py
cp bin/quality_playbook.py                  .github/skills/quality-playbook/bin/quality_playbook.py
cp bin/reference_docs_ingest.py             .github/skills/quality-playbook/bin/reference_docs_ingest.py
cp bin/role_map.py                          .github/skills/quality-playbook/bin/role_map.py
cp bin/run_state_lib.py                     .github/skills/quality-playbook/bin/run_state_lib.py
cp bin/validate_phase_artifacts.py          .github/skills/quality-playbook/bin/validate_phase_artifacts.py
cp bin/qpb_validate.py                      .github/skills/quality-playbook/bin/qpb_validate.py
cp bin/qpb_phase.py                         .github/skills/quality-playbook/bin/qpb_phase.py
# v1.5.2: single reference_docs/ tree at the target repo root.
mkdir -p reference_docs reference_docs/cite
# v1.5.7: the quality/RUN_INDEX.md sentinel for the gitignore negation
# rule (without it run_playbook.py's pre-flight aborts "Required
# sentinel files missing"; install_skill.py creates it too).
mkdir -p quality
echo "# Run Index" > quality/RUN_INDEX.md
cat skill-template.gitignore >> .gitignore

Cursor, Windsurf, other tools: Use any of the locations above, or put the full skill bundle (50 files: SKILL.md, quality_gate.py, references/, phase_prompts/, agents/, and 13 bin/*.py modules — see bin/install_skill.py::_bundle_files() for the authoritative list, or the Step 3 manual recipe above) in your project root. The runner, gate, and orchestrator agents check all ten documented install layouts in order — repo-root SKILL.md plus the canonical <marker>/skills/quality-playbook/ subdirectory for each of the 8 supported tools (.claude, .github, .cursor, .continue, .codex, .windsurf, .cline, .aider), with .github/skills/ also accepted for the flat Copilot layout. The simplest path for any of these tools is still python3 -m bin.install_skill --ai-tool <tool>, which writes to the right subdirectory automatically.

OpenAI Codex CLI: v1.5.3 adds the standalone codex CLI (codex-cli 0.125+) as a third runner alongside claude and copilot. No separate skill-install layout — codex runs the playbook from any of the locations above. To use it via bin/run_playbook.py, pass --codex (see Step 4 + the "Running everything autonomously" section below).

Step 4: Run the playbook

Claude Code: Open Claude Code in your project directory and say: "Run the QPB install validator against this project (the qpb_validate.py entry point inside your QPB installation). For a clone-based install, the command is python <path-to-your-QPB-clone>/bin/qpb_validate.py <this-project-absolute-path> (substitute <path-to-your-QPB-clone> with your QPB clone path and <this-project-absolute-path> with this project's absolute path). Paste the complete structured output — every event= line including the run-nonce — into chat. Do not proceed past Phase 0 until event=validation_complete status=ok; if status=remediable, run each event=remediation_suggestion's command verbatim (for a missing install the validator emits the platform-correct install command, e.g. python <path-to-your-QPB-clone>/bin/install_skill.py --into <this-project-absolute-path> --ai-tool claude — run it from your QPB clone) and re-run the validator. Then run the playbook including all four iteration strategies (the agent auto-discovers the installed skill). Execute Phases 1-5 yourself in this session — do not delegate execution to a sub-agent; Phase 6 verification uses a fresh-context auditor sub-agent per the skill's A-13-hybrid contract." (The validator is the mandatory Phase 0 single source of truth — without a clean status=ok the artifact-contract validators and the Phase 6 gate are not at canonical locations; see AGENTS.md "Mode A entry sequence".)

Add --dangerously-skip-permissions when launching claude to skip file-write approval prompts during execution.

(For automated batch invocation — headless CI, scripted runs — use the orchestrator agent file via claude --agent agents/quality-playbook.agent.md. The orchestrator-agent path spawns sub-agents per phase and hides per-step output from operator chat, which is appropriate for unattended automation but NOT for interactive sessions where the operator monitors output. See agents/quality-playbook.agent.md's "When to use this file" header for the full constraint.)

GitHub Copilot: Open the chat panel in VS Code, IntelliJ, or any IDE with Copilot support and say: "Run the QPB install validator against this project (the qpb_validate.py entry point inside your QPB installation). For a clone-based install, the command is python <path-to-your-QPB-clone>/bin/qpb_validate.py <this-project-absolute-path>. Paste the complete structured output (every event= line) into chat. Do not proceed past Phase 0 until event=validation_complete status=ok; if status=remediable, run each event=remediation_suggestion command verbatim (the validator emits the platform-correct --ai-tool copilot install, run from the QPB clone) and re-run the validator. Then run the quality playbook on this project (the agent auto-discovers the installed skill)." For the CLI, install the standalone copilot CLI (preferred — brew install copilot-cli on macOS, winget install GitHub.Copilot on Windows, or curl -fsSL https://gh.io/copilot-install | bash on Linux; npm: npm install -g @github/copilot) and invoke it with copilot -p "<prompt>" --allow-all. The deprecated gh copilot extension (gh extension install github/gh-copilot, then gh copilot -p "<prompt>" --yolo) still works during GitHub's grace period — QPB auto-detects which CLI is on PATH and routes accordingly via bin/copilot_resolver.py (v1.5.7 089f). (The validator is the mandatory Phase 0 — see AGENTS.md "Mode A entry sequence".)

OpenAI Codex CLI:

python3 -m bin.run_playbook --codex ./my-project

This invokes codex exec --full-auto (sandboxed automatic execution; the codex equivalent of the Copilot CLI's --allow-all / --yolo) for each playbook phase. Codex picks its model from ~/.codex/config.toml unless you pass --model gpt-5-codex (or another model name in your codex config).

Cursor: Open Composer (Cmd+I / Ctrl+I) and say: "Run the QPB install validator against this project (the qpb_validate.py entry point inside your QPB installation). For a clone-based install, the command is python <path-to-your-QPB-clone>/bin/qpb_validate.py <this-project-absolute-path>. Paste the complete structured output (every event= line) into chat. Do not proceed past Phase 0 until event=validation_complete status=ok; if status=remediable, run each event=remediation_suggestion command verbatim (the validator emits the platform-correct --ai-tool cursor install, run from the QPB clone) and re-run the validator. Then run the quality playbook on this project (the agent auto-discovers the installed skill)." (The validator is the mandatory Phase 0 — see AGENTS.md "Mode A entry sequence".)

Windsurf: Open Cascade and say: "Run the QPB install validator against this project (the qpb_validate.py entry point inside your QPB installation). For a clone-based install, the command is python <path-to-your-QPB-clone>/bin/qpb_validate.py <this-project-absolute-path>. Paste the complete structured output (every event= line) into chat. Do not proceed past Phase 0 until event=validation_complete status=ok; if status=remediable, run each event=remediation_suggestion command verbatim (the validator emits the platform-correct --ai-tool windsurf install, run from the QPB clone) and re-run the validator. Then run the quality playbook on this project (the agent auto-discovers the installed skill)." (The validator is the mandatory Phase 0 — see AGENTS.md "Mode A entry sequence".)

The playbook runs in six phases. Each phase gets its own context window — this is what lets it do deep analysis instead of running out of context on large codebases. After each phase, say "keep going" to continue.

After Phase 1, the playbook reports candidate bugs and tells you what to say next.

Phase 5 confirms every bug with TDD red-green verification and generates fix patches.

The final summary shows all confirmed bugs with regression tests, patches, and writeups.

The six phases: Explore (read code + docs, find candidates) → Generate (requirements, tests, protocols) → Code Review (three-pass: structural, requirement verification, cross-requirement consistency) → Spec Audit (three independent auditors check code against requirements) → Reconciliation (every bug tracked, regression-tested, TDD-verified) → Verify (45 self-check benchmarks). The full cycle takes 15-90 minutes depending on project size and works with any language.

Step 5: Run iterations

After the baseline, the playbook suggests iteration strategies that find different classes of bugs — typically 40-60% more on top of the baseline. Say "Run the next iteration using the gap strategy" to start, then follow the suggested order: gap → unfiltered → parity → adversarial.

Running everything autonomously

To run the full baseline and all four iterations without manual intervention:

Claude Code:

claude --agent agents/quality-playbook-claude.agent.md --dangerously-skip-permissions -p \
  "Run the full quality playbook with all iterations. Run each phase as a separate
   sub-agent, then run all four iteration strategies (gap, unfiltered, parity,
   adversarial) in sequence, each as a separate sub-agent. Do not stop between
   phases or iterations — run everything end to end."

To capture the output to a log file, add 2>&1 | tee playbook-run.log to the end.

Via bin/run_playbook.py (any runner): the Python orchestrator at bin/run_playbook.py accepts a runner-selection flag — pick one of --claude / --copilot (default) / --codex. Example: python3 -m bin.run_playbook --codex ./my-project runs all six phases via codex exec --full-auto. Use --model <name> to override the runner's default model (codex picks from ~/.codex/config.toml when no --model is passed).

This uses the orchestrator agent (quality-playbook-claude.agent.md), which spawns a separate sub-agent for each of the six phases and each of the four iteration strategies. Each sub-agent gets its own context window, communicates with the others through files on disk (quality/PROGRESS.md, quality/BUGS.md, etc.), and exits when its phase is complete. The orchestrator reads the results and launches the next sub-agent.

Three things in the prompt matter:

"Run each phase as a separate sub-agent" — this is the most important part. Each phase needs the full context window for deep analysis. If the agent tries to run multiple phases in a single context, it runs out of room partway through Phase 3 on most projects, producing shallow analysis and fewer bugs. Separate sub-agents mean each phase gets ~200K tokens of context for investigation.

"All four iteration strategies in sequence" — iterations re-explore the codebase with different approaches: gap (areas the baseline missed), unfiltered (pure domain-driven exploration without structural constraints), parity (compare parallel code paths), and adversarial (challenge prior dismissals). Each strategy finds a different class of bug. Running all four typically adds 40-60% more confirmed bugs on top of the baseline.

"Do not stop between phases or iterations" — by default, the playbook pauses after each phase and waits for the user to say "keep going." This is useful when you want to review intermediate results, but for an autonomous run you want it to continue through all ten sub-agents (six phases + four iterations) without interruption.

The full autonomous run takes 60-180 minutes depending on codebase size and model. Add --model sonnet or --model opus to choose a specific model.

Step 6: Fix bugs, then recheck

After fixing the bugs from BUGS.md, say "recheck" to verify your fixes. Recheck mode reads the existing bug report, checks each bug against the current source (reverse-applying patches, inspecting cited lines), and reports which bugs are fixed vs. still open. Takes 2-10 minutes instead of re-running the full pipeline.

Running in CI

For headless / CI usage where python3 -m bin.run_playbook may be invoked from a non-interactive context, see docs/CI_INTEGRATION.md for the operator-side configuration steps.

Non-interactive host-CLI invocation (auto-approval flag). Each supported host CLI needs its auto-approval flag (--yolo / --dangerously-skip-permissions / --full-auto) for non-interactive runs — omitting it makes the CLI silently deny filesystem ops and cascade into a failed (or fabricated) run. See the Canonical adopter invocations table in AGENTS.md for the exact interactive vs non-interactive command per host CLI (Claude Code, the GitHub Copilot CLI — new standalone copilot and the deprecated gh copilot extension during the grace period per v1.5.7 089f, codex CLI, codex desktop).

Known limitations

Phase validator-invocation contracts are prose-enforced. Phase 1, Phase 2, Phase 5, and Phase 6 each require the agent to invoke validate_phase_artifacts (Phase 1/2/5) or quality_gate.py + the fresh-context auditor (Phase 6) at phase boundary and quote the verbatim verdict line. This is currently prose-mandated in phase_prompts/*.md and the per-phase reference guides — agents are required to comply but the requirement is not mechanically enforced. Empirically:

Phase 6 — codex desktop performs in-session verification with explicit disclosure rather than dispatching the mandated fresh-context sub-agent (observed 2026-05-18). Claude Code via Task tool + Copilot CLI Mode B dispatch the sub-agent correctly (Copilot CLI was the deprecated gh copilot extension at the time of observation; superseded by the standalone copilot CLI per v1.5.7 089f).
Phase 1 — codex desktop reported Phase 1 PASS while producing an EXPLORATION.md the validator would have FAILed (observed 2026-05-18 self-bootstrap). Either the validator was not invoked, or its FAIL verdict was ignored.

Phase 2 and Phase 5 have the same structural shape and likely fail the same way under the same conditions, though they have not surfaced empirically yet.

Operators reviewing phase verdicts should check for verbatim RESULT: VALIDATION PASSED (phase N) lines (Phase 1/2/5) or fresh-context framing in the auditor verdict (Phase 6). If absent, do not treat the verdict as load-bearing.

Structural enforcement is tracked for v1.6.x — see docs/design/QPB_v1.6.x_Phase6_Structural_Enforcement_Proposal.md (filename retains the historical Phase6 suffix; content covers all phase-boundary validator contracts via Slice 0 for Phase 1/2/5 subprocess attestation and Slices 1+2 for Phase 6 subprocess verifier + witness-signing).

Running the playbook: phases, iterations, and macros

bin/run_playbook.py exposes three invocation modes:

Mode 1 — Single baseline run (default):

python3 -m bin.run_playbook ./my-project

Runs Phase 1 through Phase 6 in sequence on one target.

Mode 2 — Explicit iteration list:

python3 -m bin.run_playbook --iterations gap,unfiltered,parity,adversarial ./my-project

Runs baseline + the listed iteration strategies in order. Early-stop is disabled when --iterations is explicit — every strategy in the list runs regardless of prior yields.

Mode 3 — --full-run macro:

python3 -m bin.run_playbook --full-run ./my-project

Equivalent to baseline + all four iteration strategies (gap, unfiltered, parity, adversarial) in order, with early-stop enabled. If yields drop below the threshold, remaining iterations are skipped.

Use Mode 2 when you want to force all four strategies to run even if early-stop would trigger. Use Mode 3 for unattended runs where you're happy to save budget on clearly-exhausted cycles.

Rate limits and run budgets

GitHub Copilot GPT-5.4: Copilot enforces a 54-hour cooldown on ~15M-token prompts. Plan benchmark re-runs accordingly — the casbin-1.5.1 incident locked out GPT-5.4 for two days mid-release.
Claude Code plan budget: a full run of the playbook on a 50K-LOC project typically consumes ~30% of a Sonnet-family monthly budget. Budget surges during Phase 4 (Spec Audit, three parallel auditors) and Phase 5 (TDD red-green verification on many bugs).
Reference-doc scaling: the playbook reads all of reference_docs/ into Phase 1 context. Keep it under ~2M tokens to avoid context-budget pressure on downstream phases. For very large specs, curate the excerpts that are actually cited rather than dumping full RFCs.

Why phases?

The playbook runs each phase in a separate context window on purpose. A single-session approach runs out of context partway through Phase 3 on most projects, which means shallow analysis and missed bugs. The phase-by-phase design gives each phase the full context budget for deep investigation. The tradeoff is saying "keep going" a few times — or use the autonomous mode above to skip the manual steps entirely.

What the playbook produces

The playbook generates these files:

| Artifact | Location | What it does | |----------|----------|-------------| | REQUIREMENTS.md | quality/ | Behavioral requirements derived from code, docs, and community sources via a five-phase pipeline. This is the foundation -- without requirements, review is limited to structural bugs. | | QUALITY.md | quality/ | Quality constitution defining what "correct" means for this specific project, with fitness-to-purpose scenarios and coverage theater prevention. | | test_functional.* | quality/ | Functional tests in the project's native language, traced to requirements rather than generated from source code. | | RUN_CODE_REVIEW.md | quality/ | Three-pass protocol: structural review, requirement verification, cross-requirement consistency. Each pass finds bugs the others can't. | | RUN_SPEC_AUDIT.md | quality/ | Council of Three: three independent AI models audit the code against requirements. Different models have different blind spots, and the triage uses confidence weighting, not majority vote. | | RUN_INTEGRATION_TESTS.md | quality/ | End-to-end test protocol grounded in use cases, with a traceability column mapping each test to the user outcome it validates. | | RUN_TDD_TESTS.md | quality/ | Red-green TDD verification protocol: for each confirmed bug, prove the regression test fails on unpatched code and passes with the fix. | | BUGS.md | quality/ | Consolidated bug report with spec basis, severity, reproduction steps, and patch references for every confirmed finding. | | AGENTS.md | project root | Bootstrap file so every future AI session inherits the full quality infrastructure. |

How it works

The playbook's value comes from requirement derivation. AI code reviewers are bottlenecked by the same thing human reviewers are: if you don't know what the code is supposed to do, you can only find structural issues. The playbook's main job is figuring out intent, then using that intent to drive every downstream artifact.

Phase 1: Explore. The AI reads source files, tests, config, specs, and commit history. If you provide community documentation (GitHub issues, user guides, API docs, forum discussions), it reads those too. The goal is to understand not just what the code does, but what it's supposed to do.

Phase 2: Generate. A five-phase pipeline extracts behavioral contracts from the codebase, derives testable requirements, verifies coverage, checks completeness, and adds a narrative layer with validated use cases. The pipeline also generates functional tests, review protocols, a TDD verification protocol, and the quality constitution.

Phase 3: Code review. A three-pass code review runs against HEAD: structural review with anti-hallucination guardrails, requirement verification checking each requirement against the code, and cross-requirement consistency checking whether requirements contradict each other. About 65% of findings come from Pass 1, 35% from Passes 2 and 3. Each confirmed bug gets a regression test.

Phase 4: Spec audit. Three independent AI models audit the code against the requirements. The triage process uses verification probes -- targeted checks that ask "is this actually true?" -- rather than dismissing single-model findings. As of v1.3.17, verification probes must produce executable test assertions (not just prose reasoning) to confirm or reject findings, which prevents the triage from hallucinating code compliance. The most valuable findings are often the ones only one model catches.

Phase 5: Reconciliation. Post-review reconciliation closes the loop: every bug from code review and spec audit is tracked, regression-tested or explicitly exempted, and the completeness report is finalized with one authoritative verdict.

Phase 6: Verify. 45 self-check benchmarks validate the generated artifacts against internal consistency rules -- requirement counts match across all surfaces, no stale text remains, every finding has a closure status, and triage probes include executable evidence.

The gate ends with one of three verdicts (v1.5.7):

GATE PASSED — the review completed and every audit record is in place. Nothing to do.
GATE PASSED WITH CLEANUP NEEDED — the bug findings are real, reviewed, and stand on their own; only the audit trail is incomplete (a manifest record missing a field, a per-bug challenge record absent, a cross-site pattern tag not applied). This is not a failure — the review is done; only the paperwork needs filling in. Ask your AI assistant to complete the audit records without changing any findings.
GATE FAILED — a substantive problem: the review didn't complete, specs are missing, the mechanical verifier never ran, or a verdict was fabricated. Fix the listed issues before treating the run as trustworthy.

The split exists so you can tell "your code is broken in N ways" apart from "your audit trail is incomplete in N ways" — earlier versions reported both as a flat GATE FAILED — N checks, and honest record-keeping-incomplete runs (which had found real, TDD-verified bugs) looked identical to runs where the review never happened.

Why documentation matters

Adding community documentation to the pipeline produces measurably better results. In a controlled experiment across multiple repositories, documentation-enriched runs found more bugs, different bugs, and higher-confidence bugs than code-only baselines. The documentation gives auditors spec language to check against, turning "this code looks odd" into "this code contradicts the documented behavior."

Roadmap

The Quality Playbook is developed in a two-half arc. The v1.5.x series is the QC half — the quality-control infrastructure for finding bugs and validating skill prose. The v1.6+ series is the QI half — quality-improvement built on top of that infrastructure: better requirements review, statistical control over the development process, and eventually multi-operator workflows. Each version below has a brief description, a tag (most recent for that minor version), and links to its design and implementation-plan documents.

v1.8 — Cross-operator workflow (future). Multiple QPB operators sharing calibration data, lever-pull history, and benchmark results across sites. Lets a team adopt the playbook and accumulate evidence collectively rather than each operator running a private cycle. Design forthcoming.
v1.7 — Statistical process control machinery. Statistical process control for both the improvement loop (multi-cycle calibration data with control charts on lever-pull deltas) and the SDLC itself (defect-rate trending, recurrence-class detection, process-change drivers). Includes multi-cell calibration cycles — multiple lever pulls in parallel using cell.json's structured output instead of one at a time — and cross-version trend tracking — recall trajectories per benchmark per release, with control limits inferred from accumulated history. Both are next iterations of QPB's own development process; the SPC framework's first proof point is the QPB development workflow itself. Design at docs/design/QPB_v1.7.0_Design.md, spec at docs/design/QPB_v1.7.0_Implementation_Plan.md.
v1.6 — Requirements review and management UX. Operator-facing system for reviewing and managing the requirements QPB derives from a target. The UX walks the operator through each requirement (Wiegers quality attributes — clarity, completeness, consistency, testability, necessity, feasibility, verifiability), surfaces evidence from formal docs, informal sources (chat archives, design notes), and exploration findings, and helps validate or refine the REQ set. Includes targeted playbook runs that check specific requirements against the code — e.g., re-derive REQ-007 against the updated source, verify a logging requirement against bin/audit_log.py, compare the current REQ-set against a prior run for drift detection. Closes the QI loop: defect data from review sessions feeds back into Phase 1/2 prompt-tuning calibration cycles. Design at docs/design/QPB_v1.6.0_Design.md, spec at docs/design/QPB_v1.6.0_Implementation_Plan.md, feature proposal at docs/design/QPB_v1.6.x_Requirements_Review_Proposal.md.
v1.5.6 — Adopter-facing distribution + Pattern 7 displacement-recovery cycle. Shipped turnkey install/distribution (bin/install_skill.py, AGENTS-driven setup, multi-environment auto-detection), code-only-mode documentation/instrumentation for empty reference_docs/, and adopter-grade AI orchestration patterns documentation; the Pattern 7 displacement-recovery cycle also shipped with a documented revert, keeping the budget cap at 3-5. Tag v1.5.6. Design at docs/design/QPB_v1.5.6_Design.md, spec at docs/design/QPB_v1.5.6_Implementation_Plan.md.
v1.5.5 — Autonomous improvement-loop infrastructure. Run-state instrumentation (quality/run_state.jsonl, quality/PROGRESS.md), phase-boundary cross-validation (catches the failure mode where a phase reports "complete" with empty artifacts), Phase 5 source-edit guardrail, calibration-cycle orchestrator template, four matplotlib visualization charts, plus seven v1.5.4 self-audit defect fixes and four inherited regression-replay test failures cleared. Tag: in flight (HEAD on the 1.5.5 branch; not yet tagged). Design at docs/design/QPB_v1.5.5_Design.md, spec at docs/design/QPB_v1.5.5_Implementation_Plan.md.
v1.5.4 — Skill-as-code via AI-driven file role tagging + Pattern 7. Phase 1 produces quality/exploration_role_map.json with one record per in-scope file (role tag: skill-prose / skill-tool / code / test / docs / etc.); replaces v1.5.3's mechanical Code/Skill/Hybrid classifier whose LOC denominator was getting polluted by playbook artifacts shipped into benchmark targets. Pipeline activation reads the role map (always-Hybrid downstream). Pattern 7 — Composition and Mount-Context Awareness — added as the seventh exploration pattern. First calibration cycle measured +0.20 recall on chi-1.3.45 with documented displacement asterisk. Tag v1.5.4. Design at docs/design/QPB_v1.5.4_Design.md, spec at docs/design/QPB_v1.5.4_Implementation_Plan.md.
v1.5.3 — Four-pass skill-derivation pipeline + project-type classifier. Extends the v1.5.0 divergence model to AI-skill targets where SKILL.md prose IS the spec. Phase 0 classifier (bin/classify_project.py) tags each target as Code / Skill / Hybrid. Four-pass derivation pipeline: Pass A naive coverage, Pass B mechanical citation extraction with Jaccard pre-filter (~93× speedup), Pass C formal REQ + UC production, Pass D coverage audit with structured Council inbox. Curated REQUIREMENTS.md comparable to the Haiku reference (~65 unique REQ definitions). Cross-target validation against five code targets and three pure-skill targets. Tag v1.5.3. Design at docs/design/QPB_v1.5.3_Design.md, spec at docs/design/QPB_v1.5.3_Implementation_Plan.md.
v1.5.2 — Council review hardening + cardinality gate. Two nine-panelist Council-of-Three reviews cleared the release. New _finalize_iteration helper runs quality_gate.py as a subprocess after each iteration and writes structured PROGRESS.md output. Cardinality gate hardening: citation excerpts byte-equal verified against the producer's extract_excerpt output, strict boolean type checks, body-prose vs. tier-marker disambiguation. Citation verifier hardening — citation-stale detection now runs end-to-end. Phase 6 verdict-mapping guard so a fail finalizer no longer demotes to partial because the gate log contains "warn." Tag v1.5.2. Design at docs/design/QPB_v1.5.2_Design.md, spec at docs/design/QPB_v1.5.2_Implementation_Plan.md.
v1.5.1 — Phase 5 writeup hydration. Phase 5 prompt carries a MANDATORY HYDRATION STEP — a BUGS.md → writeup field map, a worked BUG-004 example, and a per-writeup confirmation checklist forbidding empty backticks, empty diff fences, and angle-bracket placeholders. quality_gate.py's check_writeups fails on any of five template-sentinel strings, or on \``difffences containing no+/- lines. Case-insensitive diff-fence detection so mixed-case fences don't slip past the inline-fix-diff check. Tag [v1.5.1](https://github.com/andrewstellman/quality-playbook/releases/tag/v1.5.1). Design at [docs/design/QPB_v1.5.1_Design.md](docs/design/QPB_v1.5.1_Design.md), spec at [docs/design/QPB_v1.5.1_Implementation_Plan.md`](docs/design/QPB_v1.5.1_Implementation_Plan.md).
v1.5.0 — Divergence model + consolidated quality/ layout. Introduces the divergence framing: a defect is a divergence between documented intent and code implementation, not a judgment about whether the code is "good." Bootstrap artifacts tracked in git as project history (quality/runs/, quality/control_prompts/). Foundation for the v1.5.x quality-control arc. Tag v1.5.0. Design at docs/design/QPB_v1.5.0_Design.md, spec at docs/design/QPB_v1.5.0_Implementation_Plan.md.
v1.4 — Six-phase architecture + iteration strategies + TDD red-green. Playbook splits into six phases (Explore, Generate, Review, Audit, Reconcile, Verify), each running in its own context window with exit gates verifying prerequisites and artifact completeness. Four iteration strategies (gap, unfiltered, parity, adversarial) consistently add 40-60% more confirmed bugs on top of the baseline. Every confirmed bug requires a regression-test patch, a red-phase log proving the test fails on unpatched code, and a green-phase log proving the fix resolves it. Mechanical quality gate (quality_gate.py) validates artifact completeness as the final Phase 6 step. Validated against Express.js, Gson, virtio. Tag v1.4.6 (most recent v1.4.x). Design at docs/design/QPB_v1.4_Design.md. No standalone implementation plan — design contains the work breakdown.
v1.3 — Mechanical verification + iterative convergence. Mechanical artifacts with integrity check: extraction commands (awk/grep) produce per-function evidence files, append themselves to quality/mechanical/verify.sh, and Phase 6 re-runs the script and diffs against saved files (catches the failure mode where the model executes the right command but writes fabricated output). Contradiction gate compares executed evidence (mechanical artifacts, regression-test results, TDD red-phase failures) against prose artifacts; if they contradict, the executed result wins. Self-contained iterative convergence: Phase 0 builds a seed list from prior runs, mechanically re-checks each seed; runs iterate up to 5 times until net-new bugs = 0. Tag v1.3.50 (most recent v1.3.x). Design across multiple incremental files: docs/design/QPB_v1.3.0_Design.md, docs/design/QPB_v1.3.7_Design.md, docs/design/QPB_v1.3.21_Design.md, docs/design/QPB_v1.3.35_Design.md, docs/design/QPB_v1.3.50_Design.md, and others — each captures the design state at that increment.
v1.2 — Initial public release. First tagged version of the playbook with the inspection-style workflow (deskcheck → walkthrough → inspection) and the bug-finding-as-divergence-detection methodology. Tag v1.2.16 (most recent v1.2.x). Design at docs/design/QPB_v1.2.15_Design.md.

What's new in v1.5.8

v1.5.8 makes Windows a first-class supported platform for both Mode A (claude) and Mode B (codex via run_playbook), closes the cp1252-on-Windows hazard surface at all three sites where Python's system-locale default codec was eating data, formalizes the Worker self-Council protocol as load-bearing development methodology, graduates the AUDIT-table invariant test pattern to a standard mechanism after three confirmed reuses, and lands the v2 blind CVE benchmark methodology under Security Research/CVE_BENCHMARK_METHODOLOGY_v2.md.

Windows harness compatibility. The 180 chain (10 followups) makes the harness fully cross-platform: psutil for process management (replaces POSIX-only os.kill / os.killpg — also fixes a latent Windows tree-kill-orphans-descendants bug), CREATE_NO_WINDOW instead of DETACHED_PROCESS so background spawns don't flash console windows, windows-curses automatically pulled in via the new bin/harness/requirements.txt (sys_platform=='win32' marker), signal.SIGHUP / signal.SIGKILL lazy-resolved (don't exist on Windows), git core.longpaths=true for MAX_PATH headroom. Install harness deps once via python3 -m pip install -r bin/harness/requirements.txt. Windows codex Mode B verified end-to-end with gpt-5.5 / gpt-5.4-mini.
The cp1252-on-Windows hazard surface, closed. On Windows, Python's default codec for stdout/stderr (pre-185), log file reads (pre-189), and subprocess stdin writes (pre-190) was cp1252 — which silently corrupts or hard-crashes on common high-bit characters (em-dash, ≥, ←, emoji verdict markers). v1.5.8 closes all three sites with explicit encoding="utf-8" + errors="replace", AND each site landed with an AUDIT-table invariant test that prevents the same defect class from regressing at a new site. Future PR reviewers reference Section O ("Windows cp1252 hazard surface") in the design doc before approving any new subprocess.run / open(text=True) site.
include_iterations opt-in plan-row field (harness plans). Per-row boolean (default false) — when true, the Mode A launch prompt drops the "Do not run the iteration strategies" exclusion clause so QPB runs all 4 iteration strategies (gap/unfiltered/parity/adversarial) per its documented default. Empirical caveat: in a 2026-06-03 blind-CVE benchmark A/B test, iterations made detection worse in 2/2 directly-comparable rows because the adversarial pass over-dismisses real findings when the model's call-graph reasoning is shallow. Default include_iterations: false is the recommended setting for security-targeted plans.
kill <harness-run> cancels PENDING runs + new CANCELLED terminal state. Previously kill only SIGKILL'd RUNNING rows; PENDING rows would silently re-launch when a pool slot freed. Now kill <harness-run> stops EVERYTHING in the plan: RUNNING gets SIGKILL, PENDING gets transitioned to the new CANCELLED terminal state via the new cancel_pending_run helper. The collector and _try_acquire_pool_slot both treat CANCELLED as terminal. Status / TUI render CANCELLED rows in their own section with a C column.
The pre-186 ABANDONED_STARVED 3600s PENDING-run deadline is REMOVED. For sequential pool=1 plans with long-running rows (e.g., 7 × 45min security runs), the deadline killed runs before the pool could free a slot for them. Replaced with operator-visible signals: status shows pending Nh Mm waiting time + collector heartbeat-age health, and qpb_harness force-run <run-NN> (CLI) + E keybinding (TUI) explicitly launch a PENDING row out of pool when the operator decides the wait is wrong.
Worker self-Council protocol (ai_context/DEVELOPMENT_PROCESS.md). Formalization of the Parallel-Agent Council flavor with stricter discipline: 3 panelist charters in parallel via the implementing AI's Task tool, each Write-to-file artifact at Reviews/v<NNN>_self_council/panelist_<X>_<charter>.md, synthesis to synthesis.md, FIX-REQUIRED iterates in-branch BEFORE filing the v1 review-request. Has demonstrably caught ship-blocker