agent-booster-pack-proof
v2.0.1
Published
Proof-first Pi extension that uses a red-green-refactor cycle when behavior should be specified in tests first, with built-in parsing for popular test frameworks. Renamed from pi-proof; old name deprecated.
Maintainers
Readme
agent-booster-pack-proof
A proof-first extension for Pi, the terminal coding agent.
It nudges the agent into a red-green-refactor loop when the next change needs a test. It stays out of the way for docs, config, and exploration.
Renamed from
pi-proof. The old npm name is deprecated; migrate toagent-booster-pack-prooffor ongoing updates. This package is one of four sibling packages in Agent Booster Pack; install the meta-packageagent-booster-packfor the full bundle.
Install
Install Pi:
npm install -g @mariozechner/pi-coding-agent
piInstall agent-booster-pack-proof:
pi install npm:agent-booster-pack-proofIf Pi is already running, run /reload.
Use it
Ask the agent to change behavior:
Fix the off-by-one error in paginationThe agent decides if proof mode fits. If it does, the agent writes a failing test, makes it pass, refactors, and finishes.
Toggle by hand:
/proofWhen proof mode helps
Reach for it when:
- A bug has a clear failing case.
- A feature adds or changes observable behavior.
- A business rule needs to be locked down before code.
Skip it when:
- You are editing docs, config, manifests, or lockfiles.
- You are scaffolding plumbing.
- You are exploring and the behavior is not settled.
By default the extension is advisory. It tells the agent that proof mode is available. The agent decides. Once on, the loop is strict.
Why this works
Tests give the agent ground truth. Without it, the agent guesses.
The research backs this up. TDFlow (2025) found that human-written acceptance criteria improve agent accuracy by 12–46 points. AlphaCodium (2024) raised GPT-4 accuracy from 19% to 44% with a test-execute-fix loop. Reflexion (NeurIPS 2023) hit 91% on HumanEval, up from 80%.
Tests document, too. They show the next reader, human or agent, how the system actually behaves.
Without test discipline, agents tend to:
- Implement before specifying behavior.
- Change too much at once.
- Mix features with refactors.
- Declare success from plausibility, not proof.
How it works
Three phases. Each phase tells the agent what is allowed.
stateDiagram-v2
OFF --> SPECIFYING : /proof or proof_start
SPECIFYING --> IMPLEMENTING : test fails
IMPLEMENTING --> REFACTORING : tests pass
REFACTORING --> SPECIFYING : new turn
REFACTORING --> OFF : /proof or proof_doneSPECIFYING. The agent writes a failing test. Production write and edit calls are blocked. Test files and config files pass through. A failing test advances to IMPLEMENTING. If the test fails because a module cannot be imported, the agent gets a one-shot allowance to create a minimal stub so the test can load — the allowance clears after the next run.
IMPLEMENTING. The agent writes the smallest code that makes the test pass. A passing test advances to REFACTORING.
REFACTORING. The agent restructures. Failing tests tell the agent to revert. No new behavior here.
A new turn — not proof_done — closes the cycle and returns to SPECIFYING. The cycle counter ticks then.
Phase transitions ride on test results. The extension runs tests after every file write and parses the output. SPECIFYING only advances after it sees a test file written or a manual test run; unrelated failures do not push the phase forward.
Some files skip the loop: configs, lockfiles, docs, scaffolding. The extension recognizes them by path.
Test integration
The extension finds your test command from what it sees in the project:
| Detected | Runs |
|----------|------|
| package.json with test script | npm test |
| Cargo.toml | cargo test |
| go.mod | go test ./... |
| pytest.ini or pyproject.toml | pytest |
If it cannot tell, it asks once.
It recognizes test files by name: *.test.*, *.spec.*, *_test.*, *_spec.*, plus files under __tests__/ or test/.
It parses output from:
| Language | Frameworks | |----------|-----------| | JS/TS | Jest, Vitest, Mocha, Bun, AVA | | Python | pytest, unittest | | Go | go test | | Rust | cargo test | | Ruby | RSpec, Minitest | | Java/Kotlin | Gradle; JUnit/Maven (summary) | | C# | dotnet test | | Swift | XCTest, Swift Testing | | PHP | PHPUnit, Pest | | Elixir | ExUnit | | Universal | TAP |
When per-test lines aren't found, the parser falls back to summary regex. Parsed results show in the tool result and in the HUD.
HUD
When proof mode is on, a widget shows:
- The phase, color-coded.
- The cycle count.
- Passed, failed, duration.
- Up to seven test results, with an overflow indicator.
It updates after each test run.
Tools
| Interface | What it does |
|-----------|--------------|
| proof_start | Agent tool. Enters proof mode. |
| proof_done | Agent tool. Exits proof mode. |
| /proof | Slash command. Manual toggle. |
The legacy tdd_start, tdd_done, and /tdd still work.
Limits
This extension enforces the loop, not the quality of the tests.
- Shallow user stories give shallow confidence.
- Proof mode is opt-in per task. The extension does not force it on every change.
- Only SPECIFYING blocks writes. IMPLEMENTING and REFACTORING steer through prompts.
- Shell-based production writes during SPECIFYING are warned, not blocked.
- The import-only stub allowance lets SPECIFYING produce a minimal production stub when the test file cannot load.
- A new turn closes the cycle, not
proof_done. A long turn can stay in REFACTORING across many writes. - No state between sessions.
- No LLM review. The extension trusts the test runner.
Development
git clone [email protected]:kreek/agent-booster-pack.git
cd agent-booster-pack/agent-booster-pack-proof
npm install
npm run install-hooks
npm testThe pre-commit hook runs biome check --staged.
src/
index.ts Extension entry, phase machine, HUD, tools
parsers.ts Test output parsers (13 frameworks)
test/
parsers.test.ts Parser testsTo add a parser, append a TestLineParser to defaultParsers in src/parsers.ts. For development installs from a local checkout, npm run install-ext symlinks the repo into ~/.pi/agent/extensions/agent-booster-pack-proof.
Eval
The extension ships with an eval harness built on pi-do-eval. It runs Pi with agent-booster-pack-proof loaded against small coding projects and scores proof-first compliance, test quality, and correctness.
cd eval
npm install
npm run eval -- list # list trials, variants, suites
npm run eval -- run small # fast regression
npm run eval -- run --trial temp-api --variant typescript-vitest
npm run view # http://localhost:3333
npm run eval -- regress small # compare against previous runsmall is for day-to-day changes. full is for releases.
Suites run serially. --concurrency opts into parallel runs, but the harness refuses values above 1 when the worker or judge provider is subscription-backed.
License
MIT

