agentic-sdlc-wizard

v1.87.0

Published

a day ago

SDLC enforcement for Claude Code — hooks, skills, and wizard setup in one command

0High
0Medium
0Low

baseinfinity

claude-code sdlc tdd ai-agent developer-tools code-quality

Claude Code SDLC Wizard

A self-evolving Software Development Life Cycle (SDLC) enforcement system for AI coding agents. Makes Claude plan before coding, test before shipping, and ask when uncertain. Measures itself getting better over time.

Built on 15+ years of software engineering and founding engineering experience — battle-tested patterns from real production systems, baked into an AI agent that follows tried-and-true software quality practices so you don't have to enforce them manually.

Built for Claude Code. Using OpenAI's Codex CLI instead? Check out codex-sdlc-wizard. Need privacy-first / any-backend (local Ollama, Azure OpenAI, hosted OSS)? See opencode-sdlc-wizard. (Full ecosystem.)

Install

Requires Claude Code (Anthropic's CLI for Claude).

Run from your terminal or from inside Claude Code (! prefix):

npx -y agentic-sdlc-wizard@latest init

The @latest pin forces npm to fetch the newest version. Without it, npx may serve a stale CLI from your local cache (#358); init also nudges if it detects a gap. Then start (or restart) Claude Code — type /exit then claude to reload hooks. Setup auto-invokes on first prompt — Claude reads the wizard doc, scans your project, and generates bespoke CLAUDE.md, SDLC.md, TESTING.md, and ARCHITECTURE.md. No manual commands needed.

curl (no npm install needed):

curl -fsSL https://raw.githubusercontent.com/BaseInfinity/claude-sdlc-wizard/main/install.sh | bash

Homebrew:

brew install BaseInfinity/sdlc-wizard/sdlc-wizard
sdlc-wizard init

GitHub CLI extension:

gh extension install BaseInfinity/gh-sdlc-wizard
gh sdlc-wizard init

From GitHub (no npm registry needed):

npx github:BaseInfinity/claude-sdlc-wizard init

Install CLI globally:

npm install -g agentic-sdlc-wizard
sdlc-wizard init

Manual (advanced — escape hatch only): Download CLAUDE_CODE_SDLC_WIZARD.md to your project and tell Claude Run the SDLC wizard setup. This skips the live-session auto-invoke and is only intended for environments where npx, curl, brew, and gh are all unavailable. The default human path is npx init → restart CC → first-prompt auto-setup, not this manual flow.

npx agentic-sdlc-wizard check        # Human-readable
npx agentic-sdlc-wizard check --json  # Machine-readable (CI-friendly)

Reports MATCH / CUSTOMIZED / MISSING / DRIFT for every installed file. Exits non-zero on MISSING or DRIFT — use in CI to catch setup regressions.

Check for content updates: Tell Claude Check if the SDLC wizard has updates — it reads CHANGELOG.md, shows what's new, and offers to apply changes.

Why Use This

You want Claude Code to follow engineering discipline automatically:

Plan before coding (not guess-and-check)
Write tests first (TDD enforced via hooks)
State confidence (LOW = ask user, don't guess)
Track work visibly (TaskCreate)
Self-review before presenting
Prove it's better (use native features unless you prove custom wins)

The wizard auto-detects your stack (package.json, test framework, deployment targets) and generates bespoke hooks + skills + docs. CI validates the generated assets; cross-stack setup-path E2E is on the roadmap.

What This Actually Is

Five layers working together:

Layer 5: SELF-IMPROVEMENT
  Weekly/monthly workflows detect changes, test them
  statistically, create PRs. Baselines evolve organically.

Layer 4: STATISTICAL VALIDATION
  E2E scoring with 95% CI (5 trials, t-distribution).
  SDP normalizes for model quality. CUSUM catches drift.

Layer 3: SCORING ENGINE
  Multi-criteria scoring, 10/11 points. Claude evaluates Claude.
  Before/after wizard A/B comparison in CI.

Layer 2: ENFORCEMENT
  Hooks fire every interaction (~100 tokens).
  PreToolUse reminds Claude to write tests first.

Layer 1: PHILOSOPHY
  The wizard document. KISS. TDD. Confidence levels.
  Copy it, run setup, get a bespoke SDLC.

What Makes This Different

| Capability | What It Does | |---|---| | E2E scoring in CI | Every PR gets an automated SDLC compliance score (0-10) — measures whether Claude actually planned, tested, and reviewed | | Before/after A/B testing | Compares wizard changes against a baseline with 95% confidence intervals to prove improvements aren't noise | | SDP normalization | Separates "the model had a bad day" from "our SDLC broke" by cross-referencing external benchmarks | | CUSUM drift detection | Catches gradual quality decay over time — borrowed from manufacturing quality control | | Pre-tool TDD hooks | Before source edits, a hook reminds Claude to write tests first. CI scoring checks whether it actually followed TDD | | Self-evolving loop | Weekly/monthly external research + local CI shepherd loop — you approve, the system gets better |

Cross-Model Review (Codex) — REQUIRED for High-Stakes

Claude can't grade its own homework. Have a different AI from a different company review Claude's work — different training, different blind spots, different biases. We use OpenAI's Codex CLI, and it's three commands to set up:

npm i -g @openai/codex
export OPENAI_API_KEY=sk-...
codex --version   # confirm ready

That's it. Codex picks up your OpenAI account's best available model automatically — if you have GPT-5.6 Sol, it uses Sol; otherwise it falls back to Terra. No model config needed.

How to use it: after Claude's self-review passes, write a one-file mission brief and run:

codex exec -c 'model_reasoning_effort="xhigh"' -s danger-full-access \
  -o .reviews/latest-review.md \
  "Read .reviews/handoff.json and review per the checklist. Output findings + CERTIFIED or NOT CERTIFIED." \
  < /dev/null

Always append < /dev/null when running codex exec from a non-interactive parent (background, hooks, CI, Claude Code Bash tool). Without it, codex blocks on stdin reads even when the prompt is an argument — the process sits at S/0% CPU indefinitely with a 0-byte -o output file. Validated on codex-cli 0.130.0 / macOS 14, 2026-05-15.

xhigh reasoning is non-negotiable — lower settings miss subtle bugs. See CLAUDE_CODE_SDLC_WIZARD.md for the full protocol (handoff format, round-2 dialogue loop, preflight docs). Real-world: this catches P0/P1 issues in 2-3 out of 10 reviews that Claude's self-review rated as clean.

Choosing Your Model

The wizard ships a default recommendation, not a mandate. You can swap to any Claude model — newer, older, or sibling tier — at any time. /model per session, or pin in .claude/settings.json.

Default: Sonnet 5 at medium effort, escalating to high/xhigh (Setup A in AI_SETUP_LANES.md). Sonnet 5 (launched 2026-06-30) beats Opus 4.6 on every coding benchmark (SWE-bench Verified 85.2% vs 80.8%, Terminal-Bench 2.1 80.4% vs 65.4%) while generally using less Max quota — savings vary by effort level and narrow at high/xhigh (newer tokenizer, more agentic turns per task). Opus 4.6 remains available as Setup B — Stability for proven-consistency workflows; here's why it's no longer the default.

Why Opus 4.6 was the flagship, and why that changed

Two weeks of in-the-wild data after Opus 4.8's launch (2026-05-28) showed a clear pattern that first made Opus 4.6 the wizard's flagship over Anthropic's own "latest" model:

Andon Labs Vending-Bench — 4.8 finished last vs 4.7 and GPT-5.5; documented "Max reasoning is not the best reasoning effort"; falls for scam suppliers 30× more frequently
AI Weekly: 900K cache tokens per turn — 40-60× jump vs 4.7 at HIGH effort. Burns Max 5-hour limits 2-3× faster
Tech.yahoo review — explicit: "Anthropic deliberately made Opus's new tokenizer less efficient"; "a single coding prompt drained our entire token quota"
Active GitHub regressions — false-greens (#63861), 2-3× token burn (#64961), 46K tokens for simple coding turn (#64153), dropped constraints during execution (#65932), fabricated identifiers in parallel tool batches
Paweł Huryn's 4.7 guide — "most complaints about 4.7 feeling slow stem from people reflexively using max"
BSWEN effort decision guide — "Max on Opus causes overthinking on routine stuff. xHigh is the sweet spot for autonomous work"
r/Claudeopus field reports — one maintainer's literal A/B: "12 hours with 4.8 zero deliverables; plugged in 4.6, spec written + 133 tests green in one session." Top comment: "4.6 had the best overall balance at max"

That research still stands — it's why Opus 4.6 remains Setup B (Stability) rather than being dropped entirely, and why the wizard escalates to Opus 4.8 (not a blanket "latest") only when Sonnet 5 gets stuck. But Sonnet 5's June 30 launch changed the calculus for the default: it beats Opus 4.6 on every benchmark above at generally lower quota cost, so it's no longer a tradeoff between "reliable but weaker" and "strong but overthinks" — Sonnet 5 doesn't have Opus 4.8's overthinking problem in the first place.

4.6 remains Anthropic-supported until ≥ Feb 5, 2027 per the official deprecation page, so Setup B stays a safe long-term choice if you've tuned a workflow to its behavior specifically.

Switch any time

/model sonnet             # wizard's recommended default (Setup A) — native 1M context
/model claude-opus-4-6    # Stability lane (Setup B) — proven consistency
/model opusplan           # Opus plans (Shift+Tab), Sonnet executes — both Max-bundled (Setup C)
/model claude-opus-4-8    # escalation only — don't run as daily driver, burns quota 2-3x faster

Or pin in .claude/settings.json:

{ "model": "sonnet", "advisorModel": "fable", "effortLevel": "medium" }

Effort is model-aware, not blanket max. Sonnet 5: medium default (CodeRabbit-tested), escalate /effort high → xhigh for hard tasks. Opus 4.8: xhigh (its own max overthinks). Opus 4.6: max (its one xhigh-less sweet spot). Set per-session with /effort, not a shell-rc or settings env var — persisting effort that way silently overrides a later /effort change after you switch models (see SDLC.md's Lessons Learned for a real incident this caused). OpenAI/Codex reviewer: xhigh default — escalate to max/Pro mode only for unusually risky PRs (see AI_SETUP_LANES.md's Final Review Policy).

Four Setup Lanes

The wizard defines four AI coding setups in AI_SETUP_LANES.md:

| Lane | Advisor | Driver | Reviewer | Escalation | |------|---------|--------|----------|------------| | A — Recommended | Fable 5 (advisorModel) | Sonnet 5, medium→high→xhigh | GPT-5.6 Sol xhigh | Opus 4.8 xhigh or Fable review | | B — Stability | Fable 5 (advisorModel) | Opus 4.6 max | GPT-5.6 Sol xhigh | None | | C — Saver | Fable 5 or Opus 4.8 (advisorModel) | Opus 4.8 plans, Sonnet 5 executes | GPT-5.6 Sol xhigh | None | | D — Lite | None | Sonnet 5, medium | None | None |

Setup D's whole point: the discipline of knowing when NOT to use discipline. When blast radius is low and you just need fast cheap hands, skip the SDLC overhead.

Reading Setup A precisely

Clarified 2026-07-13 after these exact points kept getting re-confused; each rule states its why:

Effort escalation stays inside the driver. Sonnet 5 starts at medium; /effort high when it struggles, xhigh for hard debugging or long agent runs. Why medium and not high: the old "start at high" guidance was removed in ROADMAP #440 — it had no measurement behind it, and CodeRabbit's testing (the source this repo already cites) found medium captures most of Sonnet 5's upside without paying for the top effort tiers, while community cost reports suggest the quota advantage erodes at high/xhigh.
Model escalation swaps the driver. After 2 failed attempts, LOW confidence, or on high-stakes changes, Opus 4.8 xhigh takes over as driver (or run a Fable 5 review pass on the diff). Why a swap and not more effort: the lane's policy treats repeated failure as a sign the approach needs different eyes, not deeper reasoning on the same track — so once the effort ladder is exhausted, the next rung is a different model, not a bigger bill.
Advisor failure has a fallback, not a shrug. Fable 5 advises via advisorModel: "fable"; when advisor() errors (it's a server-side tool — API incidents happen), spawn a Fable subagent as the fallback reviewer, exactly as the /sdlc skill prescribes. Why: the advisor's job is catching wrong approaches before they're built, so a transport failure changes how the advice is obtained — not whether the check happens.

A note on [1m] and billing. Sonnet 5 always runs at its native 1M context — no [1m] suffix needed, no separate billing tier. For Opus, the [1m] suffix is the 1M-context alias; as of March 2026, 1M context is GA at standard pricing — no long-context surcharge, no premium tier, no API-only restriction. Interactive Claude Code sessions on Max / Team / Enterprise plans include 1M context automatically. (Pro users need "Enable usage credits" turned on once.) The June 15, 2026 billing split moved headless surfaces — claude -p, Agent SDK, GitHub Actions, third-party apps — off the Max subscription onto a separate metered credit pool. Interactive Claude Code in your terminal stays on Max. Full details in AI_SETUP_LANES.md § How Billing Works.

How It Works

Think Iron Man: Jarvis is nothing without Tony Stark. Tony Stark is still Tony Stark. But together? They make Iron Man. This SDLC is your suit - you build it over time, improve it for your needs, and it makes you both better.

The dream: Mold an ever-evolving SDLC to your needs. Replace my components with native Claude Code features as they ship — and one day, delete this repo entirely because Claude Code has them all built in. That's the goal.

WIZARD FILE (CLAUDE_CODE_SDLC_WIZARD.md)
  - Setup guide, used once
  - Lives on GitHub, fetched when needed
        |
        | generates
        v
GENERATED FILES (in your repo)
  - .claude/hooks/*.sh
  - .claude/skills/*/SKILL.md
  - .claude/settings.json
  - CLAUDE.md, SDLC.md, TESTING.md, ARCHITECTURE.md
        |
        | validated by
        v
CI/CD PIPELINE
  - E2E: simulate SDLC task -> score 0-10
  - Before/after: main vs PR wizard
  - Statistical: 5x trials, 95% CI
  - Model-aware: SDP adjusts for external conditions

Self-Evolving System

| Cadence | Source | Action | |---------|--------|--------| | Weekly | Claude Code releases | PR with analysis + E2E test | | Weekly | Community (Reddit, HN) | Issue digest | | Monthly | Deep research, papers | Trend report |

Every update: regression tested -> AI reviewed -> human approved.

E2E Scoring

Like evaluating scientific method adherence - we measure process compliance:

| Criterion | Points | Type | |-----------|--------|------| | TodoWrite/TaskCreate | 1 | Deterministic | | Confidence stated | 1 | Deterministic | | Plan mode | 2 | AI-judge | | TDD RED | 2 | Deterministic | | TDD GREEN | 2 | AI-judge | | Self-review | 1 | AI-judge | | Clean code | 1 | AI-judge |

40% deterministic + 60% AI-judged. 5 trials handle variance.

Model-Adjusted Scoring (SDP)

| Metric | Meaning | |--------|---------| | Raw | Actual score (Layer 2: SDLC compliance) | | SDP | Adjusted for model conditions | | Robustness | How well SDLC holds up vs model changes |

Robustness < 1.0 = SDLC is resilient (good!)
Robustness > 1.0 = SDLC is sensitive (investigate)

Tests Are The Building Blocks

Tests aren't just validation - they're the foundation everything else builds on.

Tests >= App Code - Critique tests as hard (or harder) than implementation
Tests prove correctness - Without them, you're just hoping
Tests enable fearless change - Refactor confidently

Official Plugin Integration

| Plugin | Purpose | Scope | |--------|---------|-------| | claude-md-management | Required - CLAUDE.md maintenance | CLAUDE.md only | | claude-code-setup | Recommends automations | Recommendations | | code-review | Local self-review and PR review (optional) | Local + PRs |

Prove It's Better

Don't reinvent the wheel. Use native/built-in features UNLESS you prove your custom version is better. If you can't prove it, delete yours.

Test the native solution — measure quality, speed, reliability
Test your custom solution — same scenario, same metrics
Compare side-by-side
Native >= custom? Use native. Delete yours.
Custom > native? Keep yours. Document WHY. Re-evaluate when native improves.

This applies to everything: native commands vs custom skills, framework utilities vs hand-rolled code, library functions vs custom implementations.

How This Compares

This isn't the only Claude Code SDLC tool. Here's an honest comparison:

| Aspect | SDLC Wizard | everything-claude-code | claude-sdlc | |--------|------------|----------------------|-------------| | Focus | SDLC enforcement + measurement | Agent performance optimization | Plugin marketplace | | Hooks | 3 (SDLC, TDD, instructions) | 12+ (dev blocker, prettier, etc.) | Webhook watcher | | Skills | 4 (/sdlc, /setup, /update, /feedback) | 80+ domain-specific | 13 slash commands | | Evaluation | 95% CI, CUSUM, SDP, Tier 1/2 | Configuration testing | skilltest framework | | CI Shepherd | Local CI fix loop | No | No | | Auto-updates | Weekly CC + community scan | No | No | | Install | npx -y agentic-sdlc-wizard@latest init | npm install | npm install | | Philosophy | Lightweight, prove-it-or-delete | Scale and optimization | Documentation-first |

Our unique strengths: Statistical rigor (CUSUM + 95% CI), SDP scoring (model quality vs SDLC compliance), CI shepherd loop, Prove-It A/B pipeline, comprehensive automated test suite, dogfooding enforcement.

Where others are stronger: everything-claude-code has broader language/framework coverage. claude-sdlc has webhook-driven automation. Both have npm distribution.

The spirit: Open source — we learn from each other. See COMPETITIVE_AUDIT.md for details.

Documentation

| Document | What It Covers | |----------|---------------| | ARCHITECTURE.md | System design, 5-layer diagram, data flows, file structure | | CI_CD.md | All workflows, E2E scoring, tier system, SDP, integrity checks | | SDLC.md | Version tracking, enforcement rules, SDLC configuration | | TESTING.md | Testing philosophy, test diamond, TDD approach | | CHANGELOG.md | Version history, what changed and when | | CONTRIBUTING.md | How to contribute, evaluation methodology |

XDLC Ecosystem (Sibling Projects)

This wizard is one of three published siblings. Same enforcement philosophy, different agent / domain:

| Package | Agent / Domain | What It Does | |---------|----------------|--------------| | agentic-sdlc-wizard (repo) | Claude Code / SDLC | This repo. Plan → TDD → self-review for code, with hooks + skills + CI scoring | | codex-sdlc-wizard (repo) | OpenAI Codex / SDLC | Same SDLC enforcement, ported to Codex CLI (writes .codex/ + AGENTS.md) | | opencode-sdlc-wizard (repo) | OpenCode / privacy-first | Same SDLC enforcement against ANY backend OpenCode supports — local Ollama, Azure OpenAI, Together, Groq, OpenRouter. Writes .opencode/ + AGENTS.md. | | claude-gdlc-wizard (repo) | Claude Code / GDLC | Game Development Life Cycle — persona-driven playtest cycles, triangulated findings, ratchet-only-tightens |

All four are part of the broader XDLC ecosystem — generalized lifecycle enforcement across agents and domains.

Community

Automation Station — a community Discord packed with software engineers bringing 40+ years of combined experience across every area of the stack.

Frontend · Backend · Infra · Embedded · Data · QA · DevOps

Share patterns, ask questions, compare notes on AI agents, automation, and SDLC tooling.

Contributing

PRs welcome. See CONTRIBUTING.md for evaluation methodology and testing.

Feedback

Three ways to report bugs, request features, or ask questions:

In-session: run /feedback inside any Claude Code session using this wizard — auto-fills context and redacts secrets before filing
Issue templates: bug report, feature request, question
Discussions: open-ended conversations at github.com/BaseInfinity/claude-sdlc-wizard/discussions