wicked-testing

v0.9.0

Published

2 days ago

Standalone QE library for AI coding CLIs — 5-core testing surface (plan/authoring/execution/review/insight), SQLite ledger with fixed-SQL oracle, 3-agent acceptance pipeline, bus+brain optional integration.

Downloads

1,749

0High
0Medium
0Low

mike.parcewski

ai testing qe quality-engineering acceptance-testing skills claude-code gemini-cli cursor codex kiro sqlite test-oracle wicked-garden

           _      _            _       _            _   _             
 __      _(_) ___| | _____  __| |     | |_ ___  ___| |_(_)_ __   __ _ 
 \ \ /\ / / |/ __| |/ / _ \/ _` |_____| __/ _ \/ __| __| | '_ \ / _` |
  \ V  V /| | (__|   <  __/ (_| |_____| ||  __/\__ \ |_| | | | | (_| |
   \_/\_/ |_|\___|_|\_\___|\__,_|      \__\___||___/\__|_|_| |_|\__, |
                                                                 |___/

48 skills: 40 specialist skills running in isolated forked contexts, coordinated by 8 workflow skills. A 3-agent acceptance pipeline that eliminates self-grading.

npx wicked-testing install

Works with Claude Code, Antigravity, Cursor, Codex, Kiro, Copilot, OpenCode, and Pi.

Published to npm as wicked-testing v0.8.0 · site at wt.wickedagile.com. The wicked-qe rename is pending — the wicked-qe bin alias already ships.

The Problem

When you ask an AI agent to test its own work, it grades its own homework. Self-reported PASS rates on agentic test runs sit 80%+ above human-reviewed rates. The agent that wrote the code also runs the tests and evaluates the results — there is no independence at any layer.

The industry answer has been scripted test frameworks: Playwright, pytest, k6, axe-core. But those only run what you already thought to test. They don't tell you what to test, whether the tests are any good, whether the results mean anything, or why the suite keeps failing intermittently on CI.

wicked-testing gives your AI CLI a complete QE team — from planning through execution through judgment — with enforced separation between the agent that runs tests and the agent that evaluates them.

What You Get

claude plugins marketplace add mikeparcewski/wicked-testing
claude plugins install wicked-testing

Then:

# Generate a shift-left test strategy from your codebase
/wicked-testing:plan src/auth/ --project auth-service

# Run the 3-agent acceptance pipeline with enforced reviewer isolation
/wicked-testing:acceptance-testing scenarios/login-positive.md

# Ask plain-English questions about your test history
/wicked-testing:insight "what was the last verdict for the login scenario?"

Under the hood: a project-local SQLite ledger, 40 specialist skills routed by 8 workflow skills — every specialist runs in an isolated forked context (context: fork) — and a public event contract for wicked-garden integration.

48 Skills, One Namespace

Everything is a skill. 8 workflow skills are the user-invokable entry points (plan, authoring, execution, acceptance-testing, review, insight, setup, update). The other 40 are specialist skills marked context: fork — the workflow skills dispatch them into isolated contexts with no shared conversation history.

Tier-1 Specialist Skills — Public Contract

The 15 Tier-1 specialist skills form the stable integration surface. wicked-garden and other consumers depend only on these. Dispatch names are unchanged from the earlier agent-based layout: wicked-testing:<name>.

| Skill | Invoked By | What It Does | |-------|-----------|--------------| | test-strategist | plan | Maps codebase to test scenarios — positive, negative, edge cases | | testability-reviewer | plan | Blocks designs that will be hard to test before a line is written | | requirements-quality-analyst | plan | Applies SMART+T to acceptance criteria — ready-for-design or needs-iteration | | risk-assessor | plan | Scores risks by likelihood × impact, produces a mitigation matrix | | test-designer | authoring | Full write→execute→analyze→verdict loop from a scenario file | | test-automation-engineer | authoring | Generates test code in the project's detected framework | | contract-testing-engineer | authoring | Consumer-driven contract tests (Pact-style), breaking-change detection | | code-analyzer | authoring | Static quality + testability signals, ship/fix/refactor verdict | | acceptance-test-writer | execution | Evidence-gated test plan — every step declares expected evidence and an assertion | | acceptance-test-executor | execution | Executes plan mechanically, captures artifacts, makes no judgment | | acceptance-test-reviewer | review | Reads cold evidence only (advisory allowed-tools: Read) — never sees executor context | | scenario-executor | execution | Runs a scenario markdown file step-by-step | | semantic-reviewer | review | Gap Report per AC: aligned / divergent / missing | | production-quality-engineer | insight | Post-deploy health: healthy / degraded / unhealthy + next action | | test-oracle | insight | Plain-English questions → 12 named parameterized SQL queries. No ad-hoc SQL. |

Tier-2 Specialist Skills — Internal

25 domain specialists routed by the workflow skills, each in its own forked context. Never break downstream consumers because they are not part of the public contract.

| Specialist | Domain | |-----------|--------| | integration-test-engineer | Real-service wiring tests — testcontainers, docker compose, no mocks | | ui-component-test-engineer | React/Vue/Svelte component tests — RTL, user-event, role queries | | e2e-orchestrator | Full journey Playwright tests — multi-context, API seeding, journey diagrams | | visual-regression-engineer | Playwright + pixelmatch baselines, dynamic region masking | | a11y-test-engineer | axe-core / pa11y, WCAG 2.1 AA, keyboard flows, focus management | | load-performance-engineer | k6 / locust / hey, P95/P99 assertions, bottleneck diagnosis | | chaos-test-engineer | Toxiproxy / Chaos Mesh failure injection, blast radius, hypothesis→verdict | | fuzz-property-engineer | Hypothesis (Python) / fast-check (TS) / AFL++ — round-trip and invariant testing | | mutation-test-engineer | Stryker / Mutmut / Pitest kill-rate analysis, surviving-mutant triage | | localization-test-engineer | Pseudolocalization, RTL layout, CLDR pluralization, string-length overflow | | data-quality-tester | great_expectations / dbt-test suites, migration forward+rollback, drift detection | | observability-test-engineer | Structured log field assertions, OTel span coverage, cardinality hazard detection, PII-in-signals | | flaky-test-hunter | Reproduce, root-cause, quarantine — never "add retry" as a fix | | test-data-manager | factory_boy / fishery factories, referential consistency, PII-scrubbed snapshots | | exploratory-tester | Charter-driven sessions, SFDIPOT heuristics, ranked findings, follow-up charters | | coverage-archaeologist | lcov + git-blame-age + call-graph ranked by impact × exposure — top-N not 500 | | security-test-engineer | SAST/DAST/secrets + authz/authn (JWT, IDOR, CSRF) — OWASP ASVS traceability | | ai-feature-test-engineer | Prompt-injection library, hallucination drift, output-drift, judge ≠ SUT isolation | | test-impact-analyzer | git diff + call-graph + ledger coverage → ranked "which tests for this diff" | | release-readiness-engineer | Aggregates verdicts / flakes / risk / coverage / prod-SLO → GO/CONDITIONAL/NO-GO | | iac-test-engineer | terraform/checkov/tflint/opa/kyverno/helm — plan-not-clean as verdict signal | | compliance-test-engineer | SOC2/HIPAA/GDPR/PCI control mapping → auditor-ready control-evidence + coverage CSV | | snapshot-hygiene-auditor | Rot / over-broad / rubber-stamp / dead detection across .snap/.golden/cassettes | | test-code-quality-auditor | Test-code smells (assertion-free, tautological, hardcoded sleeps, shared-state) | | incident-to-scenario-synthesizer | Stack trace → new scenario with incident-linkage + pending-review task |

The 3-Agent Acceptance Pipeline

The acceptance-testing skill (/wicked-testing:acceptance-testing) eliminates the self-grading problem with enforced role separation. Writer, Executor, and Reviewer are three separate skills, each dispatched into an isolated forked context:

Writer ──→ Test Plan ──→ Executor ──→ Evidence ──→ [context.md] ──→ Reviewer ──→ Verdict
                                                    (cold brain
                                                     knowledge)

Writer (allowed-tools: Read, Grep, Glob, Skill) — reads scenario + code, optionally queries wicked-brain for prior flaky patterns and tool quirks, produces an evidence-gated plan where every step declares expected evidence and an assertion. Cannot execute or write state.
Executor (allowed-tools: Read, Write, Bash) — follows the plan mechanically. Captures stdout, stderr, exit codes, and file artifacts. Optionally emits wicked.testrun.* events via wicked-bus. Makes no judgment about results.
Reviewer (allowed-tools: Read) — reads cold evidence files only. Never sees the executor's context, reasoning, or stdout. Evaluates assertions against artifacts. Cannot execute.

Cold context injection: before dispatching Reviewer, the orchestrator may materialize a context.md in the evidence directory with non-prejudicial domain knowledge from wicked-brain (WCAG thresholds, tool quirks). Prior verdicts, pass/fail rates, and anything run-specific are strictly excluded — if Reviewer sees prejudicial content it returns INCONCLUSIVE with CONTEXT_CONTAMINATION.

Reviewer isolation is enforced three ways on Claude Code: allowed-tools frontmatter, context: fork (each role runs as a separate forked skill with no shared conversation history), and evidence-only dispatch (the Reviewer receives only file paths — never the executor's transcript). Advisory on other CLIs. The separation is what makes the verdict trustworthy.

Workflow Skills

The user-invokable entry points. There is no separate command layer — each of these is a skill, and the familiar /wicked-testing:<name> invocations trigger them directly.

| Skill | Invoke | Description | |-------|--------|-------------| | setup | /wicked-testing:setup | Initialize for this project — detect CLI tools, create config | | plan | /wicked-testing:plan | Shift-left test strategy from code or feature description | | authoring | /wicked-testing:authoring | Author scenario files and test code | | execution | /wicked-testing:execution | Run a scenario and capture evidence | | acceptance-testing | /wicked-testing:acceptance-testing | Full 3-agent pipeline: Writer → Executor → Reviewer | | review | /wicked-testing:review | Evaluate captured evidence | | insight | /wicked-testing:insight | Domain health, run history, oracle queries, stats, reports | | update | /wicked-testing:update | Check for and install wicked-testing updates across detected CLIs |

Most workflow skills support --json for machine-readable output. Exceptions are /wicked-testing:review and /wicked-testing:authoring, which are narrative-heavy and don't emit a single envelope.

Storage and Integration

.wicked-testing/
  config.json              Project configuration + detected capabilities
  wicked-testing.db        SQLite ledger (WAL mode, 7 tables)
  evidence/
    <run-id>/              <run-id> is the canonical run UUID
      manifest.json        PUBLIC contract — read by consumers (see schemas/evidence.json)
      evidence.json        Executor summary (internal)
      step-N.json          Per-step evidence (internal)
      context.md           Optional reviewer context (non-prejudicial; excluded from manifest artifacts)
  projects/{id}.json       Canonical JSON (dual-write with SQLite)
  strategies/{id}.json
  scenarios/{id}.json
  runs/{id}.json           Canonical run records (flat, never collides with evidence/<run-id>/)
  verdicts/{id}.json
  tasks/{id}.json

Dual-write: every record writes JSON first (fsync'd), then SQLite. On SQLite failure the store degrades to JSON-only. Oracle and task operations require SQLite; every other skill continues in JSON-only mode.

Provenance: all records carry run_id, scenario_id, agent, verdict, and timestamps — every verdict traces back to the evidence that produced it.

wicked-bus integration

When wicked-bus is on PATH, wicked-testing emits on every significant action. Event names match the public catalog in docs/INTEGRATION.md §4.

| Event | When | |-------|------| | wicked.test.strategy.generated | Test strategy record created | | wicked.scenario.authored | Scenario record created or updated | | wicked.testrun.started | Run row written with status: running | | wicked.test.run.completed | Run row updated with finished_at (any terminal status) | | wicked.test.verdict.created | Reviewer writes a verdict | | wicked.evidence.captured | evidence/<run-id>/manifest.json written |

Emission is fire-and-forget: if wicked-bus is absent or the spawn fails, wicked-testing continues without error. See lib/bus-emit.mjs.

wicked-brain integration

When wicked-brain is present, wicked-testing writes memories on high-signal events: persistent FAIL patterns, flaky test discoveries, coverage gaps found by the archaeologist. If absent, no-op.

Install

npx wicked-testing install

Detects which AI CLIs are present via identity markers (config.json, settings.json, plugins/, etc.), copies all 48 skills (all specialists run with context: fork) into each CLI's home-relative skill directory (~/.claude/skills/, ~/.gemini/antigravity-cli/plugins/wicked-testing/skills/, ~/.codex/skills/, ~/.cursor/skills/, ~/.kiro/skills/), installs lifecycle hooks (JSON hooks for Claude Code, Antigravity, Codex, Cursor, Kiro, and Copilot; TypeScript plugins for OpenCode and Pi), runs a bootstrap self-test, and prints a per-target isolation-tier note (hard-enforced on Claude Code, advisory on everyone else). Idempotent — safe to run multiple times.

Claude Code users: npx wicked-testing install is the preferred path and drops everything where Claude Code's skill resolver already looks. A .claude-plugin/marketplace.json also ships so you can register via the plugin-system install if you prefer:

claude plugins marketplace add mikeparcewski/wicked-testing
claude plugins install wicked-testing

Both paths work; pick one.

# Install for a specific CLI only
npx wicked-testing install --cli=claude

# Install to a custom path
npx wicked-testing install --path=~/.claude

# Skip the SQLite bootstrap self-test (faster; skips integrity check)
npx wicked-testing install --skip-self-test

# Force reinstall even if the installed version matches
npx wicked-testing install --force

# Override CLI detection (useful when identity markers don't match)
npx wicked-testing install --assume-cli=gemini

# Check what's installed
npx wicked-testing status

# Verify the installation is healthy
npx wicked-testing doctor

# Check installed version satisfies a semver spec (exit 0/1/2)
npx wicked-testing check --require=^0.2.0

# Update to the latest version
npx wicked-testing update

# Remove
npx wicked-testing uninstall

Scenario Format

Scenarios are self-contained markdown files — the executable unit for the acceptance pipeline.

---
name: api-health-check
description: Validate the health endpoint returns 200 with expected JSON
category: api
tools:
  required: [curl]
timeout: 30
---

## Steps

### Step 1: HTTP 200 response (curl)

```bash
curl -sf https://api.example.com/health

Expect: Exit code 0, JSON response with status: ok


See [SCENARIO-FORMAT.md](SCENARIO-FORMAT.md) for the full spec. Working examples are in [scenarios/examples/](scenarios/examples/) — start with `smoke-test-execution.md` to verify the pipeline fires real commands.

---

## Documentation

| Doc | For |
|-----|-----|
| [ARCHITECTURE.md](ARCHITECTURE.md) | Library maintainers — component diagram, design decisions |
| [HOW-IT-WORKS.md](HOW-IT-WORKS.md) | Walkthrough of the full E2E pipeline |
| [SCENARIO-FORMAT.md](SCENARIO-FORMAT.md) | Scenario file spec with examples |
| [DATA-DOMAIN.md](DATA-DOMAIN.md) | 7-table SQLite schema and DomainStore API |
| [docs/INTEGRATION.md](docs/INTEGRATION.md) | Consumers integrating with wicked-testing |
| [docs/EVIDENCE.md](docs/EVIDENCE.md) | Evidence manifest schema |
| [docs/NAMESPACE.md](docs/NAMESPACE.md) | Naming rules for the skill namespace |
| [docs/STANDALONE.md](docs/STANDALONE.md) | Using without wicked-garden |
| [docs/WICKED-GARDEN.md](docs/WICKED-GARDEN.md) | wicked-garden integration guide |

---

## Requirements

- Node.js ≥ 20
- One of: Claude Code, Antigravity, Codex, Cursor, Kiro, Copilot, OpenCode, or Pi
- `better-sqlite3` — installed via `npm install`, pre-built binaries for macOS x64/arm64, Linux x64/arm64, Windows x64. On unsupported platforms, `npm install` falls back to `node-gyp rebuild` which requires a C++ toolchain.

Windows (Git Bash / WSL) is fully supported.

## License

MIT