testbotpro

v0.3.0

Published

4 days ago

Adversarial testing suite: a separate pass that writes unit + E2E tests from an independent, spec-only perspective to catch bugs the author/planner missed.

0High
0Medium
0Low

timothyjordan

testing adversarial test-generation mutation-testing property-based-testing llm agent claude

testbotpro

An adversarial testing suite. A separate pass, optionally on a separate model, that writes unit and E2E tests from a genuinely independent perspective, deliberately probing both what your code should do and what it should not do. It exists to catch the bugs nobody thought of during planning, the ones a same-agent "write code, write tests that pass" loop bakes in instead of catching.

Status: the unit path, property mode, the staged triage classifier, the E2E adversary (HTTP and Playwright), and the mutation-scored benchmark are all shipped and verified. Filling the published leaderboard with pinned-model numbers is the remaining polish step.

Why this is different

Mainstream LLM test generators write tests against the implementation and keep only the tests that pass. That bakes the author's (possibly buggy) behavior into the test oracle. A 2024 study found such filtered tools caught 0% of injected bugs while validating 60-68% of them into the suite as "correct."

testbotpro inverts that. The oracle comes from the spec, never from the code under test. The test-writer works black-box: it sees the public signatures, types, and a normalized spec, never the implementation bodies. A test that fails is the interesting signal, not garbage to be discarded.

Core principles

Spec-only black-box generation is the foundation. The test-writer cannot read the implementation.
The oracle is never derived from the code under test. (See above, this is the whole point.)
Spec-only generation modes: example-based adversarial tests, seeded property/metamorphic tests (fast-check) whose invariants are structurally incapable of encoding implementation behavior, and end-to-end tests against a running app: HTTP-level authorization/negative invariants ("user A cannot read user B's data") and UI-observable Playwright tests.
A failing test is a candidate bug, surfaced with its assertion diff and the spec claim it violates.
Triage is first-class and staged. Stage 1 is mechanical: fail-then-explain, reproduction (kill flakes), spec-fidelity checks, and broken-test demotion. Stage 2 is a model classifier (spec-violation / spec-ambiguity / test-wrong / environment) that sees only spec-side evidence; confidence composes reproduction x fidelity x classifier x claim authority, and only reproduced spec-violations of high-authority claims can block a CI gate.
Prove it on a benchmark. Every claim is measured on known-buggy code: bug-catch rate, noise/precision, and spec-fidelity. A mutation mode (benchmark --mutation) generates many labeled mutants per source (Stryker by default, a zero-dependency built-in engine as a fallback) and reports the suite's mutation score, so the leaderboard rests on a large labeled set rather than a handful of hand-authored bugs.

Install

Requires Node >= 22.

npx testbotpro --help

Or add it to a project:

pnpm add -D testbotpro

The local tool is free and needs no API key or account. A key is required only for the headless CI path (one provider call per test), never for the in-agent flow below.

Install the skill

The fastest way to set up testbotpro is one command. It installs the CLI globally, then installs the agent skill into your coding agents:

npx testbotpro install

That runs npm install -g testbotpro and then the skill installer (the interactive picker below). Pass -y for a non-interactive run, or any of the skill flags (--link, --project, --agent <name>).

If the CLI is already on your PATH, install or manage just the skill directly. It needs no network: it copies the SKILL.md bundled with the CLI, so the installed skill always matches the version you ran. It is idempotent (installs if missing, updates if present), auto-detects which agents you have configured, and shows a checklist before writing.

npx -y testbotpro skill              # interactive picker, then install for the agents you choose
npx -y testbotpro skill --link       # one shared copy in .agents/skills, symlinked from each agent
npx -y testbotpro skill --project    # guided install into the current repo, so collaborators share it
npx -y testbotpro skill --check      # report whether your copy is out of date (exit 1 on drift)
npx -y testbotpro skill uninstall    # remove it from every agent and the shared dir

Supported agents: Claude Code, Cursor, GitHub Copilot, Gemini CLI, Codex, Windsurf, Antigravity, Cline, OpenCode, Roo Code, and Zed. Use --agent <name> to target one, -y to skip the picker, or --target <dir> to write to an explicit directory. Run testbotpro skill --help for the full flag list.

Quickstart (no key, in-agent)

Inside a coding agent (Claude Code and similar), the model that writes the tests is the agent you already pay for, so no key is needed. The CLI does the deterministic work; the agent fills the tests in between.

# 1. Plan: build spec + façade requests (no key)
testbotpro plan --unit clamp --impl src/clamp.ts --spec specs/clamp.md --modes adversarial,property

# 2. Generate: the agent dispatches one clean subagent per request
testbotpro requests            # list the requests for the agent to fill

# 3. Judge: run the tests against the real code, triage, report (no key)
testbotpro judge

# 4. Classify + report: the agent fills verdicts, confidence composes (no key)
testbotpro classify --list
testbotpro report

Findings land in .testbotpro/work/findings.{md,json,sarif}. See SKILL.md for the full agent flow and harness/claude/README.md for the exact subagent dispatch.

CLI reference

| command | needs a key | what it does | |---|---|---| | plan | no | Build spec + façade generation requests into the work dir. --modes adversarial,property,e2e-http,e2e-browser. | | requests | no | List planned generation requests for an agent/subagent to fill. | | generate | API providers only | Fill the planned tests with a model provider (anthropic / openai / command). | | judge | no | Run the tests against the implementation, triage, emit findings. | | classify | no with --list | List classification requests for the agent (--list), or run a provider classifier. | | report | no | Merge verdicts, recompose confidence, rewrite the reports. | | run | yes | plan -> generate(provider) -> judge in one shot, for CI/headless. | | capabilities | no | Print harness capabilities used to choose the generation path. | | benchmark | yes | Run the adversarial-vs-baseline benchmark and render a leaderboard. --mutation, --engine, --max-mutants. | | install | no | Install testbotpro globally (npm install -g) and then install the agent skill, in one step. Accepts the skill flags. | | skill | no | Install, update, or uninstall the agent skill for your coding agents (idempotent). --link, --project, --check, --agent, uninstall. |

Leaderboard

The benchmark measures every claim on known-buggy code; generation never sees the implementation. testbotpro benchmark renders the two views below. Numbers are from a pinned run on the 13-unit ts-mini corpus, provider anthropic, model claude-opus-4-8, measured 2026-06-11.

Bug-catch (testbotpro benchmark --corpus ts-mini):

| mode | items | bugs caught | bug-catch rate | mean spec-fidelity | |------|-------|-------------|----------------|--------------------| | adversarial | 13 | 13 | 100% | 99% | | property | 13 | 13 | 100% | 99% | | baseline | 13 | 0 | 0% | 62% |

The story: spec-only adversarial and property tests caught the injected bug in every unit while staying faithful to the correct code (99% of their tests pass on the correct implementation). The baseline (the conventional "write tests against the code, keep the ones that pass" approach) caught none of the bugs, and its 62% spec-fidelity is that pathology made visible: more than a third of its tests fail on the correct code because they baked the bug into the oracle.

Mutation score (testbotpro benchmark --corpus ts-mini --mutation):

| mode | items | mean mutation score | mean spec-fidelity | |------|-------|---------------------|--------------------| | adversarial | 13 | 95% | 99% | | property | 13 | 88% | 99% | | baseline | 13 | 96% | 100% |

Each correct source is mutated into many labeled variants (Stryker by default, a zero-dependency built-in engine as a fallback); a mutant is "killed" when a test that passes on the correct code fails on it. The spec-only adversarial suite kills 95% of mutants while staying faithful to the correct code. Unlike the bug-catch view, the baseline scores well here: in mutation mode it is generated from the correct source, so there is no planted bug for it to encode. That is the point of having both views: mutation score measures absolute thoroughness, while the bug-catch table measures the thing that matters in practice, catching a real bug the author already shipped.

Reproduce with testbotpro benchmark --corpus ts-mini (add --mutation for the second table) and an ANTHROPIC_API_KEY (about $2-3 per view at the Opus tier). The run records the provider and model in benchmark/results/*-results.json.

CI (the one keyed path)

In CI there is no agent, so a provider fills the tests. The build blocks only on reproduced, high-confidence spec-violations, so the gate does not cry wolf. See harness/ci/README.md and the composite action in action.yml.

testbotpro run --targets testbotpro.targets.json --provider anthropic --gate-confidence 0.8

The provider is model-agnostic: anthropic, an OpenAI-compatible endpoint (openai), or a shell-out to any model CLI (command). The key lives only here.

Development

Requires Node >= 22 and pnpm.

pnpm install
pnpm test:unit      # the tool's own unit tests
pnpm typecheck
pnpm lint
pnpm benchmark      # the adversarial-vs-baseline thesis demo (needs ANTHROPIC_API_KEY)

License

Apache-2.0, with an all-permissive dependency graph. Mutation testing uses Stryker (Apache-2.0) by default, with a zero-dependency built-in operator engine as a fallback so the default works with no extra install. Engines are pluggable through a generic interface, so you can point testbotpro at any engine you install yourself; testbotpro never bundles a copyleft-licensed tool.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

testbotpro

Why this is different

Core principles

Install

Install the skill

Quickstart (no key, in-agent)

CLI reference

Leaderboard

CI (the one keyed path)

Development

License