breakit

v1.2.1

Published

18 days ago

AI-assisted exploratory testing CLI for web apps. Drives a real browser to surface UX issues, functional bugs, and console errors, with replay-verified findings and CI/SARIF output.

breakit

AI-assisted exploratory testing for web applications. A command-line tool that drives a real browser through your app to surface UX issues, functional bugs, broken or confusing flows, console errors, and benign security signals — with reproduction steps, evidence, and replay-based verification.

breakit is built for QA and test automation engineers. It complements your scripted suites, CI checks, and manual exploratory sessions; it does not replace them.

npx breakit test https://staging.your-app.com

Why use breakit

Scripted end-to-end tests verify the paths you already thought of. Exploratory testing finds the ones you didn't — but it's manual, hard to repeat, and easy to skip under deadline. breakit automates the exploratory pass:

Finds issues outside your test plan. It interacts with forms, navigation, and flows the way different real users would, and reports what breaks or confuses — not just assertions you wrote in advance.
Every finding is reproducible. Candidate issues are replayed in a clean browser session before being trusted, and each is graded by confidence. You get reproduction steps, screenshots, and console/network evidence, not a wall of speculation.
Built for CI. Stable JSON and SARIF output, a configurable severity gate, and a non-zero exit code on failure let you wire it into a pipeline as a quality signal.
Safe by default. No destructive actions, secret redaction, and explicit production warnings — see Safety and Privacy below.

It runs on a free LLM tier by default, so trying it against a staging site costs nothing.

Where it fits in your QA workflow

breakit occupies the gap between scripted regression tests and manual exploratory testing:

Pre-merge / PR checks — run a short pass against a preview deployment and gate on verified high/critical findings.
Nightly against staging — a longer pass with more strategies to catch regressions and UX drift the scripted suite doesn't cover.
Before a release — an exploratory sweep to surface confusing flows and edge cases while there's still time to fix them.
Alongside manual exploratory sessions — use the report as a starting checklist so testers spend their time confirming and digging, not rediscovering.

It is a complementary signal. Treat its findings as leads to triage, the same way you would treat a report from a new exploratory tester.

Install

npm install -g breakit

Or run on demand with npx:

npx breakit test https://staging.your-app.com

breakit drives a real Chromium instance via Playwright. Install the browser once:

npx playwright install chromium

breakit doctor verifies that Chromium, Node, and your API key are ready.

Requirements: Node.js 20 or newer, and an API key for one supported LLM provider.

Configure a provider

breakit uses Google Gemini by default (free tier, no cost to evaluate). Anthropic Claude is supported as an alternative.

export GEMINI_API_KEY=...            # default provider
# or
export ANTHROPIC_API_KEY=sk-ant-...  # use with --provider anthropic

You can also pass the key directly with --key, or commit non-secret defaults to a config file (breakit init).

Exploratory strategies

breakit explores using a set of named strategies (referred to internally as personas). Each is an LLM-driven agent with a defined goal, risk level, and focus — for example, probing form validation, navigating with assistive-technology constraints, or stress-testing a multi-step flow under impatient interaction. They are deterministic in intent, not random clicking: each strategy pursues its goal and stops when it has covered its focus area or exhausted its action budget.

Strategies are grouped into packs:

QA pack (5) — functional and input-handling focus: form/input abuse, low-tech-literacy navigation, mobile viewport, impatient interaction, first-time comprehension.
UX pack (20) — friction, trust, and accessibility focus: pricing/trust evaluation, privacy review, enterprise evaluation, screen-reader navigation, cancellation flows, and more.
All (25) — QA and UX strategies interleaved for balanced coverage.

Select with --pack qa (default), --pack ux, or --pack all. List the available strategies and their focus areas with breakit personas, and run a single one with --persona <id>.

Verification and confidence

A report is only useful if you can trust it. breakit grades every finding and replays it before claiming it is real.

Replay verification. Each candidate finding is re-executed in a fresh browser session. A finding is labeled verified only when that replay actually reproduced it. If replay is skipped, times out, or fails, the finding is reported at a lower confidence — never as verified.
Confidence levels (highest to lowest): verified → high → plausible → unverified. Confidence is derived from evidence (replay outcome, screenshots, console and network errors, and how many strategies independently reported it), not asserted.
Deduplication and scoring. Findings are deduplicated across strategies and scored by priority so the report leads with what matters.

Triage verified and high first; treat plausible and unverified as leads for manual confirmation.

Usage

breakit test <url> [options]

| Flag | Description | Default | |------|-------------|---------| | -k, --key <key> | API key (or set the provider env var) | env var | | --provider <provider> | gemini (free) or anthropic | gemini | | -m, --model <model> | Model override | provider default | | --mode <mode> | browser (full) or snapshot (faster, cheaper) | browser | | --pack <pack> | qa, ux, or all | qa | | --persona <id> | Run a single strategy by id (see breakit personas) | — | | -n, --personas <count> | Number of strategies to run (1–50) | 5 | | -d, --description <text> | Brief description of the app under test | — | | -e, --email <email> | Login email for authenticated testing | — | | -p, --password <password> | Login password | — | | -t, --time-budget <seconds> | Maximum run time (60–900) | 300 | | -o, --output <dir> | Output directory for reports | ./breakit-output | | --severity-threshold <level> | Exit non-zero if findings meet this severity (off|low|medium|high|critical) | off | | --sarif | Also emit a SARIF report | — | | --compare <path> | A previous run's JSON for before/after comparison | — | | --allow-destructive-actions | Permit state-changing requests and uploads (see Safety) | off | | --config <path> | Path to a config file | auto-detect |

Additional commands: breakit personas (list strategies), breakit doctor (check setup), breakit init (write a starter config).

Examples

# Short functional pass: 3 strategies, 2-minute budget
breakit test https://staging.my-app.com -t 120 -n 3

# UX and accessibility pass using the faster snapshot engine
breakit test https://staging.my-app.com --mode snapshot --pack ux

# Authenticated run
breakit test https://staging.my-app.com -e [email protected] -p "$QA_PASSWORD"

# Run a single strategy for a focused investigation
breakit test https://staging.my-app.com --persona form_abuser

CI usage

breakit is designed to run as a pipeline quality gate. Use --severity-threshold to fail the build only on verified findings at or above a chosen severity, and --sarif to surface findings in your code-scanning UI.

GitHub Actions:

- name: Install Chromium
  run: npx playwright install --with-deps chromium

- name: Exploratory test pass
  run: npx breakit test "$PREVIEW_URL" --severity-threshold high --sarif
  env:
    GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

- name: Upload findings to code scanning
  if: always()
  uses: github/codeql-action/upload-sarif@v3
  with:
    # Reports are written with timestamped, host-prefixed names
    # (e.g. staging-my-app-com_2026-06-02T14-30-00.sarif); point at the directory.
    sarif_file: breakit-output

The process exits with a non-zero status when verified findings meet the threshold, and the machine-readable JSON reports in breakit-output/ can be archived or parsed by downstream steps. Because runs are LLM-driven and vary between executions, treat the gate as a high-signal warning rather than a deterministic pass/fail oracle, and keep the threshold conservative (high or critical) to avoid noisy builds.

Report output

Reports are written to ./breakit-output/ as Markdown (human review), JSON (CI and tooling), and HTML (shareable). SARIF is added with --sarif. When an output directory is set, a structured per-action log is also written as JSONL.

Each report includes the target URL, run ID, ISO 8601 start and end times, duration, strategies used, total actions, pages and forms discovered, and findings grouped by severity — each with its confidence and verification status, reproduction steps, and captured console errors. A Limitations section records anything the run could not cover, such as a crashed strategy or incomplete verification, so the report is never presented as more complete than it is.

Safety

No destructive actions by default. State-changing API requests (DELETE, PUT, PATCH) and file uploads are blocked unless you explicitly pass --allow-destructive-actions. Enable that flag only against an environment you own and can reset.
breakit will not, in its default mode, complete purchases, delete data, change passwords, or send real emails.
Test against staging. Running against production sends real traffic and may trigger monitoring, alerts, or rate limits. breakit warns when a target does not look like a non-production environment.
Action and time budgets are enforced, and the run aborts cleanly when the time budget is reached.

Privacy

Secrets are redacted before anything is written to disk or sent to the LLM provider: credential headers (Authorization, Cookie, Set-Cookie, API keys), and secret-shaped values in console output, response bodies, and tool results.
Page content is sent to your LLM provider. Because the model drives exploration, page content, accessibility trees, and redacted network metadata from the target are transmitted to the provider you configure. Do not point breakit at pages containing data you are not permitted to share with that provider.
Output is sensitive. Evidence and report files contain DOM snapshots and network metadata from the target site. Store the output directory accordingly and keep it out of version control.

What this is not

Not a replacement for QA or scripted tests. It is an additional, automated exploratory signal that feeds your existing process.
Not a penetration test or security audit. Its security checks are limited, benign reflection probes that flag candidates for manual review. It does not prove exploitability and does not attempt real attacks.
Not deterministic or exhaustive. Because exploration is LLM-driven, coverage and findings vary between runs. It will not find every issue, and it is not a substitute for thorough test design.

Troubleshooting

"Browser not found" or Playwright launch errors — run npx playwright install chromium, then breakit doctor.
"Missing API key" — set GEMINI_API_KEY (or ANTHROPIC_API_KEY with --provider anthropic), or pass --key.
Rate limits on the free tier — reduce --personas, use --mode snapshot, shorten --time-budget, or switch providers.
Noisy CI runs — raise --severity-threshold to high or critical so only verified, high-impact findings fail the build.

How it works

Discovery — crawls the target and maps its pages and forms.
Execution — each strategy explores the app, with the LLM selecting actions from a fixed, sandboxed browser tool set (no shell, filesystem, or arbitrary code execution).
Verification — candidate findings are replayed in a clean session to confirm they reproduce and to filter false positives.
Reporting — findings are deduplicated, scored, graded by confidence, and written with severity, reproduction steps, evidence, and limitations.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

breakit

Why use breakit

Where it fits in your QA workflow

Install

Configure a provider

Exploratory strategies

Verification and confidence

Usage

Examples

CI usage

Report output

Safety

Privacy

What this is not

Troubleshooting

How it works

License