testloop

v0.13.0

Published

14 days ago

Drop-in verifier for AI-written web code: crawls your app like a real user, catches errors and dead UI, and flags new code that never executed (i.e. not wired in).

Downloads

2,481

0High
0Medium
0Low

wemiki89

testing e2e playwright ai claude coverage qa

testloop

A drop-in verifier for AI-written web code. It launches your app, uses it like a real user, and reads your code like a reviewer — all deterministic, no AI judging anything. Built for people orchestrating AI builders (and vibe coders) who keep hitting the same wall: "Claude says it's done, but is it actually wired in, does it work, and did it break anything?"

The two foundations no unit test gives you:

A generic browser crawl (Playwright): visit every same-origin page, click every button, fill every form, and record console errors, uncaught exceptions, failed network requests, blank pages, dead clicks (elements that do literally nothing), and missing loading states — during click-testing the network is slowed (~400ms via CDP), so any click that hits the network without instant feedback shows up as a frozen screen.
A coverage-vs-git-diff intersection — full-stack: the browser records what frontend code ran, and the app's Node server is launched with NODE_V8_COVERAGE so backend execution is tracked too. Then it checks whether the lines you (or Claude) just changed actually executed while the site was used. Code neither the browser nor the server loaded is, by definition, not wired in. Non-user-facing code is never tested "on its own" — it's tested through whatever on the site it affects.

What it checks (all deterministic, no AI)

Wired-in (coverage vs diff) · dead clicks · missing loading states · console errors & crashes · failed requests · blank pages · secrets exposed in the browser · stack traces shown to users · silent failures & fallbacks (static) · hallucinated imports (static) · responsiveness (phone/tablet/desktop, scored /10) · design consistency · form-field hygiene · stale dev server · slow pages · placeholder/leftover text · user flows (the spec) · autonomous login/signup/wizard-walking · page classification & confusing-UI · intent review (new flows to confirm).

Install

npm install -g testloop
npx playwright install chromium          # on Linux/VPS: --with-deps chromium

Runs fully headless — built for VPS use as much as local. On a server, set notifyUrl in testloop.json (e.g. an ntfy.sh topic) for a phone push when any run finishes; if the app already runs under pm2/systemd, set url and nothing extra gets launched.

Commands

testloop ui          # the easy way: dashboard in your browser — run, settings, login, results, journey map
testloop             # run one check (auto-detects how to start the app), print the report
testloop code        # static scan only: silent failures/fallbacks + hallucinated imports (no browser)
testloop watch       # re-run the full check on every file change, whoever wrote it
testloop baseline    # record current problems as known issues (only NEW problems fail after)
testloop init        # install the Claude Code hook + slash command + flows + guide
testloop status      # hook state, config, and last run's results
testloop remove      # uninstall the hook

Common flags: --url http://localhost:3000 (app already running), --cmd "npm run dev" (explicit start), --against main (diff vs a branch), --json report.json, --headed (watch the browser), --root <dir>. Exit code 0 = clean, 1 = findings.

✗ Dead UI elements — clicking did nothing (1):
  - [/settings] <button.export-btn> "Export report"

✗ Changed code the browser NEVER LOADED (1 file(s)) — likely not wired into the UI:
  - src/components/ExportDialog.tsx (84 new line(s), new file)

Developing testloop itself? Clone the repo, then npm install && npm link in the checkout to put the testloop command on your PATH.

Closing the loop with Claude Code

cd /path/to/your/project
testloop init

This installs the full drop-in kit:

A Stop hook in .claude/settings.json — every time Claude finishes a turn that changed frontend code, the checker launches the app, crawls it, runs the wiring check, and on failure blocks the stop and feeds the findings back to Claude to fix. Works in any Claude Code session in the project — including builder sessions dispatched by an orchestrator into separate terminals: each builder gets checked (and corrected) before it can declare done and close.
A /testloop slash command — the control panel inside Claude Code: /testloop (dashboard), /testloop run, /testloop on|off, /testloop history, /testloop settings.
testloop.json — settings file ("enabled": false pauses automatic runs).
testloop-flows/ — plain-text flow files (the spec; builders add one per feature).
A CLAUDE.md section — the builder contract: consult FEATURE-MAP.md before building, ship a flow per feature, no silent fallbacks, never bypass testloop.
TESTLOOP-GUIDE.md — a how-to dropped into the project covering all of this.

The verification itself is 100% deterministic — AI is only a consumer of the report.

User flows: AI writes the intent, the bot proves it

Plain-text flow files in testloop-flows/ are replayed through the real UI before every crawl:

goto /signup
fill "Email" with [email protected]
click "Create account"
expect request POST /api/signup
expect url /onboarding
expect text "Welcome"

The full step vocabulary: goto, reload, click "text", fill "label" with value, press Key, expect text / expect no text, expect url, expect request METHOD /url-part (the click really called the backend, successfully), expect api /path contains "text" (session-aware fetch), run <command> (escape hatch for DB checks etc. — nonzero exit fails), wait ms.

This is how "what is it supposed to do" gets encoded deterministically. The key pattern for database/persistence effects: create the thing through the UI, then reload and expect text it's still there — data that survives a reload came from the backend, no DB access needed.

init adds a section to the project's CLAUDE.md instructing builder Claudes to ship a flow file with every user-facing feature. The runner can only interact with what's visibly on screen, so a passing flow is mechanical proof the feature is wired in — if Claude wrote the code but never wired it up, the flow fails with could not find anything clickable matching "..." — is it wired into the UI? and the Stop hook makes Claude fix it. Flows run in filename order (00-login.txt first), failures capture a screenshot, and the session a flow creates (e.g. logged in) carries into the free crawl — which is also how auth walls are handled.

The dashboard — built for vibe coders

testloop ui opens a local control panel in your browser (127.0.0.1 only): one button to test your app, settings as forms, login setup (including a one-click "Set it up for me" that generates a test account the robot registers itself), live output, problem list, screenshot gallery, run history, a flow editor, and the site map. No terminal knowledge needed beyond the one command.

The Journey view draws the robot's path through your site as a line of stops — green for clean pages, amber for warnings, red with an ✗ count where problems live, ✓/✗ marks for the features tested on each page, and a pulsing NOT WIRED IN stop at the end for anything that was built but is unreachable. Fix things and watch the line turn green.

👁 View my app opens your site through a local preview mirror with a testloop status bar on top (last result, dashboard link). The bar exists only at the preview address — your real site is never modified, so it can never leak to users on a live deployment.

Automatic login (and signup) — zero config

You don't have to set up login at all: if the robot finds a login page on your site and no credentials are configured, it creates its own test account (saved to testloop.json), registers it through your sign-up page like a real new user, and signs in from then on. Opt out with "login": false.

To use specific credentials instead, give it just a test email + password ("login": { "user": ..., "password": ..., "autoSignup": true }). The robot finds your login page by weighted signals — a single visible password field, "forgot password" text, button wording (log in, sign in, send code…), URL hints — fills it, submits, and verifies success (your expect text, or heuristics: form gone / page changed / no error text). With autoSignup, if the account doesn't exist it registers one through your sign-up page like a real new user. The logged-in session carries into every flow and the whole crawl.

Verification walls (email/SMS codes) are detected and reported with the fix options — robots can't receive codes, so either disable verification in dev mode or pre-create/seed a verified test account.

Success detection is self-learning: on its first successful login the bot grabs its own indicator from the landing page (the headline it saw) and uses it as the exact success check from then on — no "expect" text to type.

Security & hygiene scans

Every run also scans deterministically for: secrets exposed in the browser (Stripe/AWS/Google/GitHub/Slack keys, private keys, JWTs in served JS — a hard failure you cannot baseline away), internal errors / stack traces shown to users, form-field problems (password fields that aren't masked, email/phone fields with no validation), an unknown-URL probe (friendly 404 vs leaked error), placeholder/leftover text (lorem ipsum, TODO), and slow-page outliers.

Design consistency

testloop reads the computed styles of every page (body/heading fonts, button colors and corner style, link color, page background) into a per-page fingerprint, finds your site's dominant design, and flags pages that don't match the rest of the site — catches a new page built in an old or different style without judging taste. Needs ≥3 pages to establish a norm; a warning, not a failure; shown per-page in the dashboard journey.

Code scan — silent failures & hallucinated imports

testloop code (or the ⚡ Scan my code button in the dashboard, or automatically on the changed files during every check) reads your source — no browser — for the failure classes a running page can't reveal:

Silent fallbacks — the worst case: an error swallowed behind a default value (catch { return null }, .catch(() => [])), so a broken main feature looks like it works. You can't stop AI from adding fallbacks, so testloop logs every one in a dedicated dashboard panel for you to review — the body-aware detector flags catch blocks and .catch() handlers that swallow without logging, rethrowing, or surfacing the error (a catch that logs or rethrows is correctly ignored).
Hallucinated / missing imports — importing a package that isn't in package.json (AI inventing library names). This is a hard failure — the app is literally broken.

Responsiveness (phone / tablet / desktop)

Every page is re-measured at phone (390px), tablet/iPad (834px), and desktop sizes and scored /10. It catches what breaks on small screens deterministically — content wider than the screen (sideways scrolling), content taller than the screen that won't scroll (the bottom is unreachable — the classic iPad trap), elements cut off past the edge, tap targets under 44px, text under 12px, and a missing viewport meta tag. No CSS is applied to your site — testloop just resizes the browser window and reads the layout your styles produce. Warnings, not failures; per-page scores show in the dashboard journey.

Stale-server detection

When you point testloop at an already-running server (url mode — common when you have many terminals/dev servers going), it checks whether that server is actually serving your latest code: for each changed file it fetches the served version and compares against disk. If the server is serving an old build, you get "your dev server is serving STALE CODE — restart it" instead of silently testing 10-iterations-old code. (A server testloop launches itself is always fresh, so this only runs in url mode.)

FEATURE-MAP.md — stop rebuilding what exists

After every run testloop writes FEATURE-MAP.md: every page and its type, every named feature (flow) and the pages it touches, and recently wired-in code. The CLAUDE.md contract tells builder Claudes to consult it before building, so they extend existing features instead of rebuilding them or replacing them with simpler versions.

Visual self-review — Claude looks at what it built

testloop saves a screenshot of every page, and Claude Code can see images. So when the hook reports back, it hands Claude the screenshot of the feature it just built and tells it to look — is it laid out right, intentional, on-brand, nothing visually broken? The deterministic checks catch broken; Claude's eyes on the screenshot catch looks wrong, the one thing pixels-and-coverage can't judge. It's bundled into failure feedback for free, and on a clean run it fires only when a new feature shipped (so it doesn't nag on every edit). Disable with "visualReview": false. Claude reviews its own output — the deterministic gate still decides pass/fail.

Intent review — the one thing it can't judge, made easy

testloop can't know whether a feature is what you meant — that needs a human. So instead it surfaces ★ New since last run: flows that ran for the first time, with the pages they touched. Flows are plain-English specs, so a five-second glance confirms intent. The dashboard shows this as its own card; nothing is gated on it.

Autonomous wizard walking

Multi-step flows (onboarding, setup) are detected by signals — a "step 2 of 3" counter, progress chrome, a single continue button — and walked to completion without any instructions: fill what's on screen, press continue, repeat. Completion is recognized from the signals vanishing plus congratulations wording; a wizard where clicking continue changes nothing is reported as a hard failure: "users cannot finish this flow." No flow files, no AI, no Claude involvement — pure crawl autonomy.

Page classification, semantic navigation, and "confusing UI"

The same weighted-signal trick classifies every crawled page (login, signup, dashboard, settings, checkout, pricing, search, content, form, error) into .testloop/sitemap.json. Wrong guesses are correctable in the dashboard, and corrections feed back into auto-login. Flows can then navigate by meaning — goto settings — instead of hardcoded paths. And when a page doesn't classify clearly, that's reported as a warning: if a deterministic classifier can't tell what a page is, humans may struggle too.

Adopting on an existing (messy) project

Run testloop baseline once: every current problem is recorded as a known issue, and from then on only NEW problems fail runs (suppressed counts are still reported). Keys are normalized so timestamps/ids don't defeat matching. Re-run baseline whenever you intentionally accept the current state.

Warnings vs failures

Failures block Claude via the hook; warnings inform you in the report:

Visual changes — every page screenshot is pixel-diffed against the last green run (a clean run auto-becomes the new baseline); diff images saved alongside.
Contrast problems — axe-core's color-contrast pass catches text humans can't read (white-on-white etc). "a11y": "off" disables.
Console warnings — captured alongside errors (errors fail, warnings don't).

Where results live

Terminal report on every manual run, plus a desktop notification (macOS) when any run finishes — no sitting and waiting.
.testloop/last-report.json — full machine-readable report of the latest run (orchestrators can gate on this).
.testloop/history/ — the last 50 runs, timestamped.
.testloop/screenshots/ — a screenshot of every page the robot user visited, so you can see what it saw without opening the app.
testloop status — hook state, config, last result, and recent-run trend in one glance.

Settings (testloop.json)

Everything is optional — testloop runs with an empty config. Explicit CLI flags always win over the file.

| Key | What | |---|---| | enabled | false pauses automatic hook runs without uninstalling | | cmd / url | how to start the app, or where it's already running (auto-detected if omitted) | | login | test credentials, or false to disable auto-login (see Automatic login) | | setup | command run before each check to reset/seed data (repeatable runs) | | budget | time limit per run, seconds (default 150) — the real bound on coverage | | notify / notifyUrl | finish notifications: desktop (macOS) and/or a POST URL (ntfy.sh for phone) | | slowNetwork | ms of latency injected during click-testing to expose missing loading states (default 400) | | minNewLines | percent of a changed file's new lines that must execute (default 50) | | responsive / style / a11y / codeScan | toggle the phone/tablet, design-consistency, contrast, and static-code checks | | docs / docsMode | flag when code changes but kept docs don't ("warn" or "fail") | | flowsDir | where flow files live (default testloop-flows) | | against | git ref the wiring/diff checks compare against (default HEAD) |

The generated TESTLOOP-GUIDE.md (from testloop init) explains each in plain language.

CI

examples/github-action.yml is a drop-in GitHub Actions workflow that runs the same verification on every PR (--against origin/main), uploading the report and screenshots as artifacts — closes the "builder bypassed the local hook" hole.

How the wiring check buckets results

NEVER LOADED — the file never reached the browser. Either nothing imports it (not wired in) or it's server-side code (which browser coverage can't see — see limitations).
Loaded but NEVER EXECUTED — the module was imported but none of the new lines ran during the crawl.
Executed — at least some new lines ran while the robot user was clicking around.

Coverage is mapped back to source files two ways: direct URL→file mapping (Vite-style dev servers serve sources at their real paths) and source-map decoding (bundlers, Next.js).

Limitations (honest ones)

Intent: testloop proves a feature works, never that it's what you meant — that needs a human. The "★ New since last run" review and the plain-English flow files make that check a five-second glance, but it's still yours to make. Likewise it can't judge whether an answer is correct (only a flow with an exact expect catches a wrong total) or whether copy/aesthetics are good (only consistency).
Email/SMS verification codes: a robot can't receive them. testloop detects the verification wall and tells you the fixes (disable verification in dev mode, or pre-create/seed a verified account) rather than failing mysteriously.
Non-Node backends: server coverage works by launching the app with NODE_V8_COVERAGE, so Express/Next/Fastify etc. are tracked; Python/Go/Rails backends (or apps tested via --url) aren't — their files land in NEVER LOADED with an "unverified" caveat.
Fake elements: a "button" made from a bare <div> with no role, or an "input" from a contenteditable div, is invisible to the bot — and to screen readers and keyboards, so it's a real accessibility bug regardless.
Destructive buttons: anything labeled logout/delete/remove/reset is skipped on purpose.
Data-dependent UI: an empty database can make features invisible to the crawl — use the setup seed command.

Auth is handled automatically (see Automatic login), and multi-step flows are walked autonomously (see Autonomous wizard walking) — neither is a limitation anymore.

Watch mode

testloop watch re-runs the full check on every file change — no matter what wrote it (a Claude builder, another tool, or you). Runs are debounced and serialized; its own report writes are ignored so it never self-triggers.

Documentation drift (opt-in)

If you keep docs in the project, set "docs": "docs" (a path or array of paths) in testloop.json: when code changes but nothing under those paths does, it's flagged. "docsMode": "fail" makes builders update the docs before they can finish; the default "warn" just tells you.

Roadmap

DECISIONS.md convention — capture the why from your AI conversations so builders stop re-deriving (or contradicting) past decisions
Production-monitoring mode — testloop pointed at a live URL on a schedule, read-only (no form submits)
Cross-browser runs (Playwright already ships Firefox + WebKit)
Static "wiring linter" (orphan exports, unregistered routes) for instant pre-crawl feedback

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

testloop

What it checks (all deterministic, no AI)

Install

Commands

Closing the loop with Claude Code

User flows: AI writes the intent, the bot proves it

The dashboard — built for vibe coders

Automatic login (and signup) — zero config

Security & hygiene scans

Design consistency

Code scan — silent failures & hallucinated imports

Responsiveness (phone / tablet / desktop)

Stale-server detection

FEATURE-MAP.md — stop rebuilding what exists

Visual self-review — Claude looks at what it built

Intent review — the one thing it can't judge, made easy

Autonomous wizard walking

Page classification, semantic navigation, and "confusing UI"

Adopting on an existing (messy) project

Warnings vs failures

Where results live

Settings (testloop.json)

CI

How the wiring check buckets results

Limitations (honest ones)

Watch mode

Documentation drift (opt-in)

Roadmap