visual-agent

v0.1.7

Published

8 days ago

AI-driven browser testing tool with live headed browser, video evidence, and per-session artifacts.

0High
0Medium
0Low

visual-agent

AI-driven browser testing tool. Spawns a real Chromium window, runs an inner LLM that drives it (snapshot → click/fill/screenshot → assert → finish), records video + Playwright trace + transcript per session, and exposes both a CLI and a TypeScript API. Built to be called by another AI agent — but usable by humans too.

In action

A real run: search Amazon for a wireless mouse, open the first product, add it to the cart, verify the cart count.

If the video doesn't play (e.g. on the npm package page), download it here.

The viewer gives you a Playwright-trace-style replay for every session — synced filmstrip, action list, screenshot/video panel, network log, console, and comments:

Visual Agent viewer showing an Amazon add-to-cart run

Install

Published on npm: visual-agent.

Requires Bun ≥ 1.2 (the runtime is Bun-native; the bin is a small node-compatible shim that re-execs into bun and prints a clear error if bun is missing from PATH).

# Global install — exposes `visual-agent` and `va` on PATH
npm install -g visual-agent
# or: bun add -g visual-agent

# First-time setup (interactive, writes .env)
visual-agent configure

The postinstall hook downloads Playwright Chromium (~150MB). If that's a problem, set PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 and install the browser yourself with bunx playwright install chromium.

LLM providers for --agent runs:

anthropic (default when available) — runs through the official @anthropic-ai/claude-agent-sdk. Auth is your claude login (Claude Pro/Max subscription) — no API key, no per-token billing, just your subscription quota. The SDK handles credentials, refresh, and rate-limit retry internally.
openai — any OpenAI-compatible endpoint: OpenAI itself, OpenRouter, local LM Studio / vLLM, etc. Pay-per-token. Used as fallback when no Claude login is set up, or explicitly via --llm openai / LLM_PROVIDER=openai.

From source (for contributors):

git clone https://github.com/<you>/visual-agent && cd visual-agent
bun install
bun link        # exposes the local checkout as `visual-agent` / `va` on PATH

visual-agent configure --check prints current state without prompting; --missing-only only fills in keys that aren't set yet (safe in scripts).

Default — Claude subscription (free; counts against your Pro/Max quota):

claude login                              # one-time
visual-agent run ... --agent --json       # auto-detects, runs through Claude Agent SDK

OpenAI / OpenAI-compatible (OpenRouter, local, etc.):

LLM_PROVIDER=openai
LLM_API_KEY=sk-...                          # OpenAI / OpenRouter key
LLM_BASE_URL=https://openrouter.ai/api/v1   # omit for OpenAI itself; or http://localhost:1234/v1 etc.
LLM_MODEL=gpt-4o-mini                       # or anthropic/claude-sonnet-4 via OpenRouter, qwen/qwen-vl-max, ...
visual-agent run ... --agent --json

Or pass --llm openai / --llm anthropic per invocation to override the auto-detected default.

Quick start

# Autonomous run (parseable JSON to stdout)
visual-agent run \
  --url https://example.com \
  --goal "verify the page loads and the heading reads 'Example Domain'" \
  --agent --json

# Confirm the verdict — fast transcript audit
visual-agent verify <session-id>

# Or rigorous (audit + headless replay) before locking the flow as a Playwright spec
visual-agent strict-verify <session-id>

# Browse sessions, video, transcript, network, comments, memory
visual-agent viewer --open

Run visual-agent help for the human reference, visual-agent help --json for the agent-readable manifest.

Commands

| Command | What it does | |---|---| | run | Run a session — autonomous (--agent --goal) or interactive REPL. | | verify <id>| Fast audit: re-walks the transcript and checks the recorded verdict against observed asserts/finish. ~50ms, no browser launched. Catches transcript-internal hallucinations. | | strict-verify <id> | Slow + rigorous: runs verify first, then re-executes the deterministic part of the transcript on a fresh headless browser. ~30–60s. Use before locking a flow with codegen or on high-stakes runs. | | replay <id>| Re-execute supported deterministic actions on a fresh browser. Translates refs by element similarity; strict mode fails on unsupported mutating actions. | | list | List sessions with filters (--passed, --failed, --running, --host, --since, --limit, --json). | | codegen <id> | Emit a Playwright .spec.ts from a run, including the start URL and warnings for skipped/unstable pieces. | | clean | Delete sessions matching filters (--older-than 7d, --keep-last 50, --passed/--failed/--running, --dry-run). | | export <id>| Bundle a session (video / trace / transcript / screenshots / comments / side-channels) into a single .zip for sharing. | | configure | Interactive .env setup for LLM_* env vars. | | viewer | Local web app for replay (--open, --port, --session <id> to deep-link). Already-running detection. | | mcp | MCP stdio server. Wire as a tool source for Claude Code or any MCP client. | | memory rebuild | Aggregate addressed comments into memory/<host>.md. (Also auto-runs when you mark a comment "addressed" in the viewer.) | | help [cmd] | Human help. help --json returns a structured manifest. |

Run flags

--url <url>                  starting URL (required)
--goal <text>                test objective (required for --agent)
--agent                      autonomous mode
--json                       single-line JSON to stdout, no chatter
--max-steps <n>              cap on agent tool rounds (default 30)
--timeout <dur>              wall-clock budget (e.g. 30m, 24h, 7d, or raw ms)
--max-tokens <n>             cap on cumulative LLM tokens
--action-timeout <ms>        per-action Playwright timeout (default 5000)
--llm <provider>             'anthropic' (default if `claude login` is set up) or 'openai' (env LLM_PROVIDER)
--llm-api-key <key>          override LLM_API_KEY (openai provider only — anthropic uses your `claude login`)
--llm-base-url <url>         override LLM_BASE_URL (openai provider only)
--llm-model <id>             override LLM_MODEL (defaults: gpt-4o-mini / claude-haiku-4-5)
--live-feedback              poll comments.json mid-run
--headless                   run Chromium headless (CI)
--auth-state-load <path>     load Playwright storageState before run
--auth-state-save <path>     write storageState after run
--mock <path>                JSON file of network mock rules (fulfill / abort)
--start <cmd>                spawn a dev-server before the run; killed on shutdown
--wait-for <url|port>        poll until reachable before the agent starts
--start-timeout <dur>        --wait-for budget (default 60s)
--start-stdio inherit        forward dev-server stdio to ours (default: muted)

What you get per run

.e2e-sessions/<id>/:

| File | What | |---|---| | video.webm | Headed-browser recording with the agent's animated cursor | | trace.zip | Playwright trace (bunx playwright show-trace <path>) | | transcript.json | Final action log: every tool call, args, result, errors, healed refs (incl. iframe-prefixed f0.eN refs) | | transcript.ndjson | Append-only stream the viewer tails live | | step-NNN.png | Screenshots — auto-captured after every side-effecting action (click, fill, navigate, scroll, …) plus explicit screenshot / assert_screenshot tool calls. The viewer's filmstrip uses these to show per-step visual state. | | console.json | Console + pageerror + dialog events captured during the run | | network.json | Per-request: method, status, headers, payload, response body, timing | | result.json | { id, passed, summary, durationMs, artifacts } | | comments.json | Human feedback on verdict / video time / transcript step / pinned screenshot | | downloads/ + downloads.json | Files saved by the page during the run + side-channel index |

.e2e-sessions/_baselines/ holds visual-regression baselines (assert_screenshot).

Writing goals

The agent treats your goal as a checklist. Keep it narrow and observably verifiable or it will keep inventing steps.

Bad: "Test the todo app" — open-ended, the agent will enumerate Scenario 1, 2, 3… and may not stop. Good: "Add 'Buy milk' to the todo list and verify it appears" — one outcome, one check, one finish.

If you genuinely want a suite, run separate sessions per scenario. They're cheap and parallelizable, and verify / strict-verify give you a clean pass/fail per case.

Workflows

Smoke test in CI (cheap audit only — instant):

visual-agent run --url $URL --goal "..." --agent --headless --json --max-steps 10 > out.json
ID=$(jq -r .id out.json)
visual-agent verify $ID || exit 1

verify audits the saved finish/assert entries — fast, no browser launched. It catches transcript-internal hallucinations (PASS without asserts, missing finish, asserts that returned false). For high-stakes runs (release smokes, prod checks) or before locking a flow as a Playwright spec, follow up with strict-verify, which adds a fresh headless replay on top:

visual-agent strict-verify $ID || exit 1

replay is still transcript-based, so flows that depend on time, external state, unsupported browser side effects, or unstable generated refs should be locked with codegen and reviewed before CI.

Login once, reuse the auth state:

visual-agent run --url https://app.local/login \
  --goal "log in as [email protected] / hunter2, verify URL contains /dashboard" \
  --agent --auth-state-save .auth/admin.json --json

visual-agent run --url https://app.local/orders \
  --goal "verify the orders table has rows" \
  --agent --auth-state-load .auth/admin.json --json

Discover with the agent, lock with Playwright (use strict-verify before codegen — proves the flow re-runs cleanly):

SID=$(visual-agent run --url $URL --goal "..." --agent --json | jq -r .id)
visual-agent strict-verify $SID && visual-agent codegen $SID --out tests/discovered.spec.ts
bunx playwright test tests/discovered.spec.ts

codegen starts from the recorded startUrl unless the transcript begins with an explicit navigation. It emits comments for skipped side-channel checks or unstable locators; treat those warnings as required review before relying on the spec as a regression test.

Find prior runs:

visual-agent list --failed --since 2026-04-29 --json
visual-agent list --host app.local --passed --limit 5

Spin up the app under test:

visual-agent run \
  --start "bun dev" --wait-for http://localhost:3000 \
  --url http://localhost:3000 --goal "..." --agent --json
# dev-server is killed automatically on shutdown.

Share a session for a bug report:

visual-agent export 20260501-... --out /tmp/repro.zip
# unzip -l /tmp/repro.zip → video, trace, transcript, screenshots, comments, console/network logs

Stub flaky third-party APIs:

cat > /tmp/mocks.json <<'JSON'
[
  { "url": "**/api/billing", "status": 200, "body": { "plan": "pro" } },
  { "url": "/analytics\\.example\\.com/", "abort": true }
]
JSON
visual-agent run --url ... --mock /tmp/mocks.json --agent --json

Programmatic API

import { runTest, openSession } from "visual-agent";

// High-level: hand off a goal, get a verdict + evidence.
const result = await runTest({
  goal: "log in and verify the dashboard loads",
  url: "https://staging.app.com/login",
});

// Low-level: drive the browser yourself, no inner LLM.
const s = await openSession({ url: "https://example.com" });
const snap = await s.snapshot();
await s.click(snap.elements[0].ref);
await s.assertUrl("/dashboard");
const r = await s.finish({ passed: true, summary: "manual run" });

Use from Claude Code

Install the skill so Claude knows when and how to invoke this tool:

./skill/install.sh

The skill (skill/SKILL.md) covers when to use it, how to write goals, the verify guardrail against hallucinated PASS, and the codegen lock-in step.

MCP

Alternative to the skill: register visual-agent as an MCP server so any MCP-aware client (Claude Code, etc.) gets run_test, verify_session, strict_verify_session, list_sessions, codegen, rebuild_memory natively without shelling out.

In .claude/mcp.json (or your client's equivalent):

{
  "mcpServers": {
    "visual-agent": {
      "command": "visual-agent",
      "args": ["mcp"]
    }
  }
}

Viewer

visual-agent viewer --open launches a Chrome devtools-style replay UI at http://localhost:4567 (or --port/VA_PORT). Pass --session <id> (with or without --open) to deep-link straight to a specific session. Already-running detection: a second invocation prints the URL (and re-opens it on --open) instead of erroring.

Trace-viewer-style layout: filmstrip on top, action list on the left, tabbed center (Screenshot / Video / Console / Network / Downloads / Detail), comments on the right.
Network panel: sortable table, URL filter, errors-only checkbox, waterfall, click-to-expand request/response detail (headers / payload / body / timing) on a right side panel with × close.
Console panel: same toolbar pattern (text filter + level select), click-to-expand detail.
Downloads panel: every captured file linked to download.
Filmstrip: one slot per visible action (action count matches list count); hovering syncs both ways with the action list.
Live mode: sessions still running show a pulsing LIVE badge and stream new actions in via SSE.
Per-session header has download links for trace.zip and a generated export.zip.

Feedback loop and memory

The viewer lets you attach comments to artifacts — verdict, video timestamps, transcript steps, or pinned points on screenshots. Comments marked addressed are aggregated by visual-agent memory rebuild into memory/<host>.md per host, and the agent prepends that file to its system prompt on future runs against the same host. So corrections compound across sessions.

--live-feedback makes the agent poll comments mid-run and incorporate them on the next round, no restart required.

Live element selection: with the headed browser focused, Cmd/Ctrl+Shift+S (or select in the REPL) enters a point-and-click mode — hover highlights elements, click opens an inline comment popover, and the selection is queued as feedback for the next agent round. Stored in comments.json as kind: "selection" with the captured ref + element description.

Environment variables

LLM_PROVIDER             'anthropic' (auto-detected default when `claude login` is set up) or 'openai'.
LLM_API_KEY              API key for the OpenAI provider (or OPENAI_API_KEY).
LLM_BASE_URL             OpenAI-compatible endpoint URL (openai provider only).
LLM_MODEL                default model id. openai default: gpt-4o-mini. anthropic default: claude-haiku-4-5 (override with ANTHROPIC_MODEL or --llm-model).
ANTHROPIC_MODEL          override the anthropic default model (e.g. claude-sonnet-4-5).
CLAUDE_CREDENTIALS_FILE  override the Claude credentials path (mostly for testing; SDK reads macOS keychain or ~/.claude/.credentials.json by default).
VA_ACTION_TIMEOUT        per-action Playwright timeout in ms (default 5000)
VA_SESSIONS              sessions root directory (default ./.e2e-sessions)
VA_MEMORY                memory root directory (default ./memory)
VA_BASELINES             visual-regression baselines root
VA_PORT                  default viewer port
VA_NETWORK_BODY_BUDGET   max bytes of stored HTTP response bodies per session (default 32MB)

How it works

LLM providers: anthropic runs through the official @anthropic-ai/claude-agent-sdk — every browser primitive is registered as an in-process SDK MCP tool, the SDK drives its own loop, and auth comes from the user's claude login. openai uses standard chat-completion tool calls against any OpenAI-compatible endpoint. Provider is auto-selected; the two paths use different message shapes internally but write the same uniform transcripts.
Browser: Playwright Chromium, video + tracing + console + network capture.
Visual cursor: a fake cursor element animates to each target before the synthetic click — visible in the recording.
Glow + banner + input lock: a transparent overlay swallows user clicks while the agent runs (cosmetic; not OS-level).
Refs: snapshot() walks the DOM, tags every visible interactable with data-va-ref="eN", returns {ref, tag, role, name, text, ...}. Agents pick refs by semantics; the next snapshot wipes them.
Self-heal: stale refs auto-resnap and try to match the same logical element by tag/role/name/text similarity.
Multimodal: screenshot returns a PNG forwarded to the model as an image_url content block.
Visual regression: assert_screenshot compares against .e2e-sessions/_baselines/<name>.png via pngjs + pixelmatch and writes a .diff.png on mismatch.

Security

Local tool, designed for trusted environments. Some specifics:

Viewer binds to 127.0.0.1 only and gates every request behind a per-launch token (set as a SameSite=Strict; HttpOnly cookie on first visit). The token is in the URL printed at startup — anyone who can read your terminal can read the sessions.
--start <cmd> runs the supplied string through sh -c. Don't pass untrusted input to it (e.g. don't templated it with values from a goal file you didn't write). The dev-server child gets your full environment.
network.json captures request and response bodies, including Authorization headers and cookies. Treat session directories as sensitive — don't commit them, don't paste them into bug reports without redacting.
The viewer is not hardened for multi-tenant or network exposure. Don't put it behind a public reverse proxy.

If you find a security issue, please email [email protected] rather than opening a public issue.

License

MIT — see LICENSE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

visual-agent

In action

Install

Quick start

Commands

Run flags

What you get per run

Writing goals

Workflows

Programmatic API

Use from Claude Code

MCP

Viewer

Feedback loop and memory

Environment variables

How it works

Security

License