reflex-browser

v0.4.0

Published

8 hours ago

Token-efficient browser automation for AI agents. Local MCP server + CLI: compressed page snapshots, batched actions, forensic failures.

0High
0Medium
0Low

nitaiaharoni1

Reflex

Token-efficient browser automation CLI built for AI agents. Wraps Playwright's engine with an agent-native interface: distilled page snapshots, change deltas instead of page dumps, stable element refs, batched guarded actions, forensic failure reports, visual auditing, and agent-free replayable flows.

Reflex is designed for an AI agent in the loop. The cost model it optimizes is the agent's, not the tool's: model round trips (each one costs seconds of inference) and tokens entering context. It deliberately trades raw tool-side wall clock for fewer turns, smaller payloads, and failure reports an agent can act on in one turn. If no agent is involved, plain Playwright or a recorded Reflex flow (reflex flow run, zero tokens) is the right tool.

Design rationale: see docs/REFLEX-DESIGN.md.

Install

No global install needed. Everything runs via npx with the always-latest package.

The first time Reflex runs it will use your installed Google Chrome if available. If Chrome is not present it automatically downloads a Chromium browser (~170MB, one-time). Once downloaded it is cached and reused.

After installing, have your agent run the reflex_auth tool with action: "login" to sign in. This opens a browser window to the Reflex site, authenticates you, and saves a key locally. No manual API-key copying required. (For CI or headless environments, set REFLEX_API_KEY=rfx_live_... in the environment instead.)

Claude Desktop

Add Reflex to claude_desktop_config.json (Settings, Developer, Edit Config):

{
  "mcpServers": {
    "reflex": {
      "command": "npx",
      "args": ["-y", "reflex-browser@latest", "mcp"]
    }
  }
}

Restart Claude Desktop fully (quit from the menu bar, not just the window), then ask Claude to run reflex_auth with action: "login" to authenticate.

Claude Code

claude mcp add reflex -- npx -y reflex-browser@latest mcp

Then in a new session ask Claude to run reflex_auth with action: "login".

Cursor

Add to ~/.cursor/mcp.json (global) or .cursor/mcp.json (per-project):

{
  "mcpServers": {
    "reflex": {
      "command": "npx",
      "args": ["-y", "reflex-browser@latest", "mcp"]
    }
  }
}

Then ask the agent to run reflex_auth with action: "login".

On Windows, wrap with cmd /c: "command": "cmd", "args": ["/c", "npx", "-y", "reflex-browser@latest", "mcp"].

Everything runs on your machine: the server, the browser, your logins. Page content never leaves it. The server exposes 9 tools (reflex_open, reflex_act, reflex_read, reflex_extract, reflex_check, reflex_shot, reflex_tab, reflex_flow, reflex_auth), whose definitions total under 1,000 tokens; a 10th reflex_eval is available behind REFLEX_MCP_EVAL=1. The CLI and MCP server are one package and one version, so they can never skew. macOS is fully supported; Windows is beta.

git clone <this repo> && cd reflex
npm install
ln -s "$PWD/bin/reflex.js" /usr/local/bin/reflex   # optional, for CLI use

Point your MCP client at the local checkout instead of npx:

{
  "mcpServers": {
    "reflex": {
      "command": "node",
      "args": ["/absolute/path/to/reflex/bin/reflex-browser.js", "mcp"]
    }
  }
}

A persistent daemon keeps the browser warm between calls; the first call starts it automatically.

Why

Playwright MCP historically sent the model a full accessibility snapshot after every single action: 10K-16K tokens per snapshot and one LLM round trip per click. (Playwright MCP has been phasing this out since v0.0.51, December 2025: incremental snapshots by default, then file-linked snapshots; current actions return short summaries plus a snapshot file path. One round trip per action remains, and a full browser_snapshot look is still ~30K tokens inline.) Reflex sends a compressed semantic outline once, then only what changed, executes whole flows in one call, and checks expectations inside the batch.

Benchmark (same pages, same browser; the aria snapshot is what a full Playwright MCP browser_snapshot look returns):

| page | aria snapshot | reflex snapshot | reduction | |---|---|---|---| | Hacker News | ~10,395 tok | ~2,063 tok | 5.0x | | Wikipedia article | ~15,990 tok | ~4,046 tok | 4.0x | | GitHub repo page | ~11,971 tok | ~3,312 tok | 3.6x |

Across three real multi-step flows (same tasks, same verified end conditions), the latest Playwright MCP used as a pure MCP client took 14 calls and ~56K tokens of cumulative context; Reflex took 6 calls and ~13K tokens. Across the head-to-head's six attempted flows, Reflex completed 5 versus Playwright MCP's 3. Reproduce with node bench/headtohead.js.

Full benchmark report with 12-site coverage, real-world flow comparisons (including failures), delta measurements, and honesty disclosures (including a dated addendum on Playwright MCP's newer file-based snapshot behavior): bench/BENCHMARKS.md.

Four-way head-to-head against Playwright MCP 0.0.76, Playwright CLI, and Vercel's agent-browser (same flows, no LLM, failures verbatim): bench/HEADTOHEAD.md. Short version: Reflex has the cheapest page "look" everywhere tested (26x on giant pages), fewest turns with verification included, and the best failure reports; Playwright CLI has the lowest token floor on pre-planned flows; agent-browser has the fastest raw execution.

Usage

reflex open https://app.example.com        # distilled snapshot with [refs]
reflex act '[{"fill":["Email","[email protected]"]},{"fill":["Password","secret"]},
             {"click":"Sign in"},{"expect":{"visible":"Dashboard"}}]'
reflex read                                 # delta since last look, or "(page unchanged)"
reflex check                                # visual audit: overlap, contrast, clipped text, broken images
reflex extract a3f2                         # full table/list data (big output goes to a file)
reflex shot --region a3f2                   # screenshot crop to file

Targets are either refs from snapshots (a3f2) or semantic names ("Sign in", a label, a placeholder, an id). Resolution prefers exact accessible-name matches, falls back to prefix and substring, and reports candidates when ambiguous.

reflex help prints the full command reference (kept short on purpose: it fits in an agent's context for pennies).

Batched actions with guards

reflex act runs many steps in one call and stops at the first failure with a forensic report: failed step, reason, near-miss candidates, recent console errors and failed requests, a screenshot path, and the page changes that happened before the failure. One report is enough to repair the flow in a single agent turn.

Step types: goto, click, dblclick, fill, type, press, select, check, uncheck, hover, focus, scroll, wait, expect, shot, eval.

expect assertions: {"visible": t}, {"hidden": t}, {"text": s} (case-sensitive), {"url_contains": s}, {"value": v, "target": t}.

Flows: replay without an agent

reflex flow save login --params email      # records this session's steps
# edit ~/.reflex/flows/login.json: replace recorded values with {{email}}
reflex flow run login --param [email protected]

Replay is deterministic (no LLM): semantic targets re-resolve against the live page, with the recorded CSS path as a self-healing fallback. Flows compose: a step {"run": "login"} is planned for a future version.

Auth state

reflex state save myapp                    # cookies + localStorage
reflex open https://myapp.com --state myapp

Profile mode (stay signed in)

Reflex runs a persistent browser profile by default, so a site you sign into once in the Reflex window stays signed in across every later run (the persistent profile also defeats Google's "this browser may not be secure" automation block, so "Continue with Google" works). The profile is per-session, so parallel agents never share one.

reflex profile status          # is it on, and where is it stored
reflex profile list            # Chrome profiles you can seed from
reflex profile use "Work"      # pick the source Chrome profile
reflex profile reset           # wipe saved logins; re-seeds on next open
REFLEX_PROFILE=off reflex ...  # opt out: a clean throwaway browser each run

On first run it seeds from your real Chrome profile. macOS: cookies and passwords are not copied (they are Keychain-encrypted and an automation-launched Chrome cannot decrypt a copy), so only bookmarks are seeded; sign into each site once and it persists. Windows/Linux: the full login set carries over (the encryption key travels in Local State). Requires Google Chrome installed; without it, Reflex falls back to a clean browser.

Tabs, dialogs, secrets, parallel sessions

Popups and target=_blank links are followed automatically; responses note [switched to new tab]. Manage with reflex tabs, reflex tab <n>, reflex tab <n> --close.
JS dialogs (alert/confirm/prompt) are auto-dismissed and reported in the console feed, so they can never deadlock a session.
Secrets: write {{env:VAR}} in any value; the CLI substitutes it from the environment, so credentials stay out of step JSON and agent context.
Parallel agents: pass session: "name" on any MCP tool call (or REFLEX_SESSION=<name> reflex ... from the CLI) and that call runs in a fully isolated daemon and browser. Sessions spawn on demand, cost nothing extra (credits are per tool call, never per session), idle ones exit on their own, and reflex_session lists or stops them. Two agents, ten agents: no shared profile, no tab fights, no snapshot crosstalk.
Concurrent clients on one shared session are safe too: browser ops are serialized, and each client keeps its own delta baseline, so one agent's actions never make another's read claim "(page unchanged)" about a page it has not seen.
The daemon authenticates its clients via a per-instance token in ~/.reflex/daemon.json (mode 0600).

How it works

agent ──(CLI / MCP)──> bin/reflex.js ──(local HTTP)──> src/daemon.js ──(CDP)──> your Chrome
                                                            │
                                                  src/pagefns.js (injected:
                                                  distill, resolve, check, extract)

open returns a distilled snapshot: noise stripped, wrappers collapsed, repetition folded, stable content-hash refs.
act runs a whole batch of steps in one call, with expect guards checked inside the batch; event-driven settle detection between steps.
Subsequent reads return only the delta since the last look.
Failures return a forensic bundle (failed step, candidates, console/network errors, screenshot, pre-failure delta) sized for a one-turn repair.
flow save / flow run replay recorded tasks deterministically with zero model tokens and self-healing target resolution.

Direction

Reflex is heading toward a packaged local MCP server with credit-based access: sign up, get an API key and free starter credits, one-click install into Claude Desktop or any MCP client. Everything runs on the user's machine (their agent, their browser, their logins; page content never leaves it); the server handles only key validation, credit balance, and payments. Each tool call costs one credit, which rewards Reflex's batched design: a whole flow is 2 calls where Playwright MCP needs 5-6 (measured across the same three flows: 6 calls and ~13K tokens vs 14 calls and ~56K), so a task costs cents while saving the user most of their model-context budget. Launch pricing: 30 free credits (14-day expiry), then packs at $9/150, $19/500, $49/1,800; paid credits never expire. A cloud library of maintained replayable flows is a possible later layer.

Implementation plans for the MCP server, credits backend, website, and launch live in docs/plan/ (index).

How it stays small

Distillation: scripts, styling wrappers, and hidden nodes are stripped; a button nested 14 divs deep is just a button. Repeated structures fold: a 200-row table is one line with an extract handle, a 30-item list shows 3 items and a count.
Deltas: after the first read, responses show only added/removed/changed lines. An unchanged page costs 91 characters.
Stable refs: element ids are content hashes (role + name + ordinal), so they survive re-renders and let you act across turns without re-reading.
Artifacts to disk: screenshots and large extracts return file paths, never inline payloads.
Always-on telemetry: console errors and failed requests are attached to responses once, the first time they appear.

Layout

Four zones — product at the root, cloud apps alongside, evidence and docs grouped:

PRODUCT (npm package, ships via dist/)
  bin/                 CLI client + reflex-browser dispatcher (mcp subcommand)
  src/                 daemon, pagefns, MCP server, credits client, client transport
  test/                smoke, engine, mcp suites + hostile fixtures
  mcpb/                Claude Desktop one-click bundle
  scripts/             dist build (esbuild)

CLOUD (separate package.json each; own deploy)
  server/              credits API (Cloudflare Workers + D1)
  web/                 marketing site, docs, dashboard (Next.js + Clerk)

EVIDENCE
  bench/               head-to-head vs Playwright MCP / CLI / agent-browser (npm run bench)

DOCS
  docs/REFLEX-DESIGN.md       engine design rationale
  docs/design/impeccable.md   website brand + UX principles
  docs/plan/                  product roadmap (00-OVERVIEW is the index)

npm test runs all three suites. State lives in ~/.reflex/ (daemon.json, flows/, states/, artifacts/, daemon.log).

Limits

DOM-first: canvas/WebGL apps need the screenshot escape hatch (reflex shot).
Open shadow DOM is traversed (web components are visible and actionable); closed shadow roots are unreachable from page JS and cannot be seen.
iframes appear as markers (§ iframe "checkout") but their content is not traversed yet.
Visual tier 0 only: no baseline pixel-diff yet (planned, see design doc section 6).
Flow parameterization is manual (edit the JSON); recording marks values verbatim. Use flow save --last N to record just the clean tail of a session.