@mindstudio-ai/browser-agent
v0.1.42
Published
Browser-side agent for MindStudio dev previews — captures logs, provides DOM snapshots, and enables remote interaction.
Readme
@mindstudio-ai/browser-agent
Browser-side agent for MindStudio dev previews. Injected into app preview pages via the dev tunnel proxy. Captures browser events, provides DOM snapshots, enables remote interaction by AI agents, and supports user annotations for visual feedback.
How it works
The dev tunnel proxy injects <script src> into every HTML response (default: ngrok dev URL, fallback: unpkg latest). This script runs inside the app's preview (either in the MindStudio IDE iframe or a standalone tab) and communicates with the tunnel via HTTP endpoints on the proxy.
AI Agent ──stdin──▶ Tunnel ──queue──▶ Proxy endpoint
│
Browser agent ◀──GET /commands────────────┘
Browser agent ──POST /results──▶ Proxy ──stdout──▶ AI Agent
Browser agent ──POST /logs────▶ Proxy ──file──▶ .logs/browser.ndjson
Frontend ──postMessage──▶ Browser agent (notes mode, screenshots)Features
Log capture (always active)
Captures browser events and POSTs them to /__mindstudio_dev__/logs, which the tunnel writes to .logs/browser.ndjson:
- Console -- overrides
console.log/info/warn/error/debug, calls originals through - JS errors --
window.addEventListener('error')with message, stack, source, line, column - Unhandled rejections --
window.addEventListener('unhandledrejection') - Network requests -- monkey-patches
fetchandXMLHttpRequestto log all requests (method, URL, status, duration, response body for failures) - Click interactions -- capture-phase click listener with accessible element descriptions
Log entries are batched and flushed every 2 seconds, or immediately on errors. Uses navigator.sendBeacon on page unload.
All monkey-patches are guarded against stacking on HMR/reload (checked via __ms_patched flags on the patched objects).
DOM snapshots
Compact, token-efficient accessibility-tree-style representation of the page. Designed for AI agent consumption (~200-400 tokens for a typical page).
navigation "Generate Collection" [ref=e1]
button "Generate" [ref=e2]
button "Collection" [ref=e3]
textbox [value=""] [placeholder="enter a topic..."] [ref=e4]
button "Generate" [disabled] [ref=e5]
paragraph "5 · 7 · 5"Key design decisions:
- Semantic roles and accessible names, not CSS classes -- handles styled-components/CSS-in-JS apps where class names are generated hashes
- Transparent element collapsing -- generic
<div>/<span>wrappers without roles disappear from the tree, children float up to the nearest semantic ancestor - Cursor-interactive detection -- elements with
cursor: pointeroronclickare included even if they're generic divs - Block/inline spacing -- text from block-level children gets spaces between them (fixes concatenated text from nested components)
- Network idle wait --
takeSnapshot()waits for all fetch/XHR requests to settle (200ms quiet period, 5s max) before walking the DOM - Stable refs -- interactive elements get
[ref=eN]identifiers for command targeting - Form state -- shows
[value="..."],[placeholder="..."],[disabled],[checked],[open]
Command channel (iframe mode only)
When the page URL contains ?mode=iframe, the agent polls GET /__mindstudio_dev__/commands every 100ms for commands from the AI agent. This ensures only the preview iframe in the MindStudio IDE responds to commands, not standalone browser tabs.
The AI agent sends commands via the tunnel's stdin:
{"action": "browser", "steps": [{"command": "click", "text": "Generate"}]}The result comes back on the tunnel's stdout with a snapshot, logs captured during execution, and step results:
{"event": "browser-completed", "steps": [...], "snapshot": "...", "logs": [...], "duration": 250}Commands execute sequentially with a visible animated cursor. Execution stops on first error.
Available commands:
| Command | Description |
|---------|-------------|
| snapshot | Returns the compact DOM accessibility tree |
| click | Clicks an element (full pointer/mouse/click event sequence for React/Vue/Svelte) |
| type | Types text into an input/textarea (character-by-character with native value setter for React) |
| select | Selects an option from a <select> element |
| wait | Waits for an element to appear in the DOM (polls with timeout) |
| evaluate | Runs arbitrary JavaScript and returns the result (auto-wraps with return, handles async) |
Element targeting (for click, type, select, wait):
| Field | Example | Description |
|-------|---------|-------------|
| ref | "e5" | Ref from the last snapshot (most reliable) |
| text | "Create Board" | Match by accessible name or visible text |
| role + text | "button" + "Submit" | Match by ARIA role and name |
| label | "Board name" | Find input by its associated label text |
| selector | "#my-id" | CSS selector fallback |
Error messages include what IS on the page so the agent can self-correct (e.g., No button "Submit" found. Visible buttons: "Generate", "Collection").
Screenshots
Screenshots are captured via SnapDOM (@zumer/snapdom) and can be triggered two ways:
- Via tunnel stdin --
{"action": "screenshot"}captures the viewport, uploads to S3 via the platform, and returns a CDN URL. - Via postMessage -- the frontend sends
notes-screenshotto capture with annotations (see Notes below).
Visible cursor
A Figma-style animated cursor (#DD2590 pink with "Remy" name tag) shows the AI agent's actions in real time:
- Appears from a random viewport edge on first action
- Glides smoothly to target elements (450ms ease)
- Click animation with ripple effect
- Fades out after 1.5s of inactivity
- Reappears at last known position for subsequent actions
- Only renders in iframe mode (
?mode=iframe)
Annotation notes (postMessage API)
Users can add ephemeral visual annotations to the preview for AI feedback. Controlled by the frontend via postMessage.
Frontend → iframe messages (channel: 'mindstudio-browser-agent'):
| Command | Purpose |
|---------|---------|
| notes-enter | Enter notes mode (overlay, custom cursor, click/drag to annotate) |
| notes-exit | Exit notes mode, remove all notes |
| notes-screenshot | Capture screenshot including annotations, return base64 |
| notes-cursor-hide | Hide the notes cursor (call when mouse leaves iframe) |
Iframe → frontend responses:
| Command | Payload | Purpose |
|---------|---------|---------|
| screenshot-result | { image: string } or { error: string } | Base64 PNG screenshot |
Notes are pink (#DD2590) rounded bubbles with inline-editable text. Pin notes (click) have a dot at the click point. Area notes (drag) have a dashed border around the selected region. Notes support select → edit → move → delete lifecycle with Enter to confirm, Escape to cancel, and a × delete button.
Development
npm install
npm run build # build dist/index.js (single IIFE, minified)
npm run dev # watch mode + local HTTP server on port 8787
npm run serve # serve dist/ on port 8787 (no watch)The dev tunnel proxy defaults to loading the script from https://seankoji-msba.ngrok.io/index.js. Point ngrok at port 8787 to serve your local dev build to remote sandboxes. Falls back to https://unpkg.com/@mindstudio-ai/browser-agent/dist/index.js when no URL is configured.
Architecture
src/
index.ts -- entry point, idempotency guard, init all modules
transport.ts -- log entry buffer, batched POST to proxy, capture mode
network-idle.ts -- tracks in-flight requests for snapshot idle wait
utils.ts -- serialization, element description, sleep
capture/
console.ts -- console.* override (patch-guarded)
errors.ts -- error + unhandledrejection listeners
network.ts -- fetch monkey-patch (patch-guarded, logs + idle tracking)
xhr.ts -- XMLHttpRequest monkey-patch (patch-guarded, logs + idle tracking)
interactions.ts -- click listener (patch-guarded)
snapshot/
walker.ts -- DOM walker, takeSnapshot(), describeTarget()
roles.ts -- implicit ARIA role mapping, cursor-interactive detection
name.ts -- accessible name computation
commands/
poller.ts -- polls proxy for commands (iframe mode only)
executor.ts -- dispatches steps, captures logs, appends snapshot
actions.ts -- click, type, select, wait, evaluate implementations
resolve.ts -- element resolution (ref, text, role, label, selector)
screenshot.ts -- SnapDOM viewport capture
cursor/
cursor.ts -- animated Figma-style cursor with ripple
notes/
constants.ts -- shared color constant
messages.ts -- postMessage handler (idempotent listener)
notes-mode.ts -- enter/exit lifecycle, screenshot orchestration
note-layer.ts -- overlay, pointer events, state machine
note-element.ts -- DOM creation for pin and area notesProxy endpoints
| Endpoint | Method | Purpose |
|----------|--------|---------|
| /__mindstudio_dev__/logs | POST | Receive browser log entries |
| /__mindstudio_dev__/commands | GET | Poll for pending commands |
| /__mindstudio_dev__/results | POST | Return command execution results |
