@mindstudio-ai/browser-agent

v0.1.66

Published

21 days ago

Browser-side agent for MindStudio dev previews — captures logs, provides DOM snapshots, and enables remote interaction.

Downloads

816

0High
0Medium
0Low

markof94

sthielen

@mindstudio-ai/browser-agent

Browser-side agent for MindStudio dev previews. Injected into app preview pages via the dev tunnel proxy. Captures browser events, provides DOM snapshots, enables remote interaction by AI agents, and supports user annotations for visual feedback.

How it works

The dev tunnel proxy injects <script src> into every HTML response (default: ngrok dev URL, fallback: unpkg latest). This script runs inside the app's preview (either in the MindStudio IDE iframe or a standalone tab) and communicates with the tunnel via HTTP endpoints on the proxy.

AI Agent ──stdin──▶ Tunnel ──queue──▶ Proxy endpoint
                                          │
Browser agent ◀──GET /commands────────────┘
Browser agent ──POST /results──▶ Proxy ──stdout──▶ AI Agent
Browser agent ──POST /logs────▶ Proxy ──file──▶ .logs/browser.ndjson

Frontend ──postMessage──▶ Browser agent (notes mode, screenshots)

Features

Log capture (always active)

Captures browser events and POSTs them to /__mindstudio_dev__/logs, which the tunnel writes to .logs/browser.ndjson:

Console -- overrides console.log/info/warn/error/debug, calls originals through
JS errors -- window.addEventListener('error') with message, stack, source, line, column
Unhandled rejections -- window.addEventListener('unhandledrejection')
Network requests -- monkey-patches fetch and XMLHttpRequest to log all requests (method, URL, status, duration, response body for failures)
Click interactions -- capture-phase click listener with accessible element descriptions

Log entries are batched and flushed every 2 seconds, or immediately on errors. Uses navigator.sendBeacon on page unload.

All monkey-patches are guarded against stacking on HMR/reload (checked via __ms_patched flags on the patched objects).

DOM snapshots

Compact, token-efficient accessibility-tree-style representation of the page. Designed for AI agent consumption (~200-400 tokens for a typical page).

navigation "Generate Collection" [ref=e1]
  button "Generate" [ref=e2]
  button "Collection" [ref=e3]
textbox [value=""] [placeholder="enter a topic..."] [ref=e4]
button "Generate" [disabled] [ref=e5]
paragraph "5 · 7 · 5"

Key design decisions:

Semantic roles and accessible names, not CSS classes -- handles styled-components/CSS-in-JS apps where class names are generated hashes
Transparent element collapsing -- generic <div>/<span> wrappers without roles disappear from the tree, children float up to the nearest semantic ancestor
Cursor-interactive detection -- elements with cursor: pointer or onclick are included even if they're generic divs
Block/inline spacing -- text from block-level children gets spaces between them (fixes concatenated text from nested components)
Network idle wait -- takeSnapshot() waits for all fetch/XHR requests to settle (200ms quiet period, 5s max) before walking the DOM
Stable refs -- interactive elements get [ref=eN] identifiers for command targeting
Form state -- shows [value="..."], [placeholder="..."], [disabled], [checked], [open]

Command channel (iframe mode only)

When the page URL contains ?mode=iframe, the agent polls GET /__mindstudio_dev__/commands every 100ms for commands from the AI agent. This ensures only the preview iframe in the MindStudio IDE responds to commands, not standalone browser tabs.

The AI agent sends commands via the tunnel's stdin:

{"action": "browser", "steps": [{"command": "click", "text": "Generate"}]}

The result comes back on the tunnel's stdout with a snapshot, logs captured during execution, and step results:

{"event": "browser-completed", "steps": [...], "snapshot": "...", "logs": [...], "duration": 250}

Commands execute sequentially with a visible animated cursor. Execution stops on first error.

Available commands:

| Command | Description | |---------|-------------| | snapshot | Returns the compact DOM accessibility tree | | click | Clicks an element (full pointer/mouse/click event sequence for React/Vue/Svelte) | | type | Types text into an input/textarea (character-by-character with native value setter for React) | | select | Selects an option from a <select> element | | wait | Waits for an element to appear in the DOM (polls with timeout) | | evaluate | Runs arbitrary JavaScript and returns the result (auto-wraps with return, handles async) |

Element targeting (for click, type, select, wait):

| Field | Example | Description | |-------|---------|-------------| | ref | "e5" | Ref from the last snapshot (most reliable) | | text | "Create Board" | Match by accessible name or visible text | | role + text | "button" + "Submit" | Match by ARIA role and name | | label | "Board name" | Find input by its associated label text | | selector | "#my-id" | CSS selector fallback |

Error messages include what IS on the page so the agent can self-correct (e.g., No button "Submit" found. Visible buttons: "Generate", "Collection").

Screenshots

Screenshots are captured via SnapDOM (@zumer/snapdom) and can be triggered two ways:

Via tunnel stdin -- {"action": "screenshot"} captures the viewport, uploads to S3 via the platform, and returns a CDN URL.
Via postMessage -- the frontend sends notes-screenshot to capture with annotations (see Notes below).

Visible cursor

A Figma-style animated cursor (#DD2590 pink with "Remy" name tag) shows the AI agent's actions in real time:

Appears from a random viewport edge on first action
Glides smoothly to target elements (450ms ease)
Click animation with ripple effect
Fades out after 1.5s of inactivity
Reappears at last known position for subsequent actions
Only renders in iframe mode (?mode=iframe)

Annotation notes (postMessage API)

Users can add ephemeral visual annotations to the preview for AI feedback. Controlled by the frontend via postMessage.

Frontend → iframe messages (channel: 'mindstudio-browser-agent'):

| Command | Purpose | |---------|---------| | notes-enter | Enter notes mode (overlay, custom cursor, click/drag to annotate) | | notes-exit | Exit notes mode, remove all notes | | notes-screenshot | Capture screenshot including annotations, return base64 | | notes-cursor-hide | Hide the notes cursor (call when mouse leaves iframe) |

Iframe → frontend responses:

| Command | Payload | Purpose | |---------|---------|---------| | screenshot-result | { image: string } or { error: string } | Base64 PNG screenshot |

Notes are pink (#DD2590) rounded bubbles with inline-editable text. Pin notes (click) have a dot at the click point. Area notes (drag) have a dashed border around the selected region. Notes support select → edit → move → delete lifecycle with Enter to confirm, Escape to cancel, and a × delete button.

Development

npm install
npm run build    # build dist/index.js (single IIFE, minified)
npm run dev      # watch mode + local HTTP server on port 8787
npm run serve    # serve dist/ on port 8787 (no watch)

The dev tunnel proxy defaults to loading the script from https://seankoji-msba.ngrok.io/index.js. Point ngrok at port 8787 to serve your local dev build to remote sandboxes. Falls back to https://unpkg.com/@mindstudio-ai/browser-agent/dist/index.js when no URL is configured.

Architecture

src/
  index.ts              -- entry point, idempotency guard, init all modules
  transport.ts          -- log entry buffer, batched POST to proxy, capture mode
  network-idle.ts       -- tracks in-flight requests for snapshot idle wait
  utils.ts              -- serialization, element description, sleep
  capture/
    console.ts          -- console.* override (patch-guarded)
    errors.ts           -- error + unhandledrejection listeners
    network.ts          -- fetch monkey-patch (patch-guarded, logs + idle tracking)
    xhr.ts              -- XMLHttpRequest monkey-patch (patch-guarded, logs + idle tracking)
    interactions.ts     -- click listener (patch-guarded)
  snapshot/
    walker.ts           -- DOM walker, takeSnapshot(), describeTarget()
    roles.ts            -- implicit ARIA role mapping, cursor-interactive detection
    name.ts             -- accessible name computation
  commands/
    poller.ts           -- polls proxy for commands (iframe mode only)
    executor.ts         -- dispatches steps, captures logs, appends snapshot
    actions.ts          -- click, type, select, wait, evaluate implementations
    resolve.ts          -- element resolution (ref, text, role, label, selector)
    screenshot.ts       -- SnapDOM viewport capture
  cursor/
    cursor.ts           -- animated Figma-style cursor with ripple
  notes/
    constants.ts        -- shared color constant
    messages.ts         -- postMessage handler (idempotent listener)
    notes-mode.ts       -- enter/exit lifecycle, screenshot orchestration
    note-layer.ts       -- overlay, pointer events, state machine
    note-element.ts     -- DOM creation for pin and area notes

Proxy endpoints

| Endpoint | Method | Purpose | |----------|--------|---------| | /__mindstudio_dev__/logs | POST | Receive browser log entries | | /__mindstudio_dev__/commands | GET | Poll for pending commands | | /__mindstudio_dev__/results | POST | Return command execution results |

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@mindstudio-ai/browser-agent

How it works

Features

Log capture (always active)

DOM snapshots

Command channel (iframe mode only)

Screenshots

Visible cursor

Annotation notes (postMessage API)

Development

Architecture

Proxy endpoints