browsepilot

v0.1.0

Published

a month ago

MCP server for AI browser automation. Semantic page snapshots, annotated screenshots, and full browser control — built for Claude, GPT, and any MCP client.

0High
0Medium
0Low

lobsterman-lobster

mcp browser automation ai claude playwright cdp accessibility screenshot web-agent

🧭 BrowsePilot

MCP server for AI browser automation. Semantic page snapshots, annotated screenshots, and full browser control — built for Claude, GPT, and any MCP client.

BrowsePilot gives AI models a deep understanding of web pages through three-tree merge snapshots — combining accessibility trees, DOM structure, and runtime state into a single, AI-readable page model. No other browser MCP server does this.

Quick Start

npx browsepilot

That's it. Add it to Claude Desktop:

{
  "mcpServers": {
    "browsepilot": {
      "command": "npx",
      "args": ["browsepilot"]
    }
  }
}

Connect to Your Existing Chrome

Keep your logins, cookies, and extensions. Launch Chrome with remote debugging:

# macOS
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222

# Linux
google-chrome --remote-debugging-port=9222

# Windows
chrome.exe --remote-debugging-port=9222

Then configure:

{
  "mcpServers": {
    "browsepilot": {
      "command": "npx",
      "args": ["browsepilot", "--connect-existing", "--cdp-port", "9222"]
    }
  }
}

What Makes This Different

Three-Tree Merge Snapshots

Every other browser MCP server gives you either a raw accessibility tree or a screenshot. BrowsePilot merges three data sources via CDP into a unified page model:

Accessibility tree — semantic roles, names, values
DOM snapshot — visibility, bounding boxes, paint order, computed styles
Runtime evaluation — scroll position, shadow DOM, scrollable containers, page type

The result is a structured, categorized view of the page:

Page: Google Search [search] | https://www.google.com/
Scroll: 0.0 pages above | 3.2 pages below
Stats: 47 links | 12 interactive | 3 inputs | 340 total
[Start of page]

── INPUTS ──
  e1  textbox  "Search"  = ""
* e5  option   "Suggested: weather"

── ACTIONS ──
  e2  button   "Google Search"
  e3  button   "I'm Feeling Lucky"

── LINKS ──
  e10  link  "Gmail"
  e11  link  "Images"

── SCROLL CONTAINERS ──
  sc1  div#results — 4200px tall, at top, 3400px remaining
[End of page]

* = element appeared since last snapshot (change detection)
Elements organized by category (INPUTS, ACTIONS, LINKS, etc.)
Scroll position awareness (pages above/below)
Page type detection (auth, form, search, data-table, article, dashboard)
Shadow DOM and scrollable container awareness

Annotated Screenshots

Visual context with numbered bounding boxes on interactive elements, color-coded by type:

🔵 Blue = inputs
🟢 Green = actions/buttons
🟠 Orange = links

Resilient Actions

Click, type, and other actions use multi-strategy fallback chains:

Playwright click → scroll into view → partial name match → CDP click
If one strategy fails, the next one tries automatically

Tools (22)

| Tool | Description | |------|-------------| | browser_snapshot | Semantic page snapshot with element refs | | browser_screenshot | Annotated screenshot with bounding boxes | | browser_navigate | Navigate to URL | | browser_click | Click element by ref | | browser_type | Type text into element | | browser_press | Press keyboard key/shortcut | | browser_scroll | Scroll page or element into view | | browser_select | Select dropdown option | | browser_hover | Hover over element | | browser_fill_form | Fill multiple form fields | | browser_evaluate | Run JavaScript on page | | browser_extract | Extract structured data (tables, lists, articles, links) | | browser_read_text | Read text content | | browser_force_click | JavaScript click (for covered/shadow DOM elements) | | browser_wait | Wait for conditions | | browser_tab_new | Open new tab | | browser_tab_switch | Switch tabs | | browser_tab_list | List open tabs | | browser_tab_close | Close tab | | browser_go_back | Navigate back | | browser_go_forward | Navigate forward | | browser_scroll_container | Scroll embedded containers |

Options

npx browsepilot [options]

--headless            Run Chrome in headless mode
--connect-existing    Connect to already-running Chrome
--cdp-port <port>     CDP port (default: random)
--user-data-dir <dir> Chrome profile directory
-v, --version         Show version
-h, --help            Show help

Requirements

Node.js 18+
Chrome, Brave, Edge, or Chromium installed (auto-detected)
Optional: canvas npm package for annotated screenshots

How It Works

MCP Client (Claude Desktop, Cursor, etc.)
  → stdio transport
    → BrowsePilot MCP Server
      → Chrome via Playwright + raw CDP
        → Three-tree merge snapshot
        → Annotated screenshots
        → All browser actions

BrowsePilot doesn't run its own AI — it's a pure tool layer. Your MCP client's model makes the decisions, BrowsePilot executes them.

License

MIT