@hanzili/chrome-browser-agent

v0.1.0

Published

3 months ago

Browser automation toolkit for Chrome extensions using CDP (Chrome DevTools Protocol). Powers AI agents that interact with web pages.

0High
0Medium
0Low

hanzili

chrome-extension browser-automation cdp devtools-protocol ai-agent accessibility-tree web-automation

Chrome Browser Agent

Browser automation toolkit for Chrome extensions using CDP (Chrome DevTools Protocol). Powers AI agents that interact with web pages.

Features

CDP Integration: Full Chrome DevTools Protocol support for reliable browser automation
Accessibility Tree: Semantic page representation for AI navigation (not CSS selectors)
Reference-based Targeting: Elements are tracked by refs (ref_1, ref_2) that survive page changes
Tool Definitions: Ready-to-use tool schemas for LLM tool calling (Claude, GPT, etc.)
Screenshot Support: High-quality screenshots with DPR scaling

Installation

npm install @hanzili/chrome-browser-agent

Setup

1. Add to your manifest.json

{
  "permissions": ["debugger", "scripting", "tabs", "activeTab"],
  "host_permissions": ["<all_urls>"],
  "content_scripts": [
    {
      "matches": ["<all_urls>"],
      "js": [
        "node_modules/@hanzili/chrome-browser-agent/src/content/accessibility-tree.js",
        "node_modules/@hanzili/chrome-browser-agent/src/content/content.js"
      ]
    }
  ]
}

2. Use in your service worker

import {
  cdpHelper,
  executeTool,
  TOOL_DEFINITIONS
} from '@hanzili/chrome-browser-agent';

// Pass TOOL_DEFINITIONS to your LLM
const response = await callLLM(messages, { tools: TOOL_DEFINITIONS });

// Execute tools returned by LLM
for (const toolUse of response.tool_calls) {
  const result = await executeTool(toolUse.name, toolUse.input, {
    tabId: currentTabId,
    sendToContent: (tabId, type, payload) => chrome.tabs.sendMessage(tabId, { type, payload })
  });
}

Core Concepts

Accessibility Tree

Instead of fragile CSS selectors, this toolkit uses an accessibility tree representation:

button "Submit Application" [ref_1]
textbox "Email" [ref_2] placeholder="Enter email"
combobox "Country" [ref_3]
  option "United States" value="us"
  option "Canada" value="ca" (selected)

The LLM sees semantic roles and can reference elements by ref_1, ref_2, etc.

Available Tools

| Tool | Description | |------|-------------| | read_page | Get accessibility tree of current page | | computer | Click, type, scroll, screenshot | | form_input | Fill form fields by reference | | navigate | Go to URL, back, forward | | find | Natural language element search | | file_upload | Upload files to inputs |

API Reference

cdpHelper

// Attach debugger to tab
await cdpHelper.attachDebugger(tabId);

// Take screenshot
const base64 = await cdpHelper.takeScreenshot(tabId);

// Click at coordinates
await cdpHelper.click(tabId, x, y);

// Type text
await cdpHelper.type(tabId, "Hello world");

executeTool

const result = await executeTool('click', { ref: 'ref_1' }, {
  tabId,
  sendToContent,
  cdpHelper
});

Content Scripts

The content scripts must be injected into pages. They provide:

accessibility-tree.js: Generates the semantic tree, manages element refs
content.js: Handles messages from service worker (form fill, click, etc.)
agent-visual-indicator.js: Shows visual feedback during automation

License

MIT