browser-autopilot

v0.5.7

Published

3 months ago

Autonomous browser agent — LLM-driven CDP automation with X11 stealth fallback, CAPTCHA solving, and extensible skills

0High
0Medium
0Low

eigengajesh

browser automation agent cdp chrome llm ai web-agent browser-use autonomous scraping captcha

browser-autopilot

Autonomous browser automation for real Chrome, built for local or self-hosted execution.

It uses:

raw Chrome DevTools Protocol for fast, structured browser control
local X11 mouse and keyboard events when the task needs literal screen-level interaction
the Vercel AI SDK for multi-step model-driven reasoning and tool use

The browser always runs on your machine or your infrastructure. The package sends screenshots and state to your model provider, but it does not run the browser on OpenAI's machines.

What It Is

browser-autopilot is a TypeScript package for browser agents that need more than a scraper and less than a full remote browser service.

It is designed for:

authenticated browser workflows
long multi-step tasks
sites where DOM-first automation is useful most of the time
cases where you still want a local fallback for literal mouse and keyboard control

It is not built on Playwright or Puppeteer. The package talks to Chrome over raw CDP and can fall back to X11-level input on Linux when needed.

Core Model

There are two execution modes:

CDP mode Fast, structured control through Chrome DevTools Protocol. The model sees a screenshot plus an indexed DOM-like view and can call browser tools like click, input, navigate, extract, evaluate, upload_file, scroll, and done.
X11 mode Local OS-level control of the actual browser window through the X Window System. The model works from screenshots and issues actions like CLICK, DOUBLE_CLICK, MOVE, DRAG, SCROLL, TYPE, KEYPRESS, WAIT, and DONE.

orchestrate(...) combines them:

orchestrate({ credentials, loginUrl, successUrlContains, task })
  │
  ├─ Cached session? ───────────────→ run task in CDP mode
  ├─ CDP login works? ──────────────→ run task in CDP mode
  └─ CDP looks blocked or unusable? → relaunch and continue through X11

The CDP agent can also explicitly request a handoff by calling switch_to_x11.

What X11 Means

In this package, X11 means local OS-level browser control using the machine's real windowing and input stack, not DOM or CDP APIs.

That is implemented with Linux/X11 utilities such as:

xdotool for mouse and keyboard events
xclip for real clipboard paste
ImageMagick import for screenshots
Xvfb and openbox for headful containerized environments

This is useful when the page is hostile to DOM automation, when the DOM is not trustworthy enough, or when the task truly needs screen-level interaction.

AI SDK Integration

browser-autopilot already uses the Vercel AI SDK internally:

generateText(...) drives the step loop
gateway(model) resolves the model
browser tools and your custom tools use standard AI SDK tool(...) definitions

That means you can extend the agent with normal AI SDK tools and pass them into runAgent(...) or orchestrate(...).

import { tool } from "ai";
import { z } from "zod";
import { CDPBrowser, runAgent } from "browser-autopilot";

const browser = new CDPBrowser();
await browser.connect();

const result = await runAgent({
  browser,
  task: "Log in and fetch the latest invoice PDF",
  extraTools: {
    get_2fa_code: tool({
      description: "Return the latest 2FA code",
      inputSchema: z.object({}),
      execute: async () => "123456",
    }),
  },
});

Today the package is wired to the AI SDK gateway path, so you configure the model by name and provide AI_GATEWAY_API_KEY.

Installation

npm install browser-autopilot ai zod

Environment:

export AI_GATEWAY_API_KEY=...
export AGENT_MODEL=claude-sonnet-4-6

For local Chrome/CDP use, you also need a Chrome or Chromium binary available.

For X11 mode on Linux, you need:

xdotool
xclip
ImageMagick import
an X11 display or an Xvfb-based container/VM

Quick Start

1. Login plus task orchestration

Use orchestrate(...) when you want the package to handle:

cached-session reuse
a quick CDP login attempt
fallback to X11 when needed
running the actual task afterward

import { orchestrate } from "browser-autopilot";

const { result, success, loginMethod } = await orchestrate({
  credentials: {
    username: process.env.MY_USER ?? "",
    password: process.env.MY_PASS ?? "",
    email: process.env.MY_EMAIL ?? "",
    totpKey: process.env.MY_TOTP_KEY ?? "",
  },
  loginUrl: "https://x.com/login",
  successUrlContains: "/home",
  task: "Open settings and tell me which account email is configured.",
});

console.log({ success, loginMethod, result });

2. CDP-only task execution

Use runAgent(...) when:

you are already logged in
the task does not need auth
you want direct control over browser lifecycle

import { CDPBrowser, runAgent } from "browser-autopilot";

const browser = new CDPBrowser();
await browser.connect();

const { result, success } = await runAgent({
  browser,
  task: "Go to wikipedia.org and find the population of Tokyo.",
});

console.log({ success, result });
await browser.disconnect();

3. Direct X11 control

Use X11Agent directly when you want a pure screenshot-and-actions loop on Linux/X11.

import { X11Agent } from "browser-autopilot";
import * as chrome from "browser-autopilot/x11/chrome";

chrome.launch("https://example.com/login", "example-profile");

const agent = new X11Agent();
const result = await agent.runDetailed({
  systemPrompt: `
You are controlling a Chrome browser with local X11 mouse and keyboard actions.
Log in, then say ACTION: DONE Logged in successfully.
`,
  successCheck: () => chrome.pageUrlContains("/dashboard"),
});

console.log(result);

Supported Systems

| Environment | CDP mode | X11 mode | Notes | |---|---|---|---| | Linux desktop with X11 | Supported | Supported | Best native environment for the full stack | | Linux VM/container with Xvfb | Supported | Supported | Recommended for cloud/self-hosted deployments | | macOS | Supported | Not supported natively | Use CDP mode locally; use Docker/Linux for X11 fallback | | Windows | Partial | Not supported natively | Chrome path detection exists, but the X11 stack does not | | Serverless functions | Poor fit | Poor fit | Headful Chrome + long-lived sessions are usually the wrong shape |

Linux

Linux is the primary target for the full feature set.

Use Linux if you want:

orchestrate(...) with reliable X11 fallback
switch_to_x11
Docker/Xvfb deployment
noVNC live viewing

If your desktop runs Wayland, the X11 path may require XWayland or an X11 session. The X11 tools in this repo are not written against native Wayland APIs.

macOS

macOS is a good fit for CDP-only use:

local development
authenticated flows that succeed without X11 fallback
browser tasks driven through runAgent(...)

Native macOS is not a full X11 target for this package. src/x11/input.ts depends on Linux/X11 tools like xdotool, xclip, and ImageMagick window capture.

Practical guidance on macOS:

Use runAgent(...) locally for CDP-first tasks.
Use orchestrate(...) only if you are comfortable with the fact that X11 fallback is not available natively.
If you need the full stack, run the Docker image or a Linux VM.

Windows

The Chrome launch helper knows common Windows Chrome paths, but the broader package is not a native Windows-first stack.

Practical guidance on Windows:

treat local CDP usage as experimental
do not expect native X11 fallback
use a Linux VM or container for production use

Local Execution Model

This package is local or self-hosted by design.

What runs locally:

Chrome
CDP client
X11 input execution
screenshots
file uploads and downloads
browser profiles and cookies

What goes to the model provider:

task text
browser state text
screenshots
tool-call context and results

So if you use OpenAI, Anthropic, or another provider through AI SDK Gateway, the reasoning is remote but the browser execution stays on your machine or your own servers.

Cloud Deployment

The best cloud shape is a long-lived Linux container or VM with a headful browser.

Good fits:

Docker on a VM
Kubernetes workloads with persistent storage
self-hosted Linux boxes
isolated agent workers or TEEs that can run a browser session for minutes at a time

Poor fits:

stateless serverless functions
environments where headful Chrome cannot start
platforms without persistent disk for browser profiles

Docker

The repo includes a Docker image that sets up:

Google Chrome
Xvfb
openbox
xdotool
xclip
ImageMagick
optional noVNC viewer

Build:

docker build -f docker/Dockerfile -t browser-autopilot .

Run:

docker run --rm \
  -e AI_GATEWAY_API_KEY=$AI_GATEWAY_API_KEY \
  -e LOGIN_URL=https://x.com/login \
  -e SUCCESS_URL=/home \
  -e TWITTER_USER=myuser \
  -e TWITTER_PASS=mypass \
  -e [email protected] \
  -e TWITTER_TOTP_KEY=ABCDEF123456 \
  -e AGENT_TASK="Open settings and summarize what you find." \
  -v browser-autopilot-data:/data \
  browser-autopilot

Notes:

The Docker image is linux/amd64 today.
Persist /data if you want browser sessions and outputs to survive across runs.
Set ENABLE_VIEWER=1 if you want the noVNC viewer for debugging.

Cloud Architecture Guidance

For production-ish deployments:

prefer one browser session per worker
persist the browser profile directory
keep timezone, locale, and proxy geography aligned
use a real Linux/X11 stack if you depend on X11 fallback
treat the browser as stateful infrastructure, not a short-lived lambda

Browser Tools

The CDP agent exposes a broad tool surface, including:

navigate
click
click_at
input
type_text
send_keys
scroll
find_text
switch_tab
new_tab
close_tab
upload_file
click_and_upload
paste_content
paste_image
extract
evaluate
handle_dialog
wait
save_page_snapshot
save_element_html
shell
solve_captcha
inject_captcha_token
solve_datadome
done

The X11 agent supports local screen-level actions such as:

CLICK x y
DOUBLE_CLICK x y
MOVE x y
DRAG x1 y1 x2 y2
SCROLL up|down amount
TYPE text
KEYPRESS key
WAIT seconds
SCREENSHOT
DONE result

Sensitive Data and Custom Tools

The package supports sensitiveData masking so secrets already present in prompts can be redacted in the model-facing task text. For high-value credentials or payment details, prefer explicit AI SDK tools over dumping everything directly into the task.

That pattern looks like:

pass non-sensitive workflow context in task
expose just-in-time secrets through extraTools
let the agent request them only when needed

Project Structure

src/
  agent/       Step-based CDP agent loop and tool definitions
  browser/     Raw CDP client and DOM indexing
  captcha/     CAPTCHA solving helpers
  viewer/      Optional live viewer for Xvfb environments
  x11/         X11 agent, local input primitives, Chrome launch helpers
  orchestrator.ts
  config.ts
  index.ts
docker/
  Dockerfile
  entrypoint.sh
tests/
docs/

Important Constraints

Use headful Chrome, not headless Chrome.
X11 fallback is a Linux/X11 feature, not a cross-platform abstraction.
If you depend on stealthy login fallback, deploy on Linux.
If you only need structured browser automation, CDP mode is the simpler path.
The package currently chooses models through AI SDK Gateway, so configure AI_GATEWAY_API_KEY and AGENT_MODEL.

Environment Variables

| Variable | Required | Description | |---|---|---| | AI_GATEWAY_API_KEY | Yes | API key for AI SDK Gateway | | AGENT_MODEL | No | Model name, defaults to claude-sonnet-4-6 | | LOGIN_URL | CLI only | Login URL for the top-level entrypoint | | SUCCESS_URL | CLI only | Post-login URL substring for the top-level entrypoint | | TWITTER_USER | Optional | Username used by the default CLI entrypoint | | TWITTER_PASS | Optional | Password used by the default CLI entrypoint | | TWITTER_EMAIL | Optional | Email used by the default CLI entrypoint | | TWITTER_TOTP_KEY | Optional | TOTP seed used by the default CLI entrypoint | | PROXY_HOST | No | SOCKS5 proxy host | | PROXY_PORT | No | SOCKS5 proxy port | | PROXY_USER | No | Proxy username | | PROXY_PASS | No | Proxy password | | CAPSOLVER_KEY | No | Capsolver API key | | TWOCAPTCHA_KEY | No | 2Captcha API key | | CDP_PORT | No | Chrome remote debugging port, defaults to 9222 | | PROFILE_DIR | No | Browser profile directory | | DATA_DIR | No | Data/output directory | | AGENT_TASK | No | Task used by the top-level CLI entrypoint | | MAX_STEPS | No | Max agent steps, defaults to 80 | | ENABLE_VIEWER | No | Enable the noVNC viewer in Docker | | CHROME_PATH | No | Override Chrome binary path |

Development

npm install
npm test
npm run build

Architecture notes live in docs/architecture.md.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

browser-autopilot

What It Is

Core Model

What X11 Means

AI SDK Integration

Installation

Quick Start

1. Login plus task orchestration

2. CDP-only task execution

3. Direct X11 control

Supported Systems

Linux

macOS

Windows

Local Execution Model

Cloud Deployment

Docker

Cloud Architecture Guidance

Browser Tools

Sensitive Data and Custom Tools

Project Structure

Important Constraints

Environment Variables

Development