browser-autopilot
v0.5.7
Published
Autonomous browser agent — LLM-driven CDP automation with X11 stealth fallback, CAPTCHA solving, and extensible skills
Maintainers
Readme
browser-autopilot
Autonomous browser automation for real Chrome, built for local or self-hosted execution.
It uses:
- raw Chrome DevTools Protocol for fast, structured browser control
- local X11 mouse and keyboard events when the task needs literal screen-level interaction
- the Vercel AI SDK for multi-step model-driven reasoning and tool use
The browser always runs on your machine or your infrastructure. The package sends screenshots and state to your model provider, but it does not run the browser on OpenAI's machines.
What It Is
browser-autopilot is a TypeScript package for browser agents that need more than a scraper and less than a full remote browser service.
It is designed for:
- authenticated browser workflows
- long multi-step tasks
- sites where DOM-first automation is useful most of the time
- cases where you still want a local fallback for literal mouse and keyboard control
It is not built on Playwright or Puppeteer. The package talks to Chrome over raw CDP and can fall back to X11-level input on Linux when needed.
Core Model
There are two execution modes:
CDP modeFast, structured control through Chrome DevTools Protocol. The model sees a screenshot plus an indexed DOM-like view and can call browser tools likeclick,input,navigate,extract,evaluate,upload_file,scroll, anddone.X11 modeLocal OS-level control of the actual browser window through the X Window System. The model works from screenshots and issues actions likeCLICK,DOUBLE_CLICK,MOVE,DRAG,SCROLL,TYPE,KEYPRESS,WAIT, andDONE.
orchestrate(...) combines them:
orchestrate({ credentials, loginUrl, successUrlContains, task })
│
├─ Cached session? ───────────────→ run task in CDP mode
├─ CDP login works? ──────────────→ run task in CDP mode
└─ CDP looks blocked or unusable? → relaunch and continue through X11The CDP agent can also explicitly request a handoff by calling switch_to_x11.
What X11 Means
In this package, X11 means local OS-level browser control using the machine's real windowing and input stack, not DOM or CDP APIs.
That is implemented with Linux/X11 utilities such as:
xdotoolfor mouse and keyboard eventsxclipfor real clipboard paste- ImageMagick
importfor screenshots Xvfbandopenboxfor headful containerized environments
This is useful when the page is hostile to DOM automation, when the DOM is not trustworthy enough, or when the task truly needs screen-level interaction.
AI SDK Integration
browser-autopilot already uses the Vercel AI SDK internally:
generateText(...)drives the step loopgateway(model)resolves the model- browser tools and your custom tools use standard AI SDK
tool(...)definitions
That means you can extend the agent with normal AI SDK tools and pass them into runAgent(...) or orchestrate(...).
import { tool } from "ai";
import { z } from "zod";
import { CDPBrowser, runAgent } from "browser-autopilot";
const browser = new CDPBrowser();
await browser.connect();
const result = await runAgent({
browser,
task: "Log in and fetch the latest invoice PDF",
extraTools: {
get_2fa_code: tool({
description: "Return the latest 2FA code",
inputSchema: z.object({}),
execute: async () => "123456",
}),
},
});Today the package is wired to the AI SDK gateway path, so you configure the model by name and provide AI_GATEWAY_API_KEY.
Installation
npm install browser-autopilot ai zodEnvironment:
export AI_GATEWAY_API_KEY=...
export AGENT_MODEL=claude-sonnet-4-6For local Chrome/CDP use, you also need a Chrome or Chromium binary available.
For X11 mode on Linux, you need:
xdotoolxclip- ImageMagick
import - an X11 display or an Xvfb-based container/VM
Quick Start
1. Login plus task orchestration
Use orchestrate(...) when you want the package to handle:
- cached-session reuse
- a quick CDP login attempt
- fallback to X11 when needed
- running the actual task afterward
import { orchestrate } from "browser-autopilot";
const { result, success, loginMethod } = await orchestrate({
credentials: {
username: process.env.MY_USER ?? "",
password: process.env.MY_PASS ?? "",
email: process.env.MY_EMAIL ?? "",
totpKey: process.env.MY_TOTP_KEY ?? "",
},
loginUrl: "https://x.com/login",
successUrlContains: "/home",
task: "Open settings and tell me which account email is configured.",
});
console.log({ success, loginMethod, result });2. CDP-only task execution
Use runAgent(...) when:
- you are already logged in
- the task does not need auth
- you want direct control over browser lifecycle
import { CDPBrowser, runAgent } from "browser-autopilot";
const browser = new CDPBrowser();
await browser.connect();
const { result, success } = await runAgent({
browser,
task: "Go to wikipedia.org and find the population of Tokyo.",
});
console.log({ success, result });
await browser.disconnect();3. Direct X11 control
Use X11Agent directly when you want a pure screenshot-and-actions loop on Linux/X11.
import { X11Agent } from "browser-autopilot";
import * as chrome from "browser-autopilot/x11/chrome";
chrome.launch("https://example.com/login", "example-profile");
const agent = new X11Agent();
const result = await agent.runDetailed({
systemPrompt: `
You are controlling a Chrome browser with local X11 mouse and keyboard actions.
Log in, then say ACTION: DONE Logged in successfully.
`,
successCheck: () => chrome.pageUrlContains("/dashboard"),
});
console.log(result);Supported Systems
| Environment | CDP mode | X11 mode | Notes | |---|---|---|---| | Linux desktop with X11 | Supported | Supported | Best native environment for the full stack | | Linux VM/container with Xvfb | Supported | Supported | Recommended for cloud/self-hosted deployments | | macOS | Supported | Not supported natively | Use CDP mode locally; use Docker/Linux for X11 fallback | | Windows | Partial | Not supported natively | Chrome path detection exists, but the X11 stack does not | | Serverless functions | Poor fit | Poor fit | Headful Chrome + long-lived sessions are usually the wrong shape |
Linux
Linux is the primary target for the full feature set.
Use Linux if you want:
orchestrate(...)with reliable X11 fallbackswitch_to_x11- Docker/Xvfb deployment
- noVNC live viewing
If your desktop runs Wayland, the X11 path may require XWayland or an X11 session. The X11 tools in this repo are not written against native Wayland APIs.
macOS
macOS is a good fit for CDP-only use:
- local development
- authenticated flows that succeed without X11 fallback
- browser tasks driven through
runAgent(...)
Native macOS is not a full X11 target for this package. src/x11/input.ts depends on Linux/X11 tools like xdotool, xclip, and ImageMagick window capture.
Practical guidance on macOS:
- Use
runAgent(...)locally for CDP-first tasks. - Use
orchestrate(...)only if you are comfortable with the fact that X11 fallback is not available natively. - If you need the full stack, run the Docker image or a Linux VM.
Windows
The Chrome launch helper knows common Windows Chrome paths, but the broader package is not a native Windows-first stack.
Practical guidance on Windows:
- treat local CDP usage as experimental
- do not expect native X11 fallback
- use a Linux VM or container for production use
Local Execution Model
This package is local or self-hosted by design.
What runs locally:
- Chrome
- CDP client
- X11 input execution
- screenshots
- file uploads and downloads
- browser profiles and cookies
What goes to the model provider:
- task text
- browser state text
- screenshots
- tool-call context and results
So if you use OpenAI, Anthropic, or another provider through AI SDK Gateway, the reasoning is remote but the browser execution stays on your machine or your own servers.
Cloud Deployment
The best cloud shape is a long-lived Linux container or VM with a headful browser.
Good fits:
- Docker on a VM
- Kubernetes workloads with persistent storage
- self-hosted Linux boxes
- isolated agent workers or TEEs that can run a browser session for minutes at a time
Poor fits:
- stateless serverless functions
- environments where headful Chrome cannot start
- platforms without persistent disk for browser profiles
Docker
The repo includes a Docker image that sets up:
- Google Chrome
- Xvfb
- openbox
xdotoolxclip- ImageMagick
- optional noVNC viewer
Build:
docker build -f docker/Dockerfile -t browser-autopilot .Run:
docker run --rm \
-e AI_GATEWAY_API_KEY=$AI_GATEWAY_API_KEY \
-e LOGIN_URL=https://x.com/login \
-e SUCCESS_URL=/home \
-e TWITTER_USER=myuser \
-e TWITTER_PASS=mypass \
-e [email protected] \
-e TWITTER_TOTP_KEY=ABCDEF123456 \
-e AGENT_TASK="Open settings and summarize what you find." \
-v browser-autopilot-data:/data \
browser-autopilotNotes:
- The Docker image is
linux/amd64today. - Persist
/dataif you want browser sessions and outputs to survive across runs. - Set
ENABLE_VIEWER=1if you want the noVNC viewer for debugging.
Cloud Architecture Guidance
For production-ish deployments:
- prefer one browser session per worker
- persist the browser profile directory
- keep timezone, locale, and proxy geography aligned
- use a real Linux/X11 stack if you depend on X11 fallback
- treat the browser as stateful infrastructure, not a short-lived lambda
Browser Tools
The CDP agent exposes a broad tool surface, including:
navigateclickclick_atinputtype_textsend_keysscrollfind_textswitch_tabnew_tabclose_tabupload_fileclick_and_uploadpaste_contentpaste_imageextractevaluatehandle_dialogwaitsave_page_snapshotsave_element_htmlshellsolve_captchainject_captcha_tokensolve_datadomedone
The X11 agent supports local screen-level actions such as:
CLICK x yDOUBLE_CLICK x yMOVE x yDRAG x1 y1 x2 y2SCROLL up|down amountTYPE textKEYPRESS keyWAIT secondsSCREENSHOTDONE result
Sensitive Data and Custom Tools
The package supports sensitiveData masking so secrets already present in prompts can be redacted in the model-facing task text. For high-value credentials or payment details, prefer explicit AI SDK tools over dumping everything directly into the task.
That pattern looks like:
- pass non-sensitive workflow context in
task - expose just-in-time secrets through
extraTools - let the agent request them only when needed
Project Structure
src/
agent/ Step-based CDP agent loop and tool definitions
browser/ Raw CDP client and DOM indexing
captcha/ CAPTCHA solving helpers
viewer/ Optional live viewer for Xvfb environments
x11/ X11 agent, local input primitives, Chrome launch helpers
orchestrator.ts
config.ts
index.ts
docker/
Dockerfile
entrypoint.sh
tests/
docs/Important Constraints
- Use headful Chrome, not headless Chrome.
- X11 fallback is a Linux/X11 feature, not a cross-platform abstraction.
- If you depend on stealthy login fallback, deploy on Linux.
- If you only need structured browser automation, CDP mode is the simpler path.
- The package currently chooses models through AI SDK Gateway, so configure
AI_GATEWAY_API_KEYandAGENT_MODEL.
Environment Variables
| Variable | Required | Description |
|---|---|---|
| AI_GATEWAY_API_KEY | Yes | API key for AI SDK Gateway |
| AGENT_MODEL | No | Model name, defaults to claude-sonnet-4-6 |
| LOGIN_URL | CLI only | Login URL for the top-level entrypoint |
| SUCCESS_URL | CLI only | Post-login URL substring for the top-level entrypoint |
| TWITTER_USER | Optional | Username used by the default CLI entrypoint |
| TWITTER_PASS | Optional | Password used by the default CLI entrypoint |
| TWITTER_EMAIL | Optional | Email used by the default CLI entrypoint |
| TWITTER_TOTP_KEY | Optional | TOTP seed used by the default CLI entrypoint |
| PROXY_HOST | No | SOCKS5 proxy host |
| PROXY_PORT | No | SOCKS5 proxy port |
| PROXY_USER | No | Proxy username |
| PROXY_PASS | No | Proxy password |
| CAPSOLVER_KEY | No | Capsolver API key |
| TWOCAPTCHA_KEY | No | 2Captcha API key |
| CDP_PORT | No | Chrome remote debugging port, defaults to 9222 |
| PROFILE_DIR | No | Browser profile directory |
| DATA_DIR | No | Data/output directory |
| AGENT_TASK | No | Task used by the top-level CLI entrypoint |
| MAX_STEPS | No | Max agent steps, defaults to 80 |
| ENABLE_VIEWER | No | Enable the noVNC viewer in Docker |
| CHROME_PATH | No | Override Chrome binary path |
Development
npm install
npm test
npm run buildArchitecture notes live in docs/architecture.md.
