npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

browser-autopilot

v0.5.7

Published

Autonomous browser agent — LLM-driven CDP automation with X11 stealth fallback, CAPTCHA solving, and extensible skills

Readme

browser-autopilot

Autonomous browser automation for real Chrome, built for local or self-hosted execution.

It uses:

  • raw Chrome DevTools Protocol for fast, structured browser control
  • local X11 mouse and keyboard events when the task needs literal screen-level interaction
  • the Vercel AI SDK for multi-step model-driven reasoning and tool use

The browser always runs on your machine or your infrastructure. The package sends screenshots and state to your model provider, but it does not run the browser on OpenAI's machines.

What It Is

browser-autopilot is a TypeScript package for browser agents that need more than a scraper and less than a full remote browser service.

It is designed for:

  • authenticated browser workflows
  • long multi-step tasks
  • sites where DOM-first automation is useful most of the time
  • cases where you still want a local fallback for literal mouse and keyboard control

It is not built on Playwright or Puppeteer. The package talks to Chrome over raw CDP and can fall back to X11-level input on Linux when needed.

Core Model

There are two execution modes:

  • CDP mode Fast, structured control through Chrome DevTools Protocol. The model sees a screenshot plus an indexed DOM-like view and can call browser tools like click, input, navigate, extract, evaluate, upload_file, scroll, and done.

  • X11 mode Local OS-level control of the actual browser window through the X Window System. The model works from screenshots and issues actions like CLICK, DOUBLE_CLICK, MOVE, DRAG, SCROLL, TYPE, KEYPRESS, WAIT, and DONE.

orchestrate(...) combines them:

orchestrate({ credentials, loginUrl, successUrlContains, task })
  │
  ├─ Cached session? ───────────────→ run task in CDP mode
  ├─ CDP login works? ──────────────→ run task in CDP mode
  └─ CDP looks blocked or unusable? → relaunch and continue through X11

The CDP agent can also explicitly request a handoff by calling switch_to_x11.

What X11 Means

In this package, X11 means local OS-level browser control using the machine's real windowing and input stack, not DOM or CDP APIs.

That is implemented with Linux/X11 utilities such as:

  • xdotool for mouse and keyboard events
  • xclip for real clipboard paste
  • ImageMagick import for screenshots
  • Xvfb and openbox for headful containerized environments

This is useful when the page is hostile to DOM automation, when the DOM is not trustworthy enough, or when the task truly needs screen-level interaction.

AI SDK Integration

browser-autopilot already uses the Vercel AI SDK internally:

  • generateText(...) drives the step loop
  • gateway(model) resolves the model
  • browser tools and your custom tools use standard AI SDK tool(...) definitions

That means you can extend the agent with normal AI SDK tools and pass them into runAgent(...) or orchestrate(...).

import { tool } from "ai";
import { z } from "zod";
import { CDPBrowser, runAgent } from "browser-autopilot";

const browser = new CDPBrowser();
await browser.connect();

const result = await runAgent({
  browser,
  task: "Log in and fetch the latest invoice PDF",
  extraTools: {
    get_2fa_code: tool({
      description: "Return the latest 2FA code",
      inputSchema: z.object({}),
      execute: async () => "123456",
    }),
  },
});

Today the package is wired to the AI SDK gateway path, so you configure the model by name and provide AI_GATEWAY_API_KEY.

Installation

npm install browser-autopilot ai zod

Environment:

export AI_GATEWAY_API_KEY=...
export AGENT_MODEL=claude-sonnet-4-6

For local Chrome/CDP use, you also need a Chrome or Chromium binary available.

For X11 mode on Linux, you need:

  • xdotool
  • xclip
  • ImageMagick import
  • an X11 display or an Xvfb-based container/VM

Quick Start

1. Login plus task orchestration

Use orchestrate(...) when you want the package to handle:

  • cached-session reuse
  • a quick CDP login attempt
  • fallback to X11 when needed
  • running the actual task afterward
import { orchestrate } from "browser-autopilot";

const { result, success, loginMethod } = await orchestrate({
  credentials: {
    username: process.env.MY_USER ?? "",
    password: process.env.MY_PASS ?? "",
    email: process.env.MY_EMAIL ?? "",
    totpKey: process.env.MY_TOTP_KEY ?? "",
  },
  loginUrl: "https://x.com/login",
  successUrlContains: "/home",
  task: "Open settings and tell me which account email is configured.",
});

console.log({ success, loginMethod, result });

2. CDP-only task execution

Use runAgent(...) when:

  • you are already logged in
  • the task does not need auth
  • you want direct control over browser lifecycle
import { CDPBrowser, runAgent } from "browser-autopilot";

const browser = new CDPBrowser();
await browser.connect();

const { result, success } = await runAgent({
  browser,
  task: "Go to wikipedia.org and find the population of Tokyo.",
});

console.log({ success, result });
await browser.disconnect();

3. Direct X11 control

Use X11Agent directly when you want a pure screenshot-and-actions loop on Linux/X11.

import { X11Agent } from "browser-autopilot";
import * as chrome from "browser-autopilot/x11/chrome";

chrome.launch("https://example.com/login", "example-profile");

const agent = new X11Agent();
const result = await agent.runDetailed({
  systemPrompt: `
You are controlling a Chrome browser with local X11 mouse and keyboard actions.
Log in, then say ACTION: DONE Logged in successfully.
`,
  successCheck: () => chrome.pageUrlContains("/dashboard"),
});

console.log(result);

Supported Systems

| Environment | CDP mode | X11 mode | Notes | |---|---|---|---| | Linux desktop with X11 | Supported | Supported | Best native environment for the full stack | | Linux VM/container with Xvfb | Supported | Supported | Recommended for cloud/self-hosted deployments | | macOS | Supported | Not supported natively | Use CDP mode locally; use Docker/Linux for X11 fallback | | Windows | Partial | Not supported natively | Chrome path detection exists, but the X11 stack does not | | Serverless functions | Poor fit | Poor fit | Headful Chrome + long-lived sessions are usually the wrong shape |

Linux

Linux is the primary target for the full feature set.

Use Linux if you want:

  • orchestrate(...) with reliable X11 fallback
  • switch_to_x11
  • Docker/Xvfb deployment
  • noVNC live viewing

If your desktop runs Wayland, the X11 path may require XWayland or an X11 session. The X11 tools in this repo are not written against native Wayland APIs.

macOS

macOS is a good fit for CDP-only use:

  • local development
  • authenticated flows that succeed without X11 fallback
  • browser tasks driven through runAgent(...)

Native macOS is not a full X11 target for this package. src/x11/input.ts depends on Linux/X11 tools like xdotool, xclip, and ImageMagick window capture.

Practical guidance on macOS:

  • Use runAgent(...) locally for CDP-first tasks.
  • Use orchestrate(...) only if you are comfortable with the fact that X11 fallback is not available natively.
  • If you need the full stack, run the Docker image or a Linux VM.

Windows

The Chrome launch helper knows common Windows Chrome paths, but the broader package is not a native Windows-first stack.

Practical guidance on Windows:

  • treat local CDP usage as experimental
  • do not expect native X11 fallback
  • use a Linux VM or container for production use

Local Execution Model

This package is local or self-hosted by design.

What runs locally:

  • Chrome
  • CDP client
  • X11 input execution
  • screenshots
  • file uploads and downloads
  • browser profiles and cookies

What goes to the model provider:

  • task text
  • browser state text
  • screenshots
  • tool-call context and results

So if you use OpenAI, Anthropic, or another provider through AI SDK Gateway, the reasoning is remote but the browser execution stays on your machine or your own servers.

Cloud Deployment

The best cloud shape is a long-lived Linux container or VM with a headful browser.

Good fits:

  • Docker on a VM
  • Kubernetes workloads with persistent storage
  • self-hosted Linux boxes
  • isolated agent workers or TEEs that can run a browser session for minutes at a time

Poor fits:

  • stateless serverless functions
  • environments where headful Chrome cannot start
  • platforms without persistent disk for browser profiles

Docker

The repo includes a Docker image that sets up:

  • Google Chrome
  • Xvfb
  • openbox
  • xdotool
  • xclip
  • ImageMagick
  • optional noVNC viewer

Build:

docker build -f docker/Dockerfile -t browser-autopilot .

Run:

docker run --rm \
  -e AI_GATEWAY_API_KEY=$AI_GATEWAY_API_KEY \
  -e LOGIN_URL=https://x.com/login \
  -e SUCCESS_URL=/home \
  -e TWITTER_USER=myuser \
  -e TWITTER_PASS=mypass \
  -e [email protected] \
  -e TWITTER_TOTP_KEY=ABCDEF123456 \
  -e AGENT_TASK="Open settings and summarize what you find." \
  -v browser-autopilot-data:/data \
  browser-autopilot

Notes:

  • The Docker image is linux/amd64 today.
  • Persist /data if you want browser sessions and outputs to survive across runs.
  • Set ENABLE_VIEWER=1 if you want the noVNC viewer for debugging.

Cloud Architecture Guidance

For production-ish deployments:

  • prefer one browser session per worker
  • persist the browser profile directory
  • keep timezone, locale, and proxy geography aligned
  • use a real Linux/X11 stack if you depend on X11 fallback
  • treat the browser as stateful infrastructure, not a short-lived lambda

Browser Tools

The CDP agent exposes a broad tool surface, including:

  • navigate
  • click
  • click_at
  • input
  • type_text
  • send_keys
  • scroll
  • find_text
  • switch_tab
  • new_tab
  • close_tab
  • upload_file
  • click_and_upload
  • paste_content
  • paste_image
  • extract
  • evaluate
  • handle_dialog
  • wait
  • save_page_snapshot
  • save_element_html
  • shell
  • solve_captcha
  • inject_captcha_token
  • solve_datadome
  • done

The X11 agent supports local screen-level actions such as:

  • CLICK x y
  • DOUBLE_CLICK x y
  • MOVE x y
  • DRAG x1 y1 x2 y2
  • SCROLL up|down amount
  • TYPE text
  • KEYPRESS key
  • WAIT seconds
  • SCREENSHOT
  • DONE result

Sensitive Data and Custom Tools

The package supports sensitiveData masking so secrets already present in prompts can be redacted in the model-facing task text. For high-value credentials or payment details, prefer explicit AI SDK tools over dumping everything directly into the task.

That pattern looks like:

  • pass non-sensitive workflow context in task
  • expose just-in-time secrets through extraTools
  • let the agent request them only when needed

Project Structure

src/
  agent/       Step-based CDP agent loop and tool definitions
  browser/     Raw CDP client and DOM indexing
  captcha/     CAPTCHA solving helpers
  viewer/      Optional live viewer for Xvfb environments
  x11/         X11 agent, local input primitives, Chrome launch helpers
  orchestrator.ts
  config.ts
  index.ts
docker/
  Dockerfile
  entrypoint.sh
tests/
docs/

Important Constraints

  • Use headful Chrome, not headless Chrome.
  • X11 fallback is a Linux/X11 feature, not a cross-platform abstraction.
  • If you depend on stealthy login fallback, deploy on Linux.
  • If you only need structured browser automation, CDP mode is the simpler path.
  • The package currently chooses models through AI SDK Gateway, so configure AI_GATEWAY_API_KEY and AGENT_MODEL.

Environment Variables

| Variable | Required | Description | |---|---|---| | AI_GATEWAY_API_KEY | Yes | API key for AI SDK Gateway | | AGENT_MODEL | No | Model name, defaults to claude-sonnet-4-6 | | LOGIN_URL | CLI only | Login URL for the top-level entrypoint | | SUCCESS_URL | CLI only | Post-login URL substring for the top-level entrypoint | | TWITTER_USER | Optional | Username used by the default CLI entrypoint | | TWITTER_PASS | Optional | Password used by the default CLI entrypoint | | TWITTER_EMAIL | Optional | Email used by the default CLI entrypoint | | TWITTER_TOTP_KEY | Optional | TOTP seed used by the default CLI entrypoint | | PROXY_HOST | No | SOCKS5 proxy host | | PROXY_PORT | No | SOCKS5 proxy port | | PROXY_USER | No | Proxy username | | PROXY_PASS | No | Proxy password | | CAPSOLVER_KEY | No | Capsolver API key | | TWOCAPTCHA_KEY | No | 2Captcha API key | | CDP_PORT | No | Chrome remote debugging port, defaults to 9222 | | PROFILE_DIR | No | Browser profile directory | | DATA_DIR | No | Data/output directory | | AGENT_TASK | No | Task used by the top-level CLI entrypoint | | MAX_STEPS | No | Max agent steps, defaults to 80 | | ENABLE_VIEWER | No | Enable the noVNC viewer in Docker | | CHROME_PATH | No | Override Chrome binary path |

Development

npm install
npm test
npm run build

Architecture notes live in docs/architecture.md.