@mylesiyabor/betterbrowse

v0.7.0

Published

4 months ago

Zero-dependency browser automation via Chrome DevTools Protocol with ARIA accessibility snapshots — 10-100x cheaper than vision-based approaches

0High
0Medium
0Low

mylesiyabor

browser automation cdp chrome devtools aria accessibility snapshot web-scraping agent headless zero-dependency

betterbrowse

Zero-dependency browser automation via Chrome DevTools Protocol with ARIA accessibility snapshots — 10-100x cheaper than vision-based approaches.

Why?

Most browser automation agents use screenshots + vision models. That's expensive and slow. betterbrowse uses ARIA accessibility snapshots instead — a text representation of the page that any LLM can understand. This means:

10-100x cheaper — text tokens vs image tokens
Works with any text model — no vision model required
Faster — no image encoding/decoding overhead
More reliable — structured data vs pixel interpretation
Video recording — record browser sessions as MP4 (via CDP screencast + ffmpeg)

Install

Project (library):

npm install @mylesiyabor/betterbrowse

Global (CLI — easy for agents):

npm install -g @mylesiyabor/betterbrowse

Then use the CLI from any terminal or agent (see below).

Requires Node.js >= 20.10.0 and Chrome/Chromium installed locally.

CLI (easy for agents)

The simplest way for agents to use betterbrowse is the CLI. Install globally, then:

Snapshot only (no API key) — get the ARIA snapshot of a page on stdout:

betterbrowse https://example.com

Agent mode (uses OpenAI; set OPENAI_API_KEY) — complete a task and print the result to stdout:

betterbrowse https://news.ycombinator.com "What is the top story title?"
betterbrowse https://example.com "Click the first link" --no-headless

| Option | Description | |--------|-------------| | betterbrowse <url> | Print ARIA snapshot of the page | | betterbrowse <url> "<task>" | Run browser agent; result to stdout | | betterbrowse search "<query>" | Search the web (multi-provider, free) | | --model <name> | OpenAI model (default: gpt-4o-mini) | | --no-headless | Show browser window | | --record | Record the session as video (MP4 if ffmpeg installed) | | --record-dir <dir> | Directory for recording output (default: cwd or temp) | | --json | Output search results as JSON | | --deep | Visit top results and extract page content | | --max <n> | Max search results (default: 5) | | -v, --version | Print version | | -h, --help | Show help |

Agents can capture stdout for the snapshot or the task result. No extra dependencies — agent mode calls the OpenAI API with fetch.

Video recording: Use --record (and optionally --record-dir ./out). The browser session is captured via CDP screencast; if ffmpeg is installed, frames are stitched into recording.mp4. The output path is printed to stderr so stdout stays clean for the result.

Quick Start (library)

Browser Class (Tool Harness)

import { Browser } from '@mylesiyabor/betterbrowse';

const browser = new Browser({ headless: true });
await browser.launch();
await browser.navigate('https://example.com');

// Get ARIA snapshot — structured text representation of the page
const snapshot = await browser.getSnapshot();
console.log(snapshot);
// - heading "Example Domain" [ref=e1]
// - text "This domain is for use in illustrative examples..."
// - link "More information..." [ref=e2]

// Interact using refs from the snapshot
await browser.clickRef('e2');

// Take a screenshot
const png = await browser.screenshot(); // base64

await browser.close();

Agent (LLM-Driven Loop)

import { browseWeb } from '@mylesiyabor/betterbrowse';

const result = await browseWeb('https://news.ycombinator.com', 'Find the top story title', {
  chat: async (messages, { tools, maxTokens }) => {
    // Wire up your LLM here — OpenAI, Anthropic, Google, etc.
    const response = await yourLLM.chat(messages, { tools, maxTokens });
    return {
      content: response.text,
      toolCalls: response.toolCalls, // [{ name, arguments, id }]
      usage: { input: response.inputTokens, output: response.outputTokens },
    };
  },
});

console.log(result.result);  // "The top story is: ..."
console.log(result.usage);   // { inputTokens, outputTokens, modelCalls }
console.log(result.steps);   // [{ step, action, ref, text, result }, ...]

API

`Browser`

new Browser({ headless?: boolean, useProfile?: boolean, port?: number })

Extends EventEmitter. Events: launch, navigate, action, snapshot, close, error.

| Method | Description | |---|---| | launch() | Start Chrome and connect via CDP | | navigate(url) | Navigate to a URL | | getSnapshot() | Get optimized ARIA snapshot | | getRawSnapshot() | Get raw snapshot + refMap | | clickRef(ref) | Click element by ref (e.g. "e5") | | fillRef(ref, text) | Type into input by ref | | hover(ref) | Mouse hover by ref | | selectOption(ref, value) | Select dropdown option by ref | | waitForSelector(selector, timeout?) | Wait for CSS selector | | screenshot() | Capture PNG (base64) | | extractText() | Get all visible text | | evaluate(expr) | Run JS in page | | close() | Close browser |

`browseWeb(url, task, opts)`

LLM-driven browser agent. Returns { result, usage, steps, recording }.

Required option: chat — async function matching:

(messages, { tools, maxTokens }) => Promise<{ content, toolCalls?, usage? }>

Optional: record: true — record the session; recordDir: string — output directory. When recording, the returned object includes recording: { video, frameDir, frameCount, frames } (MP4 path in video if ffmpeg is installed).

Snapshot Utilities

import { optimizeAll, computeDiff, analyzeWaste } from '@mylesiyabor/betterbrowse';

// Optimize a raw ARIA snapshot
const optimized = optimizeAll(rawSnapshot, { maxItems: 10 });

// Compute diff between two snapshots
const diff = computeDiff(prevSnapshot, currSnapshot, prevUrl, currUrl);

// Analyze snapshot waste
const report = analyzeWaste(rawSnapshot);

How ARIA Snapshots Work

Instead of screenshots, we fetch the browser's accessibility tree via CDP and convert it to a compact text format:

- heading "Search Results" [ref=e1]
- textbox "Search query" [ref=e2]
- button "Search" [ref=e3]
- list
  - listitem
    - link "First Result" [ref=e4]
  - listitem
    - link "Second Result" [ref=e5]

Interactive elements get [ref=eXX] tags. The agent uses these refs to click, fill, hover, and select — no pixel coordinates needed.

The snapshot optimizer pipeline strips chrome (headers/footers), deduplicates links, compresses long names, and truncates lists — reducing token count by 60-90%.

Video recording

You can record browser sessions as video (CLI or library).

CLI: betterbrowse <url> "<task>" --record or betterbrowse <url> --record. Use --record-dir <dir> to choose where the file is saved. The session is captured via Chrome DevTools screencast; if ffmpeg is on your PATH, frames are stitched into recording.mp4 in that directory. The path is printed to stderr.
Library: Pass record: true and optionally recordDir: './recordings' to browseWeb(). The return value includes recording: { video, frameDir, frameCount, frames } (or recording: null if not recording). Frames are always saved; video is set only when ffmpeg is available.

Use in agents (global install)

Install betterbrowse globally so any agent (Cursor, MCP, scripts) can run it as a CLI:

npm install -g @mylesiyabor/betterbrowse

CLI (recommended): Run betterbrowse <url> or betterbrowse <url> "<task>". Result/snapshot goes to stdout — easy for agents to capture. For task mode set OPENAI_API_KEY.
As a library: In your agent code, import { Browser, browseWeb } from '@mylesiyabor/betterbrowse' and call the API (e.g. with your own chat function).
Project-local: Run npm install @mylesiyabor/betterbrowse in your project and use the CLI from npx betterbrowse or import the module.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme