browserdd

v0.0.16

Published

5 days ago

Production-grade AI agent framework for browser automation — inspired by browser-use

0High
0Medium
0Low

hert4

ai agent web-automation browser-automation llm gpt-4 claude playwright puppeteer dom-analysis web-scraping

Browserdd

Production-grade AI agent framework for browser automation — inspired by browser-use

✨ Highlights

Smart DOM Extraction — TreeWalker + paint order filtering + bounding box deduplication (no arbitrary limits!)
Hierarchical Agents — Planner → BrowserNav architecture with automatic task decomposition
Loop Detection — Detects stuck states and provides recovery nudges
Vision Support — Screenshots sent to vision-capable models for better context (None)
Navigation Awareness — Automatically navigates to correct pages before executing tasks
LLM Agnostic — OpenAI, Anthropic, Google, local models, or any OpenAI-compatible API

📦 Installation

npm install browserdd
# or
pnpm add browserdd

Quick Start

import { WebAgent } from 'browserdd';
import { chromium } from 'playwright';

// Launch browser
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://example.com');

// Create agent
const agent = new WebAgent({
  llm: {
    provider: 'openai',
    model: 'gpt-4o',
    apiKey: process.env.OPENAI_API_KEY,
  },
  browser: { page },
});

// Execute task
const result = await agent.execute('Click the login button and fill email with "[email protected]"');

console.log(result.success);  // true
console.log(result.summary);  // "Completed in 3 steps"

Architecture

┌────────────────────────────────────────────────────────────┐
│                        WebAgent                             │
│                    (Orchestrator)                           │
└─────────────────────────┬──────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
   │   Planner   │◄────────────►│ BrowserNav  │ │    Loop     │
   │    Agent    │  continuous  │    Agent    │ │  Detector   │
   └──────┬──────┘  feedback    └──────┬──────┘ └─────────────┘
          │                            │
          │         ┌──────────────────┤
          ▼         ▼                  ▼
   ┌─────────────┐ ┌─────────┐ ┌──────────┐
   │  TaskPlan   │ │   DOM   │ │  Action  │
   │ (subtasks)  │ │Distiller│ │ Executor │
   └─────────────┘ └─────────┘ └──────────┘

How It Works

PlannerAgent receives user task → extracts values → generates specific subtasks with values
BrowserNavigationAgent executes each subtask step-by-step
After each subtask: Planner reviews progress, can skip completed subtasks or stop early
On failure: Planner provides recovery strategy (retry, alternative, ask user)
DOMDistiller extracts interactive elements with smart filtering
LoopDetector monitors for stuck states and provides recovery hints

Smart DOM Extraction

Unlike simple approaches that limit elements count, we use browser-use inspired techniques:

4-Phase Filtering Pipeline

| Phase | Technique | Purpose | |-------|-----------|---------| | 1 | TreeWalker | Prune hidden subtrees early | | 2 | ClickableElementDetector | 11 signals to identify interactive elements | | 3 | Paint Order Filtering | Multi-point occlusion via elementFromPoint() | | 4 | Bounding Box Filtering | Remove children inside clickable parents |

Clickable Detection Signals

// 11 signals to detect interactive elements
- Native interactive tags (button, a, input, select, textarea, etc.)
- Role attributes (role="button", role="link", etc.)
- Cursor style (pointer, text)
- Event handlers (onclick, @click, v-on:click, ng-click)
- ContentEditable elements
- Tab-focusable elements (tabindex)
- Data attributes (data-action, data-toggle, etc.)
- Search-related patterns (searchbox, magnify, etc.)
- Label wrapper detection (for frameworks like Ant Design)
- Icon-sized elements with interactive hints

Distillation Modes

| Mode | Token Reduction | Use Case | |------|----------------|----------| | TEXT_ONLY | ~95% | Reading content, extracting data | | INPUT_FIELDS | ~90% | Form filling, data entry | | ALL_FIELDS | ~80% | Complex navigation, clicking |

Navigation Awareness

The agent automatically handles page navigation:

// User is on /chat page, but task requires /feed page
const result = await agent.execute('Post a new message saying "Hello World"');

// Agent automatically:
// 1. Detects current page doesn't have post input
// 2. Finds navigation link to correct page
// 3. Navigates first, then performs the task

loop Detection & Recovery

Built-in detection for common stuck patterns:

Action Repetition — Same action repeated 3+ times
Page Oscillation — Toggling between 2 pages
Stagnant State — No DOM changes after actions

When detected, the agent receives a "nudge" to try alternative approaches.

Configuration

const agent = new WebAgent({
  // LLM Configuration
  llm: {
    provider: 'openai',
    model: 'gpt-4o',
    apiKey: process.env.OPENAI_API_KEY,
    baseUrl: 'https://api.openai.com/v1',  // or custom endpoint
    temperature: 0.7,
    maxTokens: 4096,
  },

  // Browser (Playwright page)
  browser: {
    page: playwrightPage,
  },

  // Execution limits
  maxStepsPerSubtask: 8,
  maxSubtasksPerTask: 20,

  // Features
  screenshotOnAction: true,  // Send screenshots to vision models
  debug: false,

  // Custom prompts (optional)
  prompts: {
    planner: 'Your custom planner prompt...',
    browserNav: 'Your custom browser nav prompt...',
    siteContext: 'Site-specific rules and context...',  // See Site Context below
  },
});

Site Context

The library is designed to be site-agnostic. All site-specific rules should be provided via siteContext:

const siteContext = `
## Site Rules

### Page Detection
- /feed → Post creation page
- /chat → Messaging page
- /profile → Profile editing page

### Element Hints
- Post button: Look for "Post" or submit button after textarea
- Send button: Paper-plane icon or "Send" text
- Save button: "Save" or "Update" after form fields

### Success Indicators
- Toast/snackbar with "success", "saved", "posted"
- Form closes after submission
- New content appears in feed/chat
`;

const agent = new WebAgent({
  llm: { ... },
  prompts: {
    siteContext,  // Appended to Planner and Navigator prompts
  },
});

What to Include in Site Context

| Category | Examples | |----------|----------| | Page Structure | URL patterns, page purposes | | Element Hints | How to identify key buttons/inputs | | Success Indicators | Toast patterns, success messages | | Error Recovery | What to do when actions fail | | Localization | Language-specific patterns |


---

## Events

```typescript
agent.on('task:start', ({ taskId, task }) => {
  console.log(`Starting: ${task}`);
});

agent.on('task:plan', ({ taskId, plan }) => {
  console.log(`Plan: ${plan.subtasks.length} subtasks`);
});

agent.on('subtask:start', ({ subtask }) => {
  console.log(`Subtask: ${subtask.description}`);
});

agent.on('subtask:complete', ({ result }) => {
  console.log(`Result: ${result.success ? '✓' : '✗'}`);
});

agent.on('task:complete', ({ result }) => {
  console.log(`Done! Steps: ${result.totalSteps}`);
});

Testing

# Unit tests
npm test

# Integration tests (requires OpenAI-compatible API)
export WEB_AGENT_OPENAI_BASE_URL="https://api.openai.com/v1"
export WEB_AGENT_OPENAI_API_KEY="sk-..."
npm run test:run

Examples

Form Filling

await agent.execute('Fill the signup form with email "[email protected]" and password "Secret123"');

E-commerce

await agent.execute('Search for "wireless headphones", sort by price low to high, add first result to cart');

Social Media

await agent.execute('Go to profile, change my display name to "John Doe", and save');

Data Extraction

const context = await agent.getContext('text_only');
// Use with your own LLM for extraction

License

MIT © Hert4