agent-element-locator

v0.1.1

Published

3 months ago

Multi-strategy browser element locator: DOM → A11y → Vision fallback chain for AI agent automation

0High
0Medium
0Low

syed_abbas_hasan_zaidi

browser-automation element-locator playwright accessibility vision ai-agent fallback-chain dom a11y testing

agent-element-locator

Multi-strategy browser element locator for AI agent automation. Finds elements using a three-tier fallback chain: DOM (5ms) → A11y (150ms) → Vision (2500ms).

Install

npm install agent-element-locator playwright

Quick Start

import { ElementLocator } from "agent-element-locator";
import { chromium } from "playwright";

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");

// Create a locator instance — wires up all three strategies automatically
const locator = ElementLocator.create({
  page,
  sessionId: "my-session-001",
});

// Find by description — tries DOM selectors, then fuzzy A11y matching,
// then falls back to Claude vision if needed
const result = await locator.locate({
  description: "the login button",
});

await result.handle.click();
await browser.close();

That's it. The SDK tries the fastest strategy first and only escalates when needed.

User Guide

How the Fallback Chain Works

Every locate() call runs through up to three strategies in order. The chain stops at the first strategy that finds the element above its confidence threshold.

locate({ description: "Submit" })
        |
        v
+-------------------+     +-------------------+     +-------------------+
| 1. DomStrategy    |     | 2. A11yStrategy   |     | 3. VisionStrategy |
|    ~5ms           |     |    ~150ms          |     |    ~2500ms        |
|                   |     |                   |     |                   |
| Tries in order:   |     | Collects all      |     | Takes full-page   |
|  - cssSelector    |     | interactive       |     | screenshot, sends |
|  - xpath          |     | elements, scores  |     | to Claude with    |
|  - testId         |     | each against your |     | your description. |
|  - ariaLabel      |     | description using |     | Claude returns a  |
|  - ariaRole+text  |     | Jaro-Winkler      |     | bounding box,     |
|  - text           |     | fuzzy matching.   |     | resolved back to  |
|                   |     |                   |     | a DOM element via |
| Confidence: 1.0   |     | Threshold: 0.80   |     | elementFromPoint. |
| (exact match)     |     |                   |     | Threshold: 0.70   |
+-------------------+     +-------------------+     +-------------------+
        |                         |                         |
        +-- null? continue -------+-- null? continue -------+-- null? throw
                                                               ElementNotFoundError

Why this order?

DOM is nearly instant and exact — if you have a CSS selector or test ID, it finds the element in milliseconds with 100% confidence.
A11y survives CSS class renames and DOM restructuring because it matches on semantic labels (aria-label, visible text), not structure. Costs ~150ms.
Vision is the last resort. It captures a screenshot, sends it to Claude (or GPT-4V), and asks the model to find the element visually. Slower and costs money, but works when everything else fails — including post-deploy DOM restructures, dynamically loaded elements, and pages with no accessible labels.

Providing Targets

A target describes what you want to find. All fields are optional — provide as many or as few as you have. More hints = faster and more accurate results.

interface LocatorTarget {
  // Structural selectors (DOM strategy — fastest)
  cssSelector?: string;   // "#login-btn", ".nav > a:first-child"
  xpath?: string;         // "//button[@type='submit']"
  testId?: string;        // Value of data-testid attribute

  // Semantic hints (A11y strategy — resilient to DOM changes)
  ariaLabel?: string;     // aria-label attribute value
  ariaRole?: string;      // "button", "textbox", "link"
  text?: string;          // Visible text content

  // Natural language (A11y fuzzy match + Vision strategy)
  description?: string;   // "the blue submit button in the header"
}

Examples by use case:

// You have a stable test ID (best case — instant)
await locator.locate({ testId: "checkout-button" });

// You know the visible text
await locator.locate({ text: "Sign Up", ariaRole: "button" });

// You have a CSS selector that might go stale
await locator.locate({
  cssSelector: "#old-selector",
  description: "the main search input",  // fallback for A11y/Vision
});

// You only have a natural-language description (slowest but most resilient)
await locator.locate({
  description: "the plus button to increase the number of adult guests",
});

Tip: Always include a description alongside structural selectors. If the selector goes stale after a deploy, the A11y and Vision strategies can still find the element using the description as a fallback.

Configuration Options

const locator = ElementLocator.create({
  // Required
  page: playwrightPage,          // The Playwright Page to operate on
  sessionId: "uuid-v4-here",     // Groups trajectory logs for this session

  // Optional — sensible defaults for all of these
  timeout: 5000,                 // Per-locate timeout in ms (default: 5000)
  visionServiceUrl: "https://...", // Override vision service URL
  anthropicApiKey: "sk-ant-...",   // Override Anthropic API key
  logTrajectories: true,           // Enable/disable trajectory logging
});

Environment variable resolution (no code needed):

| Setting | Env Variable | Default | |---|---|---| | Vision service URL | VISION_SERVICE_URL | Hosted service on Railway | | Anthropic API key | ANTHROPIC_API_KEY | None (Vision strategy disabled) |

The recommended setup: set ANTHROPIC_API_KEY in your environment and let the SDK read it automatically. No need to pass it in code.

Setting Up the Vision Strategy

The Vision strategy is optional. Without an API key, the SDK uses DOM and A11y only (which handle most cases). To enable Vision:

Step 1: Set your Anthropic API key

export ANTHROPIC_API_KEY=sk-ant-your-key-here

Step 2: Create the locator (key is read automatically)

const locator = ElementLocator.create({
  page,
  sessionId: "my-session",
});
// Vision strategy is now active as the third fallback

How BYOK (Bring Your Own Key) works:

Your API key is forwarded to the vision service via the X-Anthropic-Key HTTP header. The vision service uses your key for the Claude API call and does not store it. The key never touches disk on the service side.

Self-hosting the vision service:

The vision service is a Python FastAPI application in services/vision-service/. To self-host:

cd services/vision-service
pip install -r requirements.txt
uvicorn vision_service.main:app --host 0.0.0.0 --port 8765

Then point the SDK at it:

const locator = ElementLocator.create({
  page,
  sessionId: "my-session",
  visionServiceUrl: "http://localhost:8765",
});

Convenience Methods

The SDK provides shorthand methods for common actions:

// locate() — find the element, get back a handle
const result = await locator.locate({ description: "the search button" });
console.log(result.strategy);    // "dom", "a11y", or "vision"
console.log(result.confidence);  // 0.0 to 1.0
await result.handle.click();

// click() — locate + click in one call
await locator.click({ description: "the search button" });

// fill() — locate + type into an input
await locator.fill({ description: "the email input" }, "[email protected]");

Custom Chains

For advanced use cases, build your own strategy chain:

import {
  FallbackChain,
  DomStrategy,
  A11yStrategy,
  VisionStrategy,
  VisionClient,
} from "agent-element-locator";

// DOM + A11y only (no Vision cost, no API key needed)
const cheapChain = new FallbackChain([
  new DomStrategy(),
  new A11yStrategy(),
]);

// Vision-only (for canvas UIs, cross-origin iframes, PDF viewers)
const client = new VisionClient("https://your-service.com", "sk-ant-...");
const visionOnly = new FallbackChain([new VisionStrategy(client)]);

// Skip A11y, go straight from DOM to Vision
const domThenVision = new FallbackChain([
  new DomStrategy(),
  new VisionStrategy(client),
]);

// Use a custom chain directly
const { element } = await cheapChain.locate(
  { description: "the login button" },
  { page, timeout: 5000 }
);

When to use Vision-only:

Canvas-rendered UIs (Figma, Google Sheets cells) — no DOM nodes exist
Cross-origin iframes — Playwright can't query inside them
PDF viewers — clickable regions have no DOM representation
External screenshots — no live Playwright page available

Trajectory Logging

Every locate() call logs a trajectory record (JSONL format) to the vision service. This captures what each strategy tried, how long it took, and which one succeeded — useful for debugging and analyzing agent runs.

// Enabled by default. Disable for unit tests:
const locator = ElementLocator.create({
  page,
  sessionId: "test-session",
  logTrajectories: false,
});

Trajectory records include:

Target description and page URL
Each strategy attempted, duration, and outcome
The winning strategy, selector used, and confidence score
Timestamps for the full locate() call

Error Handling

import {
  ElementLocator,
  ElementNotFoundError,
  VisionServiceError,
} from "agent-element-locator";

try {
  const result = await locator.locate({ description: "nonexistent button" });
} catch (error) {
  if (error instanceof ElementNotFoundError) {
    // All three strategies failed to find the element.
    // error.target contains the original LocatorTarget for debugging.
    console.log("Not found:", error.target);
  }

  if (error instanceof VisionServiceError) {
    // Vision service is unreachable or returned an error.
    // DOM and A11y strategies are unaffected.
    console.log("Vision service issue:", error.message);
    console.log("Cause:", error.cause);
  }
}

Common error scenarios:

| Error | Cause | Fix | |---|---|---| | ElementNotFoundError | No strategy found the element | Check your target description. Is the element visible on the page? Try adding more hints (testId, ariaLabel). | | VisionServiceError: Cannot reach vision service | Vision service is down or URL is wrong | Check VISION_SERVICE_URL. The SDK still works without Vision — DOM and A11y strategies handle most cases. | | VisionServiceError: 503 | Vision provider (Claude) is unavailable | Transient issue. Retry or rely on DOM/A11y strategies. | | Timeout | Strategy took longer than the configured timeout | Increase timeout in options. Vision strategy needs 5-10s on cold starts. |

API Reference

`ElementLocator.create(options): ElementLocator`

Factory method. Creates a fully wired locator with all three strategies.

`locator.locate(target): Promise<LocateResult>`

Find an element on the page. Returns { handle, strategy, confidence }. Throws ElementNotFoundError if all strategies fail.

`locator.click(target): Promise<LocateResult>`

Locate + click in one call. Returns the same result as locate().

`locator.fill(target, value): Promise<LocateResult>`

Locate + fill a text input. Returns the same result as locate().

`FallbackChain`

const chain = new FallbackChain([strategy1, strategy2, ...]);
const { element, attempts, totalDurationMs } = await chain.locate(target, context);

Low-level chain for custom strategy combinations.

`VisionClient`

const client = new VisionClient(baseUrl, anthropicApiKey?);
const isUp = await client.healthCheck();
const response = await client.locate(request);

HTTP adapter to the Python vision service.

Types

// What you pass in
interface LocatorTarget {
  description?: string;
  cssSelector?: string;
  xpath?: string;
  testId?: string;
  ariaLabel?: string;
  ariaRole?: string;
  text?: string;
}

// What you get back
interface LocateResult {
  handle: ElementHandle;
  strategy: "dom" | "a11y" | "vision";
  confidence: number;
}

// Errors
class ElementNotFoundError extends Error {
  target: LocatorTarget;
}

class VisionServiceError extends Error {
  cause?: unknown;
}

Help / Troubleshooting

"Element not found" but I can see it on the page

Check your description. Be specific about location: "the blue Submit button in the header" beats "submit button."
Add structural hints. If you have a data-testid or aria-label, include them — DOM strategy is instant and exact.
Increase timeout. Vision strategy needs 2-5 seconds. Set timeout: 10000 or higher.
Check element visibility. Is the element inside an iframe, behind a modal, or below the fold? fullPage: true screenshots capture below-the-fold content, but modals and iframes can block elementFromPoint().

Vision strategy returns low confidence / wrong element

Be spatially specific in your description: "the plus button in the Adults row" not "the plus button."
Provide role hints: ariaRole: "button" tells Claude what type of element to look for.
Check for duplicates: If there are multiple similar elements (e.g., several "+" buttons), mention the nearby label or section in your description.

Vision service not reachable

Check the URL: Verify VISION_SERVICE_URL is correct. Default is the hosted Railway instance.
Check API key: Set ANTHROPIC_API_KEY in your environment. Without it, Vision strategy is effectively disabled (the service has no key to call Claude with).

Test the health endpoint:

curl https://locator-sdk-production.up.railway.app/health

The SDK still works without Vision. DOM and A11y strategies handle the majority of cases. Vision is only needed for ambiguous or structurally changed pages.

Performance tips

| Scenario | Recommendation | |---|---| | You have stable test IDs | Use testId — DOM strategy finds it in ~5ms | | Test IDs might go stale | Add description as a fallback for A11y/Vision | | Speed matters, accuracy is fine | Use a DOM+A11y-only chain (skip Vision) | | Canvas or iframe targets | Use a Vision-only chain | | Running in CI | Set timeout: 15000 to handle cold starts |

Getting help

Issues: github.com/SyedAbbasHasanZaidi/browser-agent-locater-/issues
Source: github.com/SyedAbbasHasanZaidi/browser-agent-locater-

Requirements

Node.js >= 18
Playwright >= 1.44
(Optional) Anthropic API key for Vision strategy

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

agent-element-locator

Table of Contents

Install

Quick Start

User Guide

How the Fallback Chain Works

Providing Targets

Configuration Options

Setting Up the Vision Strategy

Convenience Methods

Custom Chains

Trajectory Logging

Error Handling

API Reference

ElementLocator.create(options): ElementLocator

locator.locate(target): Promise<LocateResult>

locator.click(target): Promise<LocateResult>

locator.fill(target, value): Promise<LocateResult>

FallbackChain

VisionClient

Types

Help / Troubleshooting

"Element not found" but I can see it on the page

Vision strategy returns low confidence / wrong element

Vision service not reachable

Performance tips

Getting help

Requirements

License

`ElementLocator.create(options): ElementLocator`

`locator.locate(target): Promise<LocateResult>`

`locator.click(target): Promise<LocateResult>`

`locator.fill(target, value): Promise<LocateResult>`

`FallbackChain`

`VisionClient`