@atomicbotai/computer-use

v0.1.12

Published

8 days ago

Framework-agnostic desktop computer-use tool for AI agents: screenshot, mouse, keyboard, OCR, overlay

0High
0Medium
0Low

dtorey

alexkalina

@atomicbotai/computer-use

Framework-agnostic desktop automation for AI agents.

Give your AI agent eyes and hands — screenshot the screen, click, type, scroll, drag, read text via OCR, and more. Works with any tool-calling LLM framework, MCP server, or custom pipeline.

Features

| | Feature | Description | |---|---|---| | 📸 | Screenshot | Capture any display with automatic downscaling and grid overlay | | 📸 | Screenshot Full | Full-resolution capture with OCR text anchors | | 🖱️ | Click / Double / Triple | Precise clicking with coordinate mapping from screenshot space | | ⌨️ | Type & Press | Text input and keyboard shortcuts (modifiers, combos) | | 📜 | Scroll | Directional scrolling at any position | | 🔍 | OCR | Built-in text recognition (macOS Vision, Windows Media OCR) | | 🎯 | Coord Mapping | Automatic screenshot → screen coordinate translation | | 🛡️ | Guardrails | Prevents misclicks in dock/launcher and submit zones | | 🔒 | Session Lock | File-based lock prevents concurrent agent sessions | | 🟢 | Overlay | Native "agent active" indicator (macOS Swift, Windows PowerShell) | | 🐛 | Debug Artifacts | Save screenshots, OCR results, and tool outputs per action | | 📋 | Clipboard | Read/write clipboard contents | | 🚀 | Open App | Launch or switch apps by name (cross-platform) | | ↔️ | Drag & Drop | Native drag support via platform scripts |

Quick Start

npm install @atomicbotai/computer-use

import { createComputerTool } from "@atomicbotai/computer-use";

const tool = createComputerTool({
  overlay: { enabled: true, color: "00BFFF", label: "My Agent" },
  debugArtifacts: true,
});

// Take a screenshot
const screenshot = await tool.execute("call-1", { action: "screenshot" });

// Click at coordinates from the screenshot
await tool.execute("call-2", { action: "click", x: 512, y: 384 });

// Type text
await tool.execute("call-3", { action: "type", text: "Hello, world!" });

// Press a keyboard shortcut
await tool.execute("call-4", { action: "press", text: "cmd+s" });

Use with any LLM tool-calling framework

The returned tool object has a standard shape that plugs into any agent loop:

const tool = createComputerTool();

// tool.name        → "computer"
// tool.description → detailed prompt for the model
// tool.schema      → JSON Schema (TypeBox) for tool parameters
// tool.execute     → (toolCallId, args, signal?) => Promise<ToolResult>

Supported Actions

| Action | Parameters | Description | |--------|-----------|-------------| | screenshot | display_index? | Capture screen (auto-downscaled with grid overlay) | | screenshot_full | display_index? | Full-resolution capture + OCR text anchors | | click | x, y, button? | Single click at coordinates | | double_click | x, y, button? | Double click | | triple_click | x, y, button? | Triple click (select line) | | type | text | Type literal text as-is | | press | text | Keyboard shortcut (cmd+v, enter, tab) | | submit_input | — | Press Enter to submit current input | | scroll | x?, y?, direction, amount? | Scroll in any direction | | cursor_position | — | Get current cursor position | | mouse_move | x, y | Move cursor to position | | drag | x, y, to_x, to_y | Drag from one point to another | | wait | duration? | Pause execution (max 30s) | | hold_key | text, duration? | Hold a key combo (max 10s) | | display_list | — | List connected displays | | read_clipboard | — | Read clipboard contents | | write_clipboard | text | Write text to clipboard | | open_app | app_name | Launch application by name | | switch_app | app_name | Switch to application by name |

OCR: Built-in Text Recognition

One of the most powerful features of the library — zero-dependency OCR that uses native platform APIs. No Tesseract, no cloud services, no API keys, no extra binaries. It just works.

Why This Matters for AI Agents

Screenshots alone are not enough. When an LLM looks at a downscaled screenshot, small text becomes unreadable — button labels, menu items, chat messages, form fields all blur together. The agent guesses coordinates and misses.

OCR solves this by extracting text with pixel-precise coordinates directly from the screenshot. The agent gets:

Anchor points — "Send" at (1450, 890) — exact clickable coordinates for every text element
Bounding boxes — left=1400, top=875, width=100, height=30 — the full region of each element
Confidence scores — so the agent knows which detections to trust
Layout structure — elements ordered top-to-bottom, left-to-right, matching how a human reads the screen

This turns a blurry screenshot into a structured map of the UI.

How It Works

When you call screenshot_full, the library:

Takes a full-resolution screenshot (no downscaling)
Runs OCR via the native platform engine (no external deps)
Selects the most useful text anchors (deduped, sorted by reading order)
Builds a layout map with coordinates in screenshot-image space
Generates a prompt hint for the LLM with anchor coordinates

The result returned to the model looks like:

Full-resolution screenshot captured (2560x1600) with grid overlay.
OCR anchors: "Inbox" at (120, 45); "Compose" at (85, 130); "Search mail" at (450, 45).
OCR layout: [e1 "Inbox" center=(120, 45) box=(80, 30, 80x30)] [e2 "Compose" center=(85, 130) box=(40, 115, 90x30)]

The agent can now click "Compose" at (85, 130) instead of guessing where the button might be.

Platform Engines

| Platform | Engine | How it runs | |----------|--------|-------------| | macOS | Apple Vision framework | Native Swift script via xcrun swift — no Xcode required | | Windows | Windows.Media.Ocr | PowerShell script using built-in UWP OCR API | | Linux | — | Not available yet (null adapter, graceful no-op) |

Both engines run locally and offline. No data leaves the machine. No API keys needed.

Standalone OCR Usage

The OCR subsystem is available as a separate import for use outside the computer tool:

import {
  recognizeText,
  buildOcrLayout,
  summarizeOcr,
  createOcrAdapter,
  createMacOsVisionOcrAdapter,
  createWindowsMediaOcrAdapter,
  createNullOcrAdapter,
} from "@atomicbotai/computer-use/ocr";
import type {
  OcrResult,
  OcrLine,
  OcrBoundingBox,
  OcrAnchorPoint,
  OcrAdapter,
  OcrLayout,
  OcrSummary,
} from "@atomicbotai/computer-use/ocr";

// Recognize text from any image
const result = await recognizeText({
  imagePath: "./screenshot.png",
  imageWidth: 2560,
  imageHeight: 1600,
});

// result.lines → array of detected text lines:
// [
//   {
//     text: "Compose",
//     confidence: 0.98,
//     bbox: { left: 40, top: 115, width: 90, height: 30 },
//     center: { x: 85, y: 130 }
//   },
//   ...
// ]

// Build a structured layout for LLM consumption
const layout = buildOcrLayout(result);
// layout.elements → top elements with IDs, sorted by reading order
// layout.promptHint → ready-to-use prompt string

// Or get a compact summary with anchor points
const summary = summarizeOcr(result);
// summary.text → 'OCR anchors: "Inbox" at (120, 45); "Compose" at (85, 130)'
// summary.anchors → [{ text: "Inbox", x: 120, y: 45, confidence: 0.99 }, ...]

Custom OCR Adapters

You can create your own OCR adapter (e.g. Tesseract for Linux) by implementing the OcrAdapter interface:

import type { OcrAdapter, RecognizeTextParams, OcrResult } from "@atomicbotai/computer-use/ocr";

const myAdapter: OcrAdapter = {
  async recognizeText(params: RecognizeTextParams): Promise<OcrResult | null> {
    // params.imagePath — path to the screenshot image
    // params.imageWidth / imageHeight — image dimensions
    // params.signal — AbortSignal for cancellation
    //
    // Return OcrResult with lines array, or null on failure
    return {
      engine: "my-custom-engine",
      imageWidth: params.imageWidth,
      imageHeight: params.imageHeight,
      lines: [
        {
          text: "Detected text",
          confidence: 0.95,
          bbox: { left: 100, top: 200, width: 150, height: 25 },
          center: { x: 175, y: 212 },
        },
      ],
    };
  },
};

Configuration

type ComputerToolOptions = {
  overlay?: {
    enabled?: boolean;  // Show "agent active" overlay (default: true)
    color?: string;     // Hex color without # (default: "00BFFF")
    label?: string;     // Overlay label (default: "Atomic bot")
  };
  debugArtifacts?: boolean;    // Save debug output per action
  debugArtifactsDir?: string;  // Custom debug output directory
  lockDir?: string;            // Custom session lock directory
};

Advanced: Subpath Imports

The library exposes its internal subsystems as separate entry points for advanced use cases.

Individual Actions

import { executeScreenshot, executeClick, executeType } from "@atomicbotai/computer-use/actions";

// Build a custom tool with only the actions you need
const result = await executeScreenshot({
  displayIndex: 0,
  captureSource: "screenshot_full",
  disableDownscale: true,
});

OCR (Standalone)

import {
  recognizeText,
  buildOcrLayout,
  summarizeOcr,
  createOcrAdapter,
} from "@atomicbotai/computer-use/ocr";
import type { OcrResult, OcrAdapter } from "@atomicbotai/computer-use/ocr";

const ocrResult = await recognizeText({
  imagePath: "./screenshot.png",
  imageWidth: 1920,
  imageHeight: 1080,
});

const layout = buildOcrLayout(ocrResult);
const summary = summarizeOcr(ocrResult);

Overlay

import {
  configureOverlay,
  showOverlay,
  forceHideOverlay,
} from "@atomicbotai/computer-use/overlay";

configureOverlay({ enabled: true, color: "FF5500", label: "My Bot" });
await showOverlay();
// ... do work ...
await forceHideOverlay();

Session Lock

import {
  configureLockDir,
  tryAcquire,
  releaseLock,
  isLockHeldLocally,
} from "@atomicbotai/computer-use/session-lock";

const result = await tryAcquire("my-session-id");
if (result.kind === "blocked") {
  console.log(`Blocked by session: ${result.by}`);
} else {
  // ... do work ...
  await releaseLock();
}

Coordinate Mapping

import { mapToScreen, storeCoordMap } from "@atomicbotai/computer-use/coord-mapping";

// Store the coord map string from a screenshot result
storeCoordMap("0,0,1920,1080,1024,768");

// Map screenshot coordinates to screen points
const screenPoint = mapToScreen(512, 384);
// → { x: 960, y: 540 }

Platform Support

| Feature | macOS | Windows | Linux | |---------|:-----:|:-------:|:-----:| | Screenshot | ✅ | ✅ | ✅ | | Click / Type / Press | ✅ | ✅ | ✅ | | Scroll | ✅ | ✅ | ✅ | | OCR | ✅ Vision | ✅ Media OCR | — | | Overlay | ✅ Swift | ✅ PowerShell | — | | Drag (native) | ✅ Swift | ✅ PowerShell | ✅ fallback | | Click animation | ✅ Swift | ✅ PowerShell | — | | Clipboard | ✅ pbcopy | ✅ PowerShell | ✅ xclip |

Architecture

┌─────────────────────────────────────────────────┐
│              createComputerTool()                │
│  ┌───────────────────────────────────────────┐   │
│  │            Action Router                  │   │
│  │  screenshot · click · type · press · ...  │   │
│  └──────────────┬────────────────────────────┘   │
│     ┌───────────┼───────────────┐                │
│     ▼           ▼               ▼                │
│  ┌──────┐  ┌─────────┐  ┌────────────┐          │
│  │ OCR  │  │ Overlay  │  │ Guardrails │          │
│  └──┬───┘  └────┬────┘  └────────────┘          │
│     │           │                                │
│  ┌──┴───────────┴──────────────────────┐         │
│  │    usecomputer (native bridge)      │         │
│  └──┬──────────┬───────────────┬───────┘         │
│     ▼          ▼               ▼                 │
│  macOS      Windows          Linux               │
│  (Swift)    (PowerShell)     (X11/xdotool)       │
└─────────────────────────────────────────────────┘

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@atomicbotai/computer-use

Features

Quick Start

Use with any LLM tool-calling framework

Supported Actions

OCR: Built-in Text Recognition

Why This Matters for AI Agents

How It Works

Platform Engines

Standalone OCR Usage

Custom OCR Adapters

Configuration

Advanced: Subpath Imports

Individual Actions

OCR (Standalone)

Overlay

Session Lock

Coordinate Mapping

Platform Support

Architecture

License