@mseep/clawdcursor

v1.5.5

Published

9 days ago

Local MCP server that gives any AI agent safe cross-OS desktop control — the fallback execution layer for when APIs, CLIs, and direct integrations aren't available. Works with any tool-calling model (Claude, GPT, Gemini, Llama) on Windows, macOS, and Linu

What it is

Clawd Cursor is a local MCP server that gives any tool-calling agent — Claude Code, Cursor, Windsurf, OpenClaw, the Claude Agent SDK, or your own loop — safe control of the real desktop. It clicks, types, reads the screen, opens apps, and drives any GUI the way a human would: native apps, the browser, even a canvas.

Most "let an agent use the computer" tools take a screenshot and feed it to a vision model — slow, expensive, and brittle. Clawd Cursor reads the accessibility tree first (structured text, near-free, no vision model), falls back to OCR, and only reaches for pixels as a last resort. The result is cheaper, faster, private, and — uniquely — it checks that each action actually did what it claimed.

If a human can do it on a screen, your agent can too. No API, no integration, no problem — only the right sequence of reads, clicks, keys, and waits. Use it as the last-mile fallback: native API exists? Use it. CLI? Use it. Clawd Cursor is for the click, the legacy app, the GUI with no public surface.

Why it's different

The desktop-agent space is crowded. The closest install-and-go peers are Windows-MCP and Terminator (desktop MCP servers); browser-only tools (browser-use, Playwright MCP) are adjacent; and OmniParser / UI-TARS are vision-centric parsing approaches you'd build an agent around, not products you install. Here's the honest comparison across those approaches — what Clawd Cursor does that the popular options don't:

| | Clawd Cursor | browser-use | Playwright MCP | OmniParser / UI-TARS | computer-use | |------------------------------------------|:-----------------------:|:-----------:|:--------------:|:--------------------:|:------------:| | Any desktop app, not just the web | ✅ | web only | web only | ✅ | ✅ | | Cross-OS (Windows + macOS + Linux) | ✅ | — | — | varies | sandbox | | Perception without a vision model | ✅ a11y → OCR → vision | DOM | a11y tree | ❌ vision-centric | ❌ vision | | Verifies its own actions (deviation) | ✅ | — | — | — | — | | Single safety chokepoint (allow/confirm/block) | ✅ | — | — | — | — | | Any model / vendor | ✅ | ✅ | not an agent | model-specific | Claude only | | MCP-native (one config, any host) | ✅ | library | test framework | — | tool-use API | | Local-only, no cloud required | ✅ | ✅ | ✅ | needs a model | screens → cloud |

Three things here are genuinely rare:

Cheapest-tier-first perception, fully local. Accessibility tree (free) → OCR (cheap) → screenshot (expensive — the only tier that puts pixels in the model's context; "screenshot" and "vision" are the same step). The agent climbs only when it must, so token cost tracks task difficulty — and with a local model, nothing leaves the machine. Vision-centric agents (OmniParser, UI-TARS) need a screenshot in the model for every observation.
It verifies. Pass expect on a consequential action and Clawd Cursor re-checks the live screen (with a short settle window for async UIs) and reports a DEVIATION instead of a hollow "success." A completed task can't be marked done on evidence that was already true before it acted.
One safety gate. Every call — from an editor over stdio, an external agent over HTTP, or the built-in loop — routes through a single safety.evaluate() chokepoint (allow / confirm / block) before it touches the desktop. The agent cannot bypass it.

Plus: an on-screen "desktop control in progress" banner with a blinking red dot whenever an agent is driving — double-click it to stop. A human at the machine always knows, and always has a kill switch.

60-second Quickstart

Install (any OS):

npm i -g clawdcursor

# Windows (PowerShell)
powershell -c "irm https://clawdcursor.com/install.ps1 | iex"

# macOS / Linux
curl -fsSL https://clawdcursor.com/install.sh | bash

Set up — this is the whole thing for the common case (your agent drives over MCP):

clawdcursor consent --accept   # one-time consent (required) — also registers the
                               # clawdcursor *skill* in your agents (Claude Code,
                               # OpenClaw, Codex, Cursor); re-run: clawdcursor register-skill
clawdcursor grant              # macOS only — approve Accessibility + Screen Recording

Wire it into your editor (Claude Code, Cursor, Windsurf, Zed):

// ~/.claude/settings.json  (or your editor's MCP config)
{
  "mcpServers": {
    "clawdcursor": { "command": "clawdcursor", "args": ["mcp", "--compact"] }
  }
}

That's it. Ask your agent to "open Outlook and reply to the latest email from Sarah" and watch it run.

You never run clawdcursor mcp yourself — the editor spawns it over stdio on demand. clawdcursor doctor is not part of MCP setup; it only configures the built-in LLM for the autonomous agent daemon. On macOS, Accessibility is required (the primary control path); Screen Recording is optional (only the vision fallback needs it).

Editor permission allowlists: use the server-level wildcard "mcp__clawdcursor" rather than per-tool entries — it covers every tool and survives tool renames across versions.

Or install the Claude Code plugin (no hand-edited config)

If you use Claude Code, you can skip the manual mcpServers block above. This repo ships a plugin (.claude-plugin/plugin.json) that registers the MCP server and bundles the usage skill in one step. It launches the server with npx -y clawdcursor mcp --compact, so there's nothing to install first — npx fetches clawdcursor on demand (or uses your global install if you have one), and because it resolves the package's bin (never a hard-coded dist/ path) it can't be broken by an entry-point change on upgrade.

# load the plugin for one session straight from a checkout…
claude --plugin-dir /path/to/clawdcursor
# …or add this repo to a plugin marketplace for a persistent install.

# one-time desktop-control consent (npx fetches the bin if you don't have it):
npx -y clawdcursor consent --accept

Requires Node.js 20+ (for npx, which ships with Node). The first launch downloads clawdcursor into npx's cache; later launches reuse it — no global install and no PATH shim to resolve.

The engine

The perception + verification core (the UI State Compiler, since v1.5.0):

compile_ui fuses the accessibility tree and OCR into one confidence-scored map of the screen, every element tagged with a stable el_NN id. Act on an element by {element_id, snapshot_id} instead of pixels — near-free in tokens, and it survives DPI, resize, and layout shifts. find_button / find_field locate a target by meaning and hand you the id.
Reactive verification. expect on an action → Clawd Cursor confirms the outcome on the live screen and returns a DEVIATION when the UI didn't obey.
Cross-platform parity. The compiler, secure-field redaction, and coordinate handling run on Windows, macOS, and Linux; the external-agent (MCP) surface resolves el_NN refs through the safety gate and discloses when it attached to your existing browser.

Set-of-Mark-style element IDs and a11y/OCR fusion aren't new ideas on their own — what's rare is doing them locally, a11y-first (no vision model required), with a built-in verification gate and one safety chokepoint, across three operating systems, behind a single MCP config.

See the changelog for the full release history (latest: v1.5.2 — perception reliability, honest verification, the control banner).

How it works

Where the brain lives decides how you run it. Both modes can run side-by-side.

| Brain lives… | Mode | Command | What you call | |---|---|---|---| | In your editor (Claude Code, Cursor, Windsurf, Zed) | Direct tools | clawdcursor mcp | Each tool, via stdio MCP | | In a headless agent with its own LLM (OpenClaw, Agent SDK, your loop) | Direct tools | clawdcursor agent --no-llm | Same, over HTTP MCP | | Inside Clawd Cursor itself (scheduled / "submit and walk away") | Thin agent loop | clawdcursor agent + doctor-configured LLM | task / submit_task | | External brain that delegates sub-tasks to the built-in loop | Direct + delegation | clawdcursor agent + your client | task({instruction:…}) to hand off |

The loop

Read the a11y tree (cheap) → act on named targets → verify from fresh observations → escalate perception only when needed (OCR → screenshot, the one tier that sends pixels to the model). Sparse a11y tree? system.detect_webview switches Electron/WebView2 apps to browser.* over CDP. Canvas-only (Paint, Figma, games)? Screenshot + coordinate click.

flowchart TB
    task["User task"] --> loop["Agent LLM loop<br/>plans · chooses tools · verifies"]
    loop --> observe{"Cheapest observation<br/>that answers the question"}

    observe -- "obs·a11y — free" --> a11y["A11y tree<br/>(structured text + el_NN handles)"]
    observe -- "obs·ocr — cheap" --> ocr["OCR (OS-level, no vision LLM)"]
    observe -- "obs·dom — medium" --> dom["Browser DOM (CDP)"]
    observe -- "obs·vision — expensive" --> vision["Screenshot (image into context)"]

    a11y --> act
    ocr --> act
    dom --> act
    vision --> act

    act["Act<br/>click/type/key/drag · invoke/set_value · open_app · batch"] --> safety
    safety["Single safety gate<br/>safety.evaluate() → allow / confirm / block"] -- allowed --> tools["Tool registry<br/>98 granular + 7 compound"]
    safety -- needs user --> confirm["Human confirmation"] --> tools
    safety -- denied --> blocked["blocked"]

    tools --> desktop["Real desktop"]
    desktop --> verify{"expect → does state match?"}
    verify -- pass --> done["done"]
    verify -- "DEVIATION" --> loop

    classDef agentNode fill:#dbeafe,stroke:#2563eb,color:#0f172a;
    classDef gate fill:#ede9fe,stroke:#7c3aed,color:#0f172a;
    classDef obsNode fill:#fef9c3,stroke:#ca8a04,color:#0f172a;
    classDef actNode fill:#ffedd5,stroke:#ea580c,color:#0f172a;
    classDef stop fill:#fee2e2,stroke:#dc2626,color:#0f172a;
    class loop,verify agentNode;
    class safety,confirm,tools gate;
    class observe,a11y,ocr,dom,vision obsNode;
    class act actNode;
    class blocked stop;

batch for deterministic stretches. When the next N steps are known, collapse them into one call — each step still routes through the safety gate; on any guard miss or error the batch halts with a per-step trace.

Task delegation. With an LLM configured on the daemon, an external agent can hand off at any point: task({"instruction":"…"}). The built-in loop takes the wheel and reports back — offload grunt work to a cheaper model without burning your own context.

The toolbox

Two catalogs ship side-by-side. The toolbox is 7 compound tools, each with an action enum covering ~10–20 verbs (~1,500 tokens total — about 12× smaller than granular, the computer_20250124 shape editor hosts already know). The granular surface is the 98 underlying primitives, one schema per verb (for runtimes that need top-level tools, or for debugging). Both run through the same safety.evaluate() chokepoint; the full catalog is always visible via MCP tools/list.

| Toolbox | Actions | |---|---| | computer | screenshot, click, double_click, right_click, triple_click, hover, move, scroll, scroll_horizontal, drag, drag_path, type, key, wait | | accessibility | read_tree, find, get_element, focused, invoke, focus, set_value, get_value, expand, collapse, toggle, select, state, list_children, wait_for, compile_ui, find_button, find_field, smart_click, smart_type, smart_read | | window | list, active, focus, maximize, minimize, restore, close, resize, list_displays, screen_size, open_app, open_file, open_url, switch_tab, navigate | | system | clipboard_read, clipboard_write, system_time, ocr, undo, shortcuts_list, shortcuts_run, delegate, detect_webview, relaunch_with_cdp, system_prompt, build_uri, open_uri, open_app, open_file, open_url, detect_app, app_guide, learn_app | | browser | connect, page_context, read_text, click, type, select_option, evaluate, wait_for, list_tabs, switch_tab, scroll | | task | run (default; bounded-sync — waits up to timeouts, returns {status:"running"} + progress if longer, re-call to keep waiting), status, abort. Delegates to the built-in loop. Requires clawdcursor agent with an LLM. | | batch | {steps:[…]} — collapse N calls into one round-trip; each step {name, arguments, expect?}, re-perceived and safety-gated, halts with a trace on any miss. |

computer({ action: "key", combo: "mod+s" })          // Cmd+S / Ctrl+S, resolved per-OS
accessibility({ action: "invoke", name: "Send" })    // click by name, not pixels
window({ action: "open_app", name: "Outlook" })
task({ instruction: "open Notepad and type hello" }) // hand off to the thin loop

Cheapest-tier-first perception

Every observation has a cost. Start at the cheapest rung that works; climb only when it fails. The live log (CLAWD_LOG=pretty, default on a TTY) shows the ladder in real time via per-call badges.

| Tier | Badge | Cost | Source | When | |---|---|---|---|---| | T1 structured | obs·a11y | ~free | accessibility.*, window.*, browser.read_text, clipboard | Default. Text + bounds, no image, no vision LLM. | | T2 OCR | obs·ocr | cheap | system.ocr, smart_read / smart_click / smart_type | A11y tree empty/sparse. OS-level text out, no image bytes. | | T3 DOM | obs·dom | medium | browser.read_text / page_context (CDP) | WebView / Electron / Chrome content. | | T4 screenshot (vision) | obs·vision | expensive | computer.screenshot | The only tier that puts pixels in the model's context. Canvas-only apps or spatial reasoning. Last resort. |

Acting tools log act. Watching obs·a11y → act → obs·a11y on a normal turn — and the rare climb to obs·vision — is the whole efficiency model, visible.

Transports

One protocol — MCP — two transports, same catalog and JSON-RPC envelope. Both stateless; no session handshake.

| Transport | When | Client config | |---|---|---| | stdio MCP | Editor hosts. Tools appear on demand — no daemon. | {"command":"clawdcursor","args":["mcp","--compact"]} | | HTTP MCP | Headless agents, daemons, orchestration, Agent SDK. POST JSON-RPC to http://127.0.0.1:3847/mcp. | Run clawdcursor agent. Bearer token at ~/.clawdcursor/token. |

# HTTP MCP — list tools
curl -s -X POST http://127.0.0.1:3847/mcp \
  -H "Authorization: Bearer $(cat ~/.clawdcursor/token)" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'

Platform support

Platform code lives behind a single PlatformAdapter interface (src/platform/{windows,macos,linux}.ts + wayland-backend.ts). Business logic never reads process.platform.

| Platform | UI Automation | OCR | Browser (CDP) | Input | |---|---|---|---|---| | Windows 10/11 (x64 / ARM64) | UIA via PowerShell bridge | Windows.Media.Ocr | Chrome / Edge | nut-js | | macOS 12+ (Intel / Apple Silicon) | JXA + System Events (TCC-safe) | Apple Vision | Chrome / Edge | nut-js + System Events | | Linux X11 | AT-SPI via python3-gi | Tesseract | Chrome / Edge | nut-js | | Linux Wayland | AT-SPI via python3-gi | Tesseract | Chrome / Edge | ydotool / wtype |

Windows — no setup; the PowerShell bridge spawns on demand.
macOS — first run needs Accessibility (required) + Screen Recording (optional); clawdcursor grant walks the dialogs. Retina/HiDPI handled in-adapter — don't pre-scale coordinates.
Linux X11 — apt install tesseract-ocr python3-gi gir1.2-atspi-2.0.
Linux Wayland — same, plus ydotool + ydotoold (preferred) or wtype (keyboard only).

Safety & privacy

| Tier | Actions | Behavior | |---|---|---| | Allow | Reading, opening apps, navigation, typing into non-sensitive fields, minimize | Executes immediately | | Confirm | Sends, deletes, purchases, transfers, close-window/quit-app & show-desktop key combos, sensitive apps | Pauses for approval (batch({allowConfirm:true}) to authorize) | | Block | Ctrl+Alt+Del, lock / log-out / force-quit / shutdown key sequences | Refused outright (no path) |

Network isolation. Binds to 127.0.0.1. Verify: netstat -an | findstr 3847 (Windows) / | grep 3847 (Unix).
Bearer-token auth on every HTTP request (~/.clawdcursor/token).
Sensitive-app policy. Email, banking, password managers, private messaging auto-elevate to Confirm.
No telemetry by default. Nothing phones home. Screenshots stay in RAM; with a local model nothing leaves the machine; with a cloud provider, screenshots go only to the endpoint you configured. clawdcursor report is opt-in and previews exactly what it sends.
Prompt-injection defense. Screen text is returned inside <untrusted-screen-content> tags — data, never instructions.
Log privacy. Logs redact password-field values (AXSecureTextField, UIA IsPassword=true).

See SECURITY.md for private vulnerability reporting.

Architecture

| Directory | What lives here | |---|---| | src/core/ | Thin agent loop (runAgent), sense layer (a11y / snapshot / fingerprint / UI compiler), reactive verification, focus guard, safety gate. | | src/tools/ | 98 granular tools + 7 compound aggregators + batch, playbooks, registry, dispatch. | | src/platform/ | PlatformAdapter + Windows / macOS / Linux / Wayland, OCR engine, CDP driver, URI handler. | | src/llm/ | Provider clients (Claude, GPT, Gemini, Llama, Kimi, Ollama, …), credentials, model config. | | src/surface/ | CLI, MCP server (stdio + HTTP), dashboard, doctor, onboarding, control banner. |

The PlatformAdapter is the only thing platform code talks to; safety.evaluate() is the only way tools execute. Those two seams are the whole point.

CLI

For humans diagnosing an install. Agents connect via MCP.

clawdcursor consent         Manage desktop-control consent (--accept / --revoke / --status)
clawdcursor grant           Grant macOS permissions (interactive, macOS only)
clawdcursor doctor          Configure the AI provider for `agent` mode (+ diagnostics)
clawdcursor status          Readiness check (consent, permissions, AI config)
clawdcursor mcp             stdio MCP server — editor hosts spawn this; you don't
clawdcursor agent           Daemon: HTTP MCP on :3847, optional built-in thin loop
clawdcursor agent --no-llm  Daemon, tool surface only (no built-in brain)
clawdcursor stop            Stop every running mode
clawdcursor uninstall       Remove all config and data

Options:  --port <n> (default 3847) · --compact · --no-banner · --provider <name> · --accept

Development

git clone https://github.com/AmrDab/clawdcursor.git && cd clawdcursor
npm install
npm run build       # tsc + postbuild  →  dist/surface/cli.js
npm test            # vitest (1,000+ tests)
npm run lint        # eslint
npm link            # global `clawdcursor` shim (Admin shell on Windows)

Tests run on Node 20 & 22 against Ubuntu, macOS, and Windows in CI, plus a coverage ratchet, a perf tripwire, and an npm audit gate.

Tech: TypeScript · Node 20+ · nut-js · Playwright · sharp · Express · Model Context Protocol SDK · Zod · commander.

Contributing

PRs welcome — see CONTRIBUTING.md for the dev loop, branch conventions, and the test matrix every change clears. Bugs and features in issues; private security reports via SECURITY.md.

License

MIT — see LICENSE.

Acknowledgments

Built on the Model Context Protocol SDK, nut-js, Playwright, the Anthropic computer_20250124 tool shape, and the AT-SPI / UIA / AX trees that make app-agnostic GUI automation possible at all.