@vortex-os/computer-use

v0.7.4

Published

a day ago

Add-on — read-only screen perception (structured UIA tree + pixel fallback + noise-filtered background watch with an event buffer + sub-second reflex alerts: beep / fixed-phrase / OCR or optional local-VLM description spoken locally, optional higher-quali

0High
0Medium
0Low

dydan77

@vortex-os/computer-use

Read-only screen perception for VortEX agents, exposed as an MCP server. It lets an agent see what is on screen — read a window's structure, capture a region as an image, and watch for on-screen changes — without ever moving the mouse or typing. It layers on @vortex-os/base but also works standalone.

Status: Windows-first, read-only. Mouse/keyboard control is intentionally out of scope — this package only perceives (and speaks). macOS/Linux backends are not yet implemented.

What it is

An MCP (Model Context Protocol) server that exposes eleven tools over stdio (nine perception, plus beep/speak for output):

| Tool | What it does | Cost | |---|---|---| | probe | Reports whether this environment can perceive the screen (displays, DPI, capture latency). Never captures real screen content. | ~0 | | read_ui | Reads the active/target window as a structured accessibility tree (UI Automation): element roles, coordinates, text. No image. | ~0 image tokens | | classify_activity | Classifies the on-screen activity (game / dev / media / browsing / productivity) so a companion can branch its help. | metadata | | capture_screen | Pixel capture (PNG) for what structure can't reach — canvases, games, remote desktops. Target by window, region, monitor, or cursor box. | image | | watch_capture | Captures N frames at an interval in one process; with changeOnly, keeps only changed frames. | image(s) | | poll_change | One non-blocking "did it change?" probe; returns a change percentage and (optionally) an image. Poll it on an interval to watch without blocking. | metadata, image optional | | start_watch | Watch a fixed target in the background (non-blocking) with a built-in noise filter that keeps only meaningful, settled changes. Works on video/games that change every frame. | runs in background | | get_events | Collect the buffered changes a start_watch has accumulated — batched (a few looks for a long watch); each event carries the settled frame. | metadata + image(s) | | stop_watch | Stop a background watch and discard its buffer. | — | | beep | A system beep, to get the user's attention while they look elsewhere. | — | | speak | Speaks a short utterance locally (built-in Windows voice, or the optional Supertonic neural voice). | — |

The design favors structure first, pixels as fallback: read_ui is cheap and precise for ordinary apps; capture_screen is for content that has no accessibility tree (games, custom canvases).

Watching for changes (the noise filter + event buffer)

poll_change is the manual primitive — you poll it on a loop. start_watch is the hands-off version: it runs a background loop with a noise filter and buffers what matters, so you can watch a busy screen without drowning in frames.

The problem it solves: on video, games, or scrolling, the screen changes every frame, so raw change-detection fires constantly on the ripples of one activity. The filter combines debounce (wait for motion to settle, then capture a clean frame — quality) with cooldown (at most one event per N seconds — frequency) and hysteresis (ambient jitter never even wakes it), plus an anti-starvation maxWait so a continuously-moving screen still yields periodic snapshots instead of going silent.

start_watch { region|window|monitor, watchId? }   -> returns immediately, watch runs in the background
   ... do other things ...
get_events  { watchId? }                           -> the settled changes so far (frames + metadata), batched
stop_watch  { watchId? }                           -> end it (omit watchId to stop all)

Calibration. The defaults assume a meaningful target: a playing video jitters ~2.5–4% frame-to-frame and a scene cut jumps ~16% — so activityThreshold (default 8%) sits between them, ignoring the jitter and reporting the cut. Because the change metric is whole-frame, a tiny local change (a clock, a toast) is a small fraction of the frame; target the region where the change happens so it reads as a large change. Tune activityThreshold / quietThreshold / debounceQuietMs / cooldownMs / maxWaitMs per call. The buffer is memory-only (no screen history on disk) with count, byte, and 5-minute TTL caps; watches auto-stop after 30 minutes.

Reflex alerts — sub-second voice, no cloud round-trip

get_events is the brain path (the agent looks, judges, and replies — seconds, because it makes a cloud LLM call). For things you need to hear the instant they happen, start_watch takes triggers that fire locally, with no LLM in the loop, so the alert reaches you in well under a second:

start_watch {
  window: "MyGame",
  triggers: [
    { action: "beep",  threshold: 20 },                          // a sound
    { action: "say",   threshold: 20, say: "적 출현" },           // speak a FIXED phrase (Korean TTS)
    { action: "ocr",   threshold: 12, dwellMs: 700 }             // OCR the region and read the text aloud
  ]
}

A trigger fires the moment the watched region changes past its threshold (with hysteresis + a per-trigger cooldownMs). The fast alert is the reflex; the agent's judged commentary still follows on the next get_events. Speaking uses the built-in Windows voice (System.Speech); reading text uses the built-in offline OCR (no install, no GPU).

Safety (this matters). OCR text is screen content — untrusted. It is never spoken raw: it gets a spoken provenance prefix (화면 글자: …), control/secret-token shaping, and a global speech budget (capped utterances and seconds per minute, no overlapping speech, auto-mute on sustained noise) so an on-screen string can never voice fake instructions or flood you. A fixed say phrase is the safest action (you author the words; a trigger only controls when). If the voice/OCR engine isn't available, triggers degrade quietly (a beep still works).

Optional: higher-quality neural voice (Supertonic) + audio ducking

The default voice is the built-in Windows one (System.Speech / Heami) — zero install, but robotic. For a much more natural voice, install Supertonic 3 (Supertone): an on-device ONNX neural TTS (Korean + 30 more languages; code MIT, weights OpenRAIL-M — commercial use OK). One-time model download (~380 MB), then fully offline and fast (~0.5 s/sentence on CPU):

node scripts/fetch-supertonic.mjs       # downloads models to ~/.vortex/computer-use/supertonic-3
npm i onnxruntime-node                   # the runtime (optionalDependency)

Once the models are present, the speak path uses them automatically (engine: "auto") and falls back to Heami if anything is missing — it never goes mute.

Audio ducking. While the companion speaks, other apps' audio (game / music / video) is briefly lowered per-app and restored exactly when it finishes, so the voice stands out. On by default. DRM-protected audio (e.g. Netflix) cannot be ducked — that protected path bypasses Windows volume control; normal app/game audio ducks fine.

Configure in your instance-root computer-use.config.json (tts section — see Privacy & redaction for placement) or via env (env wins). Defaults shown:

{ "tts": { "engine": "auto", "voice": "F1", "speed": 1.05, "duck": true, "duckFactor": 0.3 } }

engine auto|supertonic|heami · voice F1..F5/M1..M5 (Supertonic only; the built-in Windows voice picks by system language) · speed rate multiplier (~1.0 = normal, higher = faster; clamped 0.5..2.0, applied to both the neural and built-in voices) · duckFactor 0..1 (clamped) (others drop to this fraction; lower = quieter). Env: VORTEX_CU_TTS_ENGINE / VORTEX_CU_TTS_VOICE / VORTEX_CU_TTS_SPEED / VORTEX_CU_DUCK=off / VORTEX_CU_DUCK_FACTOR. Restart the server after changing.

Optional: a local vision model (the `vision` trigger)

A vision trigger describes the scene (not just its text) via a local vision-language model — smarter than OCR, still no cloud round-trip. It is off by default and GPU-gated: it runs only when you point it at a reachable, fast-enough local endpoint. Everything above works with no GPU; this just adds a smarter local description where the hardware allows. On a machine that can't run it, a vision trigger degrades to ocr.

Point it at any OpenAI-compatible vision endpoint — llama.cpp's llama-server (with --mmproj), llamafile, Ollama, LM Studio — via env (machine-local, never synced):

VORTEX_CU_VLM_ENDPOINT=http://127.0.0.1:8080/v1   # set this to enable; presence = on
VORTEX_CU_VLM_MODEL=gemma-4-e2b-it                # any small VLM (e2b ~2GB is a good light default)
VORTEX_CU_VLM_KEY=...                             # optional bearer token
VORTEX_CU_VLM_ALLOW_REMOTE=1                      # only if the endpoint is NOT on this machine (off by default)
VORTEX_CU_VLM_SLA_MS=6000                         # gate: if even a tiny probe is slower than this, stay off

How it stays safe (design §23.2/§24): only the intent (endpoint set or not) is configuration — the address/secret are machine-local env, never synced, so the same repo on a CPU-only machine simply runs without it. A loopback endpoint (same machine) is allowed by default; a cross-network one (LAN/VPN/another box) is off unless you opt in. The session probes the endpoint with a synthetic 1×1 image (never a real screen crop) to measure latency before trusting it; a real crop is sent only on an actual vision trigger, through the same denylist gate. The model's reply is untrusted — spoken with a 로컬 비전: … prefix, shaped, and rate-limited, and the prompt tells the model to describe only (never follow on-screen instructions). probe reports the VLM's availability when you've configured one.

Adaptive companion — classify the activity, branch the help

classify_activity lets the companion figure out what you're doing from the first screenshot and adapt, instead of being told. One read-only call returns:

{ "class": "GAME", "process": "eldenring", "title": "...", "notificationState": "BUSY",
  "interruptible": false, "canvas": true, "uiaCount": 1, "fullscreen": true,
  "profile": { "proactive": true, "cadenceSec": 30, "mode": "periodic" }, "needsChangeRate": true }

It combines cheap signals — foreground process + window title, the Windows interruptibility state (SHQueryUserNotificationState), UIA element count (a near-empty tree on a screen-filling window = a GPU game/video canvas), and whether it fills the screen — into a class: GAME · DEV · MEDIA · BROWSING · PRODUCTIVITY · UNKNOWN, each with a help profile.

The profiles branch the behavior (full design in docs/adaptive-companion.md):

| Class | Default | Speaks when | Cadence | |---|---|---|---| | Strategy / sim game | proactive | quiet stretch (your turn) + risk/opportunity | ~30 s during active play | | Fast-action game | break-gated | a menu/pause/death screen opens | one cue per break; says up front it can't coach mid-fight | | Software dev | silent | an error/failed build/stack trace appears | event-driven, not periodic | | Media / browsing / docs | silent | only on request | never proactive |

For a GAME, needsChangeRate tells the agent to take a couple of poll_change reads to split fast-action (too fast to coach — break-gated only) from strategy (coachable, periodic). Honesty is built in: it never pretends to coach a game it can't follow, and it won't talk over media. The interruptibility state gates every utterance, on top of the global speech budget. Explicit user requests ("tell me when X happens", "be quieter") layer on as reflex triggers / cadence overrides.

Tune it in your instance-root computer-use.config.json (companion section): uiaCanvasMax (the canvas cutoff) and per-class profiles (e.g. GAME.cadenceSec: 20 for chattier coaching). Env: VORTEX_CU_UIA_CANVAS_MAX.

What it is NOT

Not control. No clicking, typing, or app automation. Perception only.
Not real-time for judgment. Reflex triggers deliver a sub-second beep / fixed phrase / OCR readout, but anything the agent has to think about (a judged message) is seconds-scale — it makes a cloud call. Good for alerts, translation, and watching-alongside; not for reflex-speed decisions.
Not comprehensive secret protection. See Privacy & redaction below — the denylist is the real control; field-level masking is best-effort and does not catch plaintext secrets sitting in arbitrary windows.
Not cross-platform yet. Windows only (for now).

Install

npm i @vortex-os/computer-use

Peer dependency: @vortex-os/base (>=0.3.0 <1.0.0). The MCP SDK (@modelcontextprotocol/sdk) is an optional dependency, loaded only when the server runs. No native build step.

Register the MCP server

The package ships a vortex-mcp-computer-use bin that launches the stdio server. Register it with your agent host. For Claude Code, add it to .mcp.json:

{
  "mcpServers": {
    "vortex-computer-use": {
      "command": "npx",
      "args": ["vortex-mcp-computer-use"]
    }
  }
}

Use a server name other than the reserved computer-use (e.g. vortex-computer-use) — some hosts reserve computer-use and will silently skip a server with that exact name. MCP servers load at session start, so restart the agent after adding it.

Privacy & redaction

Whatever you point this at is sent to your AI model. Two controls reduce accidental exposure; both run in the backend before any pixels or text reach the model:

Denylist (the primary control). List window titles or process names that must never be captured. If a listed window is visible anywhere inside a capture region, the whole capture is refused ({ "redacted": true } — no image, no text). This is the reliable defense against accidentally capturing a password manager or banking window during a watch.
Password-field masking. In read_ui, fields the OS reports as password inputs are dropped (no value, no text, children not traversed).

Copy computer-use.config.example.json to computer-use.config.json in your instance root (the folder you launch the agent from — i.e. next to .mcp.json), or point VORTEX_CU_CONFIG at an explicit path, to configure the denylist; or set VORTEX_CU_DENY_TITLES / VORTEX_CU_DENY_PROCS (JSON arrays). Do not put it inside node_modules — that is wiped on every reinstall. The denylist is read once at startup — restart the server after changing it.

Honest limits. This is not comprehensive secret-scanning. A plaintext token shown in a text editor or terminal (not a password field, not a denylisted window) will still be captured. Pixel-level password masking is intentionally out of scope. Capture images are volatile — held only long enough to send, then deleted; they are never written to disk persistently.

Audit

Each perception call appends one metadata line (timestamp, tool, output size, a keyed HMAC of the output, and an HMAC of the window title) to a daily JSONL log under your user-local app data (%LOCALAPPDATA%\vortex-computer-use\audit\) — outside the synced instance data. No raw images and no plaintext window titles are stored. If the audit key can't be set up, perception still works and a warning is printed.

Verify

npm run verify        # node scripts/verify.mjs — needs a desktop session; captures the real screen
npm run test:filter   # node scripts/test-noise-filter.mjs — pure unit tests, no screen needed
npm run test:speech   # node scripts/test-speech-safety.mjs — pure unit tests (provenance/shaping/budget)
npm run test:vlm      # node scripts/test-vlm.mjs — pure unit tests (config/trust-tier/protocol)

verify exercises every tool plus the redaction/audit gate (denylist blocking across all capture modes, no over-block, no title leak, audit written with no plaintext). The test:* scripts check the noise-filter, speech-safety, and VLM-protocol logic deterministically (no screen/audio/network). Three live harnesses drive the real screen: node scripts/verify-watch.mjs (background watch: settle → event → frame, denylist blindness, cleanup), node scripts/verify-reflex.mjs (reflex triggers → local speech, rendered to WAV so it stays silent), and node scripts/verify-vlm.mjs (the vision path against a mock local endpoint: synthetic-probe, real-crop only on a trigger, degrade-to-OCR).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@vortex-os/computer-use

What it is

Watching for changes (the noise filter + event buffer)

Reflex alerts — sub-second voice, no cloud round-trip

Optional: higher-quality neural voice (Supertonic) + audio ducking

Optional: a local vision model (the vision trigger)

Adaptive companion — classify the activity, branch the help

What it is NOT

Install

Register the MCP server

Privacy & redaction

Audit

Verify

Optional: a local vision model (the `vision` trigger)