@vortex-os/computer-use
v0.7.4
Published
Add-on — read-only screen perception (structured UIA tree + pixel fallback + noise-filtered background watch with an event buffer + sub-second reflex alerts: beep / fixed-phrase / OCR or optional local-VLM description spoken locally, optional higher-quali
Readme
@vortex-os/computer-use
Read-only screen perception for VortEX agents, exposed as an MCP server. It lets an agent see what is on screen — read a window's structure, capture a region as an image, and watch for on-screen changes — without ever moving the mouse or typing. It layers on @vortex-os/base but also works standalone.
Status: Windows-first, read-only. Mouse/keyboard control is intentionally out of scope — this package only perceives (and speaks). macOS/Linux backends are not yet implemented.
What it is
An MCP (Model Context Protocol) server that exposes eleven tools over stdio (nine perception, plus beep/speak for output):
| Tool | What it does | Cost |
|---|---|---|
| probe | Reports whether this environment can perceive the screen (displays, DPI, capture latency). Never captures real screen content. | ~0 |
| read_ui | Reads the active/target window as a structured accessibility tree (UI Automation): element roles, coordinates, text. No image. | ~0 image tokens |
| classify_activity | Classifies the on-screen activity (game / dev / media / browsing / productivity) so a companion can branch its help. | metadata |
| capture_screen | Pixel capture (PNG) for what structure can't reach — canvases, games, remote desktops. Target by window, region, monitor, or cursor box. | image |
| watch_capture | Captures N frames at an interval in one process; with changeOnly, keeps only changed frames. | image(s) |
| poll_change | One non-blocking "did it change?" probe; returns a change percentage and (optionally) an image. Poll it on an interval to watch without blocking. | metadata, image optional |
| start_watch | Watch a fixed target in the background (non-blocking) with a built-in noise filter that keeps only meaningful, settled changes. Works on video/games that change every frame. | runs in background |
| get_events | Collect the buffered changes a start_watch has accumulated — batched (a few looks for a long watch); each event carries the settled frame. | metadata + image(s) |
| stop_watch | Stop a background watch and discard its buffer. | — |
| beep | A system beep, to get the user's attention while they look elsewhere. | — |
| speak | Speaks a short utterance locally (built-in Windows voice, or the optional Supertonic neural voice). | — |
The design favors structure first, pixels as fallback: read_ui is cheap and precise for ordinary apps; capture_screen is for content that has no accessibility tree (games, custom canvases).
Watching for changes (the noise filter + event buffer)
poll_change is the manual primitive — you poll it on a loop. start_watch is the hands-off version: it runs a background loop with a noise filter and buffers what matters, so you can watch a busy screen without drowning in frames.
The problem it solves: on video, games, or scrolling, the screen changes every frame, so raw change-detection fires constantly on the ripples of one activity. The filter combines debounce (wait for motion to settle, then capture a clean frame — quality) with cooldown (at most one event per N seconds — frequency) and hysteresis (ambient jitter never even wakes it), plus an anti-starvation maxWait so a continuously-moving screen still yields periodic snapshots instead of going silent.
start_watch { region|window|monitor, watchId? } -> returns immediately, watch runs in the background
... do other things ...
get_events { watchId? } -> the settled changes so far (frames + metadata), batched
stop_watch { watchId? } -> end it (omit watchId to stop all)Calibration. The defaults assume a meaningful target: a playing video jitters ~2.5–4% frame-to-frame and a scene cut jumps ~16% — so activityThreshold (default 8%) sits between them, ignoring the jitter and reporting the cut. Because the change metric is whole-frame, a tiny local change (a clock, a toast) is a small fraction of the frame; target the region where the change happens so it reads as a large change. Tune activityThreshold / quietThreshold / debounceQuietMs / cooldownMs / maxWaitMs per call. The buffer is memory-only (no screen history on disk) with count, byte, and 5-minute TTL caps; watches auto-stop after 30 minutes.
Reflex alerts — sub-second voice, no cloud round-trip
get_events is the brain path (the agent looks, judges, and replies — seconds, because it makes a cloud LLM call). For things you need to hear the instant they happen, start_watch takes triggers that fire locally, with no LLM in the loop, so the alert reaches you in well under a second:
start_watch {
window: "MyGame",
triggers: [
{ action: "beep", threshold: 20 }, // a sound
{ action: "say", threshold: 20, say: "적 출현" }, // speak a FIXED phrase (Korean TTS)
{ action: "ocr", threshold: 12, dwellMs: 700 } // OCR the region and read the text aloud
]
}A trigger fires the moment the watched region changes past its threshold (with hysteresis + a per-trigger cooldownMs). The fast alert is the reflex; the agent's judged commentary still follows on the next get_events. Speaking uses the built-in Windows voice (System.Speech); reading text uses the built-in offline OCR (no install, no GPU).
Safety (this matters). OCR text is screen content — untrusted. It is never spoken raw: it gets a spoken provenance prefix (화면 글자: …), control/secret-token shaping, and a global speech budget (capped utterances and seconds per minute, no overlapping speech, auto-mute on sustained noise) so an on-screen string can never voice fake instructions or flood you. A fixed say phrase is the safest action (you author the words; a trigger only controls when). If the voice/OCR engine isn't available, triggers degrade quietly (a beep still works).
Optional: higher-quality neural voice (Supertonic) + audio ducking
The default voice is the built-in Windows one (System.Speech / Heami) — zero install, but robotic. For a much more natural voice, install Supertonic 3 (Supertone): an on-device ONNX neural TTS (Korean + 30 more languages; code MIT, weights OpenRAIL-M — commercial use OK). One-time model download (~380 MB), then fully offline and fast (~0.5 s/sentence on CPU):
node scripts/fetch-supertonic.mjs # downloads models to ~/.vortex/computer-use/supertonic-3
npm i onnxruntime-node # the runtime (optionalDependency)Once the models are present, the speak path uses them automatically (engine: "auto") and falls back to Heami if anything is missing — it never goes mute.
Audio ducking. While the companion speaks, other apps' audio (game / music / video) is briefly lowered per-app and restored exactly when it finishes, so the voice stands out. On by default. DRM-protected audio (e.g. Netflix) cannot be ducked — that protected path bypasses Windows volume control; normal app/game audio ducks fine.
Configure in your instance-root computer-use.config.json (tts section — see Privacy & redaction for placement) or via env (env wins). Defaults shown:
{ "tts": { "engine": "auto", "voice": "F1", "speed": 1.05, "duck": true, "duckFactor": 0.3 } }engine auto|supertonic|heami · voice F1..F5/M1..M5 (Supertonic only; the built-in Windows voice picks by system language) · speed rate multiplier (~1.0 = normal, higher = faster; clamped 0.5..2.0, applied to both the neural and built-in voices) · duckFactor 0..1 (clamped) (others drop to this fraction; lower = quieter). Env: VORTEX_CU_TTS_ENGINE / VORTEX_CU_TTS_VOICE / VORTEX_CU_TTS_SPEED / VORTEX_CU_DUCK=off / VORTEX_CU_DUCK_FACTOR. Restart the server after changing.
Optional: a local vision model (the vision trigger)
A vision trigger describes the scene (not just its text) via a local vision-language model — smarter than OCR, still no cloud round-trip. It is off by default and GPU-gated: it runs only when you point it at a reachable, fast-enough local endpoint. Everything above works with no GPU; this just adds a smarter local description where the hardware allows. On a machine that can't run it, a vision trigger degrades to ocr.
Point it at any OpenAI-compatible vision endpoint — llama.cpp's llama-server (with --mmproj), llamafile, Ollama, LM Studio — via env (machine-local, never synced):
VORTEX_CU_VLM_ENDPOINT=http://127.0.0.1:8080/v1 # set this to enable; presence = on
VORTEX_CU_VLM_MODEL=gemma-4-e2b-it # any small VLM (e2b ~2GB is a good light default)
VORTEX_CU_VLM_KEY=... # optional bearer token
VORTEX_CU_VLM_ALLOW_REMOTE=1 # only if the endpoint is NOT on this machine (off by default)
VORTEX_CU_VLM_SLA_MS=6000 # gate: if even a tiny probe is slower than this, stay offHow it stays safe (design §23.2/§24): only the intent (endpoint set or not) is configuration — the address/secret are machine-local env, never synced, so the same repo on a CPU-only machine simply runs without it. A loopback endpoint (same machine) is allowed by default; a cross-network one (LAN/VPN/another box) is off unless you opt in. The session probes the endpoint with a synthetic 1×1 image (never a real screen crop) to measure latency before trusting it; a real crop is sent only on an actual vision trigger, through the same denylist gate. The model's reply is untrusted — spoken with a 로컬 비전: … prefix, shaped, and rate-limited, and the prompt tells the model to describe only (never follow on-screen instructions). probe reports the VLM's availability when you've configured one.
Adaptive companion — classify the activity, branch the help
classify_activity lets the companion figure out what you're doing from the first screenshot and adapt, instead of being told. One read-only call returns:
{ "class": "GAME", "process": "eldenring", "title": "...", "notificationState": "BUSY",
"interruptible": false, "canvas": true, "uiaCount": 1, "fullscreen": true,
"profile": { "proactive": true, "cadenceSec": 30, "mode": "periodic" }, "needsChangeRate": true }It combines cheap signals — foreground process + window title, the Windows interruptibility state (SHQueryUserNotificationState), UIA element count (a near-empty tree on a screen-filling window = a GPU game/video canvas), and whether it fills the screen — into a class: GAME · DEV · MEDIA · BROWSING · PRODUCTIVITY · UNKNOWN, each with a help profile.
The profiles branch the behavior (full design in docs/adaptive-companion.md):
| Class | Default | Speaks when | Cadence | |---|---|---|---| | Strategy / sim game | proactive | quiet stretch (your turn) + risk/opportunity | ~30 s during active play | | Fast-action game | break-gated | a menu/pause/death screen opens | one cue per break; says up front it can't coach mid-fight | | Software dev | silent | an error/failed build/stack trace appears | event-driven, not periodic | | Media / browsing / docs | silent | only on request | never proactive |
For a GAME, needsChangeRate tells the agent to take a couple of poll_change reads to split fast-action (too fast to coach — break-gated only) from strategy (coachable, periodic). Honesty is built in: it never pretends to coach a game it can't follow, and it won't talk over media. The interruptibility state gates every utterance, on top of the global speech budget. Explicit user requests ("tell me when X happens", "be quieter") layer on as reflex triggers / cadence overrides.
Tune it in your instance-root computer-use.config.json (companion section): uiaCanvasMax (the canvas cutoff) and per-class profiles (e.g. GAME.cadenceSec: 20 for chattier coaching). Env: VORTEX_CU_UIA_CANVAS_MAX.
What it is NOT
- Not control. No clicking, typing, or app automation. Perception only.
- Not real-time for judgment. Reflex
triggersdeliver a sub-second beep / fixed phrase / OCR readout, but anything the agent has to think about (a judged message) is seconds-scale — it makes a cloud call. Good for alerts, translation, and watching-alongside; not for reflex-speed decisions. - Not comprehensive secret protection. See Privacy & redaction below — the denylist is the real control; field-level masking is best-effort and does not catch plaintext secrets sitting in arbitrary windows.
- Not cross-platform yet. Windows only (for now).
Install
npm i @vortex-os/computer-usePeer dependency: @vortex-os/base (>=0.3.0 <1.0.0). The MCP SDK (@modelcontextprotocol/sdk) is an optional dependency, loaded only when the server runs. No native build step.
Register the MCP server
The package ships a vortex-mcp-computer-use bin that launches the stdio server. Register it with your agent host. For Claude Code, add it to .mcp.json:
{
"mcpServers": {
"vortex-computer-use": {
"command": "npx",
"args": ["vortex-mcp-computer-use"]
}
}
}Use a server name other than the reserved
computer-use(e.g.vortex-computer-use) — some hosts reservecomputer-useand will silently skip a server with that exact name. MCP servers load at session start, so restart the agent after adding it.
Privacy & redaction
Whatever you point this at is sent to your AI model. Two controls reduce accidental exposure; both run in the backend before any pixels or text reach the model:
- Denylist (the primary control). List window titles or process names that must never be captured. If a listed window is visible anywhere inside a capture region, the whole capture is refused (
{ "redacted": true }— no image, no text). This is the reliable defense against accidentally capturing a password manager or banking window during a watch. - Password-field masking. In
read_ui, fields the OS reports as password inputs are dropped (no value, no text, children not traversed).
Copy computer-use.config.example.json to computer-use.config.json in your instance root (the folder you launch the agent from — i.e. next to .mcp.json), or point VORTEX_CU_CONFIG at an explicit path, to configure the denylist; or set VORTEX_CU_DENY_TITLES / VORTEX_CU_DENY_PROCS (JSON arrays). Do not put it inside node_modules — that is wiped on every reinstall. The denylist is read once at startup — restart the server after changing it.
Honest limits. This is not comprehensive secret-scanning. A plaintext token shown in a text editor or terminal (not a password field, not a denylisted window) will still be captured. Pixel-level password masking is intentionally out of scope. Capture images are volatile — held only long enough to send, then deleted; they are never written to disk persistently.
Audit
Each perception call appends one metadata line (timestamp, tool, output size, a keyed HMAC of the output, and an HMAC of the window title) to a daily JSONL log under your user-local app data (%LOCALAPPDATA%\vortex-computer-use\audit\) — outside the synced instance data. No raw images and no plaintext window titles are stored. If the audit key can't be set up, perception still works and a warning is printed.
Verify
npm run verify # node scripts/verify.mjs — needs a desktop session; captures the real screen
npm run test:filter # node scripts/test-noise-filter.mjs — pure unit tests, no screen needed
npm run test:speech # node scripts/test-speech-safety.mjs — pure unit tests (provenance/shaping/budget)
npm run test:vlm # node scripts/test-vlm.mjs — pure unit tests (config/trust-tier/protocol)verify exercises every tool plus the redaction/audit gate (denylist blocking across all capture modes, no over-block, no title leak, audit written with no plaintext). The test:* scripts check the noise-filter, speech-safety, and VLM-protocol logic deterministically (no screen/audio/network). Three live harnesses drive the real screen: node scripts/verify-watch.mjs (background watch: settle → event → frame, denylist blindness, cleanup), node scripts/verify-reflex.mjs (reflex triggers → local speech, rendered to WAV so it stays silent), and node scripts/verify-vlm.mjs (the vision path against a mock local endpoint: synthetic-probe, real-crop only on a trigger, degrade-to-OCR).
