npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@vortex-os/computer-use

v0.7.4

Published

Add-on — read-only screen perception (structured UIA tree + pixel fallback + noise-filtered background watch with an event buffer + sub-second reflex alerts: beep / fixed-phrase / OCR or optional local-VLM description spoken locally, optional higher-quali

Readme

@vortex-os/computer-use

Read-only screen perception for VortEX agents, exposed as an MCP server. It lets an agent see what is on screen — read a window's structure, capture a region as an image, and watch for on-screen changes — without ever moving the mouse or typing. It layers on @vortex-os/base but also works standalone.

Status: Windows-first, read-only. Mouse/keyboard control is intentionally out of scope — this package only perceives (and speaks). macOS/Linux backends are not yet implemented.

What it is

An MCP (Model Context Protocol) server that exposes eleven tools over stdio (nine perception, plus beep/speak for output):

| Tool | What it does | Cost | |---|---|---| | probe | Reports whether this environment can perceive the screen (displays, DPI, capture latency). Never captures real screen content. | ~0 | | read_ui | Reads the active/target window as a structured accessibility tree (UI Automation): element roles, coordinates, text. No image. | ~0 image tokens | | classify_activity | Classifies the on-screen activity (game / dev / media / browsing / productivity) so a companion can branch its help. | metadata | | capture_screen | Pixel capture (PNG) for what structure can't reach — canvases, games, remote desktops. Target by window, region, monitor, or cursor box. | image | | watch_capture | Captures N frames at an interval in one process; with changeOnly, keeps only changed frames. | image(s) | | poll_change | One non-blocking "did it change?" probe; returns a change percentage and (optionally) an image. Poll it on an interval to watch without blocking. | metadata, image optional | | start_watch | Watch a fixed target in the background (non-blocking) with a built-in noise filter that keeps only meaningful, settled changes. Works on video/games that change every frame. | runs in background | | get_events | Collect the buffered changes a start_watch has accumulated — batched (a few looks for a long watch); each event carries the settled frame. | metadata + image(s) | | stop_watch | Stop a background watch and discard its buffer. | — | | beep | A system beep, to get the user's attention while they look elsewhere. | — | | speak | Speaks a short utterance locally (built-in Windows voice, or the optional Supertonic neural voice). | — |

The design favors structure first, pixels as fallback: read_ui is cheap and precise for ordinary apps; capture_screen is for content that has no accessibility tree (games, custom canvases).

Watching for changes (the noise filter + event buffer)

poll_change is the manual primitive — you poll it on a loop. start_watch is the hands-off version: it runs a background loop with a noise filter and buffers what matters, so you can watch a busy screen without drowning in frames.

The problem it solves: on video, games, or scrolling, the screen changes every frame, so raw change-detection fires constantly on the ripples of one activity. The filter combines debounce (wait for motion to settle, then capture a clean frame — quality) with cooldown (at most one event per N seconds — frequency) and hysteresis (ambient jitter never even wakes it), plus an anti-starvation maxWait so a continuously-moving screen still yields periodic snapshots instead of going silent.

start_watch { region|window|monitor, watchId? }   -> returns immediately, watch runs in the background
   ... do other things ...
get_events  { watchId? }                           -> the settled changes so far (frames + metadata), batched
stop_watch  { watchId? }                           -> end it (omit watchId to stop all)

Calibration. The defaults assume a meaningful target: a playing video jitters ~2.5–4% frame-to-frame and a scene cut jumps ~16% — so activityThreshold (default 8%) sits between them, ignoring the jitter and reporting the cut. Because the change metric is whole-frame, a tiny local change (a clock, a toast) is a small fraction of the frame; target the region where the change happens so it reads as a large change. Tune activityThreshold / quietThreshold / debounceQuietMs / cooldownMs / maxWaitMs per call. The buffer is memory-only (no screen history on disk) with count, byte, and 5-minute TTL caps; watches auto-stop after 30 minutes.

Reflex alerts — sub-second voice, no cloud round-trip

get_events is the brain path (the agent looks, judges, and replies — seconds, because it makes a cloud LLM call). For things you need to hear the instant they happen, start_watch takes triggers that fire locally, with no LLM in the loop, so the alert reaches you in well under a second:

start_watch {
  window: "MyGame",
  triggers: [
    { action: "beep",  threshold: 20 },                          // a sound
    { action: "say",   threshold: 20, say: "적 출현" },           // speak a FIXED phrase (Korean TTS)
    { action: "ocr",   threshold: 12, dwellMs: 700 }             // OCR the region and read the text aloud
  ]
}

A trigger fires the moment the watched region changes past its threshold (with hysteresis + a per-trigger cooldownMs). The fast alert is the reflex; the agent's judged commentary still follows on the next get_events. Speaking uses the built-in Windows voice (System.Speech); reading text uses the built-in offline OCR (no install, no GPU).

Safety (this matters). OCR text is screen content — untrusted. It is never spoken raw: it gets a spoken provenance prefix (화면 글자: …), control/secret-token shaping, and a global speech budget (capped utterances and seconds per minute, no overlapping speech, auto-mute on sustained noise) so an on-screen string can never voice fake instructions or flood you. A fixed say phrase is the safest action (you author the words; a trigger only controls when). If the voice/OCR engine isn't available, triggers degrade quietly (a beep still works).

Optional: higher-quality neural voice (Supertonic) + audio ducking

The default voice is the built-in Windows one (System.Speech / Heami) — zero install, but robotic. For a much more natural voice, install Supertonic 3 (Supertone): an on-device ONNX neural TTS (Korean + 30 more languages; code MIT, weights OpenRAIL-M — commercial use OK). One-time model download (~380 MB), then fully offline and fast (~0.5 s/sentence on CPU):

node scripts/fetch-supertonic.mjs       # downloads models to ~/.vortex/computer-use/supertonic-3
npm i onnxruntime-node                   # the runtime (optionalDependency)

Once the models are present, the speak path uses them automatically (engine: "auto") and falls back to Heami if anything is missing — it never goes mute.

Audio ducking. While the companion speaks, other apps' audio (game / music / video) is briefly lowered per-app and restored exactly when it finishes, so the voice stands out. On by default. DRM-protected audio (e.g. Netflix) cannot be ducked — that protected path bypasses Windows volume control; normal app/game audio ducks fine.

Configure in your instance-root computer-use.config.json (tts section — see Privacy & redaction for placement) or via env (env wins). Defaults shown:

{ "tts": { "engine": "auto", "voice": "F1", "speed": 1.05, "duck": true, "duckFactor": 0.3 } }

engine auto|supertonic|heami · voice F1..F5/M1..M5 (Supertonic only; the built-in Windows voice picks by system language) · speed rate multiplier (~1.0 = normal, higher = faster; clamped 0.5..2.0, applied to both the neural and built-in voices) · duckFactor 0..1 (clamped) (others drop to this fraction; lower = quieter). Env: VORTEX_CU_TTS_ENGINE / VORTEX_CU_TTS_VOICE / VORTEX_CU_TTS_SPEED / VORTEX_CU_DUCK=off / VORTEX_CU_DUCK_FACTOR. Restart the server after changing.

Optional: a local vision model (the vision trigger)

A vision trigger describes the scene (not just its text) via a local vision-language model — smarter than OCR, still no cloud round-trip. It is off by default and GPU-gated: it runs only when you point it at a reachable, fast-enough local endpoint. Everything above works with no GPU; this just adds a smarter local description where the hardware allows. On a machine that can't run it, a vision trigger degrades to ocr.

Point it at any OpenAI-compatible vision endpoint — llama.cpp's llama-server (with --mmproj), llamafile, Ollama, LM Studio — via env (machine-local, never synced):

VORTEX_CU_VLM_ENDPOINT=http://127.0.0.1:8080/v1   # set this to enable; presence = on
VORTEX_CU_VLM_MODEL=gemma-4-e2b-it                # any small VLM (e2b ~2GB is a good light default)
VORTEX_CU_VLM_KEY=...                             # optional bearer token
VORTEX_CU_VLM_ALLOW_REMOTE=1                      # only if the endpoint is NOT on this machine (off by default)
VORTEX_CU_VLM_SLA_MS=6000                         # gate: if even a tiny probe is slower than this, stay off

How it stays safe (design §23.2/§24): only the intent (endpoint set or not) is configuration — the address/secret are machine-local env, never synced, so the same repo on a CPU-only machine simply runs without it. A loopback endpoint (same machine) is allowed by default; a cross-network one (LAN/VPN/another box) is off unless you opt in. The session probes the endpoint with a synthetic 1×1 image (never a real screen crop) to measure latency before trusting it; a real crop is sent only on an actual vision trigger, through the same denylist gate. The model's reply is untrusted — spoken with a 로컬 비전: … prefix, shaped, and rate-limited, and the prompt tells the model to describe only (never follow on-screen instructions). probe reports the VLM's availability when you've configured one.

Adaptive companion — classify the activity, branch the help

classify_activity lets the companion figure out what you're doing from the first screenshot and adapt, instead of being told. One read-only call returns:

{ "class": "GAME", "process": "eldenring", "title": "...", "notificationState": "BUSY",
  "interruptible": false, "canvas": true, "uiaCount": 1, "fullscreen": true,
  "profile": { "proactive": true, "cadenceSec": 30, "mode": "periodic" }, "needsChangeRate": true }

It combines cheap signals — foreground process + window title, the Windows interruptibility state (SHQueryUserNotificationState), UIA element count (a near-empty tree on a screen-filling window = a GPU game/video canvas), and whether it fills the screen — into a class: GAME · DEV · MEDIA · BROWSING · PRODUCTIVITY · UNKNOWN, each with a help profile.

The profiles branch the behavior (full design in docs/adaptive-companion.md):

| Class | Default | Speaks when | Cadence | |---|---|---|---| | Strategy / sim game | proactive | quiet stretch (your turn) + risk/opportunity | ~30 s during active play | | Fast-action game | break-gated | a menu/pause/death screen opens | one cue per break; says up front it can't coach mid-fight | | Software dev | silent | an error/failed build/stack trace appears | event-driven, not periodic | | Media / browsing / docs | silent | only on request | never proactive |

For a GAME, needsChangeRate tells the agent to take a couple of poll_change reads to split fast-action (too fast to coach — break-gated only) from strategy (coachable, periodic). Honesty is built in: it never pretends to coach a game it can't follow, and it won't talk over media. The interruptibility state gates every utterance, on top of the global speech budget. Explicit user requests ("tell me when X happens", "be quieter") layer on as reflex triggers / cadence overrides.

Tune it in your instance-root computer-use.config.json (companion section): uiaCanvasMax (the canvas cutoff) and per-class profiles (e.g. GAME.cadenceSec: 20 for chattier coaching). Env: VORTEX_CU_UIA_CANVAS_MAX.

What it is NOT

  • Not control. No clicking, typing, or app automation. Perception only.
  • Not real-time for judgment. Reflex triggers deliver a sub-second beep / fixed phrase / OCR readout, but anything the agent has to think about (a judged message) is seconds-scale — it makes a cloud call. Good for alerts, translation, and watching-alongside; not for reflex-speed decisions.
  • Not comprehensive secret protection. See Privacy & redaction below — the denylist is the real control; field-level masking is best-effort and does not catch plaintext secrets sitting in arbitrary windows.
  • Not cross-platform yet. Windows only (for now).

Install

npm i @vortex-os/computer-use

Peer dependency: @vortex-os/base (>=0.3.0 <1.0.0). The MCP SDK (@modelcontextprotocol/sdk) is an optional dependency, loaded only when the server runs. No native build step.

Register the MCP server

The package ships a vortex-mcp-computer-use bin that launches the stdio server. Register it with your agent host. For Claude Code, add it to .mcp.json:

{
  "mcpServers": {
    "vortex-computer-use": {
      "command": "npx",
      "args": ["vortex-mcp-computer-use"]
    }
  }
}

Use a server name other than the reserved computer-use (e.g. vortex-computer-use) — some hosts reserve computer-use and will silently skip a server with that exact name. MCP servers load at session start, so restart the agent after adding it.

Privacy & redaction

Whatever you point this at is sent to your AI model. Two controls reduce accidental exposure; both run in the backend before any pixels or text reach the model:

  1. Denylist (the primary control). List window titles or process names that must never be captured. If a listed window is visible anywhere inside a capture region, the whole capture is refused ({ "redacted": true } — no image, no text). This is the reliable defense against accidentally capturing a password manager or banking window during a watch.
  2. Password-field masking. In read_ui, fields the OS reports as password inputs are dropped (no value, no text, children not traversed).

Copy computer-use.config.example.json to computer-use.config.json in your instance root (the folder you launch the agent from — i.e. next to .mcp.json), or point VORTEX_CU_CONFIG at an explicit path, to configure the denylist; or set VORTEX_CU_DENY_TITLES / VORTEX_CU_DENY_PROCS (JSON arrays). Do not put it inside node_modules — that is wiped on every reinstall. The denylist is read once at startup — restart the server after changing it.

Honest limits. This is not comprehensive secret-scanning. A plaintext token shown in a text editor or terminal (not a password field, not a denylisted window) will still be captured. Pixel-level password masking is intentionally out of scope. Capture images are volatile — held only long enough to send, then deleted; they are never written to disk persistently.

Audit

Each perception call appends one metadata line (timestamp, tool, output size, a keyed HMAC of the output, and an HMAC of the window title) to a daily JSONL log under your user-local app data (%LOCALAPPDATA%\vortex-computer-use\audit\) — outside the synced instance data. No raw images and no plaintext window titles are stored. If the audit key can't be set up, perception still works and a warning is printed.

Verify

npm run verify        # node scripts/verify.mjs — needs a desktop session; captures the real screen
npm run test:filter   # node scripts/test-noise-filter.mjs — pure unit tests, no screen needed
npm run test:speech   # node scripts/test-speech-safety.mjs — pure unit tests (provenance/shaping/budget)
npm run test:vlm      # node scripts/test-vlm.mjs — pure unit tests (config/trust-tier/protocol)

verify exercises every tool plus the redaction/audit gate (denylist blocking across all capture modes, no over-block, no title leak, audit written with no plaintext). The test:* scripts check the noise-filter, speech-safety, and VLM-protocol logic deterministically (no screen/audio/network). Three live harnesses drive the real screen: node scripts/verify-watch.mjs (background watch: settle → event → frame, denylist blindness, cleanup), node scripts/verify-reflex.mjs (reflex triggers → local speech, rendered to WAV so it stays silent), and node scripts/verify-vlm.mjs (the vision path against a mock local endpoint: synthetic-probe, real-crop only on a trigger, degrade-to-OCR).