stelo
v12.0.4
Published
The computer-use engine. Give any AI full autonomous control of any desktop — mouse, keyboard, screen, audio, vision, OCR, RTC — all native Rust, zero npm runtime dependencies, sub-millisecond.
Maintainers
Keywords
Readme
Stelo
The Computer-Use Engine
npm License: Apache-2.0 Windows macOS Linux
Give any AI full autonomous control of any desktop.
Mouse . Keyboard . Screen . Vision . OCR . Audio . Mic . Speaker . Voice Isolation . RTC . Clipboard . Window . Process . Recording . Hooks . Template Matching . Gestures . Workflows
Every operation is native Rust. Sub-millisecond. Zero dependencies. Zero compromise.
What is Stelo?
Stelo is a complete computer-use engine. It gives AI agents -- or any program -- the ability to see, click, type, hear, speak, and stream an entire desktop, all through a single npm install with no native compiler required.
Everything runs in Rust through napi-rs. There is no Python subprocess, no Electron wrapper, no puppeteer shim. When you call screen.capture(), you get raw pixels from the GPU in under 10ms. When you call mouse.moveSmoothly(), a physics-modeled cursor path executes on a native thread while your event loop stays free. When you call ocr.recognize(), the OS-native OCR engine returns word-level bounding boxes with no external model.
v12.0.0 — Wayland-native Linux, zero runtime npm dependencies, hardened supply chain. 200+ native functions. 20 modules.
npm install steloimport {
mouse, keyboard, screen, vision, ocr, agent, browser,
appWindow, recorder, safety, clipboard, process,
stream, audioStream, micStream, speaker, voiceIsolation,
hooks, template, rtc, workflow
} from 'stelo';Node.js >= 20 required. Prebuilt binaries for Windows, macOS, and Linux -- no compiler needed.
150+ Native Functions Across 19 Modules
| Module | What It Does | Key Functions |
| ------------------ | ---------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| mouse | Full cursor control with physics-modeled movement | move, moveSmoothly, moveHumanized, click, drag, scroll, gesture* |
| keyboard | Typing, hotkeys, sequences, batch input | type, typeHumanized, hotkey, sequence, typeBatch, waitForKey |
| screen | GPU-accelerated capture, pixel analysis | capture, capturePng, captureJpeg, getPixelColor, findColor, getAllDisplays |
| vision | Screen diffing, change detection, perceptual hashing | diff, waitForChange, waitForStable, analyzeGrid, perceptualHash, findColorClusters |
| ocr | Native OS OCR with word-level positions | recognize, findText, clickText, waitForText, waitForTextAndClick |
| agent | Batch execution, verification, parallel control | executeBatch, clickAndVerify, sequence(), parallel(), createCursorStream |
| appWindow | Window discovery, focus, geometry, capture | getActive, find, focus, resize, capture, captureJpeg, sendClick, sendKey |
| recorder | Record and replay input sequences | start, stop, replay, sequence().build() |
| safety | Failsafe, rate limiting, emergency stop | enableFailsafe, setRateLimit, emergencyStop, reset |
| clipboard | Text, HTML, and image clipboard access | readText, writeText, readHtml, readImage, formats |
| process | Process listing, launching, killing | list, find, launch, kill, isRunning, foregroundPid |
| browser | Cross-platform URL navigation | openUrl, normalizeBrowserUrl, findBrowserWindow |
| stream | Real-time screen capture at 1-120 FPS | start, getLatestFrame, waitForFrame, getStats |
| audioStream | System audio loopback capture (WASAPI) | start, getFormat, getLatestChunk, drainChunks |
| micStream | Microphone capture | start, getFormat, getLatestChunk, drainChunks |
| speaker | Audio playback (PCM16 mono) | start, write, clear, getStats |
| voiceIsolation | Neural voice isolation with enrollment | enroll, enable, setAggression, getStats |
| hooks | Global keyboard/mouse event listeners | start, drainEvents, eventCount |
| template | Template matching via NCC | findOnScreen, findAllOnScreen, waitFor |
| rtc | QUIC-based real-time communication | create, connect, sendScreenStart, sendAudioStart, channelOpen |
| workflow | Declarative automation with retry/branching | create, step, condition, loop, run |
Quick Start
Open a URL in the Browser
import { openUrl } from 'stelo';
// Windows, macOS, Linux — CDP → keyboard → OS launch (start / open / xdg-open)
await openUrl('https://www.youtube.com');On linuxserver Chromium + labwc, bake --remote-debugging-port=9222 so CDP opens tabs visible in Selkies. Do not use STELO_SANDBOX=1 on Wayland desktops.
See the Screen
import { screen, ocr } from 'stelo';
// Capture as PNG bytes (ready to send to any vision model)
const png = screen.capturePng();
// Or capture a region as JPEG with quality control
const jpeg = screen.captureJpeg({ quality: 80, x: 0, y: 0, width: 800, height: 600 });
// Read all text on screen with word-level bounding boxes
const result = ocr.recognize();
console.log(result.text);
result.lines.forEach(line => {
line.words.forEach(word => {
console.log(`"${word.text}" at (${word.x}, ${word.y}) confidence=${word.confidence}`);
});
});
// Find and click text directly
ocr.clickText('Submit');Control the Mouse
import { mouse } from 'stelo';
// Instant teleport
mouse.move(500, 400);
// Physics-modeled smooth movement (Bezier + arc-length parameterization)
await mouse.moveSmoothly(800, 600, { duration: 500, curve: 'bezier' });
// Human-indistinguishable movement (Wind Mouse algorithm)
await mouse.moveHumanized(300, 200, { jitter: 0.5, overshoot: true });
// Click, drag, scroll
mouse.click();
mouse.drag(100, 100, 500, 500);
mouse.scroll(3, 'down');
// Complex gestures
mouse.gestureCircle(400, 400, 100, 2000);
mouse.gestureSwipe(0, 500, 1920, 500, 300);
mouse.gestureSpiral(960, 540, 200, 10, 3, 3000);Type and Use Shortcuts
import { keyboard } from 'stelo';
// Instant typing (single SendInput call, 10-100x faster than per-key)
keyboard.typeBatch('Hello World');
// Human-like typing with randomized delays
await keyboard.typeHumanized('Natural typing speed', { minDelay: 40, maxDelay: 120 });
// Hotkeys and sequences
keyboard.hotkey('ctrl', 'shift', 'p');
keyboard.sequence('ctrl+a | ctrl+c | alt+tab | ctrl+v', 50);
// Convenience shortcuts
keyboard.selectAll();
keyboard.copy();
keyboard.paste();
keyboard.save();See What Changed
import { vision, screen, mouse } from 'stelo';
// Take reference, act, verify
const before = screen.capture();
mouse.click(500, 300);
const diff = vision.diff(before);
console.log(`${diff.changePercentage}% of screen changed`);
// Wait for animations to finish
await vision.waitForStable(0.1, 200, 5000);
// Perceptual hash for fast similarity checks
const hash1 = vision.perceptualHash();
// ... do something ...
const hash2 = vision.perceptualHash();
const distance = vision.hashDistance(hash1, hash2);
// distance < 5 = nearly identical, > 20 = major changeAI Agent Loop (Observe -> Act -> Verify -> Repeat)
import { agent, screen, ocr, vision } from 'stelo';
// The core agent loop
while (true) {
// 1. Observe
const screenshot = screen.captureJpeg({ quality: 70 });
const text = ocr.recognize();
// 2. Decide (send to your LLM -- Gemini, Claude, GPT, local model)
const action = await yourLLM.decide(screenshot, text);
// 3. Act with verification
const result = agent.clickAndVerify(action.x, action.y, {
minChangePercent: 0.5,
timeoutMs: 2000,
});
// 4. Verify the action worked
if (!result.verified) continue;
// 5. Wait for stability before next action
await vision.waitForStable(0.1, 200, 5000);
}Core Capabilities
Native Resolution Everywhere
Stelo forces Per-Monitor DPI Awareness V2 on Windows automatically. Screenshot coordinates = mouse coordinates = vision model coordinates. Zero drift, zero scaling math, single coordinate space end-to-end.
import { screen, mouse } from 'stelo';
const { width, height } = screen.getSize(); // True native resolution
const pixel = screen.getPixelColor(960, 540); // Exact pixel, no DPI scaling
mouse.move(960, 540); // Lands exactly where expectedParallel Action Execution
True OS-level parallelism on native threads. Not async pretend.
import { agent } from 'stelo';
const result = agent.parallelExecute([
{
groupId: 'mouse-work',
actions: [
{ type: 'mouseMoveSmooth', x: 500, y: 300, duration: 600 },
{ type: 'mouseClick' },
]
},
{
groupId: 'keyboard-work',
startDelayMs: 300,
actions: [
{ type: 'type', text: 'search query' },
{ type: 'keyPress', key: 'Enter' },
]
}
]);
console.log(`All done in ${result.totalDurationMs}ms`);Action Batching
Execute complex sequences atomically with minimal latency:
const result = agent.executeBatch([
{ type: 'mouseClickAt', x: 500, y: 300 },
{ type: 'waitForStable', stableMs: 200 },
{ type: 'type', text: 'hello world' },
{ type: 'keyPress', key: 'Enter' },
{ type: 'waitForChange', timeout: 3000 },
], { stopOnError: true });
// Or use the fluent builder
agent.sequence()
.clickAt(500, 300)
.waitForStable()
.type('hello world')
.press('Enter')
.waitForChange({ timeout: 3000 })
.execute();Real-Time Screen Streaming
Continuous screen capture at up to 120 FPS with cursor tracking:
import { stream } from 'stelo';
stream.start({ fps: 60, scale: 0.5 });
const frame = stream.getLatestFrame();
// frame.data = raw RGBA pixels
// frame.cursorX, frame.cursorY = cursor position at capture time
// frame.measuredFps = actual FPS being achieved
const next = stream.waitForFrame(100); // Block until next frame
stream.stop();Audio Pipeline
Full audio I/O: system loopback capture, microphone capture, speaker playback, and neural voice isolation.
import { audioStream, micStream, speaker, voiceIsolation } from 'stelo';
// Capture system audio
audioStream.start();
const format = audioStream.getFormat();
const chunks = audioStream.drainChunks();
// Capture microphone
micStream.start();
const micChunk = micStream.getLatestChunk();
// Voice isolation -- enroll your voice, suppress everything else
voiceIsolation.enroll();
voiceIsolation.enrollFinish();
voiceIsolation.enable();
voiceIsolation.setAggression(0.7);
// Play audio
speaker.start({ inputSampleRate: 24000 });
speaker.write(pcm16Buffer);Window Control
Complete window management including background window capture and input:
import { appWindow } from 'stelo';
const wins = appWindow.find('Chrome');
appWindow.focus(wins[0].id);
// Capture background window (no focus required)
const capture = appWindow.capture(wins[0].id);
const jpeg = appWindow.captureJpeg(wins[0].id, 80);
// Send input to background window
appWindow.sendClick(wins[0].id, 100, 200);
appWindow.sendKey(wins[0].id, 0x0D);
// Window geometry
appWindow.resize(wins[0].id, 1920, 1080);
appWindow.center(wins[0].id);
appWindow.setAlwaysOnTop(wins[0].id, true);
appWindow.setOpacity(wins[0].id, 0.8);
const win = await appWindow.waitFor('MyApp', 10000);Template Matching
Find images on screen using normalized cross-correlation:
import { template, mouse } from 'stelo';
const match = template.findOnScreen(buttonRgbaData, buttonWidth, buttonHeight, {
threshold: 0.85,
});
if (match) {
mouse.click(match.x + match.width / 2, match.y + match.height / 2);
}
const result = await template.waitFor(iconData, iconWidth, iconHeight, {
threshold: 0.8,
timeoutMs: 10000,
});Real-Time Communication (RTC)
QUIC-based peer-to-peer streaming with <50ms latency:
import { rtc } from 'stelo';
// Host
rtc.create({ port: 9000 });
const offer = rtc.getOffer();
// Remote
rtc.create();
rtc.connect(offer, 5000);
// Stream screen, audio, mic
rtc.sendScreenStart({ fps: 60, scale: 0.75 });
rtc.sendAudioStart();
rtc.sendMicStart();
// Data channels
const ch = rtc.channelOpen('control');
rtc.channelSend(ch, Buffer.from(JSON.stringify({ action: 'click', x: 100, y: 200 })));Global Input Hooks
Listen to all keyboard and mouse events system-wide:
import { hooks } from 'stelo';
hooks.start({
captureKeyboard: true,
captureMouseClicks: true,
captureMouseMove: false,
captureMouseWheel: true,
bufferSize: 1000,
});
setInterval(() => {
const events = hooks.drainEvents();
for (const e of events) {
if (e.eventType === 'key_down') {
console.log(`Key pressed: ${e.keyName}`);
}
}
}, 100);Clipboard
Full clipboard access -- text, HTML, and raw images:
import { clipboard } from 'stelo';
clipboard.writeText('Hello');
const text = clipboard.readText();
clipboard.writeHtml('<b>Bold</b>');
const html = clipboard.readHtml();
const image = clipboard.readImage();
const formats = clipboard.formats();
clipboard.clear();Process Management
import { process } from 'stelo';
const all = process.list();
const chromes = process.find('chrome');
const running = process.isRunning(1234);
const pid = process.launch('notepad.exe');
const exitCode = process.launchAndWait('cmd.exe', ['/c', 'dir']);
process.kill(pid);Safety and Failsafe
import { safety } from 'stelo';
safety.enableFailsafe(10);
safety.setRateLimit(200);
safety.emergencyStop();
console.log(safety.isStopped()); // true
safety.reset();Workflow Engine
Declarative automation with built-in retry, conditional branching, and error recovery:
import { workflow, ocr, keyboard, vision, safety, screen } from 'stelo';
const loginFlow = workflow.create('login')
.step('wait-for-page', async () => {
await ocr.waitForText('Sign In', 10000);
})
.step('enter-credentials', async (ctx) => {
ocr.clickText('Email');
await keyboard.typeHumanized(ctx.get('email'));
keyboard.press('tab');
await keyboard.typeHumanized(ctx.get('password'));
})
.step('submit', async () => {
ocr.clickText('Sign In');
await vision.waitForStable(0.1, 500, 10000);
})
.retry({ maxAttempts: 3, delayMs: 1000 })
.onError(async (error, ctx) => {
screen.saveScreenshot(`error-${Date.now()}.png`);
safety.emergencyStop();
});
await loginFlow.run({ email: '[email protected]', password: '***' });Architecture
+------------------------------------------------------+
| Your Application |
| (AI Agent / Automation Script / RPA) |
+------------------------------------------------------+
| Stelo TypeScript API |
| mouse . keyboard . screen . vision . ocr . agent |
| window . recorder . safety . clipboard . process |
| stream . audio . mic . speaker . voice . hooks |
| template . rtc . workflow |
+------------------------------------------------------+
| Stelo Rust Core (napi-rs) |
| +----------+ +----------+ +----------+ +----------+ |
| | Input | | Screen | | Audio | | Network | |
| | Mouse | | Capture | | WASAPI | | QUIC/RTC | |
| | Keyboard | | Vision | | Mic | | LZ4+Delta| |
| | Hooks | | OCR | | Speaker | | Channels | |
| | Gestures | | Template | | Voice | | H.264 | |
| +----------+ +----------+ +----------+ +----------+ |
| +----------+ +----------+ +----------+ +----------+ |
| | Window | | Process | |Clipboard | | Recorder | |
| | Manage | | Control | | Text/HTML| | Record | |
| | Capture | | Launch | | Image | | Replay | |
| | Send I/O | | Kill | | Formats | | Sequence | |
| +----------+ +----------+ +----------+ +----------+ |
+------------------------------------------------------+
| Platform Abstraction Layer |
| Win32 API . Core Graphics . X11/XCB/XTest |
| WASAPI . Core Audio . ALSA/PulseAudio |
| WinRT OCR . Vision.framework . Tesseract |
+------------------------------------------------------+Supported Platforms
| OS | Arch | Package |
| ------- | --------------------- | ------------------------ |
| Windows | x64 | @yuta123133/win32-x64-msvc |
| Windows | arm64 | @yuta123133/win32-arm64-msvc|
| macOS | x64 (Intel) | @yuta123133/darwin-x64 |
| macOS | arm64 (Apple Silicon) | @yuta123133/darwin-arm64 |
| Linux | x64 (glibc) | @yuta123133/linux-x64-gnu |
| Linux | x64 (musl) | @yuta123133/linux-x64-musl |
| Linux | arm64 (glibc) | @yuta123133/linux-arm64-gnu |
| Linux | arm64 (musl) | @yuta123133/linux-arm64-musl|
Platform packages are auto-installed via optionalDependencies as @yuta123133/* (your npm user scope — no org required; avoids squatted unscoped names).
Linux sandboxes: npm run rebuild:linux (Docker + Ubuntu 22.04) builds both @yuta123133/linux-x64-gnu and @yuta123133/linux-x64-musl into npm/linux-x64-*.
Performance
| Operation | Latency | Notes | | --------------------------- | ------------ | --------------------------------------- | | Screen capture (full 1080p) | 5-10ms | DXGI/DWM on Windows, IOSurface on macOS | | Screen capture (full 4K) | 10-25ms | Native GPU path | | Mouse move (instant) | <0.1ms | Single SendInput call | | Mouse move (smooth, 500ms) | ~500ms | Non-blocking, event loop free | | Keyboard type batch | <0.5ms | Entire string in one SendInput | | OCR (full screen) | 50-200ms | WinRT OCR engine | | Vision diff (1080p) | 2-5ms | SIMD-optimized in Rust | | Template match (1080p) | 10-50ms | NCC with grayscale optimization | | Perceptual hash | 1-3ms | 64-bit DCT hash | | Action batch (10 actions) | 5-20ms | Sequential in native thread | | RTC frame transfer | <10ms | QUIC + LZ4 delta compression | | Audio chunk capture | ~10ms | WASAPI exclusive mode |
FAQ
Does Stelo require a C++ compiler?
No. Prebuilt binaries are published for all supported platforms. Building from source requires Rust (not C++).
How does humanized movement work?
Stelo implements the Wind Mouse algorithm -- a physics-inspired model that simulates human hand tremor, acceleration, and overshoot. Combined with cubic Bezier curves and arc-length parameterization, mouse paths are indistinguishable from real user input.
What about screen scaling / HiDPI?
Stelo forces Per-Monitor DPI Awareness V2 on Windows. All coordinates and captures use true native resolution -- no DPI virtualization.
Is it safe for production?
Yes. The safety module provides failsafe (mouse-to-corner abort), rate limiting, and emergency stop. Always enable failsafe.
Does it work headless / in CI?
On Linux, use a headless Wayland compositor (e.g. Weston with --backend=headless). Windows and macOS require a desktop session.
What about Wayland?
Linux builds are Wayland-native (wlr-screencopy, uinput, wlroots window APIs). X11/Xvfb is not supported.
Is the npm package safe to install?
Stelo has no runtime npm dependencies — only dev tools to build from source. Install with npm ci --ignore-scripts (recommended). See SECURITY.md for lockfile verification, audit gates, and provenance.
Stelo Sandbox (Selkies viewport for companies)
Give customers a browser desktop plus Stelo automation without building WebRTC yourself. Default stack: Ubuntu 26.04 + GNOME 50 + Selkies viewport.
npm run sandbox:build # first time: build Ubuntu 26 / GNOME 50 image
npm run sandbox:up # http://localhost:8080 (GNOME in browser)
npm run sandbox:urlimport { sandbox } from 'stelo/sandbox';
const { viewportUrl } = sandbox.start({ password: 'change-me' });Full setup, profiles, firewall/TURN, and security notes: docs/SELKIES-SANDBOX.md
How is this different from cloud computer-use services?
Cloud services charge monthly fees to remote-control a browser. Stelo is a local SDK that controls your entire desktop at native speed. You own it. You run it. No API key required.
License
Apache-2.0 (c) Stelo Contributors
