npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

windows-use

v0.3.2

Published

Let big LLMs delegate Windows/browser automation to small LLMs via sessions

Downloads

323

Readme


Why?

When AI agents operate Windows or browsers, every screenshot eats 1000+ tokens. A single "open Chrome and search something" task can burn through 10K–50K tokens of your expensive model's context — just on screenshots and tool calls.

That's like hiring a CEO to move the mouse.

How it works

windows-use introduces a "big model directs, small model executes" architecture:

| | Without windows-use | With windows-use | |---|---|---| | Who clicks? | Claude / GPT-4o (expensive) | Qwen, GPT-4o-mini, DeepSeek (cheap) | | Context cost per task | 10K–50K tokens of screenshots | ~200 tokens (text summary) | | What big model sees | Raw screenshots + coordinates | Clean text report + optional images | | Cost | $$$ | ¢ |

Big Model                    windows-use                    Small Model
   │                              │                              │
   ├─ create_session ────────►    │                              │
   │                              │                              │
   ├─ send_instruction ──────►    │  ── tools + instruction ──► │
   │                              │     screenshot → analyze     │
   │                              │     → click → verify → ...   │
   │                              │  ◄── report ──────────────── │
   │  ◄── text + images ──────   │                              │
   │                              │                              │
   ├─ done_session ──────────►    │  cleanup                     │

Your big model just says "open Notepad and type Hello" — the small model handles all the screenshots, clicking, and verification autonomously, then reports back a concise summary.

Designed for simplicity

You don't need to be an expert. If you can run npm install, you can use this.

Most computer-use tools ask you to set up Docker, configure sandboxes, manage virtual displays, or fight with permissions. windows-use does none of that.

  • One dependencynpm install -g windows-use. That's it. No Docker, no Python environments, no system-level permissions.
  • One config — Run windows-use init, paste your API key and endpoint, done. Works with any OpenAI-compatible API.
  • Your real Chrome — No headless puppeteer, no sandboxed browser. It connects to your actual Chrome via CDP — with your cookies, logins, extensions, and bookmarks intact. Just launch Chrome with --remote-debugging-port=9222.
  • No admin / root needed — Runs entirely in user space. No elevated privileges, no security prompts.
  • Share config in one linewindows-use init --export gives you a base64 string. Send it to a teammate, they run windows-use init <string>, instant setup.
# That's literally the entire setup:
npm install -g windows-use
windows-use init
windows-use "Open Notepad and type Hello World"

Key Features

  • Save 90% context — Keep your expensive model's context window for reasoning, not screenshots
  • Any OpenAI-compatible model — Qwen, DeepSeek, Ollama, vLLM, GPT-4o-mini, or any local model
  • 16 built-in tools — Screen capture with coordinate grid, mouse, keyboard, browser automation, file I/O
  • Your real Chrome — Uses your existing cookies, login state, and extensions (no webdriver detection)
  • MCP server — Drop-in integration with Claude Desktop, VS Code, and any MCP client
  • Rich reports — Text + embedded screenshots, so the big model sees exactly what it needs
  • CLI + REPL + API — Use from terminal, interactively, or programmatically
  • Zero config headache — No Docker, no sandbox, no permissions. Just an API key.

Install

npm install -g windows-use

Quick Start

# Interactive setup — saves config to ~/.windows-use.json
windows-use init

# Export config as a shareable base64 string
windows-use init --export

# Import config from a base64 string
windows-use init eyJiYXNlVVJMIjoiaHR0cHM6Ly...

You'll be prompted for:

  • Base URL — any OpenAI-compatible endpoint (Qwen, DeepSeek, Ollama, vLLM, etc.)
  • API Key
  • Model name — e.g. qwen3.5-flash, gpt-4o-mini

Usage

CLI Mode

# Single task
windows-use "Open Notepad and type Hello World"

# Interactive REPL session
windows-use
> Open Chrome and go to github.com
> Find the trending repositories
> exit

# With explicit config flags
windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4o-mini "Take a screenshot"

CLI shows real-time step-by-step progress:

> Take a screenshot of the desktop
[windows-use] Running...

  [step 1] 🔧 screenshot
  [step 1] 📎 Screenshot captured (img_1)
  [step 1] 💭 I can see the Windows desktop with...
  [step 2] 🔧 report { status: 'completed', content: '...' }

✅ [completed]
Here is the current desktop:
   📸 desktop: http://127.0.0.1:54321/s1.jpg

MCP Server Mode

windows-use --mcp

Exposes 3 tools over MCP (stdio transport):

| Tool | Description | |------|-------------| | create_session | Create a new agent session. Returns session_id. | | send_instruction | Send a task to the agent. Returns rich report with text + images. | | done_session | Terminate a session and free resources. |

MCP Client Configuration

Claude Desktop

Edit claude_desktop_config.json:

{
  "mcpServers": {
    "windows-use": {
      "command": "npx",
      "args": ["-y", "windows-use", "--mcp"],
      "env": {
        "WINDOWS_USE_API_KEY": "sk-xxx",
        "WINDOWS_USE_BASE_URL": "https://your-api.com/v1",
        "WINDOWS_USE_MODEL": "qwen3.5-flash"
      }
    }
  }
}

VS Code (Claude Code / Copilot Chat)

Add to .vscode/settings.json or global settings:

{
  "mcp": {
    "servers": {
      "windows-use": {
        "command": "npx",
        "args": ["-y", "windows-use", "--mcp"],
        "env": {
          "WINDOWS_USE_API_KEY": "sk-xxx",
          "WINDOWS_USE_BASE_URL": "https://your-api.com/v1",
          "WINDOWS_USE_MODEL": "qwen3.5-flash"
        }
      }
    }
  }
}

If you've run windows-use init, the config is saved in ~/.windows-use.json and you can omit the env block entirely.

Configuration

Config priority: CLI flags > environment variables > ~/.windows-use.json > defaults

| Option | CLI Flag | Env Var | Default | |--------|----------|---------|---------| | API Key | --api-key | WINDOWS_USE_API_KEY | — (required) | | Base URL | --base-url | WINDOWS_USE_BASE_URL | — (required) | | Model | --model | WINDOWS_USE_MODEL | — (required) | | CDP URL | --cdp-url | WINDOWS_USE_CDP_URL | http://localhost:9222 | | Max Steps | --max-steps | WINDOWS_USE_MAX_STEPS | 50 | | Max Rounds | --max-rounds | WINDOWS_USE_MAX_ROUNDS | 20 | | Timeout | — | WINDOWS_USE_TIMEOUT_MS | 300000 (5 min) |

  • Max Steps — tool-calling rounds per instruction (how many actions the small model can take for one task)
  • Max Rounds — instruction rounds per session (how many send_instruction calls before the session expires)

Browser Setup

Browser tools connect to your real Chrome via CDP. Start Chrome with remote debugging:

# Windows
chrome.exe --remote-debugging-port=9222

# macOS
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222

Uses your existing cookies, login state, and extensions — no webdriver detection flags.

Small Model Tools

The small model agent has access to 16 tools:

Screen & Input

| Tool | Description | |------|-------------| | screenshot | Full-screen capture with coordinate grid overlay (auto-scaled to logical resolution, grid coordinates match mouse_click) | | mouse_click(x, y, button) | Click at screen coordinates | | mouse_move(x, y) | Move mouse without clicking | | mouse_scroll(direction, amount) | Scroll up/down | | keyboard_type(text) | Type text character by character | | keyboard_press(keys) | Key combos like ["Ctrl", "C"], ["Alt", "F4"] | | run_command(command) | Execute shell command (PowerShell on Windows) |

Browser

| Tool | Description | |------|-------------| | browser_navigate(url) | Open a URL | | browser_click(selector) | Click element by CSS selector or text=... | | browser_type(selector, text) | Type into input field | | browser_screenshot(fullPage?) | Page screenshot (JPEG quality 70) | | browser_content | Get visible text content of page | | browser_scroll(direction, amount) | Scroll page |

File & Image

| Tool | Description | |------|-------------| | file_read(path) | Read file contents | | file_write(path, content) | Write file | | use_local_image(path) | Load a local image and get a screenshot ID for embedding in reports |

Control

| Tool | Description | |------|-------------| | report(status, content) | Submit a rich report and stop execution |

Rich Reports

Each screenshot tool returns an ID (e.g. img_1, img_2). The report tool supports a rich document format — mix text with embedded screenshots using [Image:img_X] markers:

report({
  status: "completed",
  content: "Here's what I found:\n[Image:img_2]\nThe page shows search results.\n[Image:img_3]\nI also checked the sidebar."
})

When delivered to the user (CLI) or big model (MCP), these markers expand into actual multimodal image content.

Programmatic API

import { loadConfig, SessionRegistry } from 'windows-use';

const config = loadConfig({
  apiKey: 'sk-xxx',
  baseURL: 'https://api.example.com/v1',
  model: 'qwen3.5-flash',
});

const registry = new SessionRegistry();
const session = registry.create(config);

// Real-time step events
session.runner.setOnStep((event) => {
  if (event.type === 'tool_call') console.log(`Step ${event.step}: ${event.name}`);
  if (event.type === 'thinking') console.log(`Step ${event.step}: ${event.content}`);
});

const result = await session.runner.run('Open calculator and compute 2+2');
console.log(result.status);  // 'completed' | 'blocked' | 'need_guidance'
console.log(result.content); // Rich text with [Image:img_X] markers

await registry.destroyAll();

Architecture

src/
├── cli.ts                      # CLI entry (single task + interactive REPL)
├── index.ts                    # Public API exports
├── config/                     # Zod config schema + env/file loader
├── mcp/
│   ├── server.ts               # MCP stdio transport
│   ├── tools.ts                # 3 MCP tools (create/send/done)
│   └── session-registry.ts     # Session lifecycle + timeout
├── agent/
│   ├── runner.ts               # Tool-calling loop + step events
│   ├── llm-client.ts           # OpenAI SDK wrapper (any compatible endpoint)
│   ├── context-manager.ts      # Full message history
│   └── system-prompt.ts        # Small model system prompt
└── tools/
    ├── windows/                # screenshot (with grid), mouse, keyboard, command
    ├── browser/                # navigate, click, type, screenshot, content, scroll
    ├── file/                   # read, write, use_local_image
    └── control/                # report

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

MIT