claude-screen-mcp

v0.4.0

Published

a month ago

MCP server that lets Claude see your screen — fills the Anthropic computer-use macOS-only gap for Windows + Linux. OCR + smart vision-diff included.

0High
0Medium
0Low

lin1211

mcp model-context-protocol claude screenshot ocr computer-use anthropic

claude-screen-mcp

Let Claude see your screen. A cross-platform MCP server for Windows + macOS + Linux with OCR and smart vision-diff. Zero native runtime deps.

Anthropic's official computer-use MCP for Claude Code is macOS-only today. This server fills the gap for Windows + Linux — and adds two things the official one doesn't have:

🔍 OCR so Claude can read screen text without spending vision tokens
📊 Smart vision-diff so 24/7 monitoring stays economical (skip frames that didn't change)

Quick start

# from source (until npm publish)
git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build

# register with Claude Code
claude mcp add screen -- node "$(pwd)/dist/index.js"

# restart Claude Code, then ask:
# "Take a screenshot and tell me what's on my screen."
# "OCR my screen and tell me if there's an error message anywhere."
# "Watch my screen and ping me when the build finishes."

Tools (10 total)

| Tool | Since | What it does | |---|---|---| | screenshot | v0.1 | Capture full display, auto-resize for vision-token efficiency | | screenshot_region | v0.1 | Capture an (x, y, w, h) region — way cheaper than full | | list_displays | v0.1 | Enumerate connected monitors | | list_windows | v0.1 | List visible windows with optional title filter | | read_screen_text | v0.2 | OCR full screen or region (10-100× cheaper than vision) | | find_text_on_screen | v0.2 | Search OCR'd text, return matching lines + bboxes | | screenshot_if_changed | v0.3 | Capture only when perceptual hash distance ≥ threshold | | get_screen_diff | v0.3 | Distance-only diff — no image returned | | wait_for_change | v0.4 | Long-poll until the screen changes, then return one keyframe | | record_screen | v0.4 | Capture N seconds at low fps and return deduplicated keyframes |

All 8 tools work the same way on Windows (PowerShell + System.Drawing), macOS (screencapture + osascript), and Linux (grim / scrot / import + wmctrl).

Use cases

1. Debug what you see — "Why is my React app not rendering? Look at the screen." → screenshot → Claude sees the error overlay → suggests fix.

2. Find something specific without burning vision tokens — "Is there an error message anywhere on my screen?" → find_text_on_screen("error") returns matching line + bbox → Claude calls screenshot_region on just that bbox.

3. Watch-while-task — "Ping me when this build finishes." → wait_for_change(timeoutMs=300000, threshold=12) — server blocks until the screen actually changes (or 5 min elapses), so the model only spends a turn when something happens. For longer watches, loop screenshot_if_changed(threshold=12) every 30s.

4. Show me what just happened — "I saw something flash by, replay the last 15 seconds." → record_screen(durationMs=15000, targetFps=2, maxFrames=6) returns up to 6 deduplicated keyframes covering that period in a single tool result — like rewinding a clip without storing video.

5. Read what's on screen, not look at it — "What does the current GitHub PR description say?" → read_screen_text returns plain text → 10-100× fewer tokens than vision.

Why this exists

Anthropic's official Claude Code computer-use MCP server (v2.1.85+) is macOS-only as of May 2026. Windows and Linux users have no first-party way to give Claude vision into their desktop.

This project fills the gap with three deliberate constraints:

Zero native runtime deps — uses each OS's built-in screenshot tooling (PowerShell + System.Drawing on Win, screencapture on Mac, grim/scrot/import on Linux). No node-gyp, no postinstall flakiness, no platform-specific binaries to bundle.
Single responsibility — only screen capture (read-only). Keyboard / mouse control belongs in a separate server (different threat model). This means it can be safely autostarted in any Claude session without granting input control.
Token-aware by design — auto-resize to maxEdge=1600, JPEG/WebP support, region capture, OCR (skip vision entirely for text), and perceptual-hash diff (skip frames that didn't change).

Quality bar

Every release was reviewed by 3 specialized agents (code quality + silent-failure-hunter + security auditor) before tagging. Across v0.1 → v0.3, the audits caught 16 P0 issues that were fixed before any tag was pushed:

v0.1: PowerShell -EncodedCommand BOM / Mac+Linux list_displays returning fake data / tool errors swallowing stderr / displayId argument injection / region OOM / output byte caps
v0.2: SCREEN_MCP_OCR_LANGS supply-chain injection (allowlist enforcement) / OCR worker timeout (was unbounded) / no-match token bomb / structured OCR diagnostics / SIGTERM handler
v0.3: cache size cap + LRU + 24h stale TTL / dHash channel assert (silent monitoring failure prevention) / cross-tool cache pollution fix / CompareResult.reason to distinguish first-call from real change
v0.4: Windows window-title mojibake (PowerShell OEM codepage → UTF-8) / Tesseract v6+ output schema (blocks: true required for line bboxes; without it find_text_on_screen silently returned 0 matches) / get_screen_diff misleading above_threshold reason / two new tools (wait_for_change, record_screen) for real-time-ish workflows

See the commit log for the full audit trail.

Configuration

Environment variables:

| Var | Default | Purpose | |---|---|---| | SCREEN_MCP_LOG_LEVEL | info | debug / info / warn / error. Logs go to stderr. | | SCREEN_MCP_OCR_LANGS | eng+chi_sim | Plus-separated tesseract codes. Allowlist enforced to prevent supply-chain attacks. Allowed: eng, chi_sim, chi_tra, jpn, kor, fra, deu, spa, rus, ita, por, ara, nld, tur, vie, tha, hin, ben, ukr. |

First OCR call downloads ~40 MB of language models from cdn.jsdelivr.net. Subsequent calls reuse the cached worker.

Platform support

| Platform | Capture | Region | Displays | Windows | OCR | Vision-diff | |---|---|---|---|---|---|---| | Windows ≥ 10 | ✅ tested | ✅ | ✅ multi-display | ✅ | ✅ | ✅ | | macOS ≥ 11 | ✅ code | ✅ | 🟡 stub (single only) | ✅ | ✅ | ✅ | | Linux (X11 + Wayland) | ✅ code | ✅ | 🟡 stub (single only) | 🟡 needs wmctrl | ✅ | ✅ |

Windows is the maintainer's primary platform and has end-to-end test coverage. macOS / Linux paths are written and CI-built but not yet end-to-end tested by the maintainer — PRs and issue reports very welcome.

Security & privacy

The server runs entirely locally. No screenshot data leaves your machine via this server. (Whatever LLM client connects controls where the image goes — that's the API call you authorized when registering the connector.)
OCR text is untrusted input. Anything visible on your screen — notifications, web pages, chat windows, ads — gets passed to the LLM as a tool result. A malicious actor controlling something on your screen could embed prompt-injection content. Tool descriptions and output delimiters (<screen_ocr>...</screen_ocr>) flag this clearly so downstream models can be guided to distrust.
Use screenshot_region when you don't need the whole screen.
Use read_screen_text instead of screenshot when you only need text — vastly fewer tokens and you're not exposing other windows that happen to be open.

Development

git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build
node tests/e2e-wire.mjs    # spawn server + drive JSON-RPC + verify all 8 tools

Roadmap

v0.5 — screenshot_window(title) precisely scoped to a window's bounds; macOS multi-display enumeration via system_profiler; Linux multi-display via xrandr / wlr-randr; optional vendored tesseract models (SCREEN_MCP_OCR_LANG_PATH) for offline / air-gapped use
v1.0 — first-class MCPB bundle for one-click install via Claude Desktop

Why "real-time video" isn't a tool

MCP is request-response and each tool call costs an LLM turn (~1–3 s end-to-end). 24 fps streaming is physically impossible at that latency. Three substitutes cover the real use cases:

wait_for_change — like a human watching the screen and only saying something when it changes
record_screen — like rewinding a short clip with the boring frames cut out
screenshot_if_changed in a loop — for sustained polling under your own pacing

Contributing

PRs especially welcome for:

macOS multi-display enumeration (system_profiler SPDisplaysDataType -json parsing)
Linux per-output capture (grim -o, scrot --screen)
screenshot_window for v0.4
Performance regressions if you find any

See CONTRIBUTING.md (TODO).

License

MIT — see LICENSE.

中文 TL;DR

让 Claude 看到你的屏幕。MCP server，跨 Win/Mac/Linux，零原生依赖。

填补 Anthropic 官方 computer-use MCP 仅 macOS 的空白，外加 OCR（省 vision token 10-100x）和智能 vision-diff（让 24/7 监测在 token 经济上可行）。

8 个 tool（截屏 / 区域 / 列显示器 / 列窗口 / OCR / 找文字 / 智能截屏 / 看变化），跨平台一致。每个 release 都过了 3 agent 联合审核（代码质量 + silent failure + security），共修了 16 个 P0 才发出去。

git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp && npm install && npm run build
claude mcp add screen -- node "$(pwd)/dist/index.js"
# 重启 Claude Code，然后说"截一张屏幕给我看"

中文 OCR 默认开启（eng+chi_sim），无需额外配置。

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

claude-screen-mcp

Quick start

Tools (10 total)

Use cases

Why this exists

Quality bar

Configuration

Platform support

Security & privacy

Development

Roadmap

Why "real-time video" isn't a tool

Contributing

License

中文 TL;DR