claude-screen-mcp
v0.4.0
Published
MCP server that lets Claude see your screen — fills the Anthropic computer-use macOS-only gap for Windows + Linux. OCR + smart vision-diff included.
Maintainers
Readme
claude-screen-mcp
Let Claude see your screen. A cross-platform MCP server for Windows + macOS + Linux with OCR and smart vision-diff. Zero native runtime deps.
Anthropic's official computer-use MCP for Claude Code is macOS-only today. This server fills the gap for Windows + Linux — and adds two things the official one doesn't have:
- 🔍 OCR so Claude can read screen text without spending vision tokens
- 📊 Smart vision-diff so 24/7 monitoring stays economical (skip frames that didn't change)
Quick start
# from source (until npm publish)
git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build
# register with Claude Code
claude mcp add screen -- node "$(pwd)/dist/index.js"
# restart Claude Code, then ask:
# "Take a screenshot and tell me what's on my screen."
# "OCR my screen and tell me if there's an error message anywhere."
# "Watch my screen and ping me when the build finishes."Tools (10 total)
| Tool | Since | What it does |
|---|---|---|
| screenshot | v0.1 | Capture full display, auto-resize for vision-token efficiency |
| screenshot_region | v0.1 | Capture an (x, y, w, h) region — way cheaper than full |
| list_displays | v0.1 | Enumerate connected monitors |
| list_windows | v0.1 | List visible windows with optional title filter |
| read_screen_text | v0.2 | OCR full screen or region (10-100× cheaper than vision) |
| find_text_on_screen | v0.2 | Search OCR'd text, return matching lines + bboxes |
| screenshot_if_changed | v0.3 | Capture only when perceptual hash distance ≥ threshold |
| get_screen_diff | v0.3 | Distance-only diff — no image returned |
| wait_for_change | v0.4 | Long-poll until the screen changes, then return one keyframe |
| record_screen | v0.4 | Capture N seconds at low fps and return deduplicated keyframes |
All 8 tools work the same way on Windows (PowerShell + System.Drawing), macOS (screencapture + osascript), and Linux (grim / scrot / import + wmctrl).
Use cases
1. Debug what you see — "Why is my React app not rendering? Look at the screen."
→ screenshot → Claude sees the error overlay → suggests fix.
2. Find something specific without burning vision tokens — "Is there an error message anywhere on my screen?"
→ find_text_on_screen("error") returns matching line + bbox → Claude calls screenshot_region on just that bbox.
3. Watch-while-task — "Ping me when this build finishes."
→ wait_for_change(timeoutMs=300000, threshold=12) — server blocks until the screen actually changes (or 5 min elapses), so the model only spends a turn when something happens. For longer watches, loop screenshot_if_changed(threshold=12) every 30s.
4. Show me what just happened — "I saw something flash by, replay the last 15 seconds."
→ record_screen(durationMs=15000, targetFps=2, maxFrames=6) returns up to 6 deduplicated keyframes covering that period in a single tool result — like rewinding a clip without storing video.
5. Read what's on screen, not look at it — "What does the current GitHub PR description say?"
→ read_screen_text returns plain text → 10-100× fewer tokens than vision.
Why this exists
Anthropic's official Claude Code computer-use MCP server (v2.1.85+) is macOS-only as of May 2026. Windows and Linux users have no first-party way to give Claude vision into their desktop.
This project fills the gap with three deliberate constraints:
- Zero native runtime deps — uses each OS's built-in screenshot tooling (PowerShell + System.Drawing on Win,
screencaptureon Mac,grim/scrot/importon Linux). Nonode-gyp, no postinstall flakiness, no platform-specific binaries to bundle. - Single responsibility — only screen capture (read-only). Keyboard / mouse control belongs in a separate server (different threat model). This means it can be safely autostarted in any Claude session without granting input control.
- Token-aware by design — auto-resize to
maxEdge=1600, JPEG/WebP support, region capture, OCR (skip vision entirely for text), and perceptual-hash diff (skip frames that didn't change).
Quality bar
Every release was reviewed by 3 specialized agents (code quality + silent-failure-hunter + security auditor) before tagging. Across v0.1 → v0.3, the audits caught 16 P0 issues that were fixed before any tag was pushed:
- v0.1: PowerShell
-EncodedCommandBOM / Mac+Linuxlist_displaysreturning fake data / tool errors swallowing stderr /displayIdargument injection / region OOM / output byte caps - v0.2:
SCREEN_MCP_OCR_LANGSsupply-chain injection (allowlist enforcement) / OCR worker timeout (was unbounded) / no-match token bomb / structured OCR diagnostics / SIGTERM handler - v0.3: cache size cap + LRU + 24h stale TTL / dHash channel assert (silent monitoring failure prevention) / cross-tool cache pollution fix /
CompareResult.reasonto distinguish first-call from real change - v0.4: Windows window-title mojibake (PowerShell OEM codepage → UTF-8) / Tesseract v6+ output schema (
blocks: truerequired for line bboxes; without itfind_text_on_screensilently returned 0 matches) /get_screen_diffmisleadingabove_thresholdreason / two new tools (wait_for_change,record_screen) for real-time-ish workflows
See the commit log for the full audit trail.
Configuration
Environment variables:
| Var | Default | Purpose |
|---|---|---|
| SCREEN_MCP_LOG_LEVEL | info | debug / info / warn / error. Logs go to stderr. |
| SCREEN_MCP_OCR_LANGS | eng+chi_sim | Plus-separated tesseract codes. Allowlist enforced to prevent supply-chain attacks. Allowed: eng, chi_sim, chi_tra, jpn, kor, fra, deu, spa, rus, ita, por, ara, nld, tur, vie, tha, hin, ben, ukr. |
First OCR call downloads ~40 MB of language models from cdn.jsdelivr.net. Subsequent calls reuse the cached worker.
Platform support
| Platform | Capture | Region | Displays | Windows | OCR | Vision-diff |
|---|---|---|---|---|---|---|
| Windows ≥ 10 | ✅ tested | ✅ | ✅ multi-display | ✅ | ✅ | ✅ |
| macOS ≥ 11 | ✅ code | ✅ | 🟡 stub (single only) | ✅ | ✅ | ✅ |
| Linux (X11 + Wayland) | ✅ code | ✅ | 🟡 stub (single only) | 🟡 needs wmctrl | ✅ | ✅ |
Windows is the maintainer's primary platform and has end-to-end test coverage. macOS / Linux paths are written and CI-built but not yet end-to-end tested by the maintainer — PRs and issue reports very welcome.
Security & privacy
- The server runs entirely locally. No screenshot data leaves your machine via this server. (Whatever LLM client connects controls where the image goes — that's the API call you authorized when registering the connector.)
- OCR text is untrusted input. Anything visible on your screen — notifications, web pages, chat windows, ads — gets passed to the LLM as a tool result. A malicious actor controlling something on your screen could embed prompt-injection content. Tool descriptions and output delimiters (
<screen_ocr>...</screen_ocr>) flag this clearly so downstream models can be guided to distrust. - Use
screenshot_regionwhen you don't need the whole screen. - Use
read_screen_textinstead ofscreenshotwhen you only need text — vastly fewer tokens and you're not exposing other windows that happen to be open.
Development
git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp
npm install
npm run build
node tests/e2e-wire.mjs # spawn server + drive JSON-RPC + verify all 8 toolsRoadmap
- v0.5 —
screenshot_window(title)precisely scoped to a window's bounds; macOS multi-display enumeration viasystem_profiler; Linux multi-display viaxrandr/wlr-randr; optional vendored tesseract models (SCREEN_MCP_OCR_LANG_PATH) for offline / air-gapped use - v1.0 — first-class MCPB bundle for one-click install via Claude Desktop
Why "real-time video" isn't a tool
MCP is request-response and each tool call costs an LLM turn (~1–3 s end-to-end). 24 fps streaming is physically impossible at that latency. Three substitutes cover the real use cases:
wait_for_change— like a human watching the screen and only saying something when it changesrecord_screen— like rewinding a short clip with the boring frames cut outscreenshot_if_changedin a loop — for sustained polling under your own pacing
Contributing
PRs especially welcome for:
- macOS multi-display enumeration (
system_profiler SPDisplaysDataType -jsonparsing) - Linux per-output capture (
grim -o,scrot --screen) screenshot_windowfor v0.4- Performance regressions if you find any
See CONTRIBUTING.md (TODO).
License
MIT — see LICENSE.
中文 TL;DR
让 Claude 看到你的屏幕。MCP server,跨 Win/Mac/Linux,零原生依赖。
填补 Anthropic 官方 computer-use MCP 仅 macOS 的空白,外加 OCR(省 vision token 10-100x)和智能 vision-diff(让 24/7 监测在 token 经济上可行)。
8 个 tool(截屏 / 区域 / 列显示器 / 列窗口 / OCR / 找文字 / 智能截屏 / 看变化),跨平台一致。每个 release 都过了 3 agent 联合审核(代码质量 + silent failure + security),共修了 16 个 P0 才发出去。
git clone https://github.com/lfzds4399-cpu/claude-screen-mcp
cd claude-screen-mcp && npm install && npm run build
claude mcp add screen -- node "$(pwd)/dist/index.js"
# 重启 Claude Code,然后说"截一张屏幕给我看"中文 OCR 默认开启(eng+chi_sim),无需额外配置。
