native-devtools-mcp
v0.7.1
Published
MCP server for native app testing — screenshot, OCR, click, type, find_text, template matching. macOS, Windows & Android.
Maintainers
Keywords
Readme
native-devtools-mcp
native-devtools-mcp is a Model Context Protocol (MCP) server for computer use on macOS, Windows, and Android. It gives AI agents and MCP clients direct control over native desktop apps, Chrome/Electron browsers, and Android devices through screenshots, OCR, accessibility-based text lookup, input simulation, window management, Chrome DevTools Protocol (CDP), and ADB.
Use it when browser-only automation is not enough: Electron apps (Signal, Discord, VS Code), Chrome browser automation, system dialogs, desktop tools, native app testing, and Android device workflows. It works with Claude Desktop, Claude Code, Cursor, and other MCP-compatible clients.
Useful for MCP-based computer use, desktop automation, browser automation, UI automation, native app testing, e2e testing, RPA, screen reading, mouse and keyboard control, Chrome DevTools Protocol automation, and Android device automation.
npx -y native-devtools-mcpCore capabilities
- Screenshots, OCR, and accessibility-first
find_text click,type_text,scroll,launch_app,quit_app, and window managementelement_at_pointfor inspecting accessible UI elements at screen coordinatesload_image+find_imagefor non-text UI elements such as icons and custom controls- Chrome/Electron automation via CDP: snapshots, click, fill, navigate, type, and tab management
- Android screenshots, text lookup, input, and app control over ADB
- Local execution: screenshots and input stay on the machine
For AI agents: Read AGENTS.md for tool definitions, workflow patterns, and machine-readable usage guidance.
Features • Installation • Getting Started • Recipes • Security & Trust • For AI Agents • Chrome/Electron (CDP) • Android
🚀 Features
- 👀 Computer Vision: Capture screenshots of screens, windows, or specific regions. Includes built-in OCR (text recognition) to "read" the screen.
- 🖱️ Input Simulation: Click, drag, scroll, and type text naturally. Supports global coordinates and window-relative actions.
- 🪟 Window Management: List open windows, find applications, and bring them to focus.
- 🧩 Template Matching: Find non-text UI elements (icons, shapes) using
load_image+find_image, returning precise click coordinates. - 🔒 Local & Private: 100% local execution. No screenshots or data are ever sent to external servers.
- 📱 Android Support: Connect to Android devices over ADB for screenshots, input simulation, UI element search, and app management — all from the same MCP server.
- 🔍 Hover Tracking: Track cursor hover transitions across UI elements in real-time. Configurable dwell threshold filters pass-through noise — designed for LLMs observing user navigation patterns.
- 🌐 Browser Automation (CDP): Connect to Chrome/Electron apps via Chrome DevTools Protocol. Take accessibility tree snapshots, click elements by UID, evaluate JavaScript, and manage tabs — all without a separate Node.js server.
- 🔌 Dual-Mode Interaction:
- Visual/Native: Works with any app via screenshots & coordinates (Universal).
- AppDebugKit: Deep integration for supported apps to inspect the UI tree (DOM-like structure).
- CDP: Connect to Chrome/Electron via
--remote-debugging-portfor DOM-level element targeting and JS evaluation.
🤖 For AI Agents (LLMs)
This MCP server is designed to be highly discoverable and usable by AI models (Claude, Gemini, GPT).
- 📄 Read
AGENTS.md: A compact, token-optimized technical reference designed specifically for ingestion by LLMs. It contains intent definitions, schema examples, and reasoning patterns.
Core Capabilities for System Prompts:
take_screenshot: The "eyes". Returns images + layout metadata + text locations (OCR).click/type_text: The "hands". Interacts with the system based on visual feedback.find_text: A shortcut to find text on screen and get its coordinates immediately. Uses the platform accessibility API (macOS Accessibility / Windows UI Automation) for precise element-level matching, with OCR fallback.element_at_point: Inspect the accessibility element at given screen coordinates — returns name, role, label, value, bounds, pid, and app_name. Note: privacy-focused Electron apps (e.g. Signal) may restrict their AX tree, returning only a container — usetake_screenshotwith OCR as a fallback.load_image/find_image: Template matching for non-text UI elements (icons, shapes), returning screen coordinates for clicking.start_hover_tracking/get_hover_events/stop_hover_tracking: Track cursor hover transitions across UI elements. Configurable dwell threshold filters pass-throughs.start_recording/stop_recording: Record the frontmost app's window at ~5fps as timestamped JPEG frames. Automatically follows app switches.launch_app/quit_app: Launch apps with optional CLI args, or gracefully/forcefully quit them.cdp_connect/cdp_take_snapshot/cdp_click/cdp_fill/cdp_navigate: Connect to Chrome or Electron apps via CDP for DOM-level automation — snapshots, clicking, typing, navigation, and tab management without a separate Node.js server.
📦 Installation
The install steps are identical on macOS and Windows.
Option 1: Run with npx (no install needed)
npx -y native-devtools-mcpOption 2: Global install
npm install -g native-devtools-mcpOption 3: Build from source (Rust)
Using the build script (clones, builds, and runs setup):
curl -fsSL https://raw.githubusercontent.com/sh3ll3x3c/native-devtools-mcp/master/scripts/build-from-source.sh | bashOr manually:
git clone https://github.com/sh3ll3x3c/native-devtools-mcp
cd native-devtools-mcp
cargo build --release
# Binary: ./target/release/native-devtools-mcp🏁 Getting Started
After installing, run the setup wizard:
npx native-devtools-mcp setupThis will:
- Check permissions (macOS) — verifies Accessibility and Screen Recording, opens System Settings if needed
- Detect your MCP clients — finds Claude Desktop, Claude Code, Cursor
- Write the configuration — generates the correct JSON config and offers to write it for you
Then restart your MCP client and you're ready to go.
Claude Desktop on macOS requires the signed app bundle (Gatekeeper blocks npx). Download
NativeDevtools-X.X.X.dmgfrom GitHub Releases, drag to/Applications, then run setup — it will detect the app and configure Claude Desktop to use it.
VS Code, Windsurf, and other clients:
setupdoesn't auto-detect these yet. Runsetupfor the permission checks, then see the manual configuration below for the JSON config snippet.
Claude Code tip: To avoid approving every tool call (clicks, screenshots), add this to
.claude/settings.local.json:{ "permissions": { "allow": ["mcp__native-devtools__*"] } }
📚 Recipes and Examples
- Recipes and Examples Index
- Claude Desktop Setup
- Claude Code Setup
- Cursor Setup
- End-to-End Desktop Flow
- Native App Click Flow
- OCR Fallback and Element Inspection
- Template Matching Flow
- Android Quickstart
macOS — Claude Desktop
Config file: ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"native-devtools": {
"command": "/Applications/NativeDevtools.app/Contents/MacOS/native-devtools-mcp"
}
}
}Windows — Claude Desktop
Config file: %APPDATA%\Claude\claude_desktop_config.json
Claude Code, Cursor, and other MCP clients
{
"mcpServers": {
"native-devtools": {
"command": "npx",
"args": ["-y", "native-devtools-mcp"]
}
}
}Requires Node.js 18+.
🔐 Security & Trust
This tool requires Accessibility and Screen Recording permissions — that's a lot of trust. Here's how to verify it deserves it.
Verify your binary
native-devtools-mcp verifyComputes the SHA-256 hash of the running binary and checks it against the official checksums published on the GitHub Releases page. If the hash matches, you're running an unmodified official build.
Build from source
Don't trust pre-built binaries? Build it yourself:
curl -fsSL https://raw.githubusercontent.com/sh3ll3x3c/native-devtools-mcp/master/scripts/build-from-source.sh | bashThe script clones the repo, optionally opens it for review before building, compiles the release binary, and runs setup. See scripts/build-from-source.sh.
Audit the code
SECURITY_AUDIT.md documents exactly which permissions are used, where in the source code, and includes an LLM audit prompt you can paste into any AI model to perform an independent security review.
What this server does NOT do
- No unsolicited network access — the server never phones home. Network is only used when the MCP client explicitly invokes
app_connect(WebSocket to a local debug server) or when you run theverifysubcommand (fetches checksums from GitHub) - No file scanning — does not read or index your files. The only file reads are
load_image(reads a path the MCP client explicitly provides) and short-lived temp files for screenshots (deleted immediately after capture) - No background persistence — exits when the MCP client disconnects
- No data exfiltration — screenshots are returned to the MCP client via stdout, never stored or transmitted elsewhere
🔍 Two Approaches to Interaction
We provide two ways for agents to interact, allowing them to choose the best tool for the job.
1. The "Visual" Approach (Universal)
Best for: 99% of apps (Electron, Qt, Games, Browsers).
- How it works: The agent takes a screenshot, analyzes it visually (or uses OCR), and clicks at coordinates.
- Tools:
take_screenshot,find_text,click,type_text(plusload_image/find_imagefor icons and shapes). - Example: "Click the button that looks like a gear icon." → use
find_imagewith a gear template.
2. The "Structural" Approach (AppDebugKit)
Best for: Apps specifically instrumented with our AppDebugKit library (mostly for developers testing their own apps).
- How it works: The agent connects to a debug port and queries the UI tree (like HTML DOM).
- Tools:
app_connect,app_query,app_click. - Example:
app_click(element_id="submit-button").
🧩 Template Matching (find_image)
Use find_image when the target is not text (icons, toggles, custom controls) and OCR or find_text cannot identify it.
Typical flow:
take_screenshot(app_name="MyApp")→screenshot_idload_image(path="/path/to/icon.png")→template_idfind_image(screenshot_id="...", template_id="...")→matcheswithscreen_x/screen_yclick(x=..., y=...)
Fast vs Accurate:
- fast (default): uses downscaling and early-exit for speed.
- accurate: uses full-resolution, wider scale search, and smaller stride for thorough matching.
Optional inputs like mask_id, search_region, scales, and rotations can improve precision and performance.
🌐 Browser Automation (CDP)
Connect to Chrome or Electron apps via the Chrome DevTools Protocol for DOM-level automation — more reliable than coordinate-based clicking for web content.
# Launch Chrome with remote debugging
launch_app(app_name="Google Chrome", args=["--remote-debugging-port=9222", "--user-data-dir=/tmp/chrome-profile"])
# Connect and automate
cdp_connect(port=9222)
cdp_navigate(url="https://example.com")
cdp_take_snapshot() # accessibility tree with element UIDs
cdp_fill(uid="10", value="search query")
cdp_press_key(key="Enter")
cdp_wait_for(text=["Results"])16 CDP tools — click, hover, fill, type, press key, navigate, handle dialogs, manage tabs, evaluate JS, and more. Works with Chrome 136+, Chromium, and Electron apps (Signal, Discord, VS Code, Slack). See AGENTS.md for full tool reference.
Chrome 136+ note: Requires --user-data-dir=<path> alongside --remote-debugging-port (Chrome silently ignores the debug port with the default profile). Electron apps only need --remote-debugging-port.
📱 Android Support
Android support is built-in. The MCP server communicates with Android devices over ADB (USB or Wi-Fi), providing screenshots, input simulation, UI element search, and app management.
Prerequisites
- ADB installed on the host machine (
brew install android-platform-toolson macOS, or install via Android SDK) - USB debugging enabled on the Android device (Settings > Developer options > USB debugging)
- ADB server running — starts automatically when you run
adb devices
Android tools
All Android tools are prefixed with android_ and appear dynamically after connecting to a device:
| Tool | Description |
|------|-------------|
| android_list_devices | List all ADB-connected devices (always available) |
| android_connect | Connect to a device by serial number |
| android_disconnect | Disconnect from the current device |
| android_screenshot | Capture the device screen |
| android_find_text | Find UI elements by text (via uiautomator) |
| android_click | Tap at screen coordinates |
| android_swipe | Swipe between two points |
| android_type_text | Type text on the device |
| android_press_key | Press a key (e.g., KEYCODE_HOME, KEYCODE_BACK) |
| android_launch_app | Launch an app by package name |
| android_list_apps | List installed packages |
| android_get_display_info | Get screen resolution and density |
| android_get_current_activity | Get the current foreground activity |
Typical workflow
android_list_devices → find your device serial
android_connect(serial="...") → connect (unlocks android_* tools)
android_screenshot → see what's on screen
android_find_text(text="OK") → locate a button
android_click(x=..., y=...) → tap itKnown issues
MIUI / HyperOS (Xiaomi, Redmi, POCO devices): Input injection (
android_click,android_type_text,android_press_key,android_swipe) andandroid_find_text(via uiautomator) require an additional security toggle:Settings > Developer options > USB debugging (Security settings) — enable this toggle. MIUI may require you to sign in with a Mi account to enable it.
Without this, you'll see
INJECT_EVENTS permissionerrors for input tools andcould not get idle stateerrors forandroid_find_text. Screenshot and device info tools work without this toggle.
Wireless ADB: To connect without a USB cable, first connect via USB and run:
adb tcpip 5555 adb connect <phone-ip>:5555Then use the
<phone-ip>:5555serial inandroid_connect.
Smoke tests
Smoke tests verify all Android tools against a real connected device. They are #[ignore]d by default and must be run explicitly:
cargo test --test android_smoke_tests -- --ignored --test-threads=1Tests must run sequentially (--test-threads=1) since they share a single physical device. The device must be unlocked and awake.
🏗️ Architecture
graph TD
Client[Claude / LLM Client] <-->|JSON-RPC 2.0| Server[native-devtools-mcp]
Server -->|Direct API| Sys[System APIs]
Server -->|WebSocket| Debug[AppDebugKit]
Server -->|ADB Protocol| Android[Android Device]
subgraph "Your Machine"
Sys -->|Screen/OCR| macOS[CoreGraphics / Vision]
Sys -->|Input| Win[Win32 / SendInput]
Sys -->|Text Search| UIA[UI Automation]
Debug -.->|Inspect| App[Target App]
end
subgraph "Android Device (USB/Wi-Fi)"
Android -->|screencap| Screen[Screenshots]
Android -->|input| Input[Tap / Swipe / Type]
Android -->|uiautomator| UITree[UI Hierarchy]
end| OS | Feature | API Used |
|----|---------|----------|
| macOS | Screenshots | screencapture (CLI) |
| | Input | CGEvent (CoreGraphics) |
| | Text Search (find_text) | Accessibility API (primary), Vision OCR (fallback) |
| | Element Inspection (element_at_point) | AXUIElementCopyElementAtPosition + AX tree walk fallback (Accessibility API) |
| | Hover Tracking (start_hover_tracking) | CGEvent cursor + Accessibility API polling |
| | Screen Recording (start_recording) | CGWindowListCreateImage at configurable fps |
| | OCR | VNRecognizeTextRequest (Vision Framework) |
| Windows | Screenshots | BitBlt (GDI) |
| | Input | SendInput (Win32) |
| | Text Search (find_text) | UI Automation (primary), WinRT OCR (fallback) |
| | Element Inspection (element_at_point) | IUIAutomation::ElementFromPoint (UI Automation) |
| | Hover Tracking (start_hover_tracking) | GetCursorPos + UI Automation polling |
| | Screen Recording (start_recording) | BitBlt (GDI) at configurable fps |
| | OCR | Windows.Media.Ocr (WinRT) |
| Android | Screenshots | screencap / ADB framebuffer |
| | Input | adb shell input (tap, swipe, text, keyevent) |
| | Text Search (find_text) | uiautomator dump (accessibility tree) |
| | Device Communication | adb_client crate (native Rust ADB protocol) |
Screenshot Coordinate Precision
Screenshots include metadata for accurate coordinate conversion:
screenshot_origin_x/y: Screen-space origin of the captured area (in points)screenshot_scale: Display scale factor (e.g., 2.0 for Retina displays)screenshot_pixel_width/height: Actual pixel dimensions of the imagescreenshot_window_id: Window ID (for window captures)
Coordinate conversion:
screen_x = screenshot_origin_x + (pixel_x / screenshot_scale)
screen_y = screenshot_origin_y + (pixel_y / screenshot_scale)Implementation notes:
- Window captures (macOS): Uses
screencapture -owhich excludes window shadow. The captured image dimensions matchkCGWindowBounds × scaleexactly, ensuring click coordinates derived from screenshots land on intended UI elements. - Region captures: Origin coordinates are aligned to integers to match the actual captured area.
⚠️ Operational Safety
- Hands Off: When the agent is "driving" (clicking/typing), do not move your mouse or type.
- Why? Real hardware inputs can conflict with the simulated ones, causing clicks to land in the wrong place.
- Focus Matters: Ensure the window you want the agent to use is visible. If a popup steals focus, the agent might type into the wrong window unless it checks first.
🪟 Windows Notes
Works out of the box on Windows 10/11.
- Uses standard Win32 APIs (GDI, SendInput).
find_textuses UI Automation (UIA) as the primary search mechanism, querying the accessibility tree for element names. This is the same accessibility-first approach used on macOS (with the Accessibility API). Falls back to OCR automatically when UIA finds no matches.- OCR uses the built-in Windows Media OCR engine (offline).
- Note: Cannot interact with "Run as Administrator" windows unless the MCP server itself is also running as Administrator.
- Screen Recording Performance: Screen recording uses GDI/BitBlt at configurable fps (default 5). For higher fps requirements or game capture scenarios, DXGI Desktop Duplication API would provide hardware-accelerated capture — this is a planned future upgrade.
📜 License
MIT © sh3ll3x3c
