gemini-cua

v1.0.5

Published

4 months ago

Model Context Protocol server for Gemini computer use functionality with browser automation using Playwright

0High
0Medium
0Low

mcp model-context-protocol gemini computer-use browser-automation playwright ai-tools claude anthropic automation web-scraping browser-control

Gemini CUA - Computer Use Automation

A Model Context Protocol (MCP) server that provides computer use and browser automation capabilities based on Google's Gemini computer use preview functionality. This server enables AI assistants to control web browsers and interact with web pages through natural language commands.

Features

This MCP server provides tools for browser automation and computer interaction with support for both local and remote environments:

Supported Environments

Playwright (Local): Run browser automation locally with Chromium
Browserbase (Remote): Use Browserbase's cloud browser infrastructure

Capabilities

Browser Control: Open web browser, navigate to URLs, go back/forward
Mouse Interactions: Click, hover, drag and drop at specific coordinates
Keyboard Actions: Type text, execute key combinations
Page Navigation: Scroll, search, navigate to specific URLs
State Capture: Take screenshots and get current page state with base64 encoding
Session Management: Automatic session creation and cleanup (Browserbase)
Waiting: Built-in delays for page loading

Installation

From npm

npm install -g gemini-cua

From source

git clone https://github.com/snakecased/gemini-cua-mcp.git
cd gemini-cua
npm install
npm run build

Usage

As MCP Server (Recommended)

Add to your MCP client configuration (e.g., Claude Desktop, Cline):

{
  "mcpServers": {
    "gemini-computer-use": {
      "command": "gemini-cua"
    }
  }
}

Or if installed locally:

{
  "mcpServers": {
    "gemini-computer-use": {
      "command": "npx",
      "args": ["gemini-cua"]
    }
  }
}

Direct Usage

# If installed globally
gemini-cua

# If installed locally
npx gemini-cua

# From source
npm run start

Requirements

Node.js 18+
Chromium browser (automatically installed via Playwright)
For Browserbase: API key and Project ID from browserbase.com

Environment Configuration

Environment Variables

Set the following environment variable to choose your browser environment:

# Use local Playwright (default)
export COMPUTER_USE_ENV=playwright

# Use Browserbase remote browsers
export COMPUTER_USE_ENV=browserbase
export BROWSERBASE_API_KEY=your_api_key_here
export BROWSERBASE_PROJECT_ID=your_project_id_here

Browserbase Setup

Sign up at browserbase.com
Create a project and get your API key and Project ID
Set the environment variables as shown above

Available Tools

Browser Management

open_web_browser: Launch browser and navigate to URL
navigate: Go to specific URL
go_back: Navigate to previous page
go_forward: Navigate to next page
search: Open search engine with optional query

Mouse Interactions

click_at(x, y): Click at coordinates
hover_at(x, y): Hover at coordinates
drag_and_drop(from_x, from_y, to_x, to_y): Drag and drop

Keyboard Actions

type_text(text): Type text at cursor
key_combination(keys): Execute keyboard shortcuts

Page Control

scroll_document(direction, amount): Scroll up/down
screen_size(): Get viewport dimensions
current_state(): Get screenshot and URL
wait_5_seconds(): Wait for page loading

Dependencies

@modelcontextprotocol/sdk: MCP protocol implementation
playwright: Browser automation framework

Development

npm run dev     # Run in development mode
npm run watch   # Watch for changes
npm run build   # Build for production

License

Apache-2.0 License (matching original Google Gemini computer use preview)