@ajmalaksar/multimodal-mcp

v0.1.5

Published

3 days ago

MCP server and CLI exposing Google's multimodal models (Gemini AI Studio and Vertex AI) for image, video, and audio analysis plus image generation — for Claude Code and other agents.

0High
0Medium
0Low

ajmalaksar

mcp gemini vertex nano-banana multimodal image-generation claude-code

multimodal-mcp

Google's multimodal models (Gemini AI Studio and Vertex AI) as an MCP server and a CLI. Give Claude Code — or any agent that can't see images or generate them — the ability to read images, audio, and video and to create high-quality images with the Nano Banana / Gemini image models.

Tools: list_models, generate_text, analyze_audio, analyze_image, analyze_video, generate_image, get_usage_stats — on both the Gemini and Vertex backends. See Roadmap.

Why

Coding agents are great at text but often can't see a screenshot, read a design, or produce an image asset. Point them at this MCP and they can: describe/transcribe local media, OCR a screenshot, critique a design, and generate or edit images (text-to-image and image-to-image) — all through a small, typed tool surface. The same thing is available as a CLI for your own terminal.

Install

From npm (once published):

npm install -g @ajmalaksar/multimodal-mcp   # or: pnpm add -g
multimodal-mcp --help

From source:

git clone https://github.com/ajmalaksar25/multimodal-mcp
cd multimodal-mcp
pnpm install
pnpm run build
node dist/cli.js --help

Credentials

Easiest — guided setup:

npx -y @ajmalaksar/multimodal-mcp setup

Opens the AI Studio key page in your browser, takes your pasted key, verifies it works, stores it encrypted, and prints the line to register the MCP with Claude Code. Run multimodal-mcp doctor anytime to see what's configured and re-test connectivity. (Getting a free Gemini key from Google is the one manual step — there's no way to skip it; the Vertex backend avoids the key entirely via gcloud auth application-default login.)

Or pick a method manually — environment variables always win over the stored file:

Option A — login (encrypted at rest):

multimodal-mcp login        # prompts for backend + key, stores it encrypted

Credentials are written AES-256-GCM encrypted to ~/.multimodal-mcp/credentials.enc. By default they're sealed with a local key file (~/.multimodal-mcp/key, mode 600) — that prevents accidental leaks (commits, syncs, plaintext config), but the key sits beside the data, so it's leak-prevention rather than protection against someone who can already read your home directory. For genuine at-rest confidentiality and a secret you can carry across machines, set a passphrase first — the key is then never written to disk and only the passphrase decrypts:

export MULTIMODAL_MCP_PASSPHRASE='…'
multimodal-mcp login

On Windows the 600 file mode is only weakly enforced, so prefer passphrase mode there — and whenever credentials travel between machines (the local key file doesn't, and shouldn't, travel). Passphrase keys are derived with scrypt (N=2¹⁶); the on-disk format is versioned, so existing stores keep working across upgrades.

Option B — environment variables / .env.local:

cp .env.example .env.local   # then fill in GEMINI_API_KEY (gemini) or GCP_PROJECT_ID (vertex)

Trust boundary: the analyze tools read, and generate_image writes to, any path the server's OS user can access — output_dir and input_images are not sandboxed. Run the server as a user with appropriately scoped permissions and treat tool arguments from untrusted callers accordingly.

Use from Claude Code

Add to your MCP config (~/.claude/mcp.json or a project .mcp.json):

{
  "mcpServers": {
    "multimodal": {
      "command": "multimodal-mcp",
      "args": ["serve"],
      "env": { "GEMINI_API_KEY": "..." }
    }
  }
}

From a source checkout, use "command": "node", "args": ["/path/to/multimodal-mcp/dist/server.js"]. If you ran multimodal-mcp login, drop the env block (set MULTIMODAL_MCP_PASSPHRASE there instead if you used a passphrase).

CLI

multimodal-mcp setup            # guided first-run (open key page, paste, verify, store)
multimodal-mcp doctor           # show configured creds + test connectivity
multimodal-mcp models [--capability c] [--provider p] [--include-paid] [--json]
multimodal-mcp text "explain this error" [--model m] [--backend gemini|vertex]
multimodal-mcp analyze ./screenshot.png [--prompt "what's broken?"] [--backend …]
multimodal-mcp generate "a red bicycle, studio lighting" --out ./assets --aspect 16:9
multimodal-mcp generate "make the sky purple" --ref ./photo.png            # image-to-image
multimodal-mcp generate "logo, transparent bg" --model nano-banana-pro-preview --confirm-paid
multimodal-mcp stats [--since 2026-06-01] [--until 2026-06-30] [--by-day] [--json]
multimodal-mcp dashboard --open                            # live usage web UI on localhost
multimodal-mcp serve            # run the MCP server over stdio
multimodal-mcp login | logout   # manage stored credentials

analyze picks image/audio/video by file extension. generate saves files and prints their absolute paths. Repeat --ref to pass multiple reference images. stats summarizes local usage.

Tools

| Tool | Purpose | |---|---| | list_models | List models in the registry, filterable by capability, provider, include_paid. | | generate_text | Generate text from a Gemini model. Defaults to a cheap text-capable model; accepts model, system, temperature, max_output_tokens, backend. | | analyze_audio | Transcribe or analyze a local audio file (mp3, ogg, wav, flac, aiff, aac, m4a). Files ≤18 MB go inline; larger upload via the Gemini Files API and are polled until ACTIVE. Override the default transcription prompt as needed. | | analyze_image | Describe or analyze a local image (jpg, png, webp, gif, heic, heif). Default prompt is a thorough description; override for OCR, design critique, screenshot debugging, chart reading, etc. | | analyze_video | Describe or analyze a local video (mp4, mov, webm, mpeg, avi, wmv, flv, 3gp), with a 15-min processing-poll timeout. Default prompt is a scene-by-scene description with [MM:SS] timestamps + summary. | | generate_image | Generate or edit images (Nano Banana family). Text-to-image by default; pass input_images for image-to-image editing/composition, plus optional aspect_ratio / image_size (best on the Pro models). Cheap default gemini-2.5-flash-image; paid models require confirm_paid: true. Saves to ~/.multimodal-mcp/images/ (or output_dir) and returns absolute paths. | | get_usage_stats | Summarize locally-logged usage — call counts, token totals, and estimated cost (paid-tier rates) grouped by model, tool, and backend. Optional since / until ISO date filters. Reads ~/.multimodal-mcp/usage.jsonl. |

Every model-invoking tool also accepts a backend arg (gemini or vertex), and logs one usage record per call (see Usage analytics).

Backends

Each tool runs against one of two Google backends, chosen per call with backend or globally via MULTIMODAL_GOOGLE_BACKEND:

gemini (default) — Google AI Studio, authenticated with GEMINI_API_KEY.
vertex — Vertex AI on Google Cloud, using GCP_PROJECT_ID (+ optional GCP_LOCATION, default global) and Application Default Credentials (gcloud auth application-default login, or GOOGLE_APPLICATION_CREDENTIALS pointing at a service-account JSON).

The same Gemini model ids work on both; choose vertex to bill and quota through your Google Cloud project.

Usage analytics

Every model-invoking call appends one JSON line to ~/.multimodal-mcp/usage.jsonl — timestamp, tool, model, backend, token counts, and an estimated cost (at paid-tier rates; absent for models with no published price). No prompt text, file paths, or credentials are ever recorded. View it with multimodal-mcp stats (or the get_usage_stats tool):

multimodal-mcp stats                          # totals + per-model / per-tool / per-backend
multimodal-mcp stats --by-day                  # add a per-day breakdown
multimodal-mcp stats --since 2026-06-01 --json
multimodal-mcp dashboard --open                # live web UI (localhost, auto-refreshes)

dashboard serves a single self-contained page (no framework/build/CDN — just node:http) on 127.0.0.1 that visualizes the same data over time; it stays up until Ctrl+C.

Set MULTIMODAL_MCP_NO_USAGE_LOG=1 to turn logging off. The file is append-only and safe to delete at any time to reset stats. Costs are estimates only — see the per-model prices in src/models/registry.ts (also shown in multimodal-mcp models).

Budget guard: set DAILY_BUDGET_USD and any model-invoking tool/CLI call is refused once today's estimated spend (UTC) reaches it — a lightweight cap built on the same cost estimates. Unset by default (no limit).

Environment

| Variable | Backend | Notes | |---|---|---| | GEMINI_API_KEY | gemini | https://aistudio.google.com/app/apikey | | GCP_PROJECT_ID | vertex | Your Google Cloud project id | | GCP_LOCATION | vertex | Region; defaults to global | | GOOGLE_APPLICATION_CREDENTIALS | vertex | Service-account JSON path, or use gcloud auth application-default login | | MULTIMODAL_GOOGLE_BACKEND | both | Default backend: gemini (default) or vertex | | MULTIMODAL_MCP_PASSPHRASE | both | Passphrase to encrypt/decrypt the stored credentials (optional) | | MULTIMODAL_MCP_HOME | both | Config dir for credentials + usage log (default ~/.multimodal-mcp) | | MULTIMODAL_MCP_OUTPUT_DIR | — | Root for generated images (default ~/.multimodal-mcp) | | MULTIMODAL_MCP_NO_USAGE_LOG | — | Set to 1 to disable local usage logging | | DAILY_BUDGET_USD | — | Refuse model calls once today's estimated spend (UTC) reaches this |

Roadmap

| M | Scope | Status | |---|---|---| | M0 | Scaffold + list_models + generate_text | ✓ | | M1a | analyze_audio (inline + Files API for large files) | ✓ | | M1b | analyze_image (inline + Files API for large files) | ✓ | | M1c | analyze_video (sync, short clips, Files API for long ones) | ✓ | | M2 | Async job runner + long video / YouTube analysis | | | M3 | generate_image (Nano Banana family) + paid-gating | ✓ | | M3+ | image-to-image / reference (input_images), aspect ratio & size | ✓ | | M3+ | budget enforcement (DAILY_BUDGET_USD) | ✓ | | M4 | Vertex AI backend (Gemini models) | ✓ | | M4+ | Vertex Imagen / Veo via predict API | | | M5 | OpenRouter as a provider | | | M6− | Usage logging + stats CLI / get_usage_stats tool + budget guard + web dashboard | ✓ | | M6 | Ink TUI (config, models, jobs, usage) | | | M7 | Public release (npm) | |

Development

pnpm run dev         # run the MCP server via tsx (stdio)
pnpm test            # vitest
pnpm run typecheck   # tsc --noEmit
pnpm run build       # tsc → dist/

scripts/smoke.ts and scripts/smoke-image.ts are live checks that hit the real API — run with pnpm tsx scripts/smoke.ts and a configured key.

License

MIT — see LICENSE.