@mindfullabai/ai-vision-cli
v1.0.0
Published
AI-powered image and video analysis CLI via Google Gemini. Self-describing for AI agents via --ai flag.
Downloads
53
Maintainers
Readme
ai-vision-cli
AI-powered image and video analysis CLI via Google Gemini. Built as an AI-native CLI that any agent (Claude Code, etc.) can discover and use autonomously.
Why not an MCP?
MCP tools work, but they add ~300-500 tokens of overhead per call (JSON-RPC handshake, schema loading, response wrapping). This CLI outputs plain text to stdout, saving ~90% context when used by AI agents. It's also faster — no protocol negotiation, just a direct API call.
Install
npm install -g @mindfullabai/ai-vision-cliSetup
1. Get a Google AI Studio API key (free)
Go to aistudio.google.com/apikey and create a key.
export GOOGLE_API_KEY="your-key-here"
# Add to ~/.zshrc or ~/.bashrc to persist2. Setup Claude Code integration (optional)
ai-vision setup-claudeThis command:
- Installs the skill in
~/.claude/skills/ai-vision-cli/ - Adds
Bash(ai-vision:*)permission to Claude Code settings - Verifies your API key
To remove: ai-vision setup-claude --uninstall
Usage
Analyze an image
ai-vision analyze-image ./screenshot.png --prompt "Describe this UI"
ai-vision ai ./photo.jpg --prompt "What objects are in this image?"Analyze a video
ai-vision analyze-video ./demo.mp4 --prompt "Summarize this video"
ai-vision av "https://youtube.com/watch?v=xyz" --prompt "What topics are covered?"
ai-vision av ./recording.mp4 --prompt "What happens?" --start 1m30s --end 3mDetect objects
ai-vision detect-objects ./page.png --prompt "Find all interactive elements"
ai-vision do ./screenshot.png --prompt "Find buttons" --output ./annotated.png
ai-vision do ./ui.png --prompt "Locate the search bar" --jsonReturns a text summary with element positions + saves an annotated image with bounding boxes. Web-aware: auto-detects HTML elements on webpage screenshots.
Compare images
ai-vision compare before.png after.png --prompt "What changed?"
ai-vision cmp v1.png v2.png v3.png --prompt "How did the design evolve?"Common options
| Option | Description | Default |
|--------|-------------|---------|
| --prompt | What to analyze (required) | - |
| --model | Gemini model | gemini-2.5-flash-lite (image), gemini-2.5-flash (video) |
| --max-tokens | Max output tokens | 1000 (image), 2000 (video) |
| --temperature | Response randomness (0.0-2.0) | 0.8 |
| --json | Output JSON with metadata | false |
AI Agent Integration
This CLI is designed to be self-describing for AI agents. Any agent that can run shell commands can discover and use it:
# Agent runs --help, sees the hint:
ai-vision --help
# → "AI Agent? Run: ai-vision --ai"
# Agent gets full JSON schema:
ai-vision --ai
# Or just a brief overview:
ai-vision --ai brief
# Or usage examples:
ai-vision --ai examples
# Or schema for a specific command:
ai-vision --ai analyze-imageThe --ai flag outputs structured JSON with:
- Command names, descriptions, and when to use each one
- Full parameter schemas (types, required, defaults)
- Concrete examples
- No API key required
The --ai pattern
This CLI implements a pattern we call AI-native CLIs: tools that describe themselves in a machine-readable format via a simple --ai flag. Any CLI can adopt this pattern to become instantly usable by AI agents without external documentation.
The discovery flow for an AI agent:
tool --help→ sees "AI Agent? Run with --ai"tool --ai brief→ quick overview (few tokens)tool --ai→ full JSON schema with when_to_use, parameters, examples- Agent now knows exactly when and how to use the tool
How it works
Under the hood, ai-vision calls Google Gemini models via the @google/genai SDK:
- Image analysis:
gemini-2.5-flash-lite(fast, cheap) - Video analysis:
gemini-2.5-flash(handles temporal understanding) - Object detection: Gemini with structured JSON output +
imagescriptfor annotation rendering - Thinking disabled:
thinkingBudget: 0for flash models to avoid wasting tokens
Supports local files, URLs, base64 data URLs, and YouTube URLs (video).
License
MIT
