@mindfullabai/ai-vision-cli

v1.0.0

Published

3 months ago

AI-powered image and video analysis CLI via Google Gemini. Self-describing for AI agents via --ai flag.

Downloads

0High
0Medium
0Low

mindfullabai

ai vision gemini image-analysis video-analysis object-detection cli claude-code ai-agent mcp-alternative

ai-vision-cli

AI-powered image and video analysis CLI via Google Gemini. Built as an AI-native CLI that any agent (Claude Code, etc.) can discover and use autonomously.

Why not an MCP?

MCP tools work, but they add ~300-500 tokens of overhead per call (JSON-RPC handshake, schema loading, response wrapping). This CLI outputs plain text to stdout, saving ~90% context when used by AI agents. It's also faster — no protocol negotiation, just a direct API call.

Install

npm install -g @mindfullabai/ai-vision-cli

Setup

1. Get a Google AI Studio API key (free)

Go to aistudio.google.com/apikey and create a key.

export GOOGLE_API_KEY="your-key-here"
# Add to ~/.zshrc or ~/.bashrc to persist

2. Setup Claude Code integration (optional)

ai-vision setup-claude

This command:

Installs the skill in ~/.claude/skills/ai-vision-cli/
Adds Bash(ai-vision:*) permission to Claude Code settings
Verifies your API key

To remove: ai-vision setup-claude --uninstall

Usage

Analyze an image

ai-vision analyze-image ./screenshot.png --prompt "Describe this UI"
ai-vision ai ./photo.jpg --prompt "What objects are in this image?"

Analyze a video

ai-vision analyze-video ./demo.mp4 --prompt "Summarize this video"
ai-vision av "https://youtube.com/watch?v=xyz" --prompt "What topics are covered?"
ai-vision av ./recording.mp4 --prompt "What happens?" --start 1m30s --end 3m

Detect objects

ai-vision detect-objects ./page.png --prompt "Find all interactive elements"
ai-vision do ./screenshot.png --prompt "Find buttons" --output ./annotated.png
ai-vision do ./ui.png --prompt "Locate the search bar" --json

Returns a text summary with element positions + saves an annotated image with bounding boxes. Web-aware: auto-detects HTML elements on webpage screenshots.

Compare images

ai-vision compare before.png after.png --prompt "What changed?"
ai-vision cmp v1.png v2.png v3.png --prompt "How did the design evolve?"

Common options

| Option | Description | Default | |--------|-------------|---------| | --prompt | What to analyze (required) | - | | --model | Gemini model | gemini-2.5-flash-lite (image), gemini-2.5-flash (video) | | --max-tokens | Max output tokens | 1000 (image), 2000 (video) | | --temperature | Response randomness (0.0-2.0) | 0.8 | | --json | Output JSON with metadata | false |

AI Agent Integration

This CLI is designed to be self-describing for AI agents. Any agent that can run shell commands can discover and use it:

# Agent runs --help, sees the hint:
ai-vision --help
# → "AI Agent? Run: ai-vision --ai"

# Agent gets full JSON schema:
ai-vision --ai

# Or just a brief overview:
ai-vision --ai brief

# Or usage examples:
ai-vision --ai examples

# Or schema for a specific command:
ai-vision --ai analyze-image

The --ai flag outputs structured JSON with:

Command names, descriptions, and when to use each one
Full parameter schemas (types, required, defaults)
Concrete examples
No API key required

The `--ai` pattern

This CLI implements a pattern we call AI-native CLIs: tools that describe themselves in a machine-readable format via a simple --ai flag. Any CLI can adopt this pattern to become instantly usable by AI agents without external documentation.

The discovery flow for an AI agent:

tool --help → sees "AI Agent? Run with --ai"
tool --ai brief → quick overview (few tokens)
tool --ai → full JSON schema with when_to_use, parameters, examples
Agent now knows exactly when and how to use the tool

How it works

Under the hood, ai-vision calls Google Gemini models via the @google/genai SDK:

Image analysis: gemini-2.5-flash-lite (fast, cheap)
Video analysis: gemini-2.5-flash (handles temporal understanding)
Object detection: Gemini with structured JSON output + imagescript for annotation rendering
Thinking disabled: thinkingBudget: 0 for flash models to avoid wasting tokens

Supports local files, URLs, base64 data URLs, and YouTube URLs (video).

License

MIT