mcp-video-analyzer

v0.2.5

Published

2 months ago

MCP server for video analysis — extracts transcripts, key frames, OCR text, and metadata from video URLs. Supports Loom and direct video files.

0High
0Medium
0Low

guimatheus92

mcp model-context-protocol video video-analysis loom transcript frames ocr screen-recording claude ai

mcp-video-analyzer

Featured in awesome-mcp-servers.

MCP server for video analysis — extracts transcripts, key frames, and metadata from video URLs. Supports Loom, direct video files (.mp4, .webm), and more.

No existing video MCP combines transcripts + visual frames + metadata in one tool. This one does.

Installation

Prerequisites

Node.js 18+ — required to run the server via npx
yt-dlp (optional) — enables frame extraction via ffmpeg. Install with pip install yt-dlp
Chrome/Chromium (optional) — fallback for frame extraction if yt-dlp is unavailable

Without yt-dlp or Chrome, the server still works — you'll get transcripts, metadata, and comments, just no frames.

Claude Code (CLI)

claude mcp add video-analyzer -- npx mcp-video-analyzer@latest

Then restart Claude Code or start a new conversation.

VS Code / Cursor

Add to your MCP settings file:

VS Code: File → Preferences → Settings → search "MCP" or edit ~/.vscode/mcp.json / %APPDATA%\Code\User\mcp.json (Windows)
Cursor: Settings → MCP Servers → Add

{
  "servers": {
    "mcp-video-analyzer": {
      "type": "stdio",
      "command": "npx",
      "args": ["mcp-video-analyzer@latest"]
    }
  }
}

Then reload the window (Ctrl+Shift+P → "Developer: Reload Window").

Claude Desktop

Add to your Claude Desktop config file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "video-analyzer": {
      "command": "npx",
      "args": ["mcp-video-analyzer@latest"]
    }
  }
}

Then restart Claude Desktop.

Verify it works

Once installed, ask your AI assistant:

Analyze this video: https://www.loom.com/share/bdebdfe44b294225ac718bad241a94fe

If the server is connected, it will automatically call the analyze_video tool.

Tools

`analyze_video` — Full video analysis

Extracts everything from a video URL in one call:

> Analyze this video: https://www.loom.com/share/abc123...

Returns:

Transcript with timestamps and speakers
Key frames extracted via scene-change detection (automatically deduplicated)
OCR text extracted from frames (code, error messages, UI text visible on screen)
Annotated timeline merging transcript + frames + OCR into a unified "what happened when" view
Metadata (title, duration, platform)
Comments from viewers
Chapters and AI summary (when available)

The AI will automatically call this tool when it sees a video URL — no need to ask.

Options:

detail — analysis depth: "brief" (metadata + truncated transcript, no frames), "standard" (default), "detailed" (dense sampling, more frames)
fields — array of specific fields to return, e.g. ["metadata", "transcript"]. Available: metadata, transcript, frames, comments, chapters, ocrResults, timeline, aiSummary
maxFrames (1-60, default depends on detail level) — cap on extracted frames
threshold (0.0-1.0, default 0.1) — scene-change sensitivity
forceRefresh — bypass cache and re-analyze
skipFrames — skip frame extraction for transcript-only analysis

`get_transcript` — Transcript only

> Get the transcript from this video

Quick transcript extraction. Falls back to Whisper transcription when no native transcript is available.

`get_metadata` — Metadata only

> What's this video about?

Returns metadata, comments, chapters, and AI summary without downloading the video.

`get_frames` — Frames only

> Extract frames from this video with dense sampling

Two modes:

Scene-change detection (default) — captures visual transitions
Dense sampling (dense: true) — 1 frame/sec for full coverage

`analyze_moment` — Deep-dive on a time range

> Analyze what happens between 1:30 and 2:00 in this video

Combines burst frame extraction + filtered transcript + OCR + annotated timeline for a focused segment. Use when you need to understand exactly what happens at a specific moment.

`get_frame_at` — Single frame at a timestamp

> Show me the frame at 1:23 in this video

The AI reads the transcript, spots a critical moment, and requests the exact frame to see what's on screen.

`get_frame_burst` — N frames in a time range

> Show me 10 frames between 0:15 and 0:17 of this video

For motion, vibration, animations, or fast scrolling — burst mode captures N frames in a narrow window so the AI can see frame-by-frame changes.

Detail Levels

| Level | Frames | Transcript | OCR | Timeline | Use case | |-------|--------|-----------|-----|----------|----------| | brief | None | First 10 entries | No | No | Quick check — what's this video about? | | standard | Up to 20 (scene-change) | Full | Yes | Yes | Default — full analysis | | detailed | Up to 60 (1fps dense) | Full | Yes | Yes | Deep analysis — every second captured |

Caching

Results are cached in memory for 10 minutes. Subsequent calls with the same URL and options return instantly. Use forceRefresh: true to bypass the cache.

Supported Platforms

| Platform | Transcript | Metadata | Comments | Frames | Auth | |----------|:----------:|:--------:|:--------:|:------:|:----:| | Loom | Yes | Yes | Yes | Yes | None | | Direct URL (.mp4, .webm) | No | Duration only | No | Yes | None |

Frame Extraction Strategies

Frame extraction uses a two-strategy fallback chain — no single dependency is required:

| Strategy | How it works | Speed | Requirements | |----------|-------------|-------|-------------| | yt-dlp + ffmpeg (primary) | Downloads video, extracts frames via scene detection | Fast, precise | yt-dlp (pip install yt-dlp) | | Browser (fallback) | Opens video in headless Chrome, seeks to timestamps, takes screenshots | Slower, no download needed | Chrome or Chromium installed |

The fallback is automatic — if yt-dlp is not available, the server tries browser-based extraction via puppeteer-core. If neither is available, analysis still returns transcript + metadata + comments, just no frames.

Post-Processing Pipeline

After frame extraction, the pipeline automatically applies:

| Step | What it does | Why | |------|-------------|-----| | Frame deduplication | Removes near-identical consecutive frames using perceptual hashing (dHash + Hamming distance) | Screencasts often have long static moments — dedup removes redundant frames, saving tokens | | OCR | Extracts text visible on screen from each frame (via tesseract.js) | Captures code, error messages, terminal output, UI text that the transcript doesn't cover | | Annotated timeline | Merges transcript timestamps + frame timestamps + OCR text into a single chronological view | Gives the AI a unified "what was said, what changed visually, and what text appeared" at each moment |

The OCR step requires tesseract.js (included as a dependency). If it fails to load, analysis continues without OCR — no frames or transcript are lost.

Complementary Tools

Chrome DevTools MCP

For live web debugging alongside video analysis, pair this server with the Chrome DevTools MCP:

claude mcp add chrome-devtools npx @anthropic-ai/mcp-devtools@latest

When to use each:

| Scenario | Tool | |----------|------| | Bug report recorded as a Loom video | mcp-video-analyzer — extract transcript, frames, and error text from the recording | | Live debugging a web page | Chrome DevTools MCP — inspect DOM, console, network, take screenshots | | Video shows UI issue, need to reproduce it | Use both: analyze the video first, then open the page in Chrome DevTools to reproduce |

The two MCPs complement each other: video analyzer understands recorded content, DevTools interacts with live pages.

Example Output

The examples/loom-demo/ folder contains real outputs from analyzing a public Loom video (Boost In-App Demo Video, 2:55).

| File | What it shows | |------|--------------| | metadata.json | Title, duration, platform | | transcript.json | 42 timestamped entries with speaker IDs | | timeline.json | Unified chronological view (transcript + frames merged) | | moment-transcript-0m30s-0m45s.json | Filtered transcript for analyze_moment (0:30–0:45) | | full-analysis.json | Complete analyze_video output |

Frame images (19 total in examples/loom-demo/frames/):

scene_*.jpg — scene-change detection (key visual transitions)
dense_*.jpg — 1fps dense sampling (every 10th frame saved as sample)
burst_*.jpg — burst extraction for moment analysis (0:30–0:45)

Regenerate after changes: npx tsx examples/generate.ts — requires yt-dlp + network access.

Development

# Install dependencies
npm install

# Run all checks (format, lint, typecheck, knip, tests)
npm run check

# Build
npm run build

# Run E2E tests (requires network)
npm run test:e2e

# Open MCP Inspector for manual testing
npm run inspect

Architecture

src/
├── index.ts                    # Entry point (shebang + stdio)
├── server.ts                   # FastMCP server + tool registration
├── tools/                      # MCP tool definitions (7 tools)
│   ├── analyze-video.ts        # Full analysis with detail levels + caching
│   ├── analyze-moment.ts       # Deep-dive on a time range
│   ├── get-transcript.ts       # Transcript-only with Whisper fallback
│   ├── get-metadata.ts         # Metadata + comments + chapters
│   ├── get-frames.ts           # Frames-only (scene-change or dense)
│   ├── get-frame-at.ts         # Single frame at timestamp
│   └── get-frame-burst.ts      # N frames in a time range
├── adapters/                   # Platform-specific logic
│   ├── adapter.interface.ts    # IVideoAdapter interface + registry
│   ├── loom.adapter.ts         # Loom: authless GraphQL
│   └── direct.adapter.ts       # Direct URL: any mp4/webm link
├── processors/                 # Shared processing
│   ├── frame-extractor.ts      # ffmpeg scene detection + dense + burst extraction
│   ├── browser-frame-extractor.ts # Headless Chrome fallback for frames
│   ├── audio-transcriber.ts    # Whisper fallback (HF transformers → CLI → OpenAI)
│   ├── image-optimizer.ts      # sharp resize/compress
│   ├── frame-dedup.ts          # Perceptual dedup (dHash + Hamming distance)
│   ├── frame-ocr.ts            # OCR text extraction (tesseract.js)
│   └── annotated-timeline.ts   # Unified timeline (transcript + frames + OCR)
├── config/
│   └── detail-levels.ts        # brief / standard / detailed config
├── utils/
│   ├── cache.ts                # In-memory TTL cache with LRU eviction
│   ├── field-filter.ts         # Selective field filtering for responses
│   ├── url-detector.ts         # Platform detection from URL
│   ├── vtt-parser.ts           # WebVTT → transcript entries
│   └── temp-files.ts           # Temp directory management
└── types.ts                    # Shared TypeScript interfaces

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

mcp-video-analyzer

Installation

Prerequisites

Claude Code (CLI)

VS Code / Cursor

Claude Desktop

Verify it works

Tools

analyze_video — Full video analysis

get_transcript — Transcript only

get_metadata — Metadata only

get_frames — Frames only

analyze_moment — Deep-dive on a time range

get_frame_at — Single frame at a timestamp

get_frame_burst — N frames in a time range

Detail Levels

Caching

Supported Platforms

Frame Extraction Strategies

Post-Processing Pipeline

Complementary Tools

Chrome DevTools MCP

Example Output

Development

Architecture

License

`analyze_video` — Full video analysis

`get_transcript` — Transcript only

`get_metadata` — Metadata only

`get_frames` — Frames only

`analyze_moment` — Deep-dive on a time range

`get_frame_at` — Single frame at a timestamp

`get_frame_burst` — N frames in a time range