framecap
v0.2.1
Published
YouTube videos → structured markdown with visual frame captures
Maintainers
Readme
framecap
YouTube videos → structured markdown with visual frame captures.
Takes a YouTube URL and outputs a clean markdown document with a structured transcript (chapter headers, speaker labels, paragraph breaks) and frame captures at key moments — embedded as images in the markdown.
Why
YouTube videos contain valuable knowledge, but it's trapped in a format you can't search, reference, or link to. Transcripts alone miss the visual context. framecap gives you both: a readable document with visual bookmarks.
Install
# Prerequisites
brew install yt-dlp ffmpeg
# Install framecap
npm install -g framecapUsage
# Single video
framecap https://youtube.com/watch?v=abc123
# Custom output directory
framecap https://youtube.com/watch?v=abc -o ./notes/
# Capture frames at specific timestamps
framecap https://youtube.com/watch?v=abc --capture-at 1:30,5:00,12:45
# Hint speaker names for interviews
framecap https://youtube.com/watch?v=abc --speakers "Lex Fridman,Andrej Karpathy"
# Skip LLM structuring (free mode — raw transcript + frames only)
framecap https://youtube.com/watch?v=abc --no-structure
# Obsidian-compatible output (wikilink image syntax)
framecap https://youtube.com/watch?v=abc --format obsidian
# Preview plan and cost (fetches metadata + transcript, skips video download and LLM)
framecap https://youtube.com/watch?v=abc --dry-runOutput
./how-karpathy-builds-software.md
./frames/how-karpathy-builds-software/
├── frame-0001-00m00s.jpg
├── frame-0002-01m45s.jpg
├── frame-0003-05m30s.jpg
└── ...The markdown file includes:
- YAML frontmatter — title, channel, URL, duration, upload date, auto-generated tags
- Structured transcript — organized by chapters (from video description), with speaker labels and natural paragraph breaks
- Embedded frames — images at key moments, with timestamps and captions
Options
| Flag | Default | Description |
|---|---|---|
| -o, --output | ./ | Output directory |
| --interval | auto | Force fixed-interval frame capture (seconds) |
| --max-frames | 50 | Maximum frames to extract |
| --dedup-threshold | 0.85 | Frame similarity filter (0.0-1.0) |
| --no-dedup | off | Keep all frames |
| --format | markdown | markdown or obsidian (wikilinks) |
| --capture-at | — | Capture at specific timestamps (e.g. 1:30,5:00) |
| --speakers | auto | Comma-separated speaker names |
| --no-structure | off | Skip LLM pass (free mode) |
| --no-frames | off | Transcript only |
| --language | en | Transcript language |
| --keep-video | off | Retain downloaded video file |
| --cookies-from-browser | — | Use cookies from browser (chrome, firefox, edge) |
| --model | claude-sonnet-latest | LLM model for structuring |
| --dry-run | off | Preview plan and cost (skips video download and LLM) |
| --quiet | off | Suppress all output except the final path (for piping) |
| -v, --verbose | off | Detailed logging |
Requirements
- Node.js 18+
- yt-dlp — video/transcript download
- ffmpeg — frame extraction
- Anthropic API key (optional, for transcript structuring — set
ANTHROPIC_API_KEY)
How It Works
- Fetch metadata — yt-dlp gets title, channel, duration, chapters, description
- Extract transcript — yt-dlp pulls auto/manual captions, parses VTT
- Capture frames — ffmpeg extracts frames at intervals or chapter boundaries
- Deduplicate frames — removes visually similar frames (configurable threshold)
- Structure transcript (optional) — LLM adds chapter headers, speaker labels, paragraph breaks. All words stay verbatim — only whitespace and labels are added.
- Assemble markdown — combines metadata, structured transcript, and frame references into the output file
Cost
The LLM structuring pass is the only part that costs money (requires Anthropic API key):
| Video Length | Approximate Cost (Sonnet) | |---|---| | 15 minutes | ~$0.05–0.08 | | 1 hour | ~$0.20–0.35 | | 2 hours | ~$0.40–0.70 |
Use --no-structure for completely free operation (raw transcript + frames).
License
MIT
