media2md
v0.2.2
Published
Convert images to structured markdown with AI-generated descriptions and extracted text
Downloads
505
Maintainers
Readme
m2md
Turn images into structured, searchable markdown with AI vision.
I have 20 years of visual references (design inspiration, screenshots, diagrams, mood boards) scattered across folders. I use Obsidian for note-taking and wanted my vault to work for media too, not just text. The problem: images are invisible to search, to AI context windows, to grep. You can't find "that minimalist Japanese packaging with the kraft paper texture" in a folder of PNGs.
m2md fixes that. One command turns any image into a structured .md sidecar with AI-generated descriptions, extracted text, typed metadata, and tags. Ready for full-text search, vector search, and LLM retrieval.
npm install -g media2mdQuick start
# Describe a screenshot → writes screenshot.md next to it
m2md screenshot.png
# Batch an entire directory
m2md ./assets/
# Print to stdout (pipe to clipboard, another tool, etc.)
m2md screenshot.png --stdout | pbcopy
# Describe an image from a URL
m2md https://example.com/photo.pngFeatures
- AI-powered image descriptions via Claude or OpenAI vision
- Text extraction (OCR) from screenshots, documents, diagrams
- YAML frontmatter with 25+ structured fields (type, style, mood, era, typography, palette, references, etc.)
- Sidecar
.mdfiles next to images — makes directories greppable - Provider tiers —
--tier fastfor cheap/quick,--tier qualityfor best results - URL support — pass image URLs directly, or screenshot web pages via Playwright
- Watch mode — auto-process new/changed images in a directory
- Custom instructions (
--prompt) and focus directives (--note) - 4 built-in templates (default, minimal, alt-text, detailed) plus custom templates
- Content-hash caching — skip unchanged files automatically
- Cost estimation before processing (
--estimate,--dry-run) - Batch processing with concurrency control
- MCP server for AI agent integration (Claude Desktop, etc.)
- Programmatic API for use as a library
How the analysis works
m2md doesn't just ask an AI to "describe this image." It sends a detailed structured prompt that produces consistent, searchable metadata across every image you process. Here's what that means in practice.
Structured schema (25+ fields)
Every image produces the same set of fields, making your entire collection queryable:
| Field group | Fields | What they capture |
|-------------|--------|-------------------|
| Classification | type, category, medium | Visual form of the image (photo, illustration, painting, sketch, diagram, screenshot, render-3d, etc.) and what discipline it belongs to (ui-design, packaging, photography). Type describes how the image looks, not the subject — a photo of a logo is type "photo", not "logo". |
| Aesthetics | style, mood, palette, composition | Visual treatment (minimalist, brutalist, mid-century), emotional register (calm, dramatic, serene), material-driven color names (kraft-brown, slate-blue, bone-white), and layout structure (rule-of-thirds, grid, layered) |
| Archival context | era, artifact, typography, script, cultural_influence | Time period evoked (1970s, contemporary), designed object depicted (poster, packaging-box, website), typeface details (futura, sans-serif, letterpress), writing systems (latin, kanji, hangul), and aesthetic lineage (scandinavian-functionalism, japanese-wabi-sabi) |
| Discovery | tags, references, search_phrases, dimensions | Searchable keywords, design movement references (Bauhaus, Dieter Rams), natural language search phrases, and analytical axes explaining why the image is reference-worthy |
| Content | subject, description, extracted_text, visual_elements, use_case, color_hex | One-line summary, 4-sentence structured description, OCR text, literal visible objects, designer use cases, and sampled hex colors |
Quality constraints
The prompt enforces specific rules so output stays useful at scale:
- Tags are capped at 6-8 and drawn from a seed vocabulary of 81 canonical terms across 6 groups (materials, techniques, finishes, effects, photography, production). Formation rules enforce consistency: always hyphenated, always singular, specific over generic, material-first compounds. The model can invent new tags following the same rules.
- Style and mood cannot share terms — style is visual treatment, mood is emotional register. No "bold" in both.
- Palette uses evocative names only — never generic "white" or "black." Always material-driven: bone-white, chalk-white, ink-black, obsidian-black.
- Search phrases are capped at 8-10 and must be meaningfully distinct (different angles: literal, conceptual, stylistic, use-case).
- References push for 3-5 specific entries — art-historical context, named designers, cultural movements. "none" only for purely abstract content.
- Dimensions must be non-overlapping — each axis illuminates a genuinely distinct analytical lens.
- Subject lines attribute colors correctly to objects — "cognac leather sofa, mustard side table" not the reverse. When the 80-character limit forces compression, the model drops the color rather than misattributing it.
Controlled vocabulary
The schema uses a three-tier vocabulary system:
Tier 1 — Strict enums (type, category): closed vocabulary, auto-corrected in code. Unknown values are replaced with the nearest match or "other" with a warning. Categories cover 29 creative disciplines from branding and ui-design to ceramics, textile-design, and street-art.
Tier 2 — Normalized tags with seed vocabulary (tags): 81 canonical terms the model prefers, plus formation rules that prevent fragmentation. The model can extend the vocabulary following the same rules (hyphenated, singular, material-first). Suggested fields (style, mood, medium, composition) work similarly — 100+ suggested terms, extensible when needed.
Tier 3 — Freeform (dimensions, search_phrases, description): unconstrained fields that capture what controlled vocabularies can't.
You can extend any vocabulary per-project via config:
{
"taxonomy": {
"styles": ["sneaker-culture", "streetwear"],
"categories": ["sneaker-design"],
"tags": ["sneaker-silhouette", "midsole-tooling"]
}
}Usage
Single file
m2md screenshot.png # writes screenshot.md next to it
m2md screenshot.png -o ./docs/ # writes to docs/screenshot.md
m2md screenshot.png --stdout # print to stdoutBatch
m2md ./assets/ # .md sidecar next to each image
m2md ./assets/ -r # recursive
m2md ./assets/ -r -o ./docs/ # recursive, custom output dir
m2md ./assets/*.png # globTiers
Quick presets instead of picking provider + model:
m2md screenshot.png --tier fast # gpt-4o-mini — quick + cheap
m2md screenshot.png --tier quality # claude-sonnet — best results (default behavior)| Tier | Provider | Model | Best for |
|------|----------|-------|----------|
| fast | OpenAI | gpt-4o-mini | Quick passes, large batches, drafts |
| quality | Anthropic | claude-sonnet | Final output, detailed descriptions |
Explicit --provider / --model flags always override --tier.
Providers
By default m2md uses Anthropic's Claude. Switch to OpenAI with --provider:
m2md screenshot.png # Anthropic Claude (default)
m2md screenshot.png --provider openai # OpenAI GPT-4o
m2md screenshot.png --provider openai -m gpt-4o-mini # specific modelURLs
Pass image URLs directly — m2md downloads and processes them:
m2md https://example.com/screenshot.png # image URL → download + describe
m2md https://example.com/landing-page # non-image URL → screenshot via Playwright
m2md screenshot.png https://example.com/photo.jpg # mix local files + URLsFor non-image URLs, m2md takes a full-page screenshot using Playwright (optional dependency):
npm install playwright
npx playwright install chromium # downloads the Chromium browser binaryWatch mode
Auto-process new and changed images in a directory:
m2md watch ./assets/ # watch for new/changed images
m2md watch ./assets/ --tier fast # watch with fast tier
m2md watch ./assets/ -o ./docs/ # custom output directory
m2md watch ./assets/ -p "List all product names" # watch with custom instructionsOn startup, existing images without a .md sidecar are processed. Then m2md watches for new/changed files and processes them automatically. Press Ctrl+C to stop.
Compare mode
Compare two or more images in a single API call:
m2md compare before.png after.png # compare two images
m2md compare v1.png v2.png v3.png # compare multiple versions
m2md compare a.png b.png -n "focus on typography" # with focus directive
m2md compare a.png b.png -o comparison.md # write to fileOutputs structured markdown with Summary, Similarities, Differences, and Verdict sections. Images are labeled A, B, C, etc.
Custom instructions (--prompt)
Tell the model what to do differently. Instructions are appended to the built-in system prompt, so you get the structured output format plus your custom behavior:
m2md screenshot.png -p "List all visible product names and prices"
m2md ./assets/ -p "Identify every UI component and its state"
m2md photo.jpg -p "Describe from a security auditor's perspective"Focus directives (--note)
A lightweight nudge that's additive — it tells the model to pay extra attention to something without changing the analysis instructions. Combine with --prompt or use on its own:
m2md refs/*.png -n "watercolor technique, color palette, brushwork"
m2md hero.png -n "dark mode, spacing tokens"
m2md ./illos/ -p "art critique" -n "line weight and hatching"Templates
m2md screenshot.png # default (frontmatter + description + text)
m2md screenshot.png --template minimal # description + source link
m2md screenshot.png --template alt-text # just a description string
m2md screenshot.png --template detailed # full metadata table + image embed
m2md screenshot.png --template ./my.md # custom template file
m2md screenshot.png --no-frontmatter # strip YAML frontmatter from outputTemplate variables available in custom templates:
| Variable | Description |
|----------|-------------|
| {{type}} | Visual form (photo, illustration, painting, sketch, diagram, chart, screenshot, render-3d, etc.) |
| {{category}} | Content category (ui-design, photography, packaging-design, etc.) |
| {{style}} | Visual style (minimalist, brutalist, mid-century, bauhaus, wabi-sabi, etc.) |
| {{mood}} | Mood/tone (calm, energetic, serene, dramatic, etc.) |
| {{medium}} | Medium (screen-capture, product-photography, letterpress, risograph, etc.) |
| {{composition}} | Composition (centered, grid, rule-of-thirds, golden-ratio, etc.) |
| {{palette}} | Material-driven color names (kraft-brown, slate-blue, bone-white, etc.) |
| {{subject}} | One-line summary (max 80 chars) |
| {{description}} | 4-sentence structured description |
| {{extractedText}} | All visible text, grouped by context |
| {{colors}} | Dominant colors (same as palette) |
| {{tags}} | 6-8 searchable keywords (materials, techniques, proper nouns) |
| {{visualElements}} | Literal visible objects (5-15 items) |
| {{references}} | Design movements, named styles, artist/designer references |
| {{useCase}} | Designer reference use cases |
| {{colorHex}} / {{colorHexYaml}} | 3-5 hex color values sampled from the image |
| {{era}} | Time period the design evokes (mid-century, 1970s, contemporary, etc.) |
| {{artifact}} | Designed object type (poster, packaging-box, website, album-cover, etc.) |
| {{typography}} | Typeface names, classifications, techniques |
| {{script}} | Writing systems / languages visible (latin, kanji, hangul, etc.) |
| {{culturalInfluence}} | Aesthetic lineages (japanese-wabi-sabi, scandinavian-functionalism, etc.) |
| {{searchPhrases}} / {{searchPhrasesYaml}} | 8-10 natural language search phrases |
| {{dimensions}} / {{dimensionsYaml}} | 2-5 reference-worthiness axes |
| {{filename}} | Original filename |
| {{basename}} | Filename without extension |
| {{format}} | File format (PNG, JPEG, WebP, GIF) |
| {{dimensionsPx}} | Width x Height |
| {{width}} / {{height}} | Image dimensions |
| {{sizeHuman}} / {{sizeBytes}} | File size |
| {{sha256}} | Content hash |
| {{processedDate}} / {{datetime}} | Processing timestamp |
| {{model}} | AI model used |
| {{note}} | Focus directive |
| {{sourcePath}} | Relative path to source file |
Caching
Results are cached by content hash. Re-running on a directory only processes new or changed files.
m2md ./assets/ # second run skips unchanged files
m2md ./assets/ --no-cache # force re-processing
m2md cache status # show cache stats
m2md cache clear # clear all cached resultsCache location (in order of precedence):
M2MD_CACHE_DIRenvironment variable$XDG_CACHE_HOME/m2md(ifXDG_CACHE_HOMEis set)~/.cache/m2md(default)
Cost estimation
Preview what a batch will cost before calling the API:
m2md ./assets/ --estimate # show token/cost estimate
m2md ./assets/ --dry-run # list files with cache status + estimatesBoth work without an API key so you can preview before committing.
Custom filenames
Control the output .md filename with --name:
m2md screenshot.png --name "{date}-{filename}" # 2026-02-18-screenshot.md
m2md ./assets/ --name "{type}-{filename}" # screenshot-hero.md, photo-team.md
m2md shot.png --name "{date}-{subject}" # 2026-02-18-dashboard-with-charts.md| Placeholder | Description |
|-------------|-------------|
| {filename} | Original filename without extension |
| {date} | Processing date (YYYY-MM-DD) |
| {type} | AI-detected image type (screenshot, photo, diagram, etc.) |
| {subject} | AI-generated subject line, slugified |
Other flags
m2md screenshot.png -m claude-sonnet-4-5-20250929 # specific model
m2md ./assets/ --concurrency 10 # parallel API calls (default: 5)
m2md screenshot.png --no-frontmatter # strip YAML frontmatter
m2md screenshot.png -v # verbose output (tokens, cost, timing)Configuration
Drop a config file in any directory to set defaults for that project. This way you don't have to pass the same flags every time — just run m2md ./assets/ and it picks up your settings.
Create an m2md.config.json (or .m2mdrc, .m2mdrc.json, .m2mdrc.yaml) in your project root:
{
"tier": "quality",
"note": "focus on typography, color palette, layout grid",
"output": "./docs",
"recursive": true
}All options are optional — only set what you want to override:
| Key | What it does | Default |
|-----|-------------|---------|
| provider | AI provider (anthropic, openai) | anthropic |
| model | AI model to use | provider default |
| tier | Preset tier (fast, quality) | none |
| prompt | Custom instructions for the model | none |
| note | Focus directive (additive nudge) | none |
| template | Output template | default |
| output | Output directory for .md files | next to image |
| name | Output filename pattern ({filename}, {date}, {type}, {subject}) | none |
| noFrontmatter | Strip YAML frontmatter from output | false |
| recursive | Scan directories recursively | false |
| cache | Cache results by content hash | true |
| concurrency | Max parallel API calls | 5 |
Precedence: CLI flags > --tier > config file > defaults.
You can also put config under an "m2md" key in package.json.
Setup
# 1. Install
npm install -g media2md
# 2. Set your API key (one or both)
export ANTHROPIC_API_KEY="sk-ant-..." # for Anthropic / --tier quality (default)
export OPENAI_API_KEY="sk-..." # for OpenAI / --tier fast
# 3. Verify
m2md setupMCP server
m2md includes an MCP server (m2md-mcp) that exposes a describe_image tool over stdio. This lets AI agents — like Claude Desktop or any MCP client — analyze images directly.
The server auto-detects API keys from your shell profile (~/.zshrc, ~/.bashrc, ~/.zprofile, ~/.bash_profile, ~/.profile), so you typically don't need to pass them explicitly. This is especially useful for GUI apps like Claude Desktop that don't inherit shell environment variables.
Claude Desktop / Claude Code
Add to your MCP config (claude_desktop_config.json or .claude/settings.json):
{
"mcpServers": {
"m2md": {
"command": "m2md-mcp"
}
}
}If API key auto-detection doesn't work (e.g. keys are in a secrets manager), pass them explicitly:
{
"mcpServers": {
"m2md": {
"command": "m2md-mcp",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-..."
}
}
}
}Tool: describe_image
| Parameter | Required | Description |
|-----------|----------|-------------|
| filePath | Yes | Absolute path to the image file |
| provider | No | anthropic or openai |
| model | No | AI model to use |
| prompt | No | Custom instructions for the model |
| note | No | Focus directive |
| template | No | default, minimal, alt-text, detailed |
Returns the rendered markdown as text content. Shares the same cache as the CLI.
Supported formats
PNG, JPEG, WebP, GIF
Size limits per image:
| Provider | Max size | |----------|----------| | Anthropic | 5 MB | | OpenAI | 20 MB |
If both API keys are set, m2md automatically routes oversized files to the other provider. Otherwise, use --provider openai for larger files or resize the image first.
Programmatic API
import { processFile, AnthropicProvider } from "media2md";
const result = await processFile("screenshot.png", {
provider: new AnthropicProvider(),
});
result.markdown; // rendered markdown string
result.description; // 4-sentence structured description
result.extractedText; // extracted text
result.type; // "screenshot", "photo", "diagram", etc.
result.category; // "ui-design", "photography", etc.
result.style; // "minimalist, brutalist"
result.mood; // "calm, warm"
result.tags; // comma-separated keywords
result.palette; // material-driven color names
result.era; // "mid-century, contemporary"
result.artifact; // "poster", "website", etc.
result.typography; // "futura, sans-serif"
result.script; // "latin, english"
result.culturalInfluence; // "scandinavian-functionalism"
result.references; // "Bauhaus, Dieter Rams"
result.searchPhrases; // newline-separated search phrases
result.metadata; // { width, height, format, sizeHuman, sha256, ... }License
MIT
