video-maker-mcp

v0.1.12

Published

8 hours ago

Local MCP storyboard video renderer for Codex-driven video generation.

0High
0Medium
0Low

linkzhao.org

video-maker-mcp

Local MCP storyboard video renderer for Codex-driven short video generation.

video-maker-mcp does not run its own LLM. Codex writes the script and storyboard, then this MCP server generates an AI/stock mixed asset set: Doubao Seedream for cover and conceptual images, plus Pexels/Pixabay for stock photos and stock video B-roll. It validates assets, generates Doubao/Volcengine narration, and renders a 9:16 MP4 with ffmpeg. The requested duration is treated as a target; final video length follows the generated narration audio so speech is not cut off or padded with silence.

Install With Codex

Paste this into Codex:

请帮我安装 video-maker-mcp。执行：
npx -y video-maker-mcp@latest install --host codex

如果提示缺少 ffmpeg，请按提示安装 ffmpeg，然后重新运行：
npx -y video-maker-mcp@latest doctor

如果提示缺少图片配置，请执行：
npx -y video-maker-mcp@latest configure-image

如果提示缺少 stock 素材配置，请执行：
npx -y video-maker-mcp@latest configure-stock

如果提示缺少 TTS 配置，请执行：
npx -y video-maker-mcp@latest configure-tts

最后确认 Codex 能看到 video-maker MCP server。

Manual command:

npx -y video-maker-mcp@latest install --host codex
npx -y video-maker-mcp@latest configure-image
npx -y video-maker-mcp@latest configure-stock
npx -y video-maker-mcp@latest configure-tts
npx -y video-maker-mcp@latest doctor

The installer configures Codex with:

codex mcp add video-maker -- npx -y video-maker-mcp@latest serve

It does not silently install system packages and does not bundle ffmpeg-static.

Requirements

Node.js 20+
Codex CLI
ffmpeg on PATH
Doubao Seedream image generation credentials
Pexels or Pixabay stock media credentials
Doubao/Volcengine TTS credentials

ffmpeg install hints:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install -y ffmpeg

# Windows
winget install Gyan.FFmpeg

Asset Environment

Video asset generation is AI/stock mixed by default. Configure Seedream and at least one stock provider before creating videos:

npx -y video-maker-mcp@latest configure-image

The command prompts for the API key locally, does not require pasting the key into an LLM chat, and writes the configuration to ~/.video-maker/.env. By default it does not generate a test image, to avoid spending image-generation quota. To run a smoke test:

npx -y video-maker-mcp@latest configure-image --test

It writes:

VIDEO_MAKER_IMAGE_API_KEY=your-ark-api-key
VIDEO_MAKER_IMAGE_MODEL=doubao-seedream-5-0-260128
VIDEO_MAKER_IMAGE_ENDPOINT=https://ark.cn-beijing.volces.com/api/v3/images/generations
VIDEO_MAKER_IMAGE_SIZE=2K

VIDEO_MAKER_IMAGE_API_KEY can also be supplied as ARK_API_KEY or VOLCENGINE_API_KEY, but the VIDEO_MAKER_* key is preferred for this tool.

Stock media uses Pexels and/or Pixabay:

npx -y video-maker-mcp@latest configure-stock

# At least one is required for the default AI/stock mixed flow.
VIDEO_MAKER_PEXELS_API_KEY=your-pexels-api-key
VIDEO_MAKER_PIXABAY_API_KEY=your-pixabay-api-key

# Optional. Defaults to pexels, then falls back to pixabay if both are set.
VIDEO_MAKER_STOCK_PROVIDER=pexels

PEXELS_API_KEY and PIXABAY_API_KEY are also accepted, but VIDEO_MAKER_* names are preferred for this tool. Pexels API requests use the Authorization header; Pixabay requests use the key query parameter. Downloaded media is stored locally under the project assets/ directory before rendering.

Quick image smoke test:

npm run build
node dist/cli/index.js image-test --prompt "一张 9:16 竖屏科技短视频封面，真实摄影质感，无文字，无水印。"
open ~/Downloads/video-maker/seedream-image-test.png

image-test --out accepts either a directory or a .png file:

node dist/cli/index.js image-test --out ~/Downloads/video-maker
node dist/cli/index.js image-test --out ~/Downloads/my-seedream-test.png

TTS Environment

Narrated video generation requires Doubao/Volcengine TTS credentials before generate_audio can work.

Recommended setup:

npx -y video-maker-mcp@latest configure-tts

The command prompts for the API key locally, does not require pasting the key into an LLM chat, writes the configuration to ~/.video-maker/.env, and runs a short MP3 smoke test.

It writes:

VIDEO_MAKER_TTS_API_KEY=your-api-key
VIDEO_MAKER_TTS_VOICE_ID=zh_female_vv_uranus_bigtts
VIDEO_MAKER_TTS_RESOURCE_ID=seed-tts-2.0
VIDEO_MAKER_TTS_ENDPOINT=https://openspeech.bytedance.com/api/v3/tts/unidirectional

For local development, you can also put these values in a project-root .env file. The CLI loads the project .env first, then fills missing values from ~/.video-maker/.env. To load another env file, set VIDEO_MAKER_ENV_FILE=/path/to/file.

The default TTS integration uses the current Doubao/Volcengine V3 HTTP unidirectional API with X-Api-Key. This matches the new console API Key page.

Legacy V1 AppID + Access Token is still accepted as a fallback when VIDEO_MAKER_TTS_API_KEY is not set, but new accounts should use VIDEO_MAKER_TTS_API_KEY.

Optional:

VIDEO_MAKER_WORKSPACE=$HOME/.video-maker
VIDEO_MAKER_EXPORT_DIR=$HOME/Downloads/video-maker

Quick TTS smoke test:

npm run build
node dist/cli/index.js tts-test --text "你好，这是一段豆包语音测试。"
open ~/Downloads/video-maker/doubao-tts-test.mp3

tts-test --out accepts either a directory or an .mp3 file:

node dist/cli/index.js tts-test --out ~/Downloads/video-maker
node dist/cli/index.js tts-test --out ~/Downloads/my-voice-test.mp3

Codex Workflow

Once installed, ask Codex:

用 video-maker 把下面文稿生成一个约 60 秒竖屏中文解说视频。请先检查环境，然后创建项目，生成带 shots 的分镜计划，素材要 AI 图 + stock photo + stock video 混合：封面和概念画面用 AI，真实环境和 B-roll 用 stock video，背景/人物/商业场景可用 stock photo。然后生成素材、验证素材、生成语音，最后渲染 mp4，并把最终文件导出到 ~/Downloads/video-maker。

The MCP workflow is:

check_environment
create_video_project
get_video_plan_schema
save_video_plan
list_required_assets
generate_assets
Optional fallback: assert_image_generation_contract + host image generation
Optional fallback bridge: save_generated_image
verify_assets
Optional: create_asset_contact_sheet
generate_audio
render_video
Optional: export_video

Important asset contract:

Default: call generate_assets. The MCP server generates an AI/stock mixed asset set and writes files directly to the required project asset paths.
generate_assets uses Doubao Seedream for AI image assets, Pexels/Pixabay for stock photo assets, and Pexels/Pixabay stock video for .mp4 assets. The MCP default is concurrency: 2; use concurrency: 1 if the image provider rate-limits.
Do not treat stock as an optional fallback. save_video_plan rejects plans that do not include AI images, stock photos, and stock videos in the required mix.
Every actual visual unit must use an explicit mediaSource; do not use auto.
Minimum mix: at least 25% AI image visual units, at least 25% stock_photo image visual units, and at least 15% stock_video visual units.
For stock video shots, set mediaType: "video" and use an .mp4 assetPath, for example assets/scene_001_shot_02.mp4.
For AI images or stock photos, use image paths such as .png, .jpg, or .webp.
Provide concise English stockQuery values for stock shots, for example "city traffic night", "business meeting close up", or "factory production line".
Host image generation is only a fallback for failed AI image assets listed in generate_assets.fallbackAllowedAssetPaths, and only when the user explicitly agrees.
Host fallback must never be used for stock_photo or stock_video paths. Those must come from Pexels/Pixabay through generate_assets.
Do not call ad-hoc external image URLs/APIs from the host to bypass generate_assets.
Host fallback requires generate_image_to_file(prompt, absolutePath): call a real image model and save the generated bitmap directly to the requested local file path.
Fallback bridge: if the host image tool exposes a real generated image as a temporary local file or base64/data URL, call save_generated_image(projectId, assetPath, sourcePath|imageBase64) to let MCP persist it to the required asset path.
If the host can only show generated images in chat and cannot expose a local file path or bytes/base64, stop. Do not continue with fallback graphics.
list_required_assets returns the asset paths the tool will use, including assets/cover.png, scene image paths such as assets/scene_001.png, and stock video paths such as assets/scene_001_shot_02.mp4.
The cover must be a dedicated, content-rich 9:16 cover image generated for the video. It should not be a first-frame extraction.
render_video uses the dedicated cover as a short opening segment when assets/cover.png is available. It does not compose title text, color blocks, or poster typography over the cover; the cover artwork should be finished by the AI image model itself.
Cover prompt guidance: make the AI-generated cover cinematic, premium, stable, and design-led. Prefer a strong central subject, layered depth, controlled color, high-end lighting, mobile-thumbnail readability, and elegant negative space. Avoid readable text, fake letters, logos, UI screenshots, infographic templates, collage grids, and generic stock-photo composition.
Do not substitute SVGs, chart screenshots, CSS drawings, HTML/canvas/sharp scripted graphics, preview strips, first-frame extractions, or manually created placeholder graphics.
verify_assets rejects missing files, unreadable files, wrong aspect ratios, tiny files, suspicious placeholder-sized images, and PNG files with same-name SVG sources such as scene_001.svg.
create_asset_contact_sheet creates output/asset_contact_sheet.png from the exact files that render_video will use. Video assets are represented by their first frame. Check this if the preview does not match the intended generated assets.
If you intentionally use very small stylized images, override the size threshold with VIDEO_MAKER_MIN_ASSET_BYTES.

save_generated_image accepts exactly one source:

{
  "projectId": "vid_...",
  "assetPath": "assets/scene_001.png",
  "sourcePath": "/tmp/host-generated-image.png"
}

or:

{
  "projectId": "vid_...",
  "assetPath": "assets/scene_001.png",
  "imageBase64": "data:image/png;base64,..."
}

It only writes to asset paths returned by list_required_assets, rejects SVG input, normalizes the bitmap to the target format, and immediately runs asset validation.

Visual Timeline

For richer videos, prefer scenes[].shots[]. A scene is a narration block; a shot is a visual cut inside that block.

Recommended density:

30 seconds: 10-18 shots
60 seconds: 18-28 shots
90 seconds: 28-40 shots

Shot example:

{
  "id": "shot_01",
  "durationWeight": 1,
  "visualPrompt": "vertical editorial photo, tense courtroom and AI search interface atmosphere, no readable text",
  "assetPath": "assets/scene_001_shot_01.png",
  "mediaSource": "ai",
  "mediaType": "image",
  "motion": "pan_left",
  "transition": "smoothleft",
  "overlay": {
    "badge": "CASE",
    "headline": "责任边界",
    "metric": "30s",
    "position": "top"
  }
}

Mixed stock video example:

{
  "id": "shot_02",
  "durationWeight": 1,
  "visualPrompt": "real handheld vertical business district B-roll, commuters and glass buildings, documentary tone",
  "assetPath": "assets/scene_001_shot_02.mp4",
  "mediaSource": "stock_video",
  "mediaType": "video",
  "stockQuery": "business district commuters vertical video",
  "motion": "still",
  "transition": "smoothleft",
  "overlay": {
    "badge": "B-ROLL",
    "headline": "真实现场",
    "position": "top"
  }
}

Recommended source mix:

Cover: AI image.
Conceptual or impossible visuals: AI image.
Real-world motion, atmosphere, city, factory, product use, crowd, nature, technology B-roll: stock video.
Contextual stills, business scenes, object/background shots: stock photo.
30 seconds: target 10-18 shots, usually 3-6 AI image shots, 3-7 stock video shots, 2-5 stock photo shots.
60 seconds: target 18-28 shots, usually 5-10 AI image shots, 6-12 stock video shots, 4-8 stock photo shots.

If Seedream returns 429 or another temporary provider error, generate_assets retries AI image requests with exponential backoff and still attempts all stock photo/video downloads. It returns failed and fallbackAllowedAssetPaths instead of abandoning the whole mixed asset job.

Supported motion values:

push_in, pull_back, pan_left, pan_right, tilt_up, tilt_down, still

Supported transition values:

fade, wipeleft, wiperight, slideleft, slideright, smoothleft, none

The renderer applies Ken Burns-style pan/zoom over the full shot duration, then uses ffmpeg xfade between segments. Overlay text is burned through ASS subtitles, not generated into the image.

Timing Model

The video is audio-driven:

durationSec in video_plan.json is used as a relative weight for each scene.
If a scene has shots[], each durationWeight divides that scene's allocated time across faster visual cuts.
After generate_audio, render_video reads the real narration duration with ffprobe.
The dedicated cover opening, shot durations, and subtitle timing are scaled to fill the audio duration.
The final MP4 duration is set to the audio duration, so narration is not cut off and the video does not continue after speech ends.
Duration requests are soft targets unless the user explicitly asks for an exact duration. generate_audio returns a durationPolicy; if durationPolicy.isAcceptable is true, do not rewrite or regenerate narration only to get closer to the target duration.

Subtitle Alignment

Subtitles are generated from provider timestamps, not from rough scene durations:

generate_audio asks Doubao/Volcengine TTS for subtitle timestamps.
Timestamped words are saved to audio/alignment.json.
render_video refuses to render if the project has audio but no alignment file.
output/subtitles.srt is built from the returned word timings, grouped into readable caption cues.
output/subtitles.ass is also generated and burned into the MP4. It uses dynamic font sizing, two-line Chinese wrapping, and shot overlay text. Cover title composition is intentionally not added by the renderer.
If the selected TTS model or voice does not return timestamps, generate_audio fails instead of falling back to estimated subtitles.

Output

Projects are stored under:

~/.video-maker/projects/<projectId>

Final files:

output/final.mp4
output/subtitles.srt
output/subtitles.ass
assets/cover.png
manifest.json
video_plan.json

For user-facing delivery, ask Codex to pass outputDir to render_video or call export_video after rendering. If no export directory is provided, export_video copies files to:

~/Downloads/video-maker/<projectId>

When available, export_video also copies the dedicated cover to:

cover.png

Local Development

npm install
npm run typecheck
npm run build
node dist/cli/index.js doctor

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

video-maker-mcp

Install With Codex

Requirements

Asset Environment

TTS Environment

Codex Workflow

Visual Timeline

Timing Model

Subtitle Alignment

Output

Local Development