voxflow
v1.8.8
Published
AI audio content creation CLI — stories, podcasts, narration, dubbing, transcription, translation, and video translation with TTS
Maintainers
Readme
voxflow
AI audio content creation CLI — stories, podcasts, narration, dubbing, transcription, translation, and TTS synthesis.
Quick Start
# Synthesize a single sentence
npx voxflow say "你好世界"
# Output as MP3 (smaller file size)
npx voxflow say "你好世界" --format mp3
# Generate a story with TTS narration
npx voxflow story --topic "三只小猪"
# Dub a video from SRT subtitles
npx voxflow dub --srt subtitles.srt --video input.mp4 --output dubbed.mp4
# Transcribe audio to subtitles (SRT)
npx voxflow asr --input recording.mp3
# Translate SRT subtitles to another language
npx voxflow translate --srt subtitles.srt --to en
# End-to-end video translation (ASR → translate → dub → merge)
npx voxflow video-translate --input video.mp4 --to en
# One-command build + local delivery (for Skill/agent orchestration)
npx voxflow publish --video input.mp4 --audio narration.wav --publish local
# Browse available voices
npx voxflow voices --search "温柔"A browser window will open for login on first use. After that, your token is cached automatically.
Install
npm install -g voxflowCommands
voxflow say <text> / voxflow synthesize <text>
Synthesize a single text snippet to audio.
voxflow say "你好世界"
voxflow say "你好世界" --format mp3
voxflow synthesize "Welcome" --voice v-male-Bk7vD3xP --format mp3
voxflow say "快速测试" --speed 1.5 --volume 0.8 --pitch 2| Flag | Default | Description |
|------|---------|-------------|
| <text> | (required) | Text to synthesize (positional or --text) |
| --voice <id> | v-female-R2s4N9qJ | TTS voice ID |
| --format <fmt> | pcm | Output format: pcm (WAV), wav, mp3 |
| --speed <n> | 1.0 | Speed 0.5-2.0 |
| --volume <n> | 1.0 | Volume 0.1-2.0 |
| --pitch <n> | 0 | Pitch -12 to 12 |
| --output <path> | ./tts-<timestamp>.wav | Output file path |
voxflow narrate [options]
Narrate a document, text, or script to multi-segment audio.
voxflow narrate --input article.txt
voxflow narrate --input article.txt --format mp3
voxflow narrate --input readme.md --voice v-male-Bk7vD3xP
voxflow narrate --text "第一段。第二段。第三段。"
voxflow narrate --script narration-script.json
echo "Hello world" | voxflow narrate| Flag | Default | Description |
|------|---------|-------------|
| --input <file> | | Input .txt or .md file |
| --text <text> | | Inline text to narrate |
| --script <file> | | JSON script with per-segment voice control |
| --voice <id> | v-female-R2s4N9qJ | Default voice ID |
| --format <fmt> | pcm | Output format: pcm (WAV), wav, mp3 |
| --speed <n> | 1.0 | Speed 0.5-2.0 |
| --silence <sec> | 0.8 | Silence between segments 0-5.0 |
| --output <path> | ./narration-<timestamp>.wav | Output file path |
Script JSON format (per-segment voice/speed control):
{
"segments": [
{ "text": "第一段内容", "voiceId": "v-female-R2s4N9qJ", "speed": 1.0 },
{ "text": "第二段内容", "voiceId": "v-male-Bk7vD3xP", "speed": 0.8 }
],
"silence": 1.0,
"output": "my-narration.wav"
}voxflow voices [options]
Browse and filter available TTS voices (no login required).
voxflow voices
voxflow voices --search "温柔" --gender female
voxflow voices --language en --extended
voxflow voices --json| Flag | Default | Description |
|------|---------|-------------|
| --search <query> | | Search by name, tone, style |
| --gender <m\|f> | | Filter by gender |
| --language <code> | | Filter by language: zh, en, etc. |
| --extended | false | Include extended voice library (380+) |
| --json | false | Output raw JSON |
voxflow story [options]
Generate a story with AI and synthesize TTS audio.
voxflow story --topic "小红帽的故事"
voxflow story --topic "太空探险" --paragraphs 8 --speed 0.8| Flag | Default | Description |
|------|---------|-------------|
| --topic <text> | Children's story | Story prompt |
| --voice <id> | v-female-R2s4N9qJ | TTS voice ID |
| --output <path> | ./story-<timestamp>.wav | Output WAV file |
| --paragraphs <n> | 5 | Number of paragraphs (1-20) |
| --speed <n> | 1.0 | Speed (0.5-2.0) |
| --silence <sec> | 0.8 | Silence between paragraphs (0-5.0) |
voxflow podcast [options]
Generate a multi-speaker podcast dialogue with AI script generation and multi-voice TTS.
# Quick start — AI generates script + synthesizes audio
voxflow podcast --topic "AI in healthcare"
# Use a template with colloquial control
voxflow podcast --topic "tech news" --template news --colloquial high --speakers 3
# English podcast
voxflow podcast --topic "AI ethics debate" --language en --template discussion
# Generate script only (no TTS), export as JSON
voxflow podcast --topic "量子计算入门" --format json --no-tts
# Synthesize from a previously exported .podcast.json
voxflow podcast --input my-podcast.podcast.json --output final.wav
# Legacy engine (lower quota cost)
voxflow podcast --topic "AI趋势" --engine legacy --exchanges 10| Flag | Default | Description |
|------|---------|-------------|
| --topic <text> | tech trends | Podcast topic or prompt |
| --engine <type> | auto (→ ai-sdk) | auto, legacy, or ai-sdk |
| --template <name> | interview | interview, discussion, news, story, tutorial |
| --colloquial <lvl> | medium | Conversational tone: low, medium, high |
| --speakers <n> | 2 | Speaker count: 1, 2, or 3 |
| --language <code> | zh-CN | zh-CN, en, ja |
| --format json | — | Also output .podcast.json alongside audio |
| --input <file> | — | Load .podcast.json for synthesis (skip LLM) |
| --no-tts | false | Generate script only, skip TTS synthesis |
| --length <len> | medium | short, medium, long |
| --exchanges <n> | 8 | Number of exchanges, 2-30 (legacy engine) |
| --style <style> | — | Legacy: dialogue style (maps to --template) |
| --voice <id> | — | Override TTS voice for all speakers |
| --bgm <file> | — | Background music file to mix in |
| --ducking <n> | 0.2 | BGM volume ducking (0-1.0) |
| --output <path> | ./podcast-<ts>.wav | Output file path |
| --speed <n> | 1.0 | TTS speed (0.5-2.0) |
| --silence <sec> | 0.5 | Silence between segments (0-5.0) |
Two-step workflow (recommended for editing):
voxflow podcast --topic "..." --format json --no-tts→ generates.podcast.json- Edit the JSON (speakers, dialogue, voice mapping)
voxflow podcast --input edited.podcast.json→ synthesizes audio
voxflow dub [options]
Dub audio from SRT subtitles with timeline-precise TTS synthesis. Supports multi-speaker voice mapping, dynamic speed compensation, video merge, and background music mixing.
# Basic: generate dubbed audio from SRT
voxflow dub --srt subtitles.srt
# Dub and merge into video
voxflow dub --srt subtitles.srt --video input.mp4 --output dubbed.mp4
# Multi-speaker with voice mapping
voxflow dub --srt subtitles.srt --voices speakers.json --speed-auto
# Add background music with ducking
voxflow dub --srt subtitles.srt --bgm music.mp3 --ducking 0.3
# Patch a single caption without full rebuild
voxflow dub --srt subtitles.srt --patch 5 --output dub-existing.wav| Flag | Default | Description |
|------|---------|-------------|
| --srt <file> | (required) | SRT subtitle file |
| --video <file> | | Video file — merge dubbed audio into video |
| --voice <id> | v-female-R2s4N9qJ | Default TTS voice ID |
| --voices <file> | | JSON speaker-to-voiceId map for multi-speaker dubbing |
| --speed <n> | 1.0 | TTS speed 0.5-2.0 |
| --speed-auto | false | Auto-adjust speed when audio overflows timeslot |
| --bgm <file> | | Background music file to mix in |
| --ducking <n> | 0.2 | BGM volume ducking 0-1.0 (lower = quieter BGM) |
| --patch <id> | | Re-synthesize a single caption by ID (patch mode) |
| --output <path> | ./dub-<timestamp>.wav | Output file path (.wav or .mp4 with --video) |
SRT format with speaker tags (optional [Speaker: xxx] extension):
1
00:00:01,000 --> 00:00:03,500
[Speaker: Alice]
Hello, welcome to the show!
2
00:00:04,000 --> 00:00:06,500
[Speaker: Bob]
Thanks for having me.Voice mapping JSON (speakers.json):
{
"Alice": "v-female-R2s4N9qJ",
"Bob": "v-male-Bk7vD3xP"
}Requires
ffmpegin PATH for--video,--bgm, and--speed-autofeatures.
voxflow asr [options] / voxflow transcribe [options]
Transcribe audio or video files to text. Supports cloud ASR (Tencent Cloud, 3 modes) and local Whisper (offline, no quota).
# Transcribe with auto engine detection (local Whisper if available, else cloud)
voxflow asr --input recording.mp3
# Force local Whisper (no login needed, no quota used)
voxflow asr --input recording.mp3 --engine local
# Use a larger Whisper model for better accuracy
voxflow asr --input meeting.wav --engine local --model small
# Cloud ASR with speaker diarization
voxflow asr --input meeting.wav --engine cloud --speakers --speaker-number 3
# Transcribe video file, output plain text
voxflow asr --input video.mp4 --format txt
# Remote URL (cloud only)
voxflow asr --url https://example.com/audio.wav --mode flash
# Record from microphone (cloud only)
voxflow asr --mic --format txt| Flag | Default | Description |
|------|---------|-------------|
| --input <file> | | Local audio or video file |
| --url <url> | | Remote audio URL (cloud only) |
| --mic | | Record from microphone (cloud only, requires sox) |
| --engine <type> | auto | Engine: auto, local, cloud |
| --model <name> | base | Whisper model: tiny, base, small, medium, large |
| --mode <type> | auto | Cloud mode: auto, sentence, flash, file |
| --lang <model> | 16k_zh | Language: 16k_zh, 16k_en, 16k_zh_en, 16k_ja, 16k_ko |
| --format <fmt> | srt | Output: srt, txt, json |
| --output <path> | <input>.<format> | Output file path |
| --speakers | false | Enable speaker diarization (cloud only) |
| --speaker-number <n> | | Expected speakers (with --speakers) |
| --task-id <id> | | Resume async task polling (cloud only) |
Engine selection:
auto— Uses local Whisper ifnodejs-whisperis installed, otherwise falls back to cloudlocal— Local Whisper via whisper.cpp (no login, no quota, offline capable)cloud— Tencent Cloud ASR (requires login, uses quota)
Local Whisper setup (optional):
npm install -g nodejs-whisper
# Model downloads automatically on first use (~142 MB for base)Requires
ffmpegin PATH for audio extraction from video files.
voxflow translate [options]
Translate SRT subtitles, plain text, or text files using LLM-powered batch translation.
# Translate SRT file (Chinese → English)
voxflow translate --srt subtitles.srt --to en
# Translate with timing realignment for target language
voxflow translate --srt subtitles.srt --to en --realign
# Translate a text file
voxflow translate --input article.txt --to ja --output article-ja.txt
# Translate inline text
voxflow translate --text "你好世界" --to en
# Auto-detect source language
voxflow translate --srt movie.srt --to ko| Flag | Default | Description |
|------|---------|-------------|
| --srt <file> | | SRT subtitle file to translate |
| --input <file> | | Plain text / markdown file to translate |
| --text <string> | | Inline text to translate |
| --from <lang> | auto-detect | Source language: zh, en, ja, ko, fr, de, es, etc. |
| --to <lang> | (required) | Target language code |
| --realign | false | Adjust subtitle timing for target language length differences |
| --batch-size <n> | 10 | Captions per translation batch (1-20) |
| --output <path> | <input>-<lang>.srt | Output file path |
Supported languages: zh, en, ja, ko, fr, de, es, pt, ru, ar, th, vi, id, and more.
Cost: 1 quota per batch (~10 captions). A 100-caption SRT costs ~10 quota.
voxflow video-translate [options]
End-to-end video translation: extracts audio, transcribes, translates subtitles, dubs with TTS, and merges back into video.
# Translate Chinese video to English
voxflow video-translate --input video.mp4 --to en
# Specify source language
voxflow video-translate --input video.mp4 --from zh --to ja
# Keep intermediate files (SRT, audio) for debugging
voxflow video-translate --input video.mp4 --to en --keep-intermediates
# Custom voice and speed
voxflow video-translate --input video.mp4 --to en --voice v-male-Bk7vD3xP --speed 0.9| Flag | Default | Description |
|------|---------|-------------|
| --input <file> | (required) | Input video file |
| --from <lang> | auto-detect | Source language code |
| --to <lang> | (required) | Target language code |
| --voice <id> | v-female-R2s4N9qJ | TTS voice ID for dubbing |
| --voices <file> | | Voice mapping JSON for multi-speaker |
| --realign | false | Adjust subtitle timing for target language |
| --speed <n> | 1.0 | TTS speed (0.5-2.0) |
| --batch-size <n> | 10 | Translation batch size |
| --keep-intermediates | false | Keep temp files (SRT, audio) |
| --output <path> | <input>-<lang>.mp4 | Output MP4 path |
| --asr-mode <mode> | auto | Override ASR mode: auto, sentence, flash, file |
| --asr-lang <engine> | auto | Override ASR engine: 16k_zh, 16k_en, 16k_ja, 16k_ko |
Pipeline: Video → FFmpeg extract audio → ASR transcribe → LLM translate → TTS dub → FFmpeg merge → Output MP4
Cost: ~3-N quota (1 ASR + 1+ translate batches + 1 per TTS caption)
Requires
ffmpegin PATH.
voxflow publish [options]
Single command for final deliverables. Designed for agent skills and automation orchestration:
- Build final MP4 (translate+dub / dub / merge)
- Deliver to local directory or via webhook
- Return structured JSON output for downstream processing
Note:
--platformis a metadata tag only — it does NOT upload to any platform. Use--publish webhookto integrate with your own distribution service.
# Mode A: video-translate + local delivery
voxflow publish --input video.mp4 --to en --publish local
# Mode B: dub existing subtitles into video
voxflow publish --srt subtitles.srt --video input.mp4 --publish local
# Mode C: merge existing audio into video
voxflow publish --video input.mp4 --audio narration.mp3 --publish local
# Deliver via webhook (e.g. custom distribution service)
voxflow publish --input video.mp4 --to ja \
--publish webhook \
--publish-webhook https://publisher.example.com/hook \
--json| Flag | Default | Description |
|------|---------|-------------|
| --input <video> | | Mode A: source video for translate+dub (requires --to) |
| --to <lang> | | Target language for Mode A |
| --from <lang> | auto | Source language for Mode A |
| --srt <file> | | Mode B: SRT subtitle file (requires --video) |
| --video <file> | | Mode B/Mode C video file |
| --audio <file> | | Mode C: external narration audio |
| --voice <id> | v-female-R2s4N9qJ | TTS voice for Mode A/B |
| --voices <file> | | Multi-speaker voice mapping JSON |
| --output <path> | auto | Final MP4 output path |
| --publish <target> | local | local | webhook | none |
| --publish-dir <dir> | ./published | Local publish directory |
| --publish-webhook <url> | | Webhook URL for distribution service |
| --platform <name> | generic | Platform metadata tag (not an actual upload target) |
| --title <text> | filename | Title metadata |
| --json | false | Print machine-readable JSON result |
voxflow login / logout / status / dashboard
voxflow login # Open browser to login via email OTP
voxflow logout # Clear cached token
voxflow status # Show login status and token info
voxflow dashboard # Open Web dashboard in browservoxflow add <name> (experimental — Day-1 MVP)
Pull a curated flow / voice recipe / CLI preset from the official registry into your current project.
voxflow add --list # Browse all items in the registry
voxflow add dub-anime-jp-zh # Pull a preset (resolves to voxflow/dub-anime-jp-zh)
voxflow add chico/my-recipe # Explicit author for community items
voxflow add foo --force # Overwrite existing local files
voxflow add foo --registry <url> # Use a different registry (e.g. enterprise private)After add, files land under presets/<name>/ (preset), recipes/<name>/ (voice-recipe), or flows/<name>/ (flow). The CLI prints a Try it: hint with the exact command to use the just-installed item.
Day-1 MVP scope: no
dependsOncascading, no ETag cache, no private-registry token. Coming in Phase 2 — seedocs/product/cli-registry.md.
Authentication
voxflow uses browser-based email OTP login (Supabase):
- CLI starts a temporary local HTTP server
- Opens your browser to the login page
- You enter your email and verification code
- Browser redirects back to the CLI with your token
- Token is cached at
~/.config/voxflow/token.json
Quota
- Free tier: 10,000 quota per month (1 basic TTS = 100 quota)
say/synthesize: 100 quota per callnarrate: 100 quota per segmentstory: ~600-800 quota (1 LLM + N TTS)podcast(ai-sdk): ~5,000-10,000 quota (script) + 100/segment (TTS)podcast(legacy): ~200 quota (script) + 100/segment (TTS)dub: 100 quota per SRT captionasr(cloud): 100 quota per recognitionasr(local): free (no quota)translate: 100 quota per batch (~10 captions)video-translate: ~300-N quota (ASR + translate + TTS)voices: free (no quota)- Quota resets monthly
Requirements
- Node.js >= 18.0.0
ffmpegrecommended — needed by most audio/video features:
| Command | Without FFmpeg | With FFmpeg |
|---------|---------------|-------------|
| say / synthesize | Full support | Full support |
| narrate | Full support | Full support |
| story / podcast | Full support | Full support |
| voices | Full support | Full support |
| dub --srt file.srt | Audio output only | Audio output only |
| dub --video / --bgm / --speed-auto | Not available | Full support |
| asr --input file.wav (16kHz mono) | Works (cloud) | Works (cloud + local) |
| asr --input file.mp3 / video | Not available | Full support |
| asr --engine local | Not available | Full support |
| translate | Full support | Full support |
| video-translate | Not available | Full support |
Install FFmpeg:
# macOS
brew install ffmpeg
# Ubuntu / Debian
sudo apt install ffmpeg
# Windows — download from https://ffmpeg.org/download.htmlOptional dependencies:
nodejs-whisper— for local Whisper ASR without cloud API (npm install -g nodejs-whisper)sox— for microphone recording (asr --mic)
Claude Code / AI Agent Integration
The voxflow CLI is designed to be called by AI agents (Claude Code, Cursor, etc.) as the unified execution layer. No API keys or Python scripts needed — all auth goes through voxflow login (JWT).
Skill documentation: See cli/skills/podcast/SKILL.md for the full podcast skill reference.
Plugin install (Claude Code / Cursor / Codex)
The CLI ships agent plugin manifests so it can be installed as a first-class plugin in any major AI coding environment. Each manifest points to the shared cli/skills/ directory.
# Claude Code — local try (plugin root is cli/, manifest at cli/.claude-plugin/plugin.json)
claude --plugin-dir cli
# Verify the manifest is valid:
claude plugin tag cli --dry-run
# Codex — sparse install from GitHub (plugin metadata + skills only)
codex plugin marketplace add VoxFlowStudio/FlowStudio \
--sparse cli/.codex-plugin --sparse cli/skills
# Cursor — sideload from a cloned repo (Settings → Plugins → Load unpacked → cli/)Note: plugin install only ships the agent-side manifest and skill docs. To actually run
voxflowcommands, install the CLI separately:npm install -g voxflow.
Manifests live at cli/.claude-plugin/plugin.json, cli/.cursor-plugin/plugin.json, cli/.codex-plugin/plugin.json. Claude Code discovers skills/ relative to the plugin root (cli/), so the Claude manifest omits the skills field and relies on the folder-name convention.
Typical agent workflow:
# 1. Login (one-time)
voxflow login
# 2. Generate script only
voxflow podcast --topic "Your topic" --format json --no-tts
# 3. Agent edits the .podcast.json as needed
# 4. Synthesize from edited script
voxflow podcast --input edited.podcast.json --output final.wavCI/non-interactive environments: Set VOXFLOW_TOKEN env var to skip browser login.
License
UNLICENSED - All rights reserved.
