voicesmith-mcp

v1.0.19

Published

2 months ago

Local AI voice for coding assistants — TTS & STT via MCP. Kokoro ONNX + faster-whisper, fully offline.

0High
0Medium
0Low

shshalom

mcp tts stt voice kokoro whisper claude ai-assistant

VoiceSmith MCP

Local AI voice for coding assistants. Gives your AI a real voice (text-to-speech) and ears (speech-to-text) via the Model Context Protocol (MCP). Fully offline — no cloud APIs, no data leaves your machine.

What You Get

54 distinct voices via Kokoro ONNX (local TTS, ~300MB model)
Speech-to-text via faster-whisper (local STT, ~150MB model)
Voice activity detection via Silero VAD (local, 2MB)
Multi-session support — run multiple Claude Code sessions, each with its own voice (single session for Cursor/Codex)
Works with Claude Code, Cursor, and Codex

Quick Start

npx voicesmith-mcp install

The installer will:

Check system dependencies (Python 3.11+, espeak-ng, mpv)
Set up a Python virtual environment with all packages
Download TTS and STT models
Configure your IDE's MCP settings
Let you pick a voice
Inject voice behavior rules so the AI knows how to speak

Restart your IDE session after installing. The AI will greet you by voice on the first response.

Usage

[!NOTE] Everything works out of the box. After installing, just start a session — the AI speaks automatically. No configuration needed. The installer sets up voice behavior rules that teach the AI when and how to use its voice.

What the AI does automatically:

| Moment | What happens | |--------|-------------| | You give it a task | Gets to work (speaks only when clarifying approach) | | It finishes work | Speaks a summary of what was done | | It has a question | Asks out loud, then listens for your voice response | | Voice tools unavailable | Falls back to text silently |

Changing Voices Mid-Session

Ask the AI to switch voices at any time:

"Switch to Nova"

If the voice is available, the AI switches immediately. If it's occupied by another session, the AI will tell you and show available alternatives.

Browse all 54 voices:

"Show me the available voices"

Or preview them in a terminal: npx voicesmith-mcp voices

Voice Persistence

[!TIP] When you switch voices, the choice is saved automatically. Next time you start or resume a session, the AI uses the same voice — no need to switch again.

Muting

In a meeting or shared space? Just ask:

"Mute the voice"

The AI continues working normally — it just won't play audio. Say "unmute" when you're ready.

Menu Bar App (macOS)

On macOS, VoiceSmith includes a native menu bar app for hands-free control:

Session Activity — see all active sessions with real-time sparkline graphs
Quick toggles — Media Ducking, Nudge on Timeout
Voice switcher — browse and change from 54 voices, nested by language
Whisper model — switch between base/small/medium/large-v3 with inline download progress
Audio devices — choose audio output and input devices
Voice rules — edit or reset to default
Updates — check and install new versions

The menu bar app starts automatically at login and runs independently from IDE sessions.

Audio Device Selection

Choose specific audio output (speakers/headphones) and input (microphone) devices from the menu bar app, or in config:

{
  "tts": { "audio_output_device": "coreaudio/BuiltInSpeakerDevice" },
  "stt": { "audio_input_device": 1 }
}

Changes take effect immediately — no restart needed. If a configured device is unavailable, falls back to system default.

Interrupting Speech

Press Escape while the AI is speaking to stop audio immediately. The AI stops mid-sentence and waits for your next input.

Alternative Install

If you don't have Node.js or prefer a shell script:

git clone https://github.com/shshalom/voicesmith-mcp.git
cd voicesmith-mcp
./install.sh

Supports the same flags: --claude, --cursor, --codex, --all, --uninstall.

MCP Tools

Once installed, your AI assistant has access to these tools:

| Tool | Description | |------|-------------| | speak | Synthesize and play speech for a named agent | | listen | Open the mic, record speech, return transcribed text | | speak_then_listen | Speak a question, then immediately listen for the answer | | set_voice | Change the voice for an agent name | | get_voice_registry | See which voices are assigned and available | | list_voices | Browse all 54 Kokoro voices | | mute / unmute | Silence or resume voice output | | stop | Stop playback or cancel an active recording | | status | Server health and session info | | list_audio_devices | List available audio input and output devices |

How It Works

The MCP server runs as a local process alongside your IDE. It communicates over stdio (the MCP protocol). All processing happens on your machine:

TTS: Kokoro ONNX — fast neural TTS, 54 voices, no GPU needed
STT: faster-whisper — OpenAI Whisper running locally via CTranslate2
VAD: Silero VAD — voice activity detection for clean recordings
Audio: mpv for playback; CoreAudio via native app bundle on macOS (sounddevice fallback on Linux)
Media ducking: Auto-pauses Apple Music, Spotify, and browser audio during speech (macOS)

Multi-Session

Claude Code: Full multi-session support. Multiple Claude Code sessions can run simultaneously, each with its own voice. Session identity is tracked via Claude's session_id — resuming a session reclaims the same voice, and multiple terminals sharing the same session share the same voice. Orphaned servers are detected and cleaned up automatically.

Cursor / Codex: Single session only. Cursor runs one MCP server per config (shared across tabs), and Codex has no multi-session hooks. Voice works normally — just no multi-session coordination.

Cross-session audio is serialized via flock to prevent overlapping playback.

Configuration

Config lives at ~/.local/share/voicesmith-mcp/config.json. Key settings:

{
  "main_agent": "Eric",
  "tts": {
    "default_voice": "am_eric",
    "audio_player": "mpv",
    "duck_media": true
  },
  "stt": {
    "model_size": "base",
    "language": "en",
    "vad_threshold": 0.3,
    "nudge_on_timeout": false
  }
}

| Setting | Description | Default | |---------|-------------|---------| | tts.duck_media | Auto-pause music/browser audio during speech (macOS) | true | | stt.nudge_on_timeout | Speak "I didn't catch that" when listen times out | false | | stt.vad_threshold | Voice detection sensitivity (lower = more sensitive) | 0.3 |

Re-run npx voicesmith-mcp install to change your voice or update settings. Existing configuration is preserved — only new defaults are added.

Requirements

Python 3.11+ (3.11 or 3.12 recommended)
macOS (primary platform) or Linux (partial support)
espeak-ng — phoneme backend for Kokoro
mpv — audio playback
~500MB disk space for models

[!WARNING] Windows is not supported yet. The server uses Unix-specific features (file locking, audio commands, process detection). Windows support is planned — see TODO for details.

Supported IDEs

| IDE | Config Location | Rules Location | Multi-Session | |-----|----------------|----------------|---------------| | Claude Code | ~/.claude.json | ~/.claude/CLAUDE.md | Yes (via session_id) | | Cursor | ~/.cursor/mcp.json | ~/.cursor/rules/voicesmith.mdc | No (single server) | | Codex | ~/.codex/mcp.json | ~/.codex/AGENTS.md | No (single session) |

Troubleshooting

The AI can't hear me (listen returns empty or times out)

Check microphone permissions. On macOS, VoiceSmith uses a native app bundle (VoiceSmithMCP.app) for mic access. The first time it records, macOS should show a permission dialog for the app. If it didn't:

Open System Settings > Privacy & Security > Microphone
Look for VoiceSmithMCP and make sure it's enabled
If it's not listed, the LaunchAgent may not be running — try reinstalling: npx voicesmith-mcp install

[!IMPORTANT] If the server detects silent audio (all zeros for ~320ms), it returns an error pointing you to the microphone permission settings. This usually means macOS TCC denied mic access.

Check your audio input device. If an external mic is selected but not connected, the server opens it but gets silence:

Open System Settings > Sound > Input and verify the correct mic is selected
Or ask the AI: "What's the server status?" — check that stt.loaded and vad.loaded are both true

Another app is using the mic. Apps like Zoom, Teams, or FaceTime can hold exclusive mic access. Close them and try again.

Voice too quiet for VAD. The voice activity detector might not pick up soft speech. You can lower the sensitivity threshold in ~/.local/share/voicesmith-mcp/config.json:

{
  "stt": {
    "vad_threshold": 0.2
  }
}

Lower values = more sensitive. Default is 0.3. Restart the session after changing.

The AI doesn't speak

Check that espeak-ng and mpv are installed: which espeak-ng mpv
Check the AI's status: ask "What's your voice status?"
If muted, say "Unmute"

The AI speaks with the wrong voice

This can happen when another session is holding your preferred voice name. Ask the AI: "Switch to Eric" — it will either switch or tell you what's available.

Uninstall

npx voicesmith-mcp uninstall
# or if installed via git clone:
./install.sh --uninstall

Removes all files, models, MCP config entries, voice rules, LaunchAgents, and hooks cleanly.

License

Apache 2.0