@sylweriusz/mcp-kokoro-voice

v1.2.0

Published

5 months ago

MCP Kokoro Voice - Local-first voice synthesis with Kokoro TTS and macOS fallback for AI agents.

0High
0Medium
0Low

mcp kokoro-voice kokoro-tts local-first voice-synthesis tts speech macos-fallback security-hardened english-only model-context-protocol ai-agents zero-cloud privacy-first

🎵 MCP Nexus Voice v5.0

Local-First Voice Synthesis with Multiple Engine Support

AI agents can express themselves with natural voice synthesis using zero cloud dependencies. Support for XTTS2, Kokoro TTS, and macOS system voice fallback.

✨ Architecture

Multi-Engine Support with Intelligent Fallback:

🎤 XTTS2 - High-quality voice synthesis with custom voice support (optional, configured via VOICE_CHANNEL)
🎌 Kokoro TTS - Local high-quality synthesis with bf_isabella (female English voice)
🍎 macOS Fallback - Automatic fallback to system 'say' command with Zoe (Premium) voice
🔒 Security Hardened - Command injection protection, input validation, queue limits
⚡ Dual Queue System - Sequential synthesis (1 concurrent worker) with sequential playback
🛡️ Production Ready - Mutex protection, DoS prevention, proper error handling, waiting queue with 30s timeout

Zero cloud dependencies. Complete privacy. Enterprise security.

🚀 Quick Start

Prerequisites

Node.js 18+ - Required for MCP server
macOS - Required for audio playback (afplay) and fallback voice (say)
Kokoro TTS Server (Optional) - If running, provides high-quality voice synthesis. Otherwise falls back to macOS system voice

Installation Methods

Method 1: From Distribution Package (USB Drive)

Creating the Package:

# Build distribution package
npm run package

# Output files in dist-packages/:
# - mcp-kokoro-voice-v1.0.0.zip (~927 KB)
# - install.sh

Installing on Target Mac:

Copy both files from dist-packages/ to USB drive
On target Mac, navigate to USB drive location
Run installer:

chmod +x install.sh
./install.sh

Follow on-screen instructions to configure Claude Desktop
Add configuration to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "speech": {
      "command": "node",
      "args": [
        "/Users/YOUR_USERNAME/.mcp-servers/mcp-kokoro-voice-v1.0.0/dist/index.js"
      ],
      "env": {
        "KOKORO_API_URL": "http://localhost:8880",
        "KOKORO_MODEL": "mlx-community/Kokoro-82M-8bit"
      }
    }
  }
}

Restart Claude Desktop
Verify with /mcp command

Method 2: NPM Installation (Recommended)

# Install MCP Kokoro Voice globally
npm install -g @sylweriusz/mcp-kokoro-voice

Claude Desktop Configuration (for NPM installation)

Add this to: ~/Library/Application Support/Claude/claude_desktop_config.json

Option 1: Kokoro TTS (Default)

{
  "mcpServers": {
    "speech": {
      "command": "npx",
      "args": ["-y", "@sylweriusz/mcp-kokoro-voice"],
      "env": {
        "VOICE_CHANNEL": "KOKORO",
        "KOKORO_API_URL": "http://localhost:8880"
      }
    }
  }
}

Option 2: XTTS2 with Kokoro Fallback

{
  "mcpServers": {
    "speech": {
      "command": "npx",
      "args": ["-y", "@sylweriusz/mcp-kokoro-voice"],
      "env": {
        "VOICE_CHANNEL": "XTTS2",
        "XTTS2_API_URL": "http://your-xtts2-server:5002",
        "XTTS2_MALE_VOICE": "patrick_stewart",
        "XTTS2_FEMALE_VOICE": "langusta",
        "XTTS2_SPEED": "1.0",
        "XTTS2_PITCH": "1.0",
        "KOKORO_API_URL": "http://localhost:8880"
      }
    }
  }
}

Configuration Notes:

VOICE_CHANNEL: Choose between XTTS2 or KOKORO (defaults to KOKORO)
KOKORO_API_URL is optional (defaults to http://localhost:8880)
XTTS2_API_URL: Required when using XTTS2 channel
XTTS2_MALE_VOICE and XTTS2_FEMALE_VOICE: Custom voice names (optional)
XTTS2_SPEED: Speech tempo - 1.0=normal, 1.2=20% faster, 0.8=20% slower (optional, defaults to 1.0)
XTTS2_PITCH: Voice pitch/frequency - 1.0=normal, 1.1=10% higher, 0.9=10% lower (optional, defaults to 1.0)
If primary engine is unavailable, automatically falls back to next available engine
Restart Claude Desktop after configuration changes

Verification

# Test installation
npx @sylweriusz/mcp-kokoro-voice

# Should show:
# 🎵 MCP Nexus Voice v1.0 ready
# 🎌 Kokoro TTS: Available (or Unavailable if fallback active)

🎮 Usage

Basic Voice Expression

With XTTS2 (Multi-language):

// English with default settings
say({
  text: "Hello! I'm excited to help you today!",
  language: "en"
})

// Polish with specific voice
say({
  text: "Dzień dobry! Jak się masz?",
  language: "pl",
  voice: "narrator"
})

// Spanish with custom voice
say({
  text: "¡Hola! ¿Cómo estás?",
  language: "es",
  voice: "langusta"
})

With Kokoro (English only):

// Voice synthesis with automatic fallback
// - Kokoro TTS available: Uses bf_isabella (female English)
// - Kokoro unavailable: Falls back to macOS Zoe (Premium)
say("Hello! I'm excited to help you today!")

Voice Quality Tuning (XTTS2)

Speed and pitch are configured globally via environment variables and apply to all synthesis:

XTTS2_SPEED: Controls speaking tempo (how fast the voice talks)
XTTS2_PITCH: Controls voice frequency/octave (how high/low the voice sounds)

Example configurations:

# Natural female voice (recommended starting point)
XTTS2_SPEED=1.04  # Slightly faster for better flow
XTTS2_PITCH=1.10  # Higher pitch for feminine tone

# Natural male voice
XTTS2_SPEED=1.0   # Normal tempo
XTTS2_PITCH=0.95  # Slightly lower for masculine tone

# Fast-paced narration
XTTS2_SPEED=1.2   # 20% faster
XTTS2_PITCH=1.0   # Normal pitch

# Slow, deliberate speech
XTTS2_SPEED=0.85  # 15% slower
XTTS2_PITCH=1.0   # Normal pitch

📝 Text Preprocessing Requirements (CRITICAL)

For optimal synthesis quality, preprocess text according to these guidelines:

1. Language & Translation

Always provide English text only
Translate non-English text to natural, conversational English

2. TTS Optimization (Critical for Quality)

// Expand abbreviations
say("Doctor Smith called Mister Johnson")  // Not: Dr. Smith called Mr. Johnson

// Convert numbers to words
say("one hundred twenty-three")  // Not: 123
say("twenty twenty-four")        // Not: 2024

// Spell out currency
say("fifty dollars")         // Not: $50
say("twenty-five euros")     // Not: €25

// Expand dates
say("January first")         // Not: Jan 1st
say("December twenty-fifth") // Not: 12/25

// Convert times
say("three thirty PM")       // Not: 3:30 PM
say("two PM")               // Not: 14:00

// Spell out symbols
say("and")     // Not: &
say("percent") // Not: %
say("at")      // Not: @

// Handle acronyms with dots for pronunciation
say("N.A.S.A.")  // Not: NASA
say("F.B.I.")    // Not: FBI

3. Speech Flow Optimization

// Use natural punctuation for speech pauses
say("Welcome to our platform. Let's get started with your project.")

// Break long sentences into speakable segments
say("First, we'll analyze the data. Then, we'll generate the report.")

// Ensure text flows naturally when spoken aloud
say("The meeting is scheduled for tomorrow at nine AM in conference room B.")

Example Transformation

// ❌ Poor: Raw text with abbreviations and symbols
"Dr. Smith earned $1,000 on Jan 1st @ 3:30 PM (approx. 50%)"

// ✅ Good: Preprocessed for optimal synthesis
say("Doctor Smith earned one thousand dollars on January first at three thirty PM, approximately fifty percent")

🛠️ Environment Setup

Environment Variables

# Voice Channel Selection (XTTS2 or KOKORO)
export VOICE_CHANNEL=XTTS2  # or KOKORO (default)

# XTTS2 Configuration (when using XTTS2 channel)
export XTTS2_API_URL=http://your-xtts2-server:5002
export XTTS2_MALE_VOICE=patrick_stewart
export XTTS2_FEMALE_VOICE=langusta

# Optional: Adjust speed and pitch for optimal voice quality
export XTTS2_SPEED=1.0   # Speech tempo: 1.0=normal, 1.04=4% faster, 1.2=20% faster
export XTTS2_PITCH=1.0   # Voice pitch: 1.0=normal, 1.05=5% higher, 1.1=10% higher

# Example: If voice sounds too deep/slow, try these values:
# export XTTS2_SPEED=1.04
# export XTTS2_PITCH=1.10

# Kokoro Configuration
export KOKORO_API_URL=http://localhost:8880  # Optional, defaults to this value
export KOKORO_MODEL=mlx-community/Kokoro-82M-8bit  # Optional

Note: The system automatically falls back through available engines: XTTS2 → Kokoro → macOS system voice.

📊 Engine Status

The system shows real-time engine availability at startup:

When using XTTS2:

🎵 MCP Nexus Voice v5.0 ready
🎯 Active Channel: XTTS2
🎤 XTTS2 TTS: Available
🎌 Kokoro TTS (fallback): Available

When using Kokoro (default):

🎵 MCP Nexus Voice v5.0 ready
🎯 Active Channel: KOKORO
🎌 Kokoro TTS: Available
🎤 XTTS2 TTS (alternative): Available

When primary engine is unavailable, synthesis automatically falls back to the next available engine.

🔒 Security Features

Command Injection Protection - Secure execFile usage prevents malicious commands
Input Validation - Text length limits, path traversal prevention
Queue Limits - DoS protection with configurable synthesis and playback limits
Mutex Protection - Race condition prevention in queue operations
Secure File Handling - Temporary file cleanup and path validation

🛠️ Troubleshooting

Common Issues

XTTS2 TTS Not Available

If you're using VOICE_CHANNEL=XTTS2 and XTTS2 is unavailable, the system automatically falls back to Kokoro or macOS voice.

To enable XTTS2:

# Check if XTTS2 server is running
curl http://your-xtts2-server:5002/speakers
# Should return: JSON array of available voices

# Check environment variables
echo $VOICE_CHANNEL  # Should be: XTTS2
echo $XTTS2_API_URL  # Should be: http://your-xtts2-server:5002

# Verify server status in logs
tail -f ~/Library/Logs/Claude/mcp-server-speech.log

Kokoro TTS Not Available (Fallback Active)

This is not an error - the system automatically uses macOS system voice.

To enable Kokoro TTS for better quality:

# Check if Kokoro server is running
curl -I http://localhost:8880/tts
# Should return: HTTP/1.1 405 Method Not Allowed (endpoint exists)

# Check environment variable (if set)
echo $KOKORO_API_URL
# Should show: http://localhost:8880 or empty (uses default)

# Verify server status in logs
tail -f ~/Library/Logs/Claude/mcp-server-speech.log

XTTS2 Audio Speed/Pitch Adjustment

If XTTS2 audio sounds too slow/fast or pitch is too high/low, adjust using native API parameters:

# In .env or Claude Desktop config

# Speed (tempo of speech)
XTTS2_SPEED=1.0   # Normal speed
XTTS2_SPEED=1.04  # 4% faster (subtle)
XTTS2_SPEED=1.2   # 20% faster
XTTS2_SPEED=0.8   # 20% slower

# Pitch (voice frequency/octave)
XTTS2_PITCH=1.0   # Normal pitch
XTTS2_PITCH=1.05  # 5% higher (subtle)
XTTS2_PITCH=1.10  # 10% higher (noticeable)
XTTS2_PITCH=0.9   # 10% lower (deeper voice)

Common Adjustments:

Voice sounds too low/deep: XTTS2_PITCH=1.05 to 1.10
Voice speaks too slow: XTTS2_SPEED=1.04 to 1.20
Voice sounds robotic: Try XTTS2_SPEED=0.95 for more natural tempo

Example Configuration:

{
  "env": {
    "VOICE_CHANNEL": "XTTS2",
    "XTTS2_API_URL": "http://your-server:5002",
    "XTTS2_SPEED": "1.04",
    "XTTS2_PITCH": "1.10"
  }
}

Note: These parameters are applied via ffmpeg post-processing (XTTS2 API doesn't reliably support native speed/pitch adjustment). Requires ffmpeg to be installed.

Text Not Playing

Ensure text is preprocessed according to guidelines (abbreviations expanded, etc.)
Check text length (max 1000 characters)
Verify audio system is working: afplay /System/Library/Sounds/Glass.aiff

Queue Full Errors

System has protective limits: max 10 synthesis tasks (1 concurrent worker), max 20 playback tasks
Requests exceeding synthesis queue enter waiting queue with 30-second timeout
Sequential processing prevents Kokoro TTS server overload
Wait for current tasks to complete or restart the server

Debug Mode

# Enable MCP debugging
export MCP_TIMEOUT=10000
export PYTHONUNBUFFERED=1

# Run with debug output
claude --mcp-debug

📄 License

MIT License - see LICENSE file for details.

🎵 Express yourself vocally - the local way!