audio-transcription-mcp

v0.7.1

Published

6 months ago

MCP server for real-time audio transcription using OpenAI Whisper

0High
0Medium
0Low

pmerwin

mcp model-context-protocol audio transcription whisper openai real-time speech-to-text cursor claude

Audio Transcription MCP Server

Real-time audio transcription using OpenAI Whisper. Capture and transcribe system audio (meetings, videos, music) automatically with AI assistance through Cursor or Claude Desktop.

✨ Features

🎤 Real-time transcription - Captures and transcribes audio as it plays
🔄 Zero installation - Use with npx, no global install needed
🤖 AI-powered - Uses OpenAI's Whisper API for accurate transcription
📝 Timestamped transcripts - Every entry is timestamped in markdown format
🔒 Session isolation - Each session gets its own unique transcript file
⚡ Smart silence detection - Automatically pauses when no audio detected
🎯 Automated setup - One command sets up audio routing
🧪 Built-in testing - Verify your setup before starting

🚀 Quick Start (5 Minutes)

Step 1: Run Automated Setup

The setup script installs everything you need and guides you through configuration:

npx audio-transcription-mcp setup

What this does:

✅ Installs Homebrew (if needed)
✅ Installs ffmpeg for audio processing
✅ Installs BlackHole virtual audio driver
✅ Guides you through creating a Multi-Output Device (or does it automatically!)
✅ Takes 5 minutes, mostly automated

First time? The script will walk you through everything with clear instructions. Don't worry if it asks for your Mac password - that's normal for installing software!

Step 2: Test Your Setup

Verify everything works before using it:

npx audio-transcription-mcp test

This captures 5 seconds of audio and shows you if it's working correctly.

Step 3: Configure Your AI Assistant

Add to your Cursor or Claude Desktop config:

Edit ~/.cursor/config.json:

{
  "mcpServers": {
    "audio-transcription": {
      "command": "npx",
      "args": ["-y", "audio-transcription-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-your-key-here",
        "INPUT_DEVICE_NAME": "BlackHole"
      }
    }
  }
}

Then restart Cursor and ask:

"Start transcribing audio"

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "audio-transcription": {
      "command": "npx",
      "args": ["-y", "audio-transcription-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-your-key-here",
        "INPUT_DEVICE_NAME": "BlackHole",
        "OUTFILE_DIR": "/Users/yourname/Documents/Transcripts"
      },
      "allowedDirectories": [
        "/Users/yourname/Documents/Transcripts"
      ]
    }
  }
}

Important:

Create the directory: mkdir -p ~/Documents/Transcripts
Replace yourname with your actual username
Restart Claude Desktop

Then ask:

"Start transcribing audio"

Step 4: Set System Output

Go to System Settings > Sound > Output and select "Multi-Output Device"

This routes audio to both your speakers (so you can hear) and BlackHole (for transcription).

Step 5: Start Transcribing!

In Cursor or Claude Desktop, just ask:

"Start transcribing audio"

Your AI assistant will start capturing and transcribing audio in real-time!

📖 What You Need

macOS 10.15+ (Catalina or later)
OpenAI API key - Get one here (pay-as-you-go, ~$0.36/hour - see detailed costs)
5 minutes for setup

🎯 Use Cases

Meeting transcription - Zoom, Google Meet, Teams calls
Content creation - Transcribe videos, podcasts, or music
Accessibility - Real-time captions for any audio
Note-taking - Automatic transcripts of lectures or presentations
Research - Transcribe interviews or focus groups

🔧 Troubleshooting

Audio Not Being Captured

Problem: Test shows silent or very low audio levels

Solution:

Check System Settings > Sound > Output is set to "Multi-Output Device"
Open Audio MIDI Setup and verify both outputs are checked:
- ☑ Built-in Output
- ☑ BlackHole 2ch
Play some audio and run npx audio-transcription-mcp test again

BlackHole Not Showing Up

Problem: BlackHole doesn't appear in device list

Solution: Restart your Mac. Audio drivers require a restart to be recognized by the system.

Setup Script Fails

Problem: Automated setup doesn't work

Solution: The script will fall back to manual mode with clear instructions. This is normal on first run if accessibility permissions aren't granted. Just follow the 4-step guide shown.

Want to Start Over?

If you need to remove everything and start fresh:

# Uninstall BlackHole and ffmpeg
brew uninstall blackhole-2ch ffmpeg

# Delete Multi-Output Device
# 1. Open Audio MIDI Setup
# 2. Select "Multi-Output Device" in left sidebar
# 3. Press Delete key

# Then run setup again
npx audio-transcription-mcp setup

Need More Help?

📚 Additional Documentation

🛠️ Advanced Usage

Standalone CLI Mode

You can use this as a standalone CLI without MCP:

# Start transcription (saves to meeting_transcript.md)
npx audio-transcription-mcp start

# Press Ctrl+C to stop

Configure via .env file:

OPENAI_API_KEY=sk-your-key-here
INPUT_DEVICE_NAME=BlackHole
CHUNK_SECONDS=8
OUTFILE=meeting_transcript.md

MCP Server Tools

When used with Cursor or Claude Desktop, these tools are available:

start_transcription - Start capturing and transcribing audio
pause_transcription - Pause transcription temporarily
resume_transcription - Resume after pause
stop_transcription - Stop and get session stats
get_status - Check if transcription is running
get_transcript - Retrieve current transcript content
clear_transcript - Clear and start fresh
cleanup_transcript - Delete transcript file

Configuration Options

Environment variables you can customize:

| Variable | Default | Description | |----------|---------|-------------| | OPENAI_API_KEY | (required) | Your OpenAI API key | | INPUT_DEVICE_NAME | BlackHole | Audio input device name | | CHUNK_SECONDS | 8 | Seconds of audio per chunk | | MODEL | whisper-1 | OpenAI Whisper model | | OUTFILE_DIR | process.cwd() | Output directory for transcripts | | SAMPLE_RATE | 16000 | Audio sample rate (Hz) | | CHANNELS | 1 | Number of audio channels |

🏗️ How It Works

Audio Routing: Multi-Output Device sends system audio to both your speakers and BlackHole
Capture: ffmpeg captures audio from BlackHole in 8-second chunks
Processing: Audio is converted to WAV format suitable for Whisper API
Transcription: Each chunk is sent to OpenAI Whisper for transcription
Output: Timestamped text is appended to a markdown file in real-time
Silence Detection: Automatically pauses after 32 seconds of silence to save API costs

💰 Costs & Performance

What You're Paying For

You ONLY pay for OpenAI Whisper API calls - everything else runs locally for free!

✅ FREE (runs locally on your machine):

Audio capture with ffmpeg
Audio processing and buffer management
Silence detection and level analysis
File operations (writing/reading transcripts)
All MCP server operations

💰 PAID (OpenAI API):

Only the transcription API calls to OpenAI Whisper
$0.006 per minute of audio transcribed
Silent chunks are automatically skipped to save money

Actual Costs

With default 8-second chunks:

| Duration | API Calls | Approximate Cost | |----------|-----------|------------------| | 1 minute | ~7.5 chunks | $0.006 | | 1 hour | ~450 chunks | $0.36 | | 8-hour workday | ~3,600 chunks | $2.88 |

Cost per chunk: ~$0.0008 (less than a tenth of a cent!)

Built-in Cost Savings

The tool includes smart silence detection that saves you money:

🔇 Silent audio chunks are NEVER sent to OpenAI
💰 Automatically tracks cost savings in the debug log
⏸️ Auto-pauses after 32 seconds of silence
📊 View statistics with get_status to see chunks skipped

Example: In a 1-hour meeting with 15 minutes of silence, you save ~$0.09 automatically!

Performance

Memory usage: 50-100 MB per session
CPU usage: Minimal (ffmpeg handles audio processing)
API latency: 1-3 seconds per chunk
Accuracy: 90-95% for clear speech
Network: Only during transcription API calls

Cost Optimization Tips

Increase chunk size - Fewer API calls (set CHUNK_SECONDS=15)
Use silence detection - Enabled by default, saves money automatically
Pause when not needed - Use pause_transcription during breaks
Monitor usage - Check OpenAI dashboard for actual costs

Bottom line: Transcription is cheap (~36¢/hour), runs mostly locally, and automatically saves money by skipping silence. You're only charged when actual speech is being transcribed.

🧪 Development & Testing

For contributors and developers:

📖 See MCP_SETUP.md for complete setup instructions

Just add to your config and restart - that's it!

See the npx configuration at the top of this README for Cursor and Claude Desktop.

For Standalone CLI (Local Development)

📖 See GETTING_STARTED.md for complete setup instructions

# Install dependencies
npm install
npm run build

# Configure environment
cp env.example .env  # Then add your OpenAI API key

# Run standalone CLI
npm start

📄 License & Contributing

This project is licensed under the MIT License - see the LICENSE file for details.

Contributions are welcome! Please feel free to submit a Pull Request.

Development Resources

Made with ❤️ for transcribing meetings, content, and conversations.

Star ⭐ this repo if you find it useful!

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Audio Transcription MCP Server

✨ Features

🚀 Quick Start (5 Minutes)

Step 1: Run Automated Setup

Step 2: Test Your Setup

Step 3: Configure Your AI Assistant

Step 4: Set System Output

Step 5: Start Transcribing!

📖 What You Need

🎯 Use Cases

🔧 Troubleshooting

Audio Not Being Captured

BlackHole Not Showing Up

Setup Script Fails

Want to Start Over?

Need More Help?

📚 Additional Documentation

🛠️ Advanced Usage

Standalone CLI Mode

MCP Server Tools

Configuration Options

🏗️ How It Works

💰 Costs & Performance

What You're Paying For

Actual Costs

Built-in Cost Savings

Performance

Cost Optimization Tips

🧪 Development & Testing

For Standalone CLI (Local Development)

📄 License & Contributing

Development Resources