voice-node-library
v1.0.2
Published
Real-time voice bot library with STT, LLM, and TTS capabilities
Maintainers
Readme
Voice Node Library
A real-time voice bot library that enables seamless voice-to-voice conversations using speech-to-text (STT), large language models (LLM), and text-to-speech (TTS) technologies.
Architecture Overview
The library follows a modular provider-based architecture that allows easy swapping of different AI services:
Audio Input → STT Provider → LLM Provider → TTS Provider → Audio Output
↓ ↓ ↓ ↓ ↓
Microphone → Deepgram → OpenAI GPT → OpenAI TTS → SpeakerCore Components
1. VoiceBot Engine (src/engine.ts)
The central orchestrator that:
- Manages the audio processing pipeline
- Coordinates between STT, LLM, and TTS providers
- Handles real-time streaming and performance metrics
- Maintains conversation history
2. Provider Interfaces (src/providers.ts)
Defines contracts for:
- STTProvider: Speech-to-text transcription
- LLMProvider: Language model chat completion
- TTSProvider: Text-to-speech synthesis
3. Provider Implementations
- DeepgramSTT (
src/providers/deepgram.ts): Real-time speech recognition - OpenAIChat (
src/providers/openai-llm.ts): GPT-based conversation - OpenAITTS (
src/providers/openai-tts.ts): Neural voice synthesis
4. Type System (src/types.ts)
Core data structures:
TranscriptChunk: STT output with timing informationTokenChunk: LLM streaming tokensAudioChunk: PCM audio data with metadata
Getting Started
Prerequisites
- Node.js 18+ with npm or pnpm
- macOS (for audio playback via
afplay) - SoX audio processing library
Installation
- Install Node.js dependencies:
npm install
# or
pnpm install- Install SoX (macOS):
# Using Homebrew
brew install sox
# Verify installation
sox --version- Install SoX (Other platforms):
# Ubuntu/Debian
sudo apt-get install sox
# Windows (using Chocolatey)
choco install soxEnvironment Variables
Create a .env file:
DEEPGRAM_API_KEY=your_deepgram_key
OPENAI_API_KEY=your_openai_keyUsage
import { VoiceBot } from "./src/engine";
import { DeepgramSTT } from "./src/providers/deepgram";
import { OpenAIChat } from "./src/providers/openai-llm";
import { OpenAITTS } from "./src/providers/openai-tts";
const bot = new VoiceBot(
new DeepgramSTT(),
new OpenAIChat(),
new OpenAITTS()
);
bot.on("sttChunk", (text) => console.log("Transcribed:", text));
bot.on("llmToken", (token) => process.stdout.write(token));
bot.on("audioChunk", (chunk) => {
});
await bot.run(audioInputStream);Running the Voice Bot
Real-time voice chat (microphone input):
npm run chat
# or
npm startThis starts real-time voice conversation using your microphone. Speak naturally and the AI will respond with voice.
Process WAV files:
npm run wav <path-to-wav-file>
# Example:
npm run wav ./audio/test.wavThis processes pre-recorded WAV files, transcribes them, generates AI responses, and plays back the audio response.
Audio Pipeline Details
Input Processing
- Audio Capture: 48kHz, 16-bit, mono PCM from microphone
- Real-time Streaming: Chunks sent to Deepgram for live transcription
- Interim Results: Accumulates partial transcriptions for longer inputs
Output Processing
- Token Streaming: LLM tokens streamed in real-time to TTS
- Text Cleaning: Markdown and formatting removed for natural speech
- Audio Synthesis: 24kHz PCM output from OpenAI TTS
- Playback: Sequential chunk playback via system audio
Performance Metrics & Monitoring
The system tracks 3 core latency metrics:
- STT Latency (
stt_duration_ms): Speech-to-text processing time - LLM Latency (
llm_duration_ms): Complete LLM response generation time - TTS Latency (
tts_duration_ms): Text-to-speech processing time
Built-in Performance Dashboard
Get beautiful real-time metrics with zero setup:
# Start your voice bot (dashboard included)
npm start
# Open the dashboard
open http://localhost:9464/dashboardAccess your metrics:
- Performance Dashboard: http://localhost:9464/dashboard (Beautiful HTML interface)
- Raw Prometheus Metrics: http://localhost:9464/metrics
- Health Check: http://localhost:9464/health
The dashboard automatically shows:
- Real-time metric cards with averages and totals
- Color-coded performance indicators (🟢🟡🔴)
- Auto-refresh every 5 seconds
- No external dependencies required
See PROMETHEUS_SETUP.md for detailed dashboard features.
Configuration Options
Deepgram STT
new DeepgramSTT("nova-3") // Model selection- Model:
nova-3(default),nova-2,whisper - Features: Smart formatting, punctuation, VAD events
- Endpointing: 2000ms for longer prompts
OpenAI Chat
new OpenAIChat("gpt-4o-mini") // Model selection- Model:
gpt-4o-mini(default),gpt-4,gpt-3.5-turbo - Temperature: 0.7 for balanced creativity
- Max tokens: 1000 per response
- Speech-Friendly Responses: Automatically generates conversational text optimized for TTS
Intelligent Speech Optimization
The system uses an advanced system prompt that instructs the LLM to generate speech-friendly responses:
- No markdown formatting or visual elements
- Natural conversational flow and transitions
- Spoken numbers and symbols ("percent" not "%")
- Bullet points converted to "First,", "Second,", "Another point is"
- Designed for listening, not reading
OpenAI TTS
new OpenAITTS("tts-1") - Model:
tts-1(default),tts-1-hd(higher quality) - Voice:
alloy,echo,fable,onyx,nova,shimmer - Format: PCM for low latency
- Streaming: Real-time chunk-based synthesis
Error Handling
The library implements robust error handling:
- Connection failures are automatically retried
- Audio processing errors don't interrupt the pipeline
- Graceful degradation when services are unavailable
System Requirements
Audio Dependencies
- macOS: Uses
afplayfor audio playback - SoX: Required for audio format conversion
- Microphone: Any USB or built-in microphone
API Dependencies
- Deepgram: Real-time STT service
- OpenAI: GPT models and TTS service
Environment Variables
# Error: Missing required environment variables
# Solution: Create .env file with your API keys
echo "DEEPGRAM_API_KEY=your_key_here" > .env
echo "OPENAI_API_KEY=your_key_here" >> .envAudio Format Issues
- Unsupported formats: Convert to WAV using
ffmpeg -i input.mp3 output.wav - Sample rate mismatch: System auto-detects, but 48kHz recommended
- Stereo to mono:
sox input.wav -c 1 output.wav
Custom Voice Selection
const voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"];
const tts = new OpenAITTS("tts-1-hd", { voice: "nova" });Performance Tuning
const deepgram = new DeepgramSTT("nova-3", { endpointing: 1000 });
const openai = new OpenAIChat("gpt-4o-mini", { max_tokens: 500 });
const deepgram = new DeepgramSTT("nova-3", { endpointing: 3000 });
const openai = new OpenAIChat("gpt-4", { max_tokens: 1500 });Audio Format Customization
// Microphone settings
const micConfig = {
rate: "48000",
channels: "1",
encoding: "signed-integer"
};
// Speaker settings
const speakerConfig = {
sampleRate: 24000,
channels: 1,
bitDepth: 16
};Development
Scripts
npm start/npm run chat: Real-time voice chat with microphonenpm run wav <file>: Process WAV audio filesnpm run dev: Development mode with auto-reloadnpm run build: Compile TypeScript to JavaScript
Project Structure
src/
├── engine.ts # Core VoiceBot orchestrator
├── index.ts # Main entry point
├── providers.ts # Provider interfaces
├── types.ts # Core type definitions
├── utils.ts # Utility functions
├── metrics.ts # Performance monitoring
├── system-audio.ts # Audio playback utilities
└── providers/
├── deepgram.ts # Deepgram STT implementation
├── openai-llm.ts # OpenAI Chat implementation
└── openai-tts.ts # OpenAI TTS implementation