voice-assistant-widget
v3.2.4
Published
Embeddable voice assistant widget for web applications
Downloads
96
Maintainers
Readme
Fonada Voice Assistant
A complete voice assistant pipeline integrating:
- Custom ASR (Automatic Speech Recognition)
- Custom Turn detection with ReplyOnPause handler
- LLM for conversational responses
- Custom Fonada TTS for high-quality voice synthesis
Table of Contents
- Documentation
- Prerequisites
- Setup
- Running the Voice Assistant
- Usage
- Customization
- Integration with FastAPI
- Troubleshooting
- Voice Assistant Monitoring
- License
Documentation
Detailed documentation is available in the docs/ folder:
- 📚 Documentation Index - Complete overview of all available documentation
- 🔧 Dynamic Model Configuration - Guide to managing LLM models dynamically via
models.json - 🤖 Google Gemini Integration - Complete guide to using Google Gemini models with tool calling
- 📊 Voice Assistant Metrics Testing - Comprehensive testing framework for evaluating performance across models and languages
- 📞 Telephony Integration - Asterisk AudioSocket integration with chunk size optimization
- 🔊 RNNoise Audio Processing - Setup and configuration for RNNoise noise reduction
- 🔄 RequestQueueManager - Centralized queue and resource management system for handling request lifecycles and interrupts
- 🎤 AudioRecorder System - Conversation audio recording with speaker identification and WAV export functionality
- ⏰ Scheduler API Integration - LLM-driven tool scheduling with external scheduler API for intelligent timing of tool execution
Prerequisites
- Python 3.8+
- 4 CUDA-capable GPU
- 50 GB+ disk space
- Microphone and speakers
Setup
Building lmdeploy from source (Optional)
If you want to build lmdeploy from source instead of using the pre-built version:
pip install pybind11
sudo apt install cmake openmpi-bin libopenmpi-dev ninja-build
cd ~/lmdeploy
# Manual CMake build
mkdir -p build
# 120a-real is for RTX 5090
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="120a-real" \
-Dpybind11_DIR=$(python3 -m pybind11 --cmakedir) \
-GNinja
# Build with verbose output
cd build && ninja
# Install the extension to final position and set RPATH
ninja install
# If successful, install the Python package
cd ..
pip install -e .Installation
- Install the required dependencies: Install NeMo from github.
pip install -r requirements.txt- Run LLM server
lmdeploy serve api_server hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --server-port 23333 --quant-policy 4or
export CUDA_VISIBLE_DEVICES=2
lmdeploy serve api_server sarvamai/sarvam-m \
--server-port 8000 \
--tp 1 \
--backend turbomind \
--quant-policy 4 \
--cache-max-entry-count 0.9- Run TTS server from
models/folder
export CUDA_VISIBLE_DEVICES=1
lmdeploy serve api_server tts_hindi --server-port 23334 --quant-policy 4Alternative: Docker deployment for RTX 5090
First, pull the Docker image:
docker pull lmsysorg/sglang:blackwellThen run the TTS server:
docker run --gpus all -d --restart unless-stopped \
-p 23334:23334 \
--name sglang_blackwell \
-v /home/fonada/voice_assistant/models/tts_hindi:/model \
lmsysorg/sglang:blackwell \
python3 -m sglang.launch_server --model-path /model --host 0.0.0.0 --port 23334Running the Voice Assistant
Run the assistant with:
export LD_LIBRARY_PATH=/workspace/TensorRT-10.10.0.31/lib:$LD_LIBRARY_PATH
export OPENAI_API_ASR_KEY=
export SARVAM_API_KEY=
export DEEPGRAM_API_KEY=
export OPENAI_API_LLM_KEY=
export GROQ_API_LLM_KEY=
export GEMINI_API_KEY=
python app.pyChange the path accordingly to your TensorRT path and API key. This will start a web server and open a browser interface where you can interact with the voice assistant.
Usage
- Click the microphone button to start speaking
- The assistant will automatically detect when you've finished speaking
- It will transcribe your speech, generate a response with LLama 3.2, and speak the response using Fonada TTS
- You can interrupt the assistant by speaking while it's responding
Customization
Voice Selection
To change the voice used by Fonada TTS, modify the options dictionary in the text_to_speech_sync method:
options = {"voice_id": "Ananya"} # Change to your preferred voiceAvailable voices: "Rahul", "Vikram", "Arjun", "Dev", "Sanjay", "Jaya", "Meera", "Priya", "Ananya", "Divya"
System Prompt
To change how the LLM responds, customize the system prompt when initializing the VoiceAssistant:
assistant = VoiceAssistant(
llm_model_path=llm_model_path,
tts_model_path=tts_model_path,
system_prompt="You are a helpful voice assistant. Keep your responses short and friendly."
)Turn Detection Sensitivity
Adjust the turn detection parameters in the create_voice_assistant_stream() function to change how the assistant detects when you've finished speaking:
algo_options=AlgoOptions(
audio_chunk_duration=0.5, # Duration of audio chunks
started_talking_threshold=0.2, # Threshold to detect start of speech
speech_threshold=0.1 # General speech detection threshold
)Integration with FastAPI
To integrate the voice assistant with a FastAPI app:
from fastapi import FastAPI
from voice_assistant.app import create_voice_assistant_stream
app = FastAPI()
stream = create_voice_assistant_stream()
stream.mount(app)Troubleshooting
Issue: Models fail to load Solution: Verify the correct paths to your model files and ensure they're accessible.
Issue: Speech recognition is inaccurate Solution: Try speaking clearly and ensure your microphone is properly configured.
Issue: High latency in responses Solution: Consider using a more powerful GPU or reducing the model parameters.
Issue: High latency in WebSocket audio processing (1-2+ second delays) Solution: Audio Chunk Size Optimization
The voice assistant uses VAD (Voice Activity Detection) that requires specific chunk sizes for optimal performance. Mismatched chunk sizes between client and server cause significant accumulation delays.
Root Cause:
- Server VAD requires chunks of
VAD_CHUNK_SIZE_SEC * REQUIRED_SAMPLE_RATEsamples - Default:
0.64 * 16000 = 10,240 samples(640ms chunks) - If client sends smaller chunks (e.g., 2048 samples = 128ms), server must accumulate 5+ chunks before processing
- This causes up to 640ms delay per processing stage
Client-Side Optimization:
// ❌ Small chunks cause accumulation delays
const bufferSize = 2048; // 128ms chunks → high latency
// ❌ Invalid: Not a power of 2 (Web Audio API requirement)
// const bufferSize = 10240;
// ✅ Valid power of 2, significant latency reduction
const bufferSize = 8192; // 512ms chunks → low latency
// ✅ Alternative: Eliminates all accumulation delays
// const bufferSize = 16384; // 1024ms chunks → minimal latencyNote: Web Audio API requires buffer sizes to be powers of 2 between 256-16384.
WebSocket Transmission Optimization:
# Send larger chunks aligned with VAD processing
chunk_size = 8192 # Valid power of 2, matches client buffer size (512ms at 16kHz)Configuration Tuning:
# config.yaml - Optimize for 8192-sample client buffers
VAD_CHUNK_SIZE_SEC: 0.512 # 8192 samples at 16kHz (matches client buffers)
NUM_CONSECUTIVE_NON_SPEECH_CHUNKS_TO_END_SEGMENT: 1 # Improve segment detection
# Alternative: Even lower latency with smaller chunks
# VAD_CHUNK_SIZE_SEC: 0.32 # 5120 samples (requires 1.6 client chunks)Expected Impact:
- Before optimization: 2+ seconds end-to-end latency
- After optimization: ~1 second end-to-end latency
- Latency reduction: Up to 1 second improvement
Performance Testing: Use the included concurrency test to measure improvements:
python test/test_concurrency.py --max_concurrent 10 --directNote: For telephony-specific optimizations and Asterisk AudioSocket integration, see the 📞 Telephony Integration Guide.
Sharing Conversation Recordings Across Machines
The conversation_recordings folder can be shared across different machines using NFS (Network File System), which is the recommended approach for Linux environments.
NFS Setup
On the Source Server (Sharing the folder):
- Install NFS server:
# Ubuntu/Debian
sudo apt install nfs-kernel-server
# CentOS/RHEL
sudo yum install nfs-utils- Create and configure the shared directory:
# Navigate to your voice assistant directory
cd /home/fonada/voice_assistant
# Set proper permissions for the conversation_recordings folder
sudo chown nobody:nogroup conversation_recordings
sudo chmod 755 conversation_recordings- Configure NFS exports:
# Edit the exports file
sudo nano /etc/exports
# Add this line (replace 192.168.1.100 with your target server's IP):
/home/fonada/voice_assistant/conversation_recordings 192.168.1.100(rw,sync,no_subtree_check,no_root_squash)- Apply changes and restart NFS:
sudo exportfs -ra
sudo systemctl restart nfs-kernel-serverOn the Target Server (Mounting the folder):
- Install NFS client:
# Ubuntu/Debian
sudo apt install nfs-common
# CentOS/RHEL
sudo yum install nfs-utils- Create mount point and mount the folder:
# Create a local mount point
sudo mkdir -p /home/fonada/voice_assistant/conversation_recordings
# Mount the remote folder (replace 192.168.1.50 with source server's IP)
sudo mount -t nfs 192.168.1.50:/home/fonada/voice_assistant/conversation_recordings /home/fonada/voice_assistant/conversation_recordings- For permanent mounting, add to /etc/fstab:
echo "192.168.1.50:/home/fonada/voice_assistant/conversation_recordings /home/fonada/voice_assistant/conversation_recordings nfs defaults 0 0" | sudo tee -a /etc/fstabUsage Notes
- Replace IP addresses (
192.168.1.50,192.168.1.100) with your actual server IPs - The conversation recordings will be automatically shared and synchronized across all mounted machines
- Ensure proper firewall configuration to allow NFS traffic (port 2049)
- For multiple target servers, add additional lines to
/etc/exportson the source server
Troubleshooting NFS Permission Issues
If you encounter permission errors when trying to save conversation recordings on the target machine:
Error Example:
PermissionError: [Errno 13] Permission denied: 'conversation_recordings/...'Root Cause:
NFS preserves original user IDs (UIDs) from the source server. If the fonada user has different UIDs on source and target machines, permission conflicts occur.
Solution 1: Configure NFS Export with UID Mapping (Recommended for Multiple Machines)
# On the source server, edit /etc/exports:
sudo nano /etc/exports
# Update the export line to include all_squash and UID mapping:
/home/fonada/voice_assistant/conversation_recordings TARGET_IP(rw,sync,no_subtree_check,all_squash,anonuid=1000,anongid=1000)
# Apply changes:
sudo exportfs -ra
sudo systemctl restart nfs-kernel-serverSolution 2: Fix Ownership on Source Server (Single Target Machine Only)
# WARNING: This approach only works if all machines have the same UID for fonada user
# On the source server, change ownership to match target machine's fonada user UID
# First check the target machine's fonada UID: id fonada
# Then on source server:
sudo chown -R TARGET_UID:TARGET_UID /home/fonada/voice_assistant/conversation_recordings
# Example: If target machine fonada user is UID 1000:
sudo chown -R 1000:1000 /home/fonada/voice_assistant/conversation_recordingsNote: Solution 2 will break access for other machines with different UIDs. Use Solution 1 for multiple machines.
Verification: After applying either solution, test write permissions:
# On target machine:
cd /home/fonada/voice_assistant/conversation_recordings
mkdir test_write_permission
# Should succeed without permission errorsLicense
This project uses the same license as the Fonada TTS system.
Voice Assistant Monitoring
This document describes how to set up monitoring for the Voice Assistant application. There are two options available:
Option 1: Streamlit Dashboard (Lightweight)
A lightweight, real-time monitoring dashboard built with Streamlit.
Installation
- Install required packages:
pip install streamlit pandas plotly- Run the monitoring dashboard:
streamlit run monitor.pyThe dashboard will be available at http://localhost:8501 and includes:
- Real-time log viewing
- Request timeline visualization
- Log level distribution
- Filtering by request ID and log level
- Auto-refresh functionality
Option 2: Graylog (Enterprise-grade)
A more comprehensive logging and monitoring solution.
Installation
- Install Graylog prerequisites (MongoDB and Elasticsearch):
sudo apt-get install mongodb-org elasticsearch- Download and install Graylog:
wget https://packages.graylog2.org/repo/packages/graylog-4.0-repository_latest.deb
sudo dpkg -i graylog-4.0-repository_latest.deb
sudo apt-get update
sudo apt-get install graylog-serverFeatures
Streamlit Dashboard
- Real-time log viewing
- Interactive visualizations
- Request timeline
- Log level distribution
- Filter by request ID and log level
- Auto-refresh capability
- Lightweight and easy to set up
Graylog
- Enterprise-grade log management
- Advanced search capabilities
- Custom dashboards
- Alerts and notifications
- Log retention policies
- Role-based access control
Usage
- Start your voice assistant application:
python app.py- Choose your preferred monitoring solution:
For Streamlit dashboard:
streamlit run monitor.pyFor Graylog:
- Access the Graylog web interface at
http://your-server:9000 - Default credentials: admin/admin (change on first login)
Monitoring Metrics
The monitoring solutions track:
- Total number of requests
- Active requests (last 5 minutes)
- Error rates
- Log levels distribution
- Request timelines
- Detailed log messages
Troubleshooting
If you encounter issues:
- Streamlit Dashboard:
- Ensure the log file exists and is readable
- Check if required packages are installed
- Verify the correct Python version
- Graylog:
- Verify MongoDB and Elasticsearch are running
- Check Graylog service status
- Review system logs for errors
TTS Text Normalizer for Indian Context
A comprehensive Python script to normalize text for Text-to-Speech (TTS) training, specifically designed for Indian languages and contexts.
Features
📞 Phone Number Normalization
Converts phone numbers in various formats to spoken digit sequences:
+919876543210→plus nine one nine eight seven six five four three two one zero919-876-543-211→nine one nine eight seven six five four three two one one9876543213→nine eight seven six five four three two one three
💰 Currency Normalization
Converts Indian currency amounts to spoken form using Indian numbering system:
₹8,500→rupees eight thousand five hundred₹2.5 lakh→rupees two point five lakh₹45 crore→rupees forty five croreRs. 1,00,000→rupees one lakh
🌡️ Temperature Normalization
Converts temperature readings to spoken form:
25°C→twenty five degrees celsius98.6°F→ninety eight point six degrees fahrenheit
📊 Percentage Normalization
Converts percentages to spoken form:
85%→eighty five percent12.5%→twelve point five percent
⏰ Time Normalization
Converts time formats to spoken form:
8:00 AM→eight o'clock AM2:30 PM→two thirty PM14:30→fourteen thirty
📅 Date Normalization
Converts ordinal dates to spoken form:
31st March→thirty first March1st April→first April
📧 Email & URL Normalization
Converts digital addresses to spoken form:
[email protected]→priya at gmail dot comwww.example.com→www dot example dot com
🔢 Number Normalization
Converts numbers using Indian numbering system:
1,00,000→one lakh50,000→fifty thousand25→twenty five
📏 Measurement Units
Converts measurement units to spoken form:
500 GB→five hundred gigabytes2.5 km→two point five kilometers100 Mbps→one hundred megabits per second
🔤 Abbreviations
Abbreviations are kept as-is (not expanded) to maintain natural pronunciation:
GST→GST(unchanged)EMI→EMI(unchanged)SBI→SBI(unchanged)
Usage
Basic Usage
from tts_text_normalizer import TTSTextNormalizer
# Initialize normalizer
normalizer = TTSTextNormalizer()
# Normalize a single sentence
text = "Rajesh Kumar का mobile number +919876543210 है। Amount ₹12,500 pay करना है।"
normalized = normalizer.normalize_text(text)
print(normalized)
# Output: Rajesh Kumar का mobile number plus nine one nine eight seven six five four three two one zero है। Amount rupees twelve thousand five hundred pay करना है।Batch Processing
sentences = [
"Contact number +919876543210 है।",
"Amount ₹12,500 pay करना है।",
"Meeting 2:30 PM scheduled है।"
]
normalized_batch = normalizer.batch_normalize(sentences)
for original, normalized in zip(sentences, normalized_batch):
print(f"Original: {original}")
print(f"Normalized: {normalized}")
print()File Processing
# Process a file containing TTS training sentences
normalizer.save_normalized_text('input.txt', 'normalized_output.txt')Individual Component Testing
# Test specific normalizations
phone = normalizer.normalize_phone_number("+919876543210")
currency = normalizer.normalize_currency("₹8,500")
temp = normalizer.normalize_temperature("25°C")
percentage = normalizer.normalize_percentage("85%")Installation
No external dependencies required! Uses only Python standard library:
# Clone or download the files
# Run directly with Python 3.6+
python3 tts_text_normalizer.pyExample Script
Run the example to see all features in action:
python3 example_usage.pyOr test specific components:
python3 debug_test.pyIndian Context Features
Indian Numbering System
- Supports lakh (1,00,000) and crore (1,00,00,000) properly
- Handles Indian comma formatting (1,23,456)
Currency Formats
- Indian Rupee symbol (₹)
- Common Indian currency expressions
- Decimal handling with paise
Phone Number Formats
- Indian country code (+91)
- Various formatting styles commonly used in India
- Mobile number patterns (10 digits)
Regional Considerations
- Preserves Hindi/Indian language text as-is
- Maintains natural code-switching patterns
- Handles common Indian abbreviations
Supported Input Formats
Phone Numbers
+919876543210919876543210+91-9876543210919-876-543-2109876543210
Currency
₹8,500₹2.5 lakh₹45 croreRs. 1,000INR 50,000
Temperature
25°C98.6°F
Time
8:00 AM2:30 PM14:30
Dates
31st March1st April25th December
Output Examples
Input: "Dr. Suresh Gupta cardiologist हैं। Emergency contact +919876543210 है। Consultation fees ₹1,500 है।"
Output: "Dr. Suresh Gupta cardiologist हैं। Emergency contact plus nine one nine eight seven six five four three two one zero है। Consultation fees rupees one thousand five hundred है।"
Input: "Property value ₹2.5 crore है। Registration 31st March तक करना है।"
Output: "Property value rupees two point five crore है। Registration thirty first March तक करना है।"
Input: "Meeting 8:00 AM scheduled है। Success rate 95% है।"
Output: "Meeting eight o'clock AM scheduled है। Success rate ninety five percent है।"File Structure
├── tts_text_normalizer.py # Main normalizer class
├── example_usage.py # Comprehensive examples
├── debug_test.py # Debug and testing script
└── README.md # This documentationCustomization
You can easily customize the normalizer by:
- Adding new abbreviations: Modify the
abbreviationsdictionary - Changing number words: Update the
ones,tens, andindian_unitslists - Adding new patterns: Extend the regex patterns in individual functions
- Custom units: Add new measurement units to the
unitsdictionary
Error Handling
The normalizer includes robust error handling:
- Invalid numbers fall back to original text
- Malformed patterns are preserved as-is
- File processing continues even if individual lines fail
Performance
- Lightweight: Uses only Python standard library
- Fast: Regex-based pattern matching
- Memory efficient: Processes text line by line for files
- Scalable: Handles large files through streaming
Perfect for TTS training data preparation with authentic Indian context and multilingual support!
