voice-assistant-widget

v3.2.4

Published

6 months ago

Embeddable voice assistant widget for web applications

0High
0Medium
0Low

anshkumar

voice-assistant widget websocket speech-recognition text-to-speech noise-suppression

Fonada Voice Assistant

A complete voice assistant pipeline integrating:

Custom ASR (Automatic Speech Recognition)
Custom Turn detection with ReplyOnPause handler
LLM for conversational responses
Custom Fonada TTS for high-quality voice synthesis

Documentation

Detailed documentation is available in the docs/ folder:

📚 Documentation Index - Complete overview of all available documentation
🔧 Dynamic Model Configuration - Guide to managing LLM models dynamically via models.json
🤖 Google Gemini Integration - Complete guide to using Google Gemini models with tool calling
📊 Voice Assistant Metrics Testing - Comprehensive testing framework for evaluating performance across models and languages
📞 Telephony Integration - Asterisk AudioSocket integration with chunk size optimization
🔊 RNNoise Audio Processing - Setup and configuration for RNNoise noise reduction
🔄 RequestQueueManager - Centralized queue and resource management system for handling request lifecycles and interrupts
🎤 AudioRecorder System - Conversation audio recording with speaker identification and WAV export functionality
⏰ Scheduler API Integration - LLM-driven tool scheduling with external scheduler API for intelligent timing of tool execution

Prerequisites

Python 3.8+
4 CUDA-capable GPU
50 GB+ disk space
Microphone and speakers

Setup

Building lmdeploy from source (Optional)

If you want to build lmdeploy from source instead of using the pre-built version:

pip install pybind11
sudo apt install cmake openmpi-bin libopenmpi-dev ninja-build

cd ~/lmdeploy

# Manual CMake build
mkdir -p build

# 120a-real is for RTX 5090
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="120a-real" \
  -Dpybind11_DIR=$(python3 -m pybind11 --cmakedir) \
  -GNinja

# Build with verbose output
cd build && ninja

# Install the extension to final position and set RPATH
ninja install

# If successful, install the Python package
cd ..
pip install -e .

Installation

Install the required dependencies: Install NeMo from github.

pip install -r requirements.txt

Run LLM server


lmdeploy serve api_server hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --server-port 23333 --quant-policy 4

export CUDA_VISIBLE_DEVICES=2
lmdeploy serve api_server sarvamai/sarvam-m \
  --server-port 8000 \
  --tp 1 \
  --backend turbomind \
  --quant-policy 4 \
  --cache-max-entry-count 0.9

Run TTS server from models/ folder

export CUDA_VISIBLE_DEVICES=1
lmdeploy serve api_server tts_hindi --server-port 23334 --quant-policy 4

Alternative: Docker deployment for RTX 5090

First, pull the Docker image:

docker pull lmsysorg/sglang:blackwell

Then run the TTS server:

docker run --gpus all -d --restart unless-stopped \
    -p 23334:23334 \
    --name sglang_blackwell \
    -v /home/fonada/voice_assistant/models/tts_hindi:/model \
    lmsysorg/sglang:blackwell \
    python3 -m sglang.launch_server --model-path /model --host 0.0.0.0 --port 23334

Running the Voice Assistant

Run the assistant with:

export LD_LIBRARY_PATH=/workspace/TensorRT-10.10.0.31/lib:$LD_LIBRARY_PATH
export OPENAI_API_ASR_KEY=
export SARVAM_API_KEY=
export DEEPGRAM_API_KEY=
export OPENAI_API_LLM_KEY=
export GROQ_API_LLM_KEY=
export GEMINI_API_KEY=
python app.py

Change the path accordingly to your TensorRT path and API key. This will start a web server and open a browser interface where you can interact with the voice assistant.

Usage

Click the microphone button to start speaking
The assistant will automatically detect when you've finished speaking
It will transcribe your speech, generate a response with LLama 3.2, and speak the response using Fonada TTS
You can interrupt the assistant by speaking while it's responding

Customization

Voice Selection

To change the voice used by Fonada TTS, modify the options dictionary in the text_to_speech_sync method:

options = {"voice_id": "Ananya"}  # Change to your preferred voice

Available voices: "Rahul", "Vikram", "Arjun", "Dev", "Sanjay", "Jaya", "Meera", "Priya", "Ananya", "Divya"

System Prompt

To change how the LLM responds, customize the system prompt when initializing the VoiceAssistant:

assistant = VoiceAssistant(
    llm_model_path=llm_model_path,
    tts_model_path=tts_model_path,
    system_prompt="You are a helpful voice assistant. Keep your responses short and friendly."
)

Turn Detection Sensitivity

Adjust the turn detection parameters in the create_voice_assistant_stream() function to change how the assistant detects when you've finished speaking:

algo_options=AlgoOptions(
    audio_chunk_duration=0.5,  # Duration of audio chunks
    started_talking_threshold=0.2,  # Threshold to detect start of speech
    speech_threshold=0.1  # General speech detection threshold
)

Integration with FastAPI

To integrate the voice assistant with a FastAPI app:

from fastapi import FastAPI
from voice_assistant.app import create_voice_assistant_stream

app = FastAPI()
stream = create_voice_assistant_stream()
stream.mount(app)

Troubleshooting

Issue: Models fail to load Solution: Verify the correct paths to your model files and ensure they're accessible.

Issue: Speech recognition is inaccurate Solution: Try speaking clearly and ensure your microphone is properly configured.

Issue: High latency in responses Solution: Consider using a more powerful GPU or reducing the model parameters.

Issue: High latency in WebSocket audio processing (1-2+ second delays) Solution: Audio Chunk Size Optimization

The voice assistant uses VAD (Voice Activity Detection) that requires specific chunk sizes for optimal performance. Mismatched chunk sizes between client and server cause significant accumulation delays.

Root Cause:

Server VAD requires chunks of VAD_CHUNK_SIZE_SEC * REQUIRED_SAMPLE_RATE samples
Default: 0.64 * 16000 = 10,240 samples (640ms chunks)
If client sends smaller chunks (e.g., 2048 samples = 128ms), server must accumulate 5+ chunks before processing
This causes up to 640ms delay per processing stage

Client-Side Optimization:

// ❌ Small chunks cause accumulation delays
const bufferSize = 2048; // 128ms chunks → high latency

// ❌ Invalid: Not a power of 2 (Web Audio API requirement)
// const bufferSize = 10240; 

// ✅ Valid power of 2, significant latency reduction
const bufferSize = 8192; // 512ms chunks → low latency

// ✅ Alternative: Eliminates all accumulation delays  
// const bufferSize = 16384; // 1024ms chunks → minimal latency

Note: Web Audio API requires buffer sizes to be powers of 2 between 256-16384.

WebSocket Transmission Optimization:

# Send larger chunks aligned with VAD processing
chunk_size = 8192  # Valid power of 2, matches client buffer size (512ms at 16kHz)

Configuration Tuning:

# config.yaml - Optimize for 8192-sample client buffers
VAD_CHUNK_SIZE_SEC: 0.512  # 8192 samples at 16kHz (matches client buffers)
NUM_CONSECUTIVE_NON_SPEECH_CHUNKS_TO_END_SEGMENT: 1  # Improve segment detection

# Alternative: Even lower latency with smaller chunks
# VAD_CHUNK_SIZE_SEC: 0.32  # 5120 samples (requires 1.6 client chunks)

Expected Impact:

Before optimization: 2+ seconds end-to-end latency
After optimization: ~1 second end-to-end latency
Latency reduction: Up to 1 second improvement

Performance Testing: Use the included concurrency test to measure improvements:

python test/test_concurrency.py --max_concurrent 10 --direct

Note: For telephony-specific optimizations and Asterisk AudioSocket integration, see the 📞 Telephony Integration Guide.

Sharing Conversation Recordings Across Machines

The conversation_recordings folder can be shared across different machines using NFS (Network File System), which is the recommended approach for Linux environments.

NFS Setup

On the Source Server (Sharing the folder):

Install NFS server:

# Ubuntu/Debian
sudo apt install nfs-kernel-server

# CentOS/RHEL
sudo yum install nfs-utils

Create and configure the shared directory:

# Navigate to your voice assistant directory
cd /home/fonada/voice_assistant

# Set proper permissions for the conversation_recordings folder
sudo chown nobody:nogroup conversation_recordings
sudo chmod 755 conversation_recordings

Configure NFS exports:

# Edit the exports file
sudo nano /etc/exports

# Add this line (replace 192.168.1.100 with your target server's IP):
/home/fonada/voice_assistant/conversation_recordings 192.168.1.100(rw,sync,no_subtree_check,no_root_squash)

Apply changes and restart NFS:

sudo exportfs -ra
sudo systemctl restart nfs-kernel-server

On the Target Server (Mounting the folder):

Install NFS client:

# Ubuntu/Debian
sudo apt install nfs-common

# CentOS/RHEL
sudo yum install nfs-utils

Create mount point and mount the folder:

# Create a local mount point
sudo mkdir -p /home/fonada/voice_assistant/conversation_recordings

# Mount the remote folder (replace 192.168.1.50 with source server's IP)
sudo mount -t nfs 192.168.1.50:/home/fonada/voice_assistant/conversation_recordings /home/fonada/voice_assistant/conversation_recordings

For permanent mounting, add to /etc/fstab:

echo "192.168.1.50:/home/fonada/voice_assistant/conversation_recordings /home/fonada/voice_assistant/conversation_recordings nfs defaults 0 0" | sudo tee -a /etc/fstab

Usage Notes

Replace IP addresses (192.168.1.50, 192.168.1.100) with your actual server IPs
The conversation recordings will be automatically shared and synchronized across all mounted machines
Ensure proper firewall configuration to allow NFS traffic (port 2049)
For multiple target servers, add additional lines to /etc/exports on the source server

Troubleshooting NFS Permission Issues

If you encounter permission errors when trying to save conversation recordings on the target machine:

Error Example:

PermissionError: [Errno 13] Permission denied: 'conversation_recordings/...'

Root Cause: NFS preserves original user IDs (UIDs) from the source server. If the fonada user has different UIDs on source and target machines, permission conflicts occur.

Solution 1: Configure NFS Export with UID Mapping (Recommended for Multiple Machines)

# On the source server, edit /etc/exports:
sudo nano /etc/exports

# Update the export line to include all_squash and UID mapping:
/home/fonada/voice_assistant/conversation_recordings TARGET_IP(rw,sync,no_subtree_check,all_squash,anonuid=1000,anongid=1000)

# Apply changes:
sudo exportfs -ra
sudo systemctl restart nfs-kernel-server

Solution 2: Fix Ownership on Source Server (Single Target Machine Only)

# WARNING: This approach only works if all machines have the same UID for fonada user
# On the source server, change ownership to match target machine's fonada user UID
# First check the target machine's fonada UID: id fonada
# Then on source server:
sudo chown -R TARGET_UID:TARGET_UID /home/fonada/voice_assistant/conversation_recordings

# Example: If target machine fonada user is UID 1000:
sudo chown -R 1000:1000 /home/fonada/voice_assistant/conversation_recordings

Note: Solution 2 will break access for other machines with different UIDs. Use Solution 1 for multiple machines.

Verification: After applying either solution, test write permissions:

# On target machine:
cd /home/fonada/voice_assistant/conversation_recordings
mkdir test_write_permission
# Should succeed without permission errors

License

This project uses the same license as the Fonada TTS system.

Voice Assistant Monitoring

This document describes how to set up monitoring for the Voice Assistant application. There are two options available:

Option 1: Streamlit Dashboard (Lightweight)

A lightweight, real-time monitoring dashboard built with Streamlit.

Installation

Install required packages:

pip install streamlit pandas plotly

Run the monitoring dashboard:

streamlit run monitor.py

The dashboard will be available at http://localhost:8501 and includes:

Real-time log viewing
Request timeline visualization
Log level distribution
Filtering by request ID and log level
Auto-refresh functionality

Option 2: Graylog (Enterprise-grade)

A more comprehensive logging and monitoring solution.

Installation

Install Graylog prerequisites (MongoDB and Elasticsearch):

sudo apt-get install mongodb-org elasticsearch

Download and install Graylog:

wget https://packages.graylog2.org/repo/packages/graylog-4.0-repository_latest.deb
sudo dpkg -i graylog-4.0-repository_latest.deb
sudo apt-get update
sudo apt-get install graylog-server

Features

Streamlit Dashboard

Real-time log viewing
Interactive visualizations
Request timeline
Log level distribution
Filter by request ID and log level
Auto-refresh capability
Lightweight and easy to set up

Graylog

Enterprise-grade log management
Advanced search capabilities
Custom dashboards
Alerts and notifications
Log retention policies
Role-based access control

Usage

Start your voice assistant application:

python app.py

Choose your preferred monitoring solution:

For Streamlit dashboard:

streamlit run monitor.py

For Graylog:

Access the Graylog web interface at http://your-server:9000
Default credentials: admin/admin (change on first login)

Monitoring Metrics

The monitoring solutions track:

Total number of requests
Active requests (last 5 minutes)
Error rates
Log levels distribution
Request timelines
Detailed log messages

Troubleshooting

If you encounter issues:

Streamlit Dashboard:

Ensure the log file exists and is readable
Check if required packages are installed
Verify the correct Python version

Graylog:

Verify MongoDB and Elasticsearch are running
Check Graylog service status
Review system logs for errors

TTS Text Normalizer for Indian Context

A comprehensive Python script to normalize text for Text-to-Speech (TTS) training, specifically designed for Indian languages and contexts.

Features

📞 Phone Number Normalization

Converts phone numbers in various formats to spoken digit sequences:

+919876543210 → plus nine one nine eight seven six five four three two one zero
919-876-543-211 → nine one nine eight seven six five four three two one one
9876543213 → nine eight seven six five four three two one three

💰 Currency Normalization

Converts Indian currency amounts to spoken form using Indian numbering system:

₹8,500 → rupees eight thousand five hundred
₹2.5 lakh → rupees two point five lakh
₹45 crore → rupees forty five crore
Rs. 1,00,000 → rupees one lakh

🌡️ Temperature Normalization

Converts temperature readings to spoken form:

25°C → twenty five degrees celsius
98.6°F → ninety eight point six degrees fahrenheit

📊 Percentage Normalization

Converts percentages to spoken form:

85% → eighty five percent
12.5% → twelve point five percent

⏰ Time Normalization

Converts time formats to spoken form:

8:00 AM → eight o'clock AM
2:30 PM → two thirty PM
14:30 → fourteen thirty

📅 Date Normalization

Converts ordinal dates to spoken form:

31st March → thirty first March
1st April → first April

📧 Email & URL Normalization

Converts digital addresses to spoken form:

[email protected] → priya at gmail dot com
www.example.com → www dot example dot com

🔢 Number Normalization

Converts numbers using Indian numbering system:

1,00,000 → one lakh
50,000 → fifty thousand
25 → twenty five

📏 Measurement Units

Converts measurement units to spoken form:

500 GB → five hundred gigabytes
2.5 km → two point five kilometers
100 Mbps → one hundred megabits per second

🔤 Abbreviations

Abbreviations are kept as-is (not expanded) to maintain natural pronunciation:

GST → GST (unchanged)
EMI → EMI (unchanged)
SBI → SBI (unchanged)

Usage

Basic Usage

from tts_text_normalizer import TTSTextNormalizer

# Initialize normalizer
normalizer = TTSTextNormalizer()

# Normalize a single sentence
text = "Rajesh Kumar का mobile number +919876543210 है। Amount ₹12,500 pay करना है।"
normalized = normalizer.normalize_text(text)
print(normalized)
# Output: Rajesh Kumar का mobile number plus nine one nine eight seven six five four three two one zero है। Amount rupees twelve thousand five hundred pay करना है।

Batch Processing

sentences = [
    "Contact number +919876543210 है।",
    "Amount ₹12,500 pay करना है।",
    "Meeting 2:30 PM scheduled है।"
]

normalized_batch = normalizer.batch_normalize(sentences)
for original, normalized in zip(sentences, normalized_batch):
    print(f"Original: {original}")
    print(f"Normalized: {normalized}")
    print()

File Processing

# Process a file containing TTS training sentences
normalizer.save_normalized_text('input.txt', 'normalized_output.txt')

Individual Component Testing

# Test specific normalizations
phone = normalizer.normalize_phone_number("+919876543210")
currency = normalizer.normalize_currency("₹8,500")
temp = normalizer.normalize_temperature("25°C")
percentage = normalizer.normalize_percentage("85%")

Installation

No external dependencies required! Uses only Python standard library:

# Clone or download the files
# Run directly with Python 3.6+
python3 tts_text_normalizer.py

Example Script

Run the example to see all features in action:

python3 example_usage.py

Or test specific components:

python3 debug_test.py

Indian Context Features

Indian Numbering System

Supports lakh (1,00,000) and crore (1,00,00,000) properly
Handles Indian comma formatting (1,23,456)

Currency Formats

Indian Rupee symbol (₹)
Common Indian currency expressions
Decimal handling with paise

Phone Number Formats

Indian country code (+91)
Various formatting styles commonly used in India
Mobile number patterns (10 digits)

Regional Considerations

Preserves Hindi/Indian language text as-is
Maintains natural code-switching patterns
Handles common Indian abbreviations

Supported Input Formats

Phone Numbers

+919876543210
919876543210
+91-9876543210
919-876-543-210
9876543210

Currency

₹8,500
₹2.5 lakh
₹45 crore
Rs. 1,000
INR 50,000

Temperature

25°C
98.6°F

Time

8:00 AM
2:30 PM
14:30

Dates

31st March
1st April
25th December

Output Examples

Input:  "Dr. Suresh Gupta cardiologist हैं। Emergency contact +919876543210 है। Consultation fees ₹1,500 है।"
Output: "Dr. Suresh Gupta cardiologist हैं। Emergency contact plus nine one nine eight seven six five four three two one zero है। Consultation fees rupees one thousand five hundred है।"

Input:  "Property value ₹2.5 crore है। Registration 31st March तक करना है।"
Output: "Property value rupees two point five crore है। Registration thirty first March तक करना है।"

Input:  "Meeting 8:00 AM scheduled है। Success rate 95% है।"
Output: "Meeting eight o'clock AM scheduled है। Success rate ninety five percent है।"

File Structure

├── tts_text_normalizer.py    # Main normalizer class
├── example_usage.py          # Comprehensive examples
├── debug_test.py            # Debug and testing script
└── README.md               # This documentation

Customization

You can easily customize the normalizer by:

Adding new abbreviations: Modify the abbreviations dictionary
Changing number words: Update the ones, tens, and indian_units lists
Adding new patterns: Extend the regex patterns in individual functions
Custom units: Add new measurement units to the units dictionary

Error Handling

The normalizer includes robust error handling:

Invalid numbers fall back to original text
Malformed patterns are preserved as-is
File processing continues even if individual lines fail

Performance

Lightweight: Uses only Python standard library
Fast: Regex-based pattern matching
Memory efficient: Processes text line by line for files
Scalable: Handles large files through streaming

Perfect for TTS training data preparation with authentic Indian context and multilingual support!

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Fonada Voice Assistant

Table of Contents

Documentation

Prerequisites

Setup

Building lmdeploy from source (Optional)

Installation

Running the Voice Assistant

Usage

Customization

Voice Selection

System Prompt

Turn Detection Sensitivity

Integration with FastAPI

Troubleshooting

Sharing Conversation Recordings Across Machines

NFS Setup

On the Source Server (Sharing the folder):

On the Target Server (Mounting the folder):

Usage Notes

Troubleshooting NFS Permission Issues

License

Voice Assistant Monitoring

Option 1: Streamlit Dashboard (Lightweight)

Installation

Option 2: Graylog (Enterprise-grade)

Installation

Features

Streamlit Dashboard

Graylog

Usage

Monitoring Metrics

Troubleshooting

TTS Text Normalizer for Indian Context

Features

📞 Phone Number Normalization

💰 Currency Normalization

🌡️ Temperature Normalization

📊 Percentage Normalization

⏰ Time Normalization

📅 Date Normalization

📧 Email & URL Normalization

🔢 Number Normalization

📏 Measurement Units

🔤 Abbreviations

Usage

Basic Usage

Batch Processing

File Processing

Individual Component Testing

Installation

Example Script

Indian Context Features

Indian Numbering System

Currency Formats

Phone Number Formats

Regional Considerations

Supported Input Formats

Phone Numbers

Currency

Temperature

Time

Dates

Output Examples

File Structure

Customization

Error Handling

Performance