listen-and-speak
v0.1.4
Published
A custom web component for local, in-browser voice interaction featuring Voice Activity Detection (VAD), Speech-to-Text (STT), and Text-to-Speech (TTS)
Readme
🎤 Listen and Speak Web Component
A custom web component for local, in-browser voice interaction featuring Voice Activity Detection (VAD), Speech-to-Text (STT), and Text-to-Speech (TTS) using transformers.js. No server required — everything runs locally in the browser.
✨ Features
🔒 Privacy-First: All processing happens locally in the browser
⚡ Real-time Processing: Live audio capture and transcription
🤖 Powered by Transformers.js: Leverages WebAssembly and WebGPU for optimal performance
🎯 Voice Activity Detection: Automatically detects speech using Silero VAD
🗣️ Speech Recognition: Whisper model for accurate transcription
🔊 Speech Synthesis: Kokoro model for natural-sounding speech
📦 Modular Design: Easy to integrate into any web project
🚀 Installation
Via NPM
npm install listen-and-speakVia CDN
<script type="module" src="https://cdn.jsdelivr.net/npm/listen-and-speak/release/listen-and-speak.js"></script>Manual Installation
<script type="module" src="path/to/listen-and-speak.js"></script>🎮 Basic Usage
<listen-and-speak id="voiceUI"></listen-and-speak>
<script type="module">
const voiceUI = document.querySelector('#voiceUI');
// Start listening for speech
voiceUI.listen();
// Speak text
voiceUI.speak('Hello, how can I help you today?');
</script>The web component is invisible, for a basic visual UI there is an alternative component (the usage is the same):
<listen-speak-ui></listen-speak-ui>This demo speaks back whatever speech it detected:
<listen-speak-ui speekback></listen-speak-ui>📖 API Reference
Methods
|Method|Description|Returns| |---|---|---| listen()| Starts VAD and begins recording audio| Promise<void> stop()| Stops recording and VAD| void speak(text)| Converts text to speech| Promise<void> stopSpeech()| Stops ongoing speech synthesis| void speakFiller()| Plays a random speech filler like "Thinking..."| void
Events
|Event |Description|Event Detail| |---|---|---| |speech-start| Fired when speech detection begins| null| |speech-end| Fired when speech detection ends| null| |frame| Fired for each audio frame captured| {frame: Float32Array}| |progress| Fired during model loading| {type: string, progress: number}| |transcription| Fired when speech is transcribed| {text: string}| |audio-stream| Fired when text to speech is playing| {audio: Float32Array, text:String}| |error| Fired on errors| {error: string}|
Properties
|Property| Type| Description| |---|---|---| isListening| boolean| Read-only. Whether VAD is active isSpeaking| boolean| Read-only. Whether TTS is active modelsLoaded| boolean| Read-only. Whether models are loaded
📝 Advanced Example
import './listen-and-speak.js';
class VoiceAssistant {
constructor() {
this.voiceUI = document.createElement('listen-and-speak');
document.body.appendChild(this.voiceUI);
this.setupEventListeners();
this.initialize();
}
async initialize() {
// Wait for models to load
await this.voiceUI.ready;
console.log('Voice UI ready!');
}
setupEventListeners() {
this.voiceUI.addEventListener('speech-start', () => {
console.log('Speech detected');
this.showListeningIndicator();
});
this.voiceUI.addEventListener('speech-end', async () => {
console.log('Speech ended, processing...');
this.hideListeningIndicator();
});
this.voiceUI.addEventListener('transcription', (ev) => {
const text = ev.detail.text;
console.log('Transcription:', text);
this.processCommand(text);
});
this.voiceUI.addEventListener('progress', (ev) => {
const { type, progress } = ev.detail;
console.log(`Loading ${type}: ${progress}%`);
});
this.voiceUI.addEventListener('error', (ev) => {
console.error('Voice UI Error:', ev.detail.error);
});
}
async startConversation() {
await this.voiceUI.listen();
await this.voiceUI.speak('I am ready. How can I help you?');
}
processCommand(text) {
// Your command processing logic here
const response = this.generateResponse(text);
this.voiceUI.speak(response);
}
generateResponse(text) {
// Simple echo for demonstration
return `You said: ${text}`;
}
showListeningIndicator() {
// Visual feedback for listening state
}
hideListeningIndicator() {
// Hide visual feedback
}
}
// Initialize the assistant
const assistant = new VoiceAssistant();
assistant.startConversation();🧠 Technical Details
Models Used
Voice Activity Detection: Silero VAD
Speech-to-Text: OpenAI Whisper
Text-to-Speech: Kokoro
Performance Characteristics
Audio Frame Size: 512 samples
Sample Rate: 16000 Hz (16kHz)
Model Loading: Cached in browser after first load
Memory Usage: ~200-400MB for all models
Initial Load Time: 30-60 seconds (first time only)
⚙️ Configuration
You can configure the component via attributes or JavaScript:
<!-- Via attributes -->
<listen-and-speak
language="en"
vad-threshold="0.5"
auto-start="false"
debug="true">
</listen-and-speak>
<!-- Via JavaScript -->
<script>
const voiceUI = document.querySelector('listen-and-speak');
voiceUI.language = 'en';
voiceUI.vadThreshold = 0.5;
voiceUI.autoStart = false;
</script>🌐 Browser Compatibility
|Browser| Support| Notes| |---|---|---| Chrome 90+| ✅ Full| Best performance Firefox 88+| ✅ Full| Good performance Safari 15+| ⚠️ Partial| Limited WebGPU support Edge 90+| ✅ Full| Based on Chromium
Requirements:
Modern browser with WebAssembly and WebAudio API support
WebGPU recommended for best performance (optional)
2GB+ RAM recommended for smooth operation
🚨 Limitations
First Load Time: Models are large (100MB+ each) and take time to download and initialize
Memory Intensive: Requires substantial RAM for all three models
Browser Support: Limited in older browsers and mobile devices
Accuracy: On-device models may have slightly lower accuracy than cloud alternatives
Languages: Supported languages depend on the underlying models
🔧 Development
# Clone the repository
git clone https://github.com/sjovanovic/listen-and-speak.git
# Install dependencies
npm install
# Start development server
npm run dev
# Build for production
npm run build
# Run tests
npm test🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
Transformers.js for making ML models accessible in the browser
Silero VAD team for voice activity detection
OpenAI Whisper team for speech recognition
Kokoro team for text-to-speech
Note: This is a client-side only solution. For production use, consider implementing fallbacks or hybrid approaches for users with limited device capabilities.
