@aid-on/vad
v0.1.3
Published
Silero VAD wrapper for browser - voice activity detection
Readme
@aid-on/vad
日本語 | English
Why @aid-on/vad?
Building voice-driven browser applications means solving two hard problems: detecting when the user is actually speaking, and filtering out background noise before processing. This package solves both:
- Silero VAD - State-of-the-art speech detection model via ONNX Runtime
- RNNoise suppression - Optional noise reduction pipeline using WebAssembly
- Simple callback API -
onSpeechStart,onSpeechEnd,onFrameProcessed,onVADMisfire - WAV conversion - Built-in
audioToWav()for sending captured audio to STT APIs - Auto CDN versioning - Pinned, tested versions of vad-web and ONNX Runtime loaded from CDN
- Zero configuration - Sensible defaults, start detecting speech in 5 lines of code
Installation
npm install @aid-on/vadNote: This package is browser-only. It requires WebAssembly support and access to navigator.mediaDevices.getUserMedia.
Quick Start
import { createVAD } from "@aid-on/vad";
const vad = await createVAD({
onSpeechStart: () => {
console.log("User started speaking");
},
onSpeechEnd: (audio) => {
// audio is Float32Array at 16kHz
console.log(`Captured ${audio.length} samples`);
},
});
vad.start();API Reference
createVAD(callbacks, config?)
Create a new VAD instance. Requests microphone access, loads the Silero VAD model, and optionally sets up the RNNoise noise suppression pipeline.
import { createVAD } from "@aid-on/vad";
const vad = await createVAD(
{
onSpeechStart: () => {
// User began speaking
updateUI("listening");
},
onSpeechEnd: (audio: Float32Array) => {
// User stopped speaking
// audio contains the captured speech at 16kHz mono
sendToSTT(audio);
},
onFrameProcessed: (probability: number) => {
// Called on each audio frame with speech probability (0-1)
updateMeter(probability);
},
onVADMisfire: () => {
// Speech was too short (below minSpeechFrames threshold)
console.log("Too short, ignoring");
},
},
{
positiveSpeechThreshold: 0.5,
negativeSpeechThreshold: 0.35,
minSpeechFrames: 3,
noiseSuppression: true,
}
);Returns: Promise<VADInstance>
VADInstance
The object returned by createVAD().
| Method/Property | Type | Description |
|----------------|------|-------------|
| start() | () => void | Start listening for speech |
| pause() | () => void | Pause listening (retains resources) |
| listening | boolean | Whether VAD is currently listening |
| destroy() | () => void | Stop listening, release microphone, and clean up all resources |
// Lifecycle
vad.start(); // Begin speech detection
console.log(vad.listening); // true
vad.pause(); // Temporarily stop
console.log(vad.listening); // false
vad.start(); // Resume
vad.destroy(); // Fully clean up (cannot restart after this)audioToWav(samples, sampleRate?)
Convert a Float32Array of audio samples to a WAV Blob. Useful for sending captured speech to STT APIs.
import { audioToWav } from "@aid-on/vad";
const vad = await createVAD({
onSpeechEnd: (audio) => {
// Convert to WAV for uploading to an STT API
const wavBlob = audioToWav(audio, 16000);
const formData = new FormData();
formData.append("file", wavBlob, "speech.wav");
fetch("/api/transcribe", {
method: "POST",
body: formData,
});
},
});Parameters:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| samples | Float32Array | required | Audio sample data (values between -1 and 1) |
| sampleRate | number | 16000 | Sample rate in Hz |
Returns: Blob with MIME type audio/wav
The output is a standard PCM WAV file: mono, 16-bit, with the specified sample rate.
Configuration
VADConfig
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| positiveSpeechThreshold | number | 0.5 | Probability threshold to detect speech start (0-1) |
| negativeSpeechThreshold | number | 0.35 | Probability threshold to detect speech end (0-1) |
| minSpeechFrames | number | 3 | Minimum frames to count as speech (prevents misfires) |
| preSpeechPadFrames | number | 3 | Number of frames to include before speech start |
| redemptionFrames | number | 8 | Frames to wait before considering speech ended |
| noiseSuppression | boolean | true | Enable RNNoise-based noise suppression |
VADCallbacks
| Callback | Type | Description |
|----------|------|-------------|
| onSpeechStart | () => void | Called when speech is detected |
| onSpeechEnd | (audio: Float32Array) => void | Called when speech ends, with captured audio data |
| onFrameProcessed | (probability: number) => void | Called on each frame with speech probability (0-1) |
| onVADMisfire | () => void | Called when detected speech was too short |
Real-World Example: Voice Chat with STT
import { createVAD, audioToWav } from "@aid-on/vad";
// Create VAD with noise suppression for a voice chat application
const vad = await createVAD(
{
onSpeechStart: () => {
statusIndicator.textContent = "Listening...";
statusIndicator.classList.add("active");
},
onSpeechEnd: async (audio) => {
statusIndicator.textContent = "Processing...";
// Convert to WAV and send to STT
const wavBlob = audioToWav(audio, 16000);
const formData = new FormData();
formData.append("file", wavBlob, "speech.wav");
const response = await fetch("/api/transcribe", {
method: "POST",
body: formData,
});
const { text } = await response.json();
chatMessages.append(createMessage(text, "user"));
statusIndicator.textContent = "Ready";
statusIndicator.classList.remove("active");
},
onFrameProcessed: (probability) => {
// Update a visual speech probability meter
meterElement.style.width = `${probability * 100}%`;
},
onVADMisfire: () => {
statusIndicator.textContent = "Ready";
},
},
{
positiveSpeechThreshold: 0.6, // Slightly higher for noisy environments
negativeSpeechThreshold: 0.35,
minSpeechFrames: 5, // Require longer speech to trigger
redemptionFrames: 10, // Wait longer before cutting off
noiseSuppression: true, // Enable RNNoise
}
);
// Start/stop via button
toggleButton.addEventListener("click", () => {
if (vad.listening) {
vad.pause();
toggleButton.textContent = "Start";
} else {
vad.start();
toggleButton.textContent = "Stop";
}
});
// Cleanup on page unload
window.addEventListener("beforeunload", () => {
vad.destroy();
});Architecture
The audio processing pipeline:
Microphone (48kHz)
|
+-- [RNNoise] (optional, WebAssembly noise suppression)
| 480-sample frames at 48kHz
|
+-- Silero VAD (ONNX Runtime, speech probability per frame)
|
+-- Speech segmentation
|
+-- onSpeechStart()
+-- onSpeechEnd(Float32Array @ 16kHz)
+-- onVADMisfire()CDN Dependencies (loaded at runtime):
| Package | Version | Purpose |
|---------|---------|---------|
| @ricky0123/vad-web | 0.0.18 | Silero VAD model and worklet |
| onnxruntime-web | 1.14.0 | ONNX Runtime for WebAssembly inference |
| @shiguredo/rnnoise-wasm | ^2025.1.5 | RNNoise noise suppression (bundled) |
License
MIT (C) Aid-On
Real-time voice detection for the browser. Hear what matters.
