avr-vad

v1.0.9

Published

9 months ago

A Node.js library for Voice Activity Detection using Silero VAD

0High
0Medium
0Low

agentvoiceresponse

vad voice-activity-detection silero audio speech onnx

Agent Voice Response - AVR VAD - Silero Voice Activity Detection for Node.js

🎤 A Node.js library for Voice Activity Detection using the Silero VAD model.

✨ Features

🚀 Based on Silero VAD: Uses the pre-trained Silero ONNX model (v5 and legacy versions) for accurate results
🎯 Real-time processing: Supports real-time frame-by-frame processing
⚡ Non-real-time processing: Batch processing for audio files and streams
🔧 Configurable: Customizable thresholds and parameters for different needs
🎵 Audio processing: Includes utilities for resampling and audio manipulation
📊 Multiple models: Support for both Silero VAD v5 and legacy models
💾 Bundled models: Models are included in the package, no external downloads required
📝 TypeScript: Fully typed with TypeScript

🚀 Installation

npm install avr-vad

📖 Quick Start

Real-time Processing

import { RealTimeVAD } from 'avr-vad';

// Initialize the VAD with default options (Silero v5 model)
const vad = await RealTimeVAD.new({
  model: 'v5', // or 'legacy'
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35,
  preSpeechPadFrames: 1,
  redemptionFrames: 8,
  frameSamples: 1536,
  minSpeechFrames: 3
});

// Process audio frames in real-time
const audioFrame = getAudioFrameFromMicrophone(); // Float32Array of 1536 samples at 16kHz
const result = await vad.processFrame(audioFrame);

console.log(`Speech probability: ${result.probability}`);
console.log(`Speech detected: ${result.msg === 'SPEECH_START' || result.msg === 'SPEECH_CONTINUE'}`);

// Clean up when done
vad.destroy();

Non-Real-time Processing

import { NonRealTimeVAD } from 'avr-vad';

// Initialize for batch processing
const vad = await NonRealTimeVAD.new({
  model: 'v5',
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35
});

// Process entire audio buffer
const audioData = loadAudioData(); // Float32Array at 16kHz
const results = await vad.processAudio(audioData);

// Get speech segments
const speechSegments = vad.getSpeechSegments(results);
console.log(`Found ${speechSegments.length} speech segments`);

speechSegments.forEach((segment, i) => {
  console.log(`Segment ${i + 1}: ${segment.start}ms - ${segment.end}ms`);
});

// Clean up
vad.destroy();

⚙️ Configuration

Real-time VAD Options

interface RealTimeVADOptions {
  /** Model version to use ('v5' | 'legacy') */
  model?: 'v5' | 'legacy';
  
  /** Threshold for detecting speech start */
  positiveSpeechThreshold?: number;
  
  /** Threshold for detecting speech end */
  negativeSpeechThreshold?: number;
  
  /** Frames to include before speech detection */
  preSpeechPadFrames?: number;
  
  /** Frames to wait before ending speech */
  redemptionFrames?: number;
  
  /** Number of samples per frame (usually 1536 for 16kHz) */
  frameSamples?: number;
  
  /** Minimum frames for valid speech */
  minSpeechFrames?: number;
}

Non-Real-time VAD Options

interface NonRealTimeVADOptions {
  /** Model version to use ('v5' | 'legacy') */
  model?: 'v5' | 'legacy';
  
  /** Threshold for detecting speech start */
  positiveSpeechThreshold?: number;
  
  /** Threshold for detecting speech end */
  negativeSpeechThreshold?: number;
}

Default Values

// Real-time VAD defaults
const defaultRealTimeOptions = {
  model: 'v5',
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35,
  preSpeechPadFrames: 1,
  redemptionFrames: 8,
  frameSamples: 1536,
  minSpeechFrames: 3
};

// Non-real-time VAD defaults
const defaultNonRealTimeOptions = {
  model: 'v5',
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35
};

📊 Results and Messages

VAD Messages

The VAD returns different message types to indicate speech state changes:

enum Message {
  ERROR = 'ERROR',
  SPEECH_START = 'SPEECH_START',
  SPEECH_CONTINUE = 'SPEECH_CONTINUE', 
  SPEECH_END = 'SPEECH_END',
  SILENCE = 'SILENCE'
}

Processing Results

interface VADResult {
  /** Speech probability (0.0 - 1.0) */
  probability: number;
  
  /** Message indicating speech state */
  msg: Message;
  
  /** Audio data if speech segment ended */
  audio?: Float32Array;
}

Speech Segments

interface SpeechSegment {
  /** Start time in milliseconds */
  start: number;
  
  /** End time in milliseconds */
  end: number;
  
  /** Speech probability for this segment */
  probability: number;
}

🔧 Audio Utilities

The library includes various audio processing utilities:

import { utils, Resampler } from 'avr-vad';

// Resample audio to 16kHz (required for VAD)
const resampler = new Resampler({
  nativeSampleRate: 44100,
  targetSampleRate: 16000,
  targetFrameSize: 1536
});

const resampledFrame = resampler.process(audioFrame);

// Other utilities
const frameSize = utils.frameSize; // Get frame size for current sample rate
const audioBuffer = utils.concatArrays([frame1, frame2]); // Concatenate audio arrays

🎯 Advanced Examples

Real-time Speech Detection with Callbacks

import { RealTimeVAD, Message } from 'avr-vad';

class SpeechDetector {
  private vad: RealTimeVAD;
  private onSpeechStart?: (audio: Float32Array) => void;
  private onSpeechEnd?: (audio: Float32Array) => void;

  constructor(callbacks: {
    onSpeechStart?: (audio: Float32Array) => void;
    onSpeechEnd?: (audio: Float32Array) => void;
  }) {
    this.onSpeechStart = callbacks.onSpeechStart;
    this.onSpeechEnd = callbacks.onSpeechEnd;
  }

  async initialize() {
    this.vad = await RealTimeVAD.new({
      positiveSpeechThreshold: 0.5,
      negativeSpeechThreshold: 0.35
      onSpeechStart: this.onSpeechStart,
      onSpeechEnd: this.onSpeechEnd
    });
  }

  async processFrame(audioFrame: Float32Array) {
    const result = await this.vad.processFrame(audioFrame);
    return result;
  }

  destroy() {
    this.vad?.destroy();
  }
}

// Usage
const detector = new SpeechDetector({
  onSpeechStart: (audio) => console.log(`Speech started with ${audio.length} samples`),
  onSpeechEnd: (audio) => console.log(`Speech ended with ${audio.length} samples`)
});

await detector.initialize();

Batch Processing Audio File

import { NonRealTimeVAD, utils } from 'avr-vad';
import * as fs from 'fs';

async function processAudioFile(filePath: string) {
  // Load audio data (you'll need your own audio loading logic)
  const audioData = loadWavFile(filePath); // Float32Array at 16kHz
  
  const vad = await NonRealTimeVAD.new({
    model: 'v5',
    positiveSpeechThreshold: 0.6,
    negativeSpeechThreshold: 0.4
  });

  const results = await vad.processAudio(audioData);
  const segments = vad.getSpeechSegments(results);

  console.log(`Processing ${filePath}:`);
  console.log(`Total audio duration: ${(audioData.length / 16000).toFixed(2)}s`);
  console.log(`Speech segments found: ${segments.length}`);
  
  segments.forEach((segment, i) => {
    const duration = ((segment.end - segment.start) / 1000).toFixed(2);
    console.log(`  Segment ${i + 1}: ${segment.start}ms - ${segment.end}ms (${duration}s)`);
  });

  vad.destroy();
  return segments;
}

📝 Development

Requirements

Node.js >= 16.0.0
TypeScript >= 5.0.0

Build

npm run build

Test

npm test

Scripts

npm run lint      # Run ESLint
npm run clean     # Clean build directory
npm run prepare   # Build before npm install (automatically run)

📁 Project Structure

avr-vad/
├── src/
│   ├── index.ts                    # Main exports
│   ├── real-time-vad.ts           # Real-time VAD implementation  
│   └── common/
│       ├── index.ts               # Common exports
│       ├── frame-processor.ts     # Core ONNX processing
│       ├── non-real-time-vad.ts  # Batch processing VAD
│       ├── utils.ts               # Utility functions
│       ├── resampler.ts           # Audio resampling
├── dist/                          # Compiled JavaScript
├── test/                          # Test files
├── silero_vad_v5.onnx            # Silero VAD v5 model
├── silero_vad_legacy.onnx        # Silero VAD legacy model
└── package.json

🔧 Troubleshooting

Audio Format Requirements

The Silero VAD model requires:

Sample rate: 16kHz
Channels: Mono (single channel)
Format: Float32Array with values between -1.0 and 1.0
Frame size: 1536 samples (96ms at 16kHz)

Model Selection

v5 model: Latest version with improved accuracy
legacy model: Original model for compatibility

Use the Resampler utility to convert audio to the required format:

import { Resampler } from 'avr-vad';

const resampler = new Resampler({
  nativeSampleRate: 44100,    // Your audio sample rate
  targetSampleRate: 16000,    // Required by VAD
  targetFrameSize: 1536       // Required frame size
});

Performance Tips

Use appropriate thresholds for your use case
Consider using the legacy model for lower resource usage
For real-time applications, ensure your audio processing pipeline can handle 16kHz/1536 samples per frame
Use redemptionFrames to avoid choppy speech detection

Acknowledgments

Silero Models for the excellent VAD model
ONNX Runtime for model inference
The open source community for supporting libraries

Support & Community

Website: https://agentvoiceresponse.com - Official website.
GitHub: https://github.com/agentvoiceresponse - Report issues, contribute code.
Discord: https://discord.gg/DFTU69Hg74 - Join the community discussion.
Docker Hub: https://hub.docker.com/u/agentvoiceresponse - Find Docker images.
NPM: https://www.npmjs.com/~agentvoiceresponse - Browse our packages.
Wiki: https://wiki.agentvoiceresponse.com/en/home - Project documentation and guides.

Support AVR

AVR is free and open-source. If you find it valuable, consider supporting its development:

License

MIT License - see the LICENSE.md file for details.