@steelbrain/media-speech-detection-web

v1.2.0

Published

6 months ago

Production-ready speech detection using Silero VAD ONNX model for web browsers

0High
0Medium
0Low

steelbrain

speech vad voice audio detection silero onnx web browser streaming

@steelbrain/media-speech-detection-web

Speech Detection using Silero VAD ONNX model for web browsers.

Installation

npm install @steelbrain/media-speech-detection-web

Modern Bundler Support: This package is fully compatible with modern bundlers (Webpack 5, Next.js, Vite, etc.). The ONNX model file is automatically detected and bundled - no manual setup or public folder configuration required.

Quick Start

import { speechFilter, preloadModel } from '@steelbrain/media-speech-detection-web';
import { ingestAudioStream, RECOMMENDED_AUDIO_CONSTRAINTS } from '@steelbrain/media-ingest-audio';

// Optional: Preload model during app initialization for faster first use
await preloadModel();

// Get microphone access
const mediaStream = await navigator.mediaDevices.getUserMedia({
  audio: RECOMMENDED_AUDIO_CONSTRAINTS
});

// Create 16kHz audio stream
const audioStream = await ingestAudioStream(mediaStream);

// Filter audio to only speech chunks
const vadTransform = speechFilter({
  onSpeechStart: () => console.log('🎤 Speech started'),
  onSpeechEnd: () => console.log('🔇 Speech ended'),
  threshold: 0.5
});

await audioStream
  .pipeThrough(vadTransform)
  .pipeTo(speechProcessor);

// Events-only (no audio output) using .tee() pattern
const [processStream, eventsStream] = audioStream.tee();

// Process audio on one branch
processStream.pipeTo(speechProcessor);

// Handle events on another branch without outputting audio
eventsStream.pipeThrough(speechFilter({
  noEmit: true,  // Don't emit audio chunks
  onSpeechStart: () => console.log('🎤 Speech started'),
  onSpeechEnd: () => console.log('🔇 Speech ended'),
  onMisfire: () => console.log('⚠️ Short speech segment filtered')
}));

API Reference

`preloadModel(): Promise<void>`

Preloads the Silero VAD ONNX model by fetching it into browser cache, eliminating network delay when speech detection is first used.

Usage: await preloadModel() - Call during app initialization for optimal performance.

`speechFilter(options): TransformStream<Float32Array, Float32Array>`

Creates a TransformStream that filters audio, outputting only speech chunks. Use the noEmit option for events-only processing.

Usage: audioStream.pipeThrough(speechFilter(options)).pipeTo(processor)

Configuration Options

interface VADOptions {
  // Event Handlers
  onSpeechStart?: () => void;
  onSpeechEnd?: (speechAudio: Float32Array) => void;
  onMisfire?: () => void;
  onError?: (error: Error) => void;
  onDebugLog?: (message: string) => void;

  // Detection Configuration
  threshold?: number;              // Speech detection threshold (0-1). Default: 0.5
  minSpeechDurationMs?: number;    // Minimum speech duration in ms. Default: 160ms
  redemptionDurationMs?: number;   // Grace period before confirming speech end. Default: 400ms
  lookBackDurationMs?: number;     // Lookback buffer for smooth speech start. Default: 384ms
  
  // Stream Control
  noEmit?: boolean;               // Don't emit chunks, only trigger callbacks. Default: false
}

Optimal Defaults

The package provides carefully tuned defaults that work well for most use cases:

| Parameter | Default | Purpose | |-----------|---------|---------| | threshold | 0.5 | Balanced speech detection | | minSpeechDurationMs | 160ms | Filters out very short sounds | | redemptionDurationMs | 400ms | Handles natural speech pauses | | lookBackDurationMs | 384ms | Captures natural audio context before speech |

Advanced Usage

Error Handling & Debugging

const vadTransform = speechFilter({
  onSpeechStart: () => console.log('🎤 Speech started'),
  onSpeechEnd: () => console.log('🔇 Speech ended'),
  onError: (error) => console.error('VAD Error:', error),
  onDebugLog: (message) => console.log('VAD Debug:', message),
  threshold: 0.6
});

Real-time Speech Transcription Pipeline

// Preload model during app startup
await preloadModel();

// Complete pipeline: microphone → VAD → transcription
await audioStream
  .pipeThrough(speechFilter({
    onSpeechStart: () => showRecordingIndicator(),
    onSpeechEnd: () => hideRecordingIndicator(),
    threshold: 0.5
  }))
  .pipeThrough(transcriptionTransform)
  .pipeTo(displayResults);

Performance Optimization

// Preload model early in your application lifecycle
window.addEventListener('load', async () => {
  try {
    await preloadModel();
    console.log('VAD model preloaded and cached');
  } catch (error) {
    console.warn('Failed to preload VAD model:', error);
  }
});

How It Works

Silero VAD Model: Uses the pre-trained Silero VAD ONNX model for production-ready accuracy
Audio Processing: Processes 16kHz mono audio in 512-sample windows (32ms frames)
State Machine: Implements a sophisticated state machine with speech/intermediate/silent states
Lookback Buffer: Maintains a buffer to capture speech starts smoothly
Temporal Smoothing: Uses configurable timing thresholds to prevent false triggers
Web Streams: Built on modern Web Streams API for optimal performance and composability

Model Details

Model: Silero VAD v4.0 (MIT License)
Input: 16kHz mono audio, 512 samples per inference (32ms windows)
Output: Speech probability (0-1) per window + internal LSTM state
Model Size: ~2.3MB ONNX format
Performance: <1ms inference time per chunk on modern browsers
Accuracy: Enterprise-grade performance across diverse acoustic conditions

Credits

This package uses the Silero VAD model developed by Silero Team, licensed under MIT License. The model provides state-of-the-art speech detection with excellent performance across various languages and acoustic conditions.

License

MIT License - See LICENSE file for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@steelbrain/media-speech-detection-web

Installation

Quick Start

API Reference

preloadModel(): Promise<void>

speechFilter(options): TransformStream<Float32Array, Float32Array>

Configuration Options

Optimal Defaults

Advanced Usage

Error Handling & Debugging

Real-time Speech Transcription Pipeline

Performance Optimization

How It Works

Model Details

Credits

License

`preloadModel(): Promise<void>`

`speechFilter(options): TransformStream<Float32Array, Float32Array>`