@tinkermonkey/utterance-emitter

v0.2.0

Published

2 months ago

A Node.js library for recording audio in a browser, performing speaker detection, and emitting chunks of mp3 encoded audio representing utterances for further processing.

0High
0Medium
0Low

tinkermonkey

audio recording browser speaker detection utterance mp3 web audio

Utterance Emitter

Utterance Emitter is a library for recording audio in the browser, performing speaker detection, and emitting chunks of mp3 encoded audio representing utterances for further processing with tools like Whisper.

Installation

npm install utterance-emitter

Usage

First, import the UtteranceEmitter class and create an instance. Then, start recording audio and handle the emitted audio chunks.

import { UtteranceEmitter } from 'utterance-emitter';

const emitter = new UtteranceEmitter({
  emitRawAudio: true,
  emitMP3Audio: true,
  onUtterance: (utterance) => {
    if (utterance.mp3) {
      const url = URL.createObjectURL(utterance.mp3);
      const audio = new Audio(url);
      audio.controls = true;
      document.body.appendChild(audio);
    }

    if (utterance.raw) {
      const url = URL.createObjectURL(utterance.raw);
      const audio = new Audio(url);
      audio.controls = true;
      document.body.appendChild(audio);
    }
  },
});

document.getElementById('startButton').addEventListener('click', () => emitter.start());
document.getElementById('stopButton').addEventListener('click', () => emitter.stop());

Algorithm for Speaker Detection

The UtteranceEmitter library supports two methods for detecting when someone is speaking:

1. Voice Activity Detection (VAD) - Recommended

This method uses the @ricky0123/vad-web library (Silero VAD) to detect human speech using machine learning. This is significantly more accurate than simple volume analysis, especially in noisy environments or with non-speech background noise (typing, HVAC, etc.).

To use this method, provide a vadConfig object in the configuration.

2. Amplitude-Based Detection (Fallback)

This is an unsophisticated but lightweight algorithm that works reasonably well in quiet environments:

Volume Analysis: The audio stream is analyzed in real-time to calculate the average volume level.
Threshold Comparison: The average volume level is compared against a predefined volume threshold.
Filtered Signal: A filtered signal is generated based on the threshold comparison.

The library is designed to attempt VAD initialization first and gracefully fall back to amplitude-based detection if VAD fails to load or is disabled.

Detection Process (Common Steps)

Regardless of the detection method:

Signal Filtering: If the signal (VAD probability or Volume) is above the threshold, it is considered speaking. If it drops below for a short duration (quiet period), it is still considered speaking to bridge pauses.
Recording Control: The media recorder starts recording when the filtered signal indicates speaking and stops when it indicates silence.
Utterance Emission: Once recording stops, the audio is processed and emitted as an utterance.

Optional Charts

UtteranceEmitter can also visualize the audio data using optional charts. These charts can help in understanding the audio signal and the detection process. These charts are kept deliberately simple and free of dependencies.

The following charts are available, they can be individually enabled:

Waveform Chart: Displays the time-domain representation of the audio signal.
Frequency Chart: Displays the frequency-domain representation of the audio signal.
Volume Chart: Displays the average volume level of the audio signal.
Threshold Signal Chart: Displays the threshold signal used for detecting speech.
Speaking Signal Chart: Displays the filtered signal indicating when someone is speaking.

To enable these charts, pass the corresponding HTML canvas elements in the charts configuration option:

const emitter = new UtteranceEmitter({
  emitRawAudio: true,
  emitMP3Audio: true,
  charts: {
    width: 400,
    height: 100,
    barMargin: 1,
    barWidthNominal: 2.5,
    waveform: document.getElementById("waveform"),
    frequency: document.getElementById("frequency"),
    volume: document.getElementById("volume"),
    threshold: document.getElementById("threshold"),
    speaking: document.getElementById("speaking"),
  },
  onUtterance: (utterance) => {
    // Handle the utterance
  },
});

Ensure that the canvas elements are present in your HTML (obviously you can use any selector you like, you're passing in the element reference):

<canvas id="waveform"></canvas>
<canvas id="frequency"></canvas>
<canvas id="volume"></canvas>
<canvas id="threshold"></canvas>
<canvas id="speaking"></canvas>

EmitterConfig Options

The EmitterConfig interface provides several options to customize the behavior of the UtteranceEmitter:

onUtterance: A callback function that is called when an utterance is detected. The callback function receives an Utterance object as the argument. This is triggered at the same time that an Utterance event is fired, there's no difference so use the one you like.
volumeThreshold: The volume threshold at which to start recording. Default is 7.
preRecordingDuration: The number of milliseconds to keep in a buffer before the volume threshold is reached. Default is 100.
emitRawAudio: Whether to emit raw audio data. Default is false.
emitMP3Audio: Whether to emit MP3 audio data. Default is true.
emitText: Whether to emit text data. Default is false.
sampleRate: The sample rate to use for audio recording. Default is 44100.
mp3BitRate: The bit rate in kbps to use for MP3 encoding. Default is 128.
vadConfig: Optional configuration for VAD (Voice Activity Detection). If provided, VAD is enabled.
- positiveSpeechThreshold: Probability threshold for voice detection [0-1]. Default 0.5.
- negativeSpeechThreshold: Probability threshold for silence detection [0-1]. Default 0.35.
- minSpeechMs: Minimum speech duration in milliseconds. Default 250.
- redemptionMs: Redemption milliseconds after speech ends. Default 80.
- baseAssetPath: Custom base path for VAD assets.
enablePerformanceMonitoring: Whether to enable internal performance monitoring (frame timing, etc.). Default is false.
charts: An optional object to configure the charts. The object can have the following properties:
- width: The width of the charts. Default is 400.
- height: The height of the charts. Default is 100.
- barMargin: The margin between bars in the charts. Default is 1.
- barWidthNominal: The nominal width of the bars in the charts. Default is 2.5.
- waveform: An HTML canvas element to display the waveform chart.
- frequency: An HTML canvas element to display the frequency chart.
- volume: An HTML canvas element to display the volume chart.
- threshold: An HTML canvas element to display the threshold signal chart.
- speaking: An HTML canvas element to display the speaking signal chart.
- foregroundColor: Color for the chart elements.
- backgroundColor: Background color for the charts.

Events

The UtteranceEmitter emits the following events:

speaking

Emitted when the speaking state changes (both when starting and stopping speaking). Subscribe to this event to be notified in real-time when speaking is detected.

emitter.on('speaking', (event) => {
  // event.speaking: boolean - true when speaking starts, false when it stops
  // event.timestamp: number - milliseconds since epoch when the event occurred
  console.log(`Speaking changed to: ${event.speaking} at ${event.timestamp}`);
});

utterance

Emitted when a complete utterance has been detected and processed. This occurs after speaking stops and the audio has been processed.

emitter.on('utterance', (event) => {
  // event.utterance: {
  //   raw?: Blob - Raw audio data (if emitRawAudio is true)
  //   mp3?: Blob - MP3 encoded audio (if emitMP3Audio is true)
  //   text?: string - Transcribed text (if emitText is true)
  //   timestamp: number - milliseconds since epoch when utterance was recorded
  // }
  
  if (event.utterance.mp3) {
    // Handle MP3 audio...
  }
});

You can subscribe to events either using the .on() method as shown above, or by providing an onUtterance callback in the config. The onUtterance callback is equivalent to subscribing to the 'utterance' event but only receives the utterance object, not the full event.