npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@ricky0123/vad

v0.2.4

Published

Powerful, user-friendly voice activity detector (VAD) for the browser

Downloads

3,602

Readme

Voice Activity Detection for the Browser

npm version

This package aims to provide an accurate, user-friendly voice activity detector (VAD) that runs in the browser. It also has limited support for node. Currently, it runs Silero VAD [1] using ONNX Runtime Web / ONNX Runtime Node.js.

Installation

Script tags

To use the VAD via a script tag in the browser, include the following script tags:

<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@ricky0123/vad/dist/index.browser.js"></script>
<script>
  async function main() {
    const myvad = await vad.MicVAD.new()
    ...
  }
  main()
</script>

Bundler

To use the VAD in a webpack project, run

npm i @ricky0123/vad onnxruntime-web

and add the following to your webpack.config.js:

const CopyPlugin = require("copy-webpack-plugin")

module.exports = {
  // ...
  plugins: [
    // ...
    new CopyPlugin({
      patterns: [
        // ...
        {
          from: "node_modules/@ricky0123/vad/dist/*.worklet.js",
          to: "[name][ext]",
        },
        {
          from: "node_modules/@ricky0123/vad/dist/*.onnx",
          to: "[name][ext]",
        },
        { from: "node_modules/onnxruntime-web/dist/*.wasm", to: "[name][ext]" },
      ],
    }),
  ],
}

With other bundlers, you will have to make sure that you are serving the onnxruntime-web wasm files and the worklet file and onnx file from this project.

Node

For a server-side node project, run

npm i @ricky0123 onnxruntime-node

and in your code

const vad = require("@ricky0123/vad/dist/index.node")
const myvad = await vad.NonRealTimeVad.new()
// ...

Note the weird import and that we install onnxruntime-node instead of onnxruntime-web.

Customizing the behavior of the VAD algorithm

The VAD algorithm works as follows:

  1. Sample rate conversion is performed on input audio so that the processed audio has a sample rate of 16000.
  2. The converted samples are batched into "frames" of size frameSamples samples.
  3. The Silero vad model is run on each frame and produces a number between 0 and 1 indicating the probability that the sample contains speech.
  4. If the algorithm has not detected speech lately, then it is in a state of not speaking. Once it encounters a frame with speech probability greater than positiveSpeechThreshold, it is changed into a state of speaking. When it encounters redemptionFrames frames with speech probability less than negativeSpeechThreshold without having encounterd a frame with speech probability greater than positiveSpeechThreshold, the speech audio segment is considered to have ended and the algorithm returns to a state of not speaking. Frames with speech probability in between negativeSpeechThreshold and positiveSpeechThreshold are effectively ignored.
  5. When the algorithm detects the end of a speech audio segment (i.e. goes from the state of speaking to not speaking), it counts the number of frames with speech probability greater than positiveSpeechThreshold in the audio segment. If the count is less than minSpeechFrames, then the audio segment is considered a false positive. Otherwise, preSpeechPadFrames frames are prepended to the audio segment and the segment is made accessible through the higher-level API.

The high-level API's that follow all accept certain common configuration parameters that modify the VAD algorithm.

  • positiveSpeechThreshold: number - determines the threshold over which a probability is considered to indicate the presence of speech.
  • negativeSpeechThreshold: number - determines the threshold under which a probability is considered to indicate the absence of speech.
  • redemptionFrames: number - number of speech-negative frames to wait before ending a speech segment.
  • frameSamples: number - the size of a frame in samples - 1536 by default and probably should not be changed.
  • preSpeechPadFrames: number - number of audio frames to prepend to a speech segment.
  • minSpeechFrames: number - minimum number of speech-positive frames for a speech segment.

API

NonRealTimeVAD (Node + Browser)

This API can be used if you have a Float32Array of audio samples and would like to extract chunks of speech audio with timestamps. This is useful if you want to run the VAD on audio from a file instead of real-time audio from a microphone.

The API works as follows:

const options: Partial<vad.NonRealTimeVADOptions> = { /* ... */ }
const myvad = await vad.MicVAD.new(options)
const audioFileData, nativeSampleRate = ... // get audio and sample rate from file
for await (const {audio, start, end} of myvad.run(audioFileData, nativeSampleRate)) {
   // do stuff with audio, start, end
}

This API only takes the options that customize the VAD algorithm, discussed above.

The speech segments and timestamps are made accessible through an async iterator. In the example above, audio is a Float32Array of audio samples (of sample rate 16000) of a segment of speech, start is a number indicating the milliseconds since the start of the audio that the speech segment began, and end is a number indicating the milliseconds since the start of the audio that the speech segment ended.

MicVAD (Browser only)

This API is used to run the VAD in real-time on microphone input in a browser. It has a callback-based API. It works as follows:

const myvad = await vad.MicVAD.new({
  onFrameProcessed: (probabilities) => { ... },
  onSpeechStart: () => { ... },
  onVADMisfire: () => { ... },
  onSpeechEnd: (audio) => { ... },
})
myvad.start()
// myvad.pause, myvad.start, ...

It also takes the algorithm-modifying parameters defined above.

References

[1] Silero Team. (2021). Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. GitHub, GitHub repository, https://github.com/snakers4/silero-vad, [email protected].