diarization-js

v0.1.0

Published

a month ago

Speaker diarization in JavaScript: ONNX port of the pyannote/speaker-diarization-community-1 pipeline. Runs in the browser (WebGPU/WASM) or Node.

0High
0Medium
0Low

briox

speaker-diarization diarization pyannote onnx onnxruntime webgpu vbx audio speech asr

diarization-js

Speaker diarization in JavaScript. ONNX port of the pyannote/speaker-diarization-community-1 pipeline. Runs in Node.js and the browser via onnxruntime-node / onnxruntime-web (WebGPU + WASM).

Status: alpha. Validated end-to-end at 1.73% DER vs the official pyannote.audio==4.0.4 Python pipeline on a 7 min mono recording.

What it does

Three building blocks, mirroring the Python pipeline:

Segmentation (pyannote/segmentation-3.0, ONNX): slides a 10 s window and outputs per-frame powerset speaker activations.
Embedding (WeSpeaker ResNet34, ONNX): one 256-d speaker embedding per chunk-local-speaker, with overlap exclusion.
Clustering: AHC seed (centroid linkage on L2-normalized embeddings) → VBx (Bayesian HMM clustering, no-HMM variant from pyannote.audio 4.0.4) on PLDA-projected features.

Plus a reconstruction step (sliding-window aggregation, top-k by instantaneous speaker count) and turn-list emission.

Install

npm install diarization-js onnxruntime-node     # Node
npm install diarization-js onnxruntime-web      # Browser

You will also need the ONNX + JSON artifacts produced by the export script in scripts/export-models/ (and scripts/export-models-v4/ for the post-vbx_setup PLDA matrices).

Usage (Node)

import { readFileSync } from "node:fs";
import * as ort from "onnxruntime-node";
import { DiarizationPipeline, decodeWav, type OrtRuntime } from "diarization-js";

const pldaJson = JSON.parse(readFileSync("artifacts/plda-params-vbx.json", "utf8"));
const pipeline = await DiarizationPipeline.create({
  ort: ort as unknown as OrtRuntime,
  segmentationModel: "artifacts/segmentation-3.0.onnx",
  embeddingModel: "artifacts/embedding-resnet34.onnx",
  pldaParamsJson: pldaJson,
});

const wav = readFileSync("meeting.wav");
const { samples, sampleRate } = decodeWav(wav.buffer);

const { result, metrics } = await pipeline.run(samples, sampleRate);
console.log(`Detected ${result.numSpeakers} speakers, ${result.segments.length} turns`);
console.log(`RTF = ${metrics.rtf.toFixed(3)}`);
for (const turn of result.segments) {
  console.log(`${turn.start.toFixed(2)} - ${turn.end.toFixed(2)}: ${turn.speaker}`);
}

Usage (browser)

import * as ort from "onnxruntime-web/webgpu";
import { DiarizationPipeline, type OrtRuntime } from "diarization-js";

const pipeline = await DiarizationPipeline.create({
  ort: ort as unknown as OrtRuntime,
  segmentationModel: new Uint8Array(await fetch("/models/segmentation-3.0.onnx").then(r => r.arrayBuffer())),
  embeddingModel: new Uint8Array(await fetch("/models/embedding-resnet34.onnx").then(r => r.arrayBuffer())),
  pldaParamsJson: await fetch("/models/plda-params-vbx.json").then(r => r.json()),
});

See apps/playground/ for a full Vite app reference.

Configuration

Defaults match the community-1 config:

| Option | Default | Notes | |---|---|---| | windowSec | 10 | Segmentation window (model was trained on 10 s) | | segmentationStep | 0.1 | 10% step → 1 s overlap | | embeddingExcludeOverlap | true | Skip frames where multiple speakers are active | | ahcThreshold | 0.6 | Centroid linkage cut on L2-normalized embeddings | | vbxFa | 0.07 | VBx scaling factor | | vbxFb | 0.8 | VBx speaker regularization | | vbxMaxIters | 20 | VB iterations cap |

API

DiarizationPipeline.create(config) → loads ONNX sessions and PLDA params
pipeline.run(waveform, sampleRate, { onProgress }) → { result, metrics }
decodeWav(arrayBuffer) → { samples, sampleRate, numChannels } (Node-side WAV decoder)
pyannoteFbank(samples) → Kaldi-compatible fbank (validated at 1e-4 vs torchaudio.compliance.kaldi)
clusterVbx(features, phi, ahcInit, opts) → low-level VBx, also exported
diarizationErrorRate(reference, hypothesis) → DER metric with Hungarian assignment

License

MIT for this package's source. The ONNX artifacts inherit their upstream Hugging Face model card licenses:

pyannote/segmentation-3.0 — MIT
pyannote/wespeaker-voxceleb-resnet34-LM — CC-BY-4.0 (pyannote wrapper around the WeSpeaker pretrained checkpoint, not fine-tuned)
pyannote/speaker-diarization-community-1 (PLDA matrices) — CC-BY-4.0

Acknowledgements

This is a port. The heavy lifting was done by the pyannote.audio team, the WeSpeaker project, and BUT VBx (Landini et al., 2022). pyannote.audio 4.0.4's utils/vbx.py is the direct reference for the clustering port.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

diarization-js

What it does

Install

Usage (Node)

Usage (browser)

Configuration

API

License

Acknowledgements