diarization-js
v0.1.0
Published
Speaker diarization in JavaScript: ONNX port of the pyannote/speaker-diarization-community-1 pipeline. Runs in the browser (WebGPU/WASM) or Node.
Maintainers
Readme
diarization-js
Speaker diarization in JavaScript. ONNX port of the
pyannote/speaker-diarization-community-1
pipeline. Runs in Node.js and the browser via onnxruntime-node /
onnxruntime-web (WebGPU + WASM).
Status: alpha. Validated end-to-end at 1.73% DER vs the official
pyannote.audio==4.0.4 Python pipeline on a 7 min mono recording.
What it does
Three building blocks, mirroring the Python pipeline:
- Segmentation (pyannote/segmentation-3.0, ONNX): slides a 10 s window and outputs per-frame powerset speaker activations.
- Embedding (WeSpeaker ResNet34, ONNX): one 256-d speaker embedding per chunk-local-speaker, with overlap exclusion.
- Clustering: AHC seed (centroid linkage on L2-normalized embeddings) → VBx (Bayesian HMM clustering, no-HMM variant from pyannote.audio 4.0.4) on PLDA-projected features.
Plus a reconstruction step (sliding-window aggregation, top-k by instantaneous speaker count) and turn-list emission.
Install
npm install diarization-js onnxruntime-node # Node
npm install diarization-js onnxruntime-web # BrowserYou will also need the ONNX + JSON artifacts produced by the export script in
scripts/export-models/ (and scripts/export-models-v4/ for the
post-vbx_setup PLDA matrices).
Usage (Node)
import { readFileSync } from "node:fs";
import * as ort from "onnxruntime-node";
import { DiarizationPipeline, decodeWav, type OrtRuntime } from "diarization-js";
const pldaJson = JSON.parse(readFileSync("artifacts/plda-params-vbx.json", "utf8"));
const pipeline = await DiarizationPipeline.create({
ort: ort as unknown as OrtRuntime,
segmentationModel: "artifacts/segmentation-3.0.onnx",
embeddingModel: "artifacts/embedding-resnet34.onnx",
pldaParamsJson: pldaJson,
});
const wav = readFileSync("meeting.wav");
const { samples, sampleRate } = decodeWav(wav.buffer);
const { result, metrics } = await pipeline.run(samples, sampleRate);
console.log(`Detected ${result.numSpeakers} speakers, ${result.segments.length} turns`);
console.log(`RTF = ${metrics.rtf.toFixed(3)}`);
for (const turn of result.segments) {
console.log(`${turn.start.toFixed(2)} - ${turn.end.toFixed(2)}: ${turn.speaker}`);
}Usage (browser)
import * as ort from "onnxruntime-web/webgpu";
import { DiarizationPipeline, type OrtRuntime } from "diarization-js";
const pipeline = await DiarizationPipeline.create({
ort: ort as unknown as OrtRuntime,
segmentationModel: new Uint8Array(await fetch("/models/segmentation-3.0.onnx").then(r => r.arrayBuffer())),
embeddingModel: new Uint8Array(await fetch("/models/embedding-resnet34.onnx").then(r => r.arrayBuffer())),
pldaParamsJson: await fetch("/models/plda-params-vbx.json").then(r => r.json()),
});See apps/playground/ for a full Vite app reference.
Configuration
Defaults match the community-1 config:
| Option | Default | Notes |
|---|---|---|
| windowSec | 10 | Segmentation window (model was trained on 10 s) |
| segmentationStep | 0.1 | 10% step → 1 s overlap |
| embeddingExcludeOverlap | true | Skip frames where multiple speakers are active |
| ahcThreshold | 0.6 | Centroid linkage cut on L2-normalized embeddings |
| vbxFa | 0.07 | VBx scaling factor |
| vbxFb | 0.8 | VBx speaker regularization |
| vbxMaxIters | 20 | VB iterations cap |
API
DiarizationPipeline.create(config)→ loads ONNX sessions and PLDA paramspipeline.run(waveform, sampleRate, { onProgress })→{ result, metrics }decodeWav(arrayBuffer)→{ samples, sampleRate, numChannels }(Node-side WAV decoder)pyannoteFbank(samples)→ Kaldi-compatible fbank (validated at 1e-4 vstorchaudio.compliance.kaldi)clusterVbx(features, phi, ahcInit, opts)→ low-level VBx, also exporteddiarizationErrorRate(reference, hypothesis)→ DER metric with Hungarian assignment
License
MIT for this package's source. The ONNX artifacts inherit their upstream Hugging Face model card licenses:
pyannote/segmentation-3.0— MITpyannote/wespeaker-voxceleb-resnet34-LM— CC-BY-4.0 (pyannote wrapper around the WeSpeaker pretrained checkpoint, not fine-tuned)pyannote/speaker-diarization-community-1(PLDA matrices) — CC-BY-4.0
Acknowledgements
This is a port. The heavy lifting was done by the
pyannote.audio team, the
WeSpeaker project, and BUT VBx
(Landini et al., 2022). pyannote.audio 4.0.4's utils/vbx.py is the direct
reference for the clustering port.
