streaming-sortformer-node
v0.2.1
Published
Node.js bindings for SortFormer streaming speaker diarization
Readme
streaming-sortformer-node
Native Node.js bindings for SortFormer streaming speaker diarization. Provides real-time speaker diarization with up to 4 concurrent speakers using native C/GGML inference with async, thread-safe operation and CoreML acceleration on Apple Silicon.
Installation
npm install streaming-sortformer-nodePlatforms: macOS arm64, macOS x64
Requirements: Node.js >= 18
Quick Start
Offline Diarization
import { Sortformer } from 'streaming-sortformer-node';
const model = await Sortformer.load('./model.gguf');
const audio = new Float32Array(/* 16kHz mono samples */);
const result = await model.diarize(audio, {
threshold: 0.5,
medianFilter: 11
});
console.log(result.rttm);
console.log(`Processed ${result.frameCount} frames`);
console.log(`Model outputs ${result.maxSpeakers} speaker channels`);
model.close();Streaming (Low-Level)
import { Sortformer } from 'streaming-sortformer-node';
const model = await Sortformer.load('./model.gguf');
const session = model.createStreamingSession({ preset: '2s' });
// Feed audio chunks as they arrive
const result1 = await session.feed(chunk1);
const result2 = await session.feed(chunk2);
// Flush at end of stream
const final = await session.flush();
console.log(`Total frames: ${session.totalFrames}`);
session.close();
model.close();Streaming (Mid-Level — Boolean Activity Frames)
import { Sortformer } from 'streaming-sortformer-node';
const model = await Sortformer.load('./model.gguf');
const stream = model.createActivityStream({
preset: '2s',
threshold: 0.5,
medianFilter: 11
});
// Feed audio chunks — get boolean "who is speaking" frames
const result = await stream.feed(chunk1);
// result.activity is a Uint8Array of 0s and 1s
// Layout: [spk0_f0, spk1_f0, spk2_f0, spk3_f0, spk0_f1, ...]
for (let i = 0; i < result.frameCount; i++) {
const globalFrame = result.startFrame + i;
const time = globalFrame * 0.08; // 80ms per frame
const speakers = [];
for (let s = 0; s < 4; s++) {
if (result.activity[i * 4 + s] === 1) speakers.push(s);
}
if (speakers.length > 0) {
console.log(`${time.toFixed(2)}s: speakers ${speakers.join(', ')} active`);
}
}
// Flush at end of stream to settle remaining frames
const final = await stream.flush();
stream.close();
model.close();Streaming (High-Level — Speaker Segments)
import { Sortformer } from 'streaming-sortformer-node';
const model = await Sortformer.load('./model.gguf');
const stream = model.createDiarizeStream({
preset: '2s',
threshold: 0.5,
medianFilter: 11,
onSegment: (segment) => {
console.log(`Speaker ${segment.speaker}: ${segment.start}s - ${segment.start + segment.duration}s`);
}
});
// Feed audio chunks
await stream.feed(chunk1);
await stream.feed(chunk2);
// Flush and close open segments
const final = await stream.flush();
stream.close();
model.close();API Reference
Sortformer
static async load(modelPath: string, options?: LoadOptions): Promise<Sortformer>
Load a SortFormer model from a GGUF file.
Parameters:
modelPath(string): Path to the GGUF model fileoptions(LoadOptions, optional): Currently empty{}
Returns: Promise<Sortformer>
Throws: Error if model file not found or invalid
Example:
const model = await Sortformer.load('./model.gguf');async diarize(audio: Float32Array, options?: DiarizeOptions): Promise<DiarizeResult>
Run offline diarization on complete audio. Processes entire audio at once with no streaming constraints.
Parameters:
audio(Float32Array): Audio samples at 16kHz monooptions(DiarizeOptions, optional):threshold?: number— Speaker activity threshold, 0.0-1.0 (default: 0.5)medianFilter?: number— Median filter window size, odd integer >= 1 (default: 11)
Returns: Promise<DiarizeResult>
rttm: string— RTTM format outputpredictions: Float32Array— Raw per-frame predictions, shape [frameCount, 4]frameCount: number— Number of framesmaxSpeakers: number— Always 4 (model outputs 4 speaker channels)
Throws: Error if model is closed, audio is invalid, or inference fails
Example:
const result = await model.diarize(audio, {
threshold: 0.5,
medianFilter: 11
});createStreamingSession(options?: StreamingSessionOptions): StreamingSession
Create a streaming session for incremental audio processing. Returns raw frame predictions.
Parameters:
options(StreamingSessionOptions, optional):preset?: StreamingPreset— Latency preset:'low'|'2s'|'3s'|'5s'(default:'2s')
Returns: StreamingSession
Throws: Error if model is closed
Example:
const session = model.createStreamingSession({ preset: 'low' });createActivityStream(options?: ActivityStreamOptions): ActivityStream
Create a mid-level streaming activity stream. Performs threshold → median filter and returns boolean speaker activity frames. Use this when you need clean "who is speaking right now" data without dealing with raw probabilities or segment boundary detection yourself.
Parameters:
options(ActivityStreamOptions, optional):preset?: StreamingPreset— Latency preset (default:'2s')threshold?: number— Speaker activity threshold (default: 0.5)medianFilter?: number— Median filter window size (default: 11)onFrames?: (predictions: Float32Array, frameCount: number) => void— Callback for raw predictions before processing
Returns: ActivityStream
Throws: Error if model is closed
Example:
const stream = model.createActivityStream({
preset: '2s',
threshold: 0.5,
medianFilter: 11
});createDiarizeStream(options?: DiarizeStreamOptions): DiarizeStream
Create a high-level streaming diarization stream. Performs threshold → median filter → segment extraction, emitting completed speaker segments via callbacks. Internally wraps an ActivityStream.
Parameters:
options(DiarizeStreamOptions, optional):preset?: StreamingPreset— Latency preset (default:'2s')threshold?: number— Speaker activity threshold (default: 0.5)medianFilter?: number— Median filter window size (default: 11)onSegment?: (segment: SpeakerSegment) => void— Callback for completed segmentsonFrames?: (predictions: Float32Array, frameCount: number) => void— Callback for raw frame predictions
Returns: DiarizeStream
Throws: Error if model is closed
Example:
const stream = model.createDiarizeStream({
preset: '2s',
onSegment: (seg) => console.log(`Speaker ${seg.speaker}: ${seg.start}s`)
});close(): void
Close the model and free native resources. Idempotent.
isClosed(): boolean
Check if the model has been closed.
StreamingSession
async feed(audio: Float32Array): Promise<FeedResult>
Feed audio samples and get predictions for this chunk.
Parameters:
audio(Float32Array): Audio samples at 16kHz mono
Returns: Promise<FeedResult>
predictions: Float32Array— Per-frame predictions, shape [frameCount, 4]frameCount: number— Number of new frames
Throws: Error if session is closed or audio is invalid
Example:
const result = await session.feed(audioChunk);
console.log(`Got ${result.frameCount} frames`);async flush(): Promise<FeedResult>
Flush remaining buffered audio at end of stream. Call when audio stream ends to process any remaining buffered audio.
Returns: Promise<FeedResult>
Throws: Error if session is closed
reset(): void
Reset streaming state for a new audio stream. Clears all internal buffers (spkcache, fifo, mel overlap) while keeping the model loaded.
Throws: Error if session is closed
close(): void
Close the session and free native resources. Idempotent.
get totalFrames: number
Get total frames output so far.
get isClosed: boolean
Check if the session is closed.
ActivityStream
Mid-level API that wraps StreamingSession and returns boolean speaker activity frames after applying threshold and median filter. Each frame tells you definitively which speakers are active — no raw probabilities to interpret.
Added latency: The median filter needs future context to settle predictions. With the default medianFilter: 11, this adds a delay of 5 frames (400ms) on top of the model's inference latency. Frames within this window are "unsettled" and held back until enough future context arrives.
async feed(audio: Float32Array): Promise<ActivityFeedResult>
Feed audio samples, run inference, apply threshold + median filter, and return settled boolean activity frames.
Parameters:
audio(Float32Array): Audio samples at 16kHz mono
Returns: Promise<ActivityFeedResult>
activity: Uint8Array— Boolean activity per speaker, flat layout[spk0_f0, spk1_f0, spk2_f0, spk3_f0, spk0_f1, ...]. Values are0(silent) or1(speaking).frameCount: number— Number of settled frames in this result. May be 0 if the engine is still buffering audio or waiting for median filter context.startFrame: number— Global frame index of the first frame in this result. Use this to compute timestamps:time = (startFrame + i) * 0.08.
Throws: Error if stream is closed
Example:
const result = await stream.feed(audioChunk);
for (let i = 0; i < result.frameCount; i++) {
const time = (result.startFrame + i) * 0.08;
for (let s = 0; s < 4; s++) {
if (result.activity[i * 4 + s] === 1) {
console.log(`${time.toFixed(2)}s: speaker ${s} is speaking`);
}
}
}async flush(): Promise<ActivityFlushResult>
Flush remaining buffered audio and settle all pending frames. Call when the audio stream ends. The median filter zero-pads future context to produce final settled frames.
Returns: Promise<ActivityFlushResult>
activity: Uint8Array— Final boolean activity framesframeCount: number— Number of final framesstartFrame: number— Global frame index of the first frame
Throws: Error if stream is closed
reset(): void
Reset for a new audio stream. Clears internal buffers, median filter state, and frame counter.
Throws: Error if stream is closed
close(): void
Close the stream and free native resources. Idempotent.
get totalFrames: number
Get total frames output by the underlying session so far (including unsettled frames not yet returned).
get isClosed: boolean
Check if the stream is closed.
DiarizeStream
Wraps an ActivityStream internally. Reads boolean activity frames and detects segment boundaries (speaker on→off transitions), emitting SpeakerSegment objects.
async feed(audio: Float32Array): Promise<DiarizeStreamFeedResult>
Feed audio samples, run inference, and post-process predictions. Returns segments that completed during this call.
Parameters:
audio(Float32Array): Audio samples at 16kHz mono
Returns: Promise<DiarizeStreamFeedResult>
frameCount: number— Number of new frames from inferencesegments: SpeakerSegment[]— Segments completed during this feed
Throws: Error if stream is closed
Example:
const result = await stream.feed(audioChunk);
for (const seg of result.segments) {
console.log(`Speaker ${seg.speaker}: ${seg.start}s for ${seg.duration}s`);
}async flush(): Promise<DiarizeStreamFlushResult>
Flush remaining buffered frames at end of stream. Zero-pads future frames to settle all pending predictions and closes any open segments.
Returns: Promise<DiarizeStreamFlushResult>
frameCount: number— Number of final framessegments: SpeakerSegment[]— All segments that completed during flush
Throws: Error if stream is closed
reset(): void
Reset for new audio stream.
Throws: Error if stream is closed
close(): void
Close the stream and free native resources. Idempotent.
get totalFrames: number
Get total frames processed so far.
get isClosed: boolean
Check if the stream is closed.
Types
interface LoadOptions {}
interface DiarizeOptions {
threshold?: number; // 0.0-1.0, default: 0.5
medianFilter?: number; // odd integer >= 1, default: 11
}
interface DiarizeResult {
rttm: string;
predictions: Float32Array; // shape [frameCount, 4]
frameCount: number;
maxSpeakers: number; // always 4
}
type StreamingPreset = 'low' | '2s' | '3s' | '5s';
interface StreamingSessionOptions {
preset?: StreamingPreset; // default: '2s'
}
interface FeedResult {
predictions: Float32Array; // shape [frameCount, 4]
frameCount: number;
}
interface SpeakerSegment {
speaker: number; // 0-3
start: number; // seconds
duration: number; // seconds
}
interface DiarizeStreamOptions {
preset?: StreamingPreset;
threshold?: number;
medianFilter?: number;
onSegment?: (segment: SpeakerSegment) => void;
onFrames?: (predictions: Float32Array, frameCount: number) => void;
}
interface ActivityStreamOptions {
preset?: StreamingPreset;
threshold?: number; // 0.0-1.0, default: 0.5
medianFilter?: number; // odd integer >= 1, default: 11
onFrames?: (predictions: Float32Array, frameCount: number) => void;
}
interface ActivityFeedResult {
activity: Uint8Array; // flat [spk0_f0, spk1_f0, spk2_f0, spk3_f0, spk0_f1, ...], values 0 or 1
frameCount: number; // number of settled frames
startFrame: number; // global frame index of first frame
}
interface ActivityFlushResult {
activity: Uint8Array; // final settled frames
frameCount: number;
startFrame: number;
}
interface DiarizeStreamFeedResult {
frameCount: number;
segments: SpeakerSegment[];
}
interface DiarizeStreamFlushResult {
frameCount: number;
segments: SpeakerSegment[];
}How Streaming Works
What feed() returns
feed() returns only the new frames from the chunk you just fed — not the full audio history. Each call's frames tile together sequentially.
const r1 = await session.feed(chunk1); // r1.frameCount = 30 (frames 0-29)
const r2 = await session.feed(chunk2); // r2.frameCount = 30 (frames 30-59)
const r3 = await session.feed(chunk3); // r3.frameCount = 30 (frames 60-89)Internally, the model uses a speaker cache for full-history context, but it only outputs predictions for the new audio.
What's inside predictions
predictions is a flat Float32Array of raw sigmoid probabilities (0.0–1.0) for 4 speaker channels at each frame. Layout is [spk0_f0, spk1_f0, spk2_f0, spk3_f0, spk0_f1, spk1_f1, ...].
// Read speaker 2's probability at frame 5:
const prob = result.predictions[5 * 4 + 2];These are raw values — no thresholding or filtering applied. If you want finished speaker segments, use createDiarizeStream() instead, which does threshold → median filter → segment extraction for you.
Feed non-overlapping audio chunks
You feed sequential, non-overlapping audio chunks. The engine handles mel spectrogram overlap (352 samples) internally.
// Correct: sequential chunks
await session.feed(samples.slice(0, 48000));
await session.feed(samples.slice(48000, 96000));
// Wrong: don't overlap your audio
await session.feed(samples.slice(0, 48000));
await session.feed(samples.slice(47000, 96000)); // ❌ overlappingFeed sizes and buffering
The model processes audio in internal chunks. If you haven't fed enough audio for a full chunk, the engine buffers it and returns frameCount: 0. Once enough accumulates, it processes and returns frames.
The model needs (chunkLen + rightContext) × 8 × 160 audio samples to produce its first chunk. After that, it needs chunkLen × 8 × 160 samples per chunk (the right context carries over internally).
Simplest approach — feed the same size every call:
| Preset | Feed size | Samples | First output after |
|--------|-----------|---------|-------------------|
| 'low' | 0.48s | 7,680 | 3rd call (1.44s) |
| '2s' | 1.20s | 19,200 | 2nd call (2.40s) |
| '3s' | 2.40s | 38,400 | 2nd call (4.80s) |
| '5s' | 4.40s | 70,400 | 2nd call (8.80s) |
The first call(s) return frameCount: 0 while the engine accumulates enough data. After that, every call returns exactly one chunk.
// Example: '3s' preset, feeding 38,400 samples each time
const FEED_SIZE = 38_400;
const r1 = await session.feed(audio.slice(0, FEED_SIZE)); // r1.frameCount = 0 (buffering)
const r2 = await session.feed(audio.slice(FEED_SIZE, FEED_SIZE*2)); // r2.frameCount = 30
const r3 = await session.feed(audio.slice(FEED_SIZE*2, FEED_SIZE*3)); // r3.frameCount = 30Minimum-latency approach — feed a larger first chunk:
| Preset | First call | Subsequent calls | First output after |
|--------|-----------|-----------------|-------------------|
| 'low' | 16,640 samples (1.04s) | 7,680 samples (0.48s) | 1st call (1.04s) |
| '2s' | 32,000 samples (2.00s) | 19,200 samples (1.20s) | 1st call (2.00s) |
| '3s' | 47,360 samples (2.96s) | 38,400 samples (2.40s) | 1st call (2.96s) |
| '5s' | 79,360 samples (4.96s) | 70,400 samples (4.40s) | 1st call (4.96s) |
This gets predictions on the very first call by providing enough audio for the right context up front:
// Example: '3s' preset, minimum-latency approach
const FIRST = 47_360;
const STEADY = 38_400;
const r1 = await session.feed(audio.slice(0, FIRST)); // r1.frameCount = 30
const r2 = await session.feed(audio.slice(FIRST, FIRST + STEADY)); // r2.frameCount = 30
const r3 = await session.feed(audio.slice(FIRST + STEADY, FIRST + STEADY*2)); // r3.frameCount = 30Or just feed whatever you have — the engine buffers internally regardless of chunk size. You can feed 1000 samples at a time or 100,000. It will process as many full chunks as possible and buffer the rest.
Tracking frame positions
Frames are sequential starting from 0. Use totalFrames to know where you are:
const session = model.createStreamingSession({ preset: '3s' });
const r1 = await session.feed(chunk1);
// session.totalFrames = 0, r1.frameCount = 0 (still buffering)
const r2 = await session.feed(chunk2);
// session.totalFrames = 30, r2.frameCount = 30
// These are global frames 0–29 → time 0.00s to 2.32s
const r3 = await session.feed(chunk3);
// session.totalFrames = 60, r3.frameCount = 30
// These are global frames 30–59 → time 2.40s to 4.72stotalFrames before a feed() call gives you the starting frame index of whatever predictions come back. Each output frame represents 80ms of audio (8× subsampling at 160-sample hops from 16kHz). Frame i covers the time window [i × 0.08s, (i+1) × 0.08s).
Predictions lag behind your audio
The model needs right context (future frames) to make predictions about the "current" audio, so predictions always lag behind the audio you've fed. For example with the '3s' preset, if you've fed 4.8s of audio, you may only have predictions up to ~2.4s.
This is normal. Call flush() when the audio stream ends — it zero-pads the future context to squeeze out the remaining predictions for the tail of your audio.
ActivityStream adds median filter settling delay
ActivityStream (and DiarizeStream, which wraps it) adds an extra delay on top of the model's buffering. The median filter needs future frames to "settle" its output, so it holds back the last floor(medianFilter / 2) frames until the next feed() call.
With the default medianFilter: 11, that's 5 frames held back (400ms). These frames are not lost — they settle when you feed more audio or call flush().
What this means in practice:
| Preset | StreamingSession first output | ActivityStream first output |
|--------|------------------------------|----------------------------|
| 'low' | 6 frames | 1 settled frame (5 held back) |
| '2s' | 15 frames | 10 settled frames (5 held back) |
| '3s' | 30 frames | 25 settled frames (5 held back) |
| '5s' | 55 frames | 50 settled frames (5 held back) |
The first output arrives on the same call — you just get 5 fewer frames. Those 5 frames settle on the next feed() call. In steady state, every call returns the full chunk size because the 5 held-back frames from the previous call are released.
You can reduce the settling delay by lowering medianFilter (e.g., medianFilter: 5 → 2 frames held back = 160ms), at the cost of noisier activity detection.
Streaming Presets
| Preset | Output frames per chunk | Audio per chunk | Right context | Best For |
|--------|------------------------|-----------------|---------------|----------|
| 'low' | 6 (480ms) | 0.48s steady | 7 frames (560ms) | Real-time, minimal delay |
| '2s' | 15 (1.2s) | 1.20s steady | 10 frames (800ms) | Balanced (default) |
| '3s' | 30 (2.4s) | 2.40s steady | 7 frames (560ms) | Higher accuracy |
| '5s' | 55 (4.4s) | 4.40s steady | 7 frames (560ms) | Best accuracy |
Audio Format
Input audio must be:
- Sample rate: 16kHz
- Channels: Mono (single channel)
- Format: Float32Array with values in range [-1.0, 1.0]
Converting from PCM Int16
// From Int16Array
const float32 = new Float32Array(int16Array.length);
for (let i = 0; i < int16Array.length; i++) {
float32[i] = int16Array[i] / 32768.0;
}
// From Buffer (16-bit PCM)
const int16 = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / 2);
const float32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) {
float32[i] = int16[i] / 32768.0;
}RTTM Output
The rttm field follows the standard RTTM format:
SPEAKER <filename> 1 <start> <duration> <NA> <NA> speaker_<id> <NA> <NA>Each line represents a contiguous speech segment for one speaker.
Example:
SPEAKER audio 1 0.000 2.560 <NA> <NA> speaker_0 <NA> <NA>
SPEAKER audio 1 2.560 1.920 <NA> <NA> speaker_1 <NA> <NA>
SPEAKER audio 1 4.480 3.200 <NA> <NA> speaker_0 <NA> <NA>Thread Safety
Multiple StreamingSession, ActivityStream, or DiarizeStream instances can run concurrently from the same model. Each session has its own backend and graph allocator. All feed() and flush() methods are async and non-blocking.
CoreML Acceleration
On Apple Silicon Macs, the addon automatically uses CoreML/ANE acceleration if a compiled CoreML model is present alongside the GGUF file.
Setup:
- Convert the model head to CoreML:
python scripts/convert_head_to_coreml.py \
--model model.nemo \
--output model-coreml-head.mlpackage \
--precision fp16- Compile the CoreML model:
xcrun coremlcompiler compile model-coreml-head.mlpackage .- Place
model-coreml-head.mlmodelc/in the same directory asmodel.gguf
The addon will automatically detect and use the CoreML model.
Performance (M3 MacBook Pro)
| Backend | Speed | Memory | |---------|-------|--------| | CoreML/ANE | ~110x real-time | ~300 MB | | CPU (F16) | ~10x real-time | ~380 MB |
Choosing an API
| API | Returns | Use When |
|-----|---------|----------|
| diarize() | RTTM string + raw predictions | Batch processing complete audio files |
| createStreamingSession() | Float32Array probabilities (0.0–1.0) | You need raw model output for custom processing |
| createActivityStream() | Uint8Array booleans (0 or 1) | You need "who is speaking right now" per frame (e.g. live UI indicators) |
| createDiarizeStream() | SpeakerSegment objects | You need finished segments with start/duration (e.g. transcription alignment) |
┌─────────────────────┐
│ StreamingSession │ Raw probabilities
│ (low-level) │ Float32Array
└────────┬────────────┘
│
┌────────▼────────────┐
│ ActivityStream │ Threshold + median filter
│ (mid-level) │ Uint8Array (0 or 1)
└────────┬────────────┘
│
┌────────▼────────────┐
│ DiarizeStream │ Segment boundary detection
│ (high-level) │ SpeakerSegment[]
└─────────────────────┘Each layer wraps the one above it. DiarizeStream internally uses ActivityStream, which internally uses StreamingSession.
Error Handling
All methods throw on invalid input. Model/session/stream throw after close(). close() is always idempotent.
License
This project follows the same license as the parent streaming-sortformer-ggml repository.
