npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

streaming-sortformer-node

v0.2.1

Published

Node.js bindings for SortFormer streaming speaker diarization

Readme

streaming-sortformer-node

Native Node.js bindings for SortFormer streaming speaker diarization. Provides real-time speaker diarization with up to 4 concurrent speakers using native C/GGML inference with async, thread-safe operation and CoreML acceleration on Apple Silicon.

Installation

npm install streaming-sortformer-node

Platforms: macOS arm64, macOS x64
Requirements: Node.js >= 18

Quick Start

Offline Diarization

import { Sortformer } from 'streaming-sortformer-node';

const model = await Sortformer.load('./model.gguf');
const audio = new Float32Array(/* 16kHz mono samples */);

const result = await model.diarize(audio, {
  threshold: 0.5,
  medianFilter: 11
});

console.log(result.rttm);
console.log(`Processed ${result.frameCount} frames`);
console.log(`Model outputs ${result.maxSpeakers} speaker channels`);

model.close();

Streaming (Low-Level)

import { Sortformer } from 'streaming-sortformer-node';

const model = await Sortformer.load('./model.gguf');
const session = model.createStreamingSession({ preset: '2s' });

// Feed audio chunks as they arrive
const result1 = await session.feed(chunk1);
const result2 = await session.feed(chunk2);

// Flush at end of stream
const final = await session.flush();

console.log(`Total frames: ${session.totalFrames}`);

session.close();
model.close();

Streaming (Mid-Level — Boolean Activity Frames)

import { Sortformer } from 'streaming-sortformer-node';

const model = await Sortformer.load('./model.gguf');
const stream = model.createActivityStream({
  preset: '2s',
  threshold: 0.5,
  medianFilter: 11
});

// Feed audio chunks — get boolean "who is speaking" frames
const result = await stream.feed(chunk1);

// result.activity is a Uint8Array of 0s and 1s
// Layout: [spk0_f0, spk1_f0, spk2_f0, spk3_f0, spk0_f1, ...]
for (let i = 0; i < result.frameCount; i++) {
  const globalFrame = result.startFrame + i;
  const time = globalFrame * 0.08; // 80ms per frame
  const speakers = [];
  for (let s = 0; s < 4; s++) {
    if (result.activity[i * 4 + s] === 1) speakers.push(s);
  }
  if (speakers.length > 0) {
    console.log(`${time.toFixed(2)}s: speakers ${speakers.join(', ')} active`);
  }
}

// Flush at end of stream to settle remaining frames
const final = await stream.flush();

stream.close();
model.close();

Streaming (High-Level — Speaker Segments)

import { Sortformer } from 'streaming-sortformer-node';

const model = await Sortformer.load('./model.gguf');
const stream = model.createDiarizeStream({
  preset: '2s',
  threshold: 0.5,
  medianFilter: 11,
  onSegment: (segment) => {
    console.log(`Speaker ${segment.speaker}: ${segment.start}s - ${segment.start + segment.duration}s`);
  }
});

// Feed audio chunks
await stream.feed(chunk1);
await stream.feed(chunk2);

// Flush and close open segments
const final = await stream.flush();

stream.close();
model.close();

API Reference

Sortformer

static async load(modelPath: string, options?: LoadOptions): Promise<Sortformer>

Load a SortFormer model from a GGUF file.

Parameters:

  • modelPath (string): Path to the GGUF model file
  • options (LoadOptions, optional): Currently empty {}

Returns: Promise<Sortformer>

Throws: Error if model file not found or invalid

Example:

const model = await Sortformer.load('./model.gguf');

async diarize(audio: Float32Array, options?: DiarizeOptions): Promise<DiarizeResult>

Run offline diarization on complete audio. Processes entire audio at once with no streaming constraints.

Parameters:

  • audio (Float32Array): Audio samples at 16kHz mono
  • options (DiarizeOptions, optional):
    • threshold?: number — Speaker activity threshold, 0.0-1.0 (default: 0.5)
    • medianFilter?: number — Median filter window size, odd integer >= 1 (default: 11)

Returns: Promise<DiarizeResult>

  • rttm: string — RTTM format output
  • predictions: Float32Array — Raw per-frame predictions, shape [frameCount, 4]
  • frameCount: number — Number of frames
  • maxSpeakers: number — Always 4 (model outputs 4 speaker channels)

Throws: Error if model is closed, audio is invalid, or inference fails

Example:

const result = await model.diarize(audio, {
  threshold: 0.5,
  medianFilter: 11
});

createStreamingSession(options?: StreamingSessionOptions): StreamingSession

Create a streaming session for incremental audio processing. Returns raw frame predictions.

Parameters:

  • options (StreamingSessionOptions, optional):
    • preset?: StreamingPreset — Latency preset: 'low' | '2s' | '3s' | '5s' (default: '2s')

Returns: StreamingSession

Throws: Error if model is closed

Example:

const session = model.createStreamingSession({ preset: 'low' });

createActivityStream(options?: ActivityStreamOptions): ActivityStream

Create a mid-level streaming activity stream. Performs threshold → median filter and returns boolean speaker activity frames. Use this when you need clean "who is speaking right now" data without dealing with raw probabilities or segment boundary detection yourself.

Parameters:

  • options (ActivityStreamOptions, optional):
    • preset?: StreamingPreset — Latency preset (default: '2s')
    • threshold?: number — Speaker activity threshold (default: 0.5)
    • medianFilter?: number — Median filter window size (default: 11)
    • onFrames?: (predictions: Float32Array, frameCount: number) => void — Callback for raw predictions before processing

Returns: ActivityStream

Throws: Error if model is closed

Example:

const stream = model.createActivityStream({
  preset: '2s',
  threshold: 0.5,
  medianFilter: 11
});

createDiarizeStream(options?: DiarizeStreamOptions): DiarizeStream

Create a high-level streaming diarization stream. Performs threshold → median filter → segment extraction, emitting completed speaker segments via callbacks. Internally wraps an ActivityStream.

Parameters:

  • options (DiarizeStreamOptions, optional):
    • preset?: StreamingPreset — Latency preset (default: '2s')
    • threshold?: number — Speaker activity threshold (default: 0.5)
    • medianFilter?: number — Median filter window size (default: 11)
    • onSegment?: (segment: SpeakerSegment) => void — Callback for completed segments
    • onFrames?: (predictions: Float32Array, frameCount: number) => void — Callback for raw frame predictions

Returns: DiarizeStream

Throws: Error if model is closed

Example:

const stream = model.createDiarizeStream({
  preset: '2s',
  onSegment: (seg) => console.log(`Speaker ${seg.speaker}: ${seg.start}s`)
});

close(): void

Close the model and free native resources. Idempotent.

isClosed(): boolean

Check if the model has been closed.

StreamingSession

async feed(audio: Float32Array): Promise<FeedResult>

Feed audio samples and get predictions for this chunk.

Parameters:

  • audio (Float32Array): Audio samples at 16kHz mono

Returns: Promise<FeedResult>

  • predictions: Float32Array — Per-frame predictions, shape [frameCount, 4]
  • frameCount: number — Number of new frames

Throws: Error if session is closed or audio is invalid

Example:

const result = await session.feed(audioChunk);
console.log(`Got ${result.frameCount} frames`);

async flush(): Promise<FeedResult>

Flush remaining buffered audio at end of stream. Call when audio stream ends to process any remaining buffered audio.

Returns: Promise<FeedResult>

Throws: Error if session is closed

reset(): void

Reset streaming state for a new audio stream. Clears all internal buffers (spkcache, fifo, mel overlap) while keeping the model loaded.

Throws: Error if session is closed

close(): void

Close the session and free native resources. Idempotent.

get totalFrames: number

Get total frames output so far.

get isClosed: boolean

Check if the session is closed.

ActivityStream

Mid-level API that wraps StreamingSession and returns boolean speaker activity frames after applying threshold and median filter. Each frame tells you definitively which speakers are active — no raw probabilities to interpret.

Added latency: The median filter needs future context to settle predictions. With the default medianFilter: 11, this adds a delay of 5 frames (400ms) on top of the model's inference latency. Frames within this window are "unsettled" and held back until enough future context arrives.

async feed(audio: Float32Array): Promise<ActivityFeedResult>

Feed audio samples, run inference, apply threshold + median filter, and return settled boolean activity frames.

Parameters:

  • audio (Float32Array): Audio samples at 16kHz mono

Returns: Promise<ActivityFeedResult>

  • activity: Uint8Array — Boolean activity per speaker, flat layout [spk0_f0, spk1_f0, spk2_f0, spk3_f0, spk0_f1, ...]. Values are 0 (silent) or 1 (speaking).
  • frameCount: number — Number of settled frames in this result. May be 0 if the engine is still buffering audio or waiting for median filter context.
  • startFrame: number — Global frame index of the first frame in this result. Use this to compute timestamps: time = (startFrame + i) * 0.08.

Throws: Error if stream is closed

Example:

const result = await stream.feed(audioChunk);

for (let i = 0; i < result.frameCount; i++) {
  const time = (result.startFrame + i) * 0.08;
  for (let s = 0; s < 4; s++) {
    if (result.activity[i * 4 + s] === 1) {
      console.log(`${time.toFixed(2)}s: speaker ${s} is speaking`);
    }
  }
}

async flush(): Promise<ActivityFlushResult>

Flush remaining buffered audio and settle all pending frames. Call when the audio stream ends. The median filter zero-pads future context to produce final settled frames.

Returns: Promise<ActivityFlushResult>

  • activity: Uint8Array — Final boolean activity frames
  • frameCount: number — Number of final frames
  • startFrame: number — Global frame index of the first frame

Throws: Error if stream is closed

reset(): void

Reset for a new audio stream. Clears internal buffers, median filter state, and frame counter.

Throws: Error if stream is closed

close(): void

Close the stream and free native resources. Idempotent.

get totalFrames: number

Get total frames output by the underlying session so far (including unsettled frames not yet returned).

get isClosed: boolean

Check if the stream is closed.

DiarizeStream

Wraps an ActivityStream internally. Reads boolean activity frames and detects segment boundaries (speaker on→off transitions), emitting SpeakerSegment objects.

async feed(audio: Float32Array): Promise<DiarizeStreamFeedResult>

Feed audio samples, run inference, and post-process predictions. Returns segments that completed during this call.

Parameters:

  • audio (Float32Array): Audio samples at 16kHz mono

Returns: Promise<DiarizeStreamFeedResult>

  • frameCount: number — Number of new frames from inference
  • segments: SpeakerSegment[] — Segments completed during this feed

Throws: Error if stream is closed

Example:

const result = await stream.feed(audioChunk);
for (const seg of result.segments) {
  console.log(`Speaker ${seg.speaker}: ${seg.start}s for ${seg.duration}s`);
}

async flush(): Promise<DiarizeStreamFlushResult>

Flush remaining buffered frames at end of stream. Zero-pads future frames to settle all pending predictions and closes any open segments.

Returns: Promise<DiarizeStreamFlushResult>

  • frameCount: number — Number of final frames
  • segments: SpeakerSegment[] — All segments that completed during flush

Throws: Error if stream is closed

reset(): void

Reset for new audio stream.

Throws: Error if stream is closed

close(): void

Close the stream and free native resources. Idempotent.

get totalFrames: number

Get total frames processed so far.

get isClosed: boolean

Check if the stream is closed.

Types

interface LoadOptions {}

interface DiarizeOptions {
  threshold?: number;    // 0.0-1.0, default: 0.5
  medianFilter?: number; // odd integer >= 1, default: 11
}

interface DiarizeResult {
  rttm: string;
  predictions: Float32Array; // shape [frameCount, 4]
  frameCount: number;
  maxSpeakers: number;       // always 4
}

type StreamingPreset = 'low' | '2s' | '3s' | '5s';

interface StreamingSessionOptions {
  preset?: StreamingPreset; // default: '2s'
}

interface FeedResult {
  predictions: Float32Array; // shape [frameCount, 4]
  frameCount: number;
}

interface SpeakerSegment {
  speaker: number;   // 0-3
  start: number;     // seconds
  duration: number;  // seconds
}

interface DiarizeStreamOptions {
  preset?: StreamingPreset;
  threshold?: number;
  medianFilter?: number;
  onSegment?: (segment: SpeakerSegment) => void;
  onFrames?: (predictions: Float32Array, frameCount: number) => void;
}

interface ActivityStreamOptions {
  preset?: StreamingPreset;
  threshold?: number;      // 0.0-1.0, default: 0.5
  medianFilter?: number;   // odd integer >= 1, default: 11
  onFrames?: (predictions: Float32Array, frameCount: number) => void;
}

interface ActivityFeedResult {
  activity: Uint8Array;    // flat [spk0_f0, spk1_f0, spk2_f0, spk3_f0, spk0_f1, ...], values 0 or 1
  frameCount: number;      // number of settled frames
  startFrame: number;      // global frame index of first frame
}

interface ActivityFlushResult {
  activity: Uint8Array;    // final settled frames
  frameCount: number;
  startFrame: number;
}

interface DiarizeStreamFeedResult {
  frameCount: number;
  segments: SpeakerSegment[];
}

interface DiarizeStreamFlushResult {
  frameCount: number;
  segments: SpeakerSegment[];
}

How Streaming Works

What feed() returns

feed() returns only the new frames from the chunk you just fed — not the full audio history. Each call's frames tile together sequentially.

const r1 = await session.feed(chunk1); // r1.frameCount = 30 (frames 0-29)
const r2 = await session.feed(chunk2); // r2.frameCount = 30 (frames 30-59)
const r3 = await session.feed(chunk3); // r3.frameCount = 30 (frames 60-89)

Internally, the model uses a speaker cache for full-history context, but it only outputs predictions for the new audio.

What's inside predictions

predictions is a flat Float32Array of raw sigmoid probabilities (0.0–1.0) for 4 speaker channels at each frame. Layout is [spk0_f0, spk1_f0, spk2_f0, spk3_f0, spk0_f1, spk1_f1, ...].

// Read speaker 2's probability at frame 5:
const prob = result.predictions[5 * 4 + 2];

These are raw values — no thresholding or filtering applied. If you want finished speaker segments, use createDiarizeStream() instead, which does threshold → median filter → segment extraction for you.

Feed non-overlapping audio chunks

You feed sequential, non-overlapping audio chunks. The engine handles mel spectrogram overlap (352 samples) internally.

// Correct: sequential chunks
await session.feed(samples.slice(0, 48000));
await session.feed(samples.slice(48000, 96000));

// Wrong: don't overlap your audio
await session.feed(samples.slice(0, 48000));
await session.feed(samples.slice(47000, 96000)); // ❌ overlapping

Feed sizes and buffering

The model processes audio in internal chunks. If you haven't fed enough audio for a full chunk, the engine buffers it and returns frameCount: 0. Once enough accumulates, it processes and returns frames.

The model needs (chunkLen + rightContext) × 8 × 160 audio samples to produce its first chunk. After that, it needs chunkLen × 8 × 160 samples per chunk (the right context carries over internally).

Simplest approach — feed the same size every call:

| Preset | Feed size | Samples | First output after | |--------|-----------|---------|-------------------| | 'low' | 0.48s | 7,680 | 3rd call (1.44s) | | '2s' | 1.20s | 19,200 | 2nd call (2.40s) | | '3s' | 2.40s | 38,400 | 2nd call (4.80s) | | '5s' | 4.40s | 70,400 | 2nd call (8.80s) |

The first call(s) return frameCount: 0 while the engine accumulates enough data. After that, every call returns exactly one chunk.

// Example: '3s' preset, feeding 38,400 samples each time
const FEED_SIZE = 38_400;
const r1 = await session.feed(audio.slice(0, FEED_SIZE));       // r1.frameCount = 0  (buffering)
const r2 = await session.feed(audio.slice(FEED_SIZE, FEED_SIZE*2));   // r2.frameCount = 30
const r3 = await session.feed(audio.slice(FEED_SIZE*2, FEED_SIZE*3)); // r3.frameCount = 30

Minimum-latency approach — feed a larger first chunk:

| Preset | First call | Subsequent calls | First output after | |--------|-----------|-----------------|-------------------| | 'low' | 16,640 samples (1.04s) | 7,680 samples (0.48s) | 1st call (1.04s) | | '2s' | 32,000 samples (2.00s) | 19,200 samples (1.20s) | 1st call (2.00s) | | '3s' | 47,360 samples (2.96s) | 38,400 samples (2.40s) | 1st call (2.96s) | | '5s' | 79,360 samples (4.96s) | 70,400 samples (4.40s) | 1st call (4.96s) |

This gets predictions on the very first call by providing enough audio for the right context up front:

// Example: '3s' preset, minimum-latency approach
const FIRST = 47_360;
const STEADY = 38_400;
const r1 = await session.feed(audio.slice(0, FIRST));                  // r1.frameCount = 30
const r2 = await session.feed(audio.slice(FIRST, FIRST + STEADY));     // r2.frameCount = 30
const r3 = await session.feed(audio.slice(FIRST + STEADY, FIRST + STEADY*2)); // r3.frameCount = 30

Or just feed whatever you have — the engine buffers internally regardless of chunk size. You can feed 1000 samples at a time or 100,000. It will process as many full chunks as possible and buffer the rest.

Tracking frame positions

Frames are sequential starting from 0. Use totalFrames to know where you are:

const session = model.createStreamingSession({ preset: '3s' });

const r1 = await session.feed(chunk1);
// session.totalFrames = 0, r1.frameCount = 0 (still buffering)

const r2 = await session.feed(chunk2);
// session.totalFrames = 30, r2.frameCount = 30
// These are global frames 0–29 → time 0.00s to 2.32s

const r3 = await session.feed(chunk3);
// session.totalFrames = 60, r3.frameCount = 30
// These are global frames 30–59 → time 2.40s to 4.72s

totalFrames before a feed() call gives you the starting frame index of whatever predictions come back. Each output frame represents 80ms of audio (8× subsampling at 160-sample hops from 16kHz). Frame i covers the time window [i × 0.08s, (i+1) × 0.08s).

Predictions lag behind your audio

The model needs right context (future frames) to make predictions about the "current" audio, so predictions always lag behind the audio you've fed. For example with the '3s' preset, if you've fed 4.8s of audio, you may only have predictions up to ~2.4s.

This is normal. Call flush() when the audio stream ends — it zero-pads the future context to squeeze out the remaining predictions for the tail of your audio.

ActivityStream adds median filter settling delay

ActivityStream (and DiarizeStream, which wraps it) adds an extra delay on top of the model's buffering. The median filter needs future frames to "settle" its output, so it holds back the last floor(medianFilter / 2) frames until the next feed() call.

With the default medianFilter: 11, that's 5 frames held back (400ms). These frames are not lost — they settle when you feed more audio or call flush().

What this means in practice:

| Preset | StreamingSession first output | ActivityStream first output | |--------|------------------------------|----------------------------| | 'low' | 6 frames | 1 settled frame (5 held back) | | '2s' | 15 frames | 10 settled frames (5 held back) | | '3s' | 30 frames | 25 settled frames (5 held back) | | '5s' | 55 frames | 50 settled frames (5 held back) |

The first output arrives on the same call — you just get 5 fewer frames. Those 5 frames settle on the next feed() call. In steady state, every call returns the full chunk size because the 5 held-back frames from the previous call are released.

You can reduce the settling delay by lowering medianFilter (e.g., medianFilter: 5 → 2 frames held back = 160ms), at the cost of noisier activity detection.

Streaming Presets

| Preset | Output frames per chunk | Audio per chunk | Right context | Best For | |--------|------------------------|-----------------|---------------|----------| | 'low' | 6 (480ms) | 0.48s steady | 7 frames (560ms) | Real-time, minimal delay | | '2s' | 15 (1.2s) | 1.20s steady | 10 frames (800ms) | Balanced (default) | | '3s' | 30 (2.4s) | 2.40s steady | 7 frames (560ms) | Higher accuracy | | '5s' | 55 (4.4s) | 4.40s steady | 7 frames (560ms) | Best accuracy |

Audio Format

Input audio must be:

  • Sample rate: 16kHz
  • Channels: Mono (single channel)
  • Format: Float32Array with values in range [-1.0, 1.0]

Converting from PCM Int16

// From Int16Array
const float32 = new Float32Array(int16Array.length);
for (let i = 0; i < int16Array.length; i++) {
  float32[i] = int16Array[i] / 32768.0;
}

// From Buffer (16-bit PCM)
const int16 = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / 2);
const float32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) {
  float32[i] = int16[i] / 32768.0;
}

RTTM Output

The rttm field follows the standard RTTM format:

SPEAKER <filename> 1 <start> <duration> <NA> <NA> speaker_<id> <NA> <NA>

Each line represents a contiguous speech segment for one speaker.

Example:

SPEAKER audio 1 0.000 2.560 <NA> <NA> speaker_0 <NA> <NA>
SPEAKER audio 1 2.560 1.920 <NA> <NA> speaker_1 <NA> <NA>
SPEAKER audio 1 4.480 3.200 <NA> <NA> speaker_0 <NA> <NA>

Thread Safety

Multiple StreamingSession, ActivityStream, or DiarizeStream instances can run concurrently from the same model. Each session has its own backend and graph allocator. All feed() and flush() methods are async and non-blocking.

CoreML Acceleration

On Apple Silicon Macs, the addon automatically uses CoreML/ANE acceleration if a compiled CoreML model is present alongside the GGUF file.

Setup:

  1. Convert the model head to CoreML:
python scripts/convert_head_to_coreml.py \
  --model model.nemo \
  --output model-coreml-head.mlpackage \
  --precision fp16
  1. Compile the CoreML model:
xcrun coremlcompiler compile model-coreml-head.mlpackage .
  1. Place model-coreml-head.mlmodelc/ in the same directory as model.gguf

The addon will automatically detect and use the CoreML model.

Performance (M3 MacBook Pro)

| Backend | Speed | Memory | |---------|-------|--------| | CoreML/ANE | ~110x real-time | ~300 MB | | CPU (F16) | ~10x real-time | ~380 MB |

Choosing an API

| API | Returns | Use When | |-----|---------|----------| | diarize() | RTTM string + raw predictions | Batch processing complete audio files | | createStreamingSession() | Float32Array probabilities (0.0–1.0) | You need raw model output for custom processing | | createActivityStream() | Uint8Array booleans (0 or 1) | You need "who is speaking right now" per frame (e.g. live UI indicators) | | createDiarizeStream() | SpeakerSegment objects | You need finished segments with start/duration (e.g. transcription alignment) |

                          ┌─────────────────────┐
                          │  StreamingSession    │  Raw probabilities
                          │  (low-level)         │  Float32Array
                          └────────┬────────────┘
                                   │
                          ┌────────▼────────────┐
                          │  ActivityStream      │  Threshold + median filter
                          │  (mid-level)         │  Uint8Array (0 or 1)
                          └────────┬────────────┘
                                   │
                          ┌────────▼────────────┐
                          │  DiarizeStream       │  Segment boundary detection
                          │  (high-level)        │  SpeakerSegment[]
                          └─────────────────────┘

Each layer wraps the one above it. DiarizeStream internally uses ActivityStream, which internally uses StreamingSession.

Error Handling

All methods throw on invalid input. Model/session/stream throw after close(). close() is always idempotent.

License

This project follows the same license as the parent streaming-sortformer-ggml repository.