whisper-cpp-node
v0.2.12
Published
Node.js bindings for whisper.cpp - fast speech-to-text with GPU acceleration
Maintainers
Readme
whisper-cpp-node
Node.js bindings for whisper.cpp - fast speech-to-text with GPU acceleration.
Features
- Fast: Native whisper.cpp performance with GPU acceleration
- Cross-platform: macOS (Metal), Windows (Vulkan)
- Core ML: Optional Apple Neural Engine support for 3x+ speedup (macOS)
- OpenVINO: Optional Intel CPU/GPU encoder acceleration (Windows/Linux)
- Streaming VAD: Built-in Silero voice activity detection
- TypeScript: Full type definitions included
- GPU Discovery: Enumerate available GPU devices for multi-GPU selection
- Self-contained: No external dependencies, just install and use
Requirements
macOS:
- macOS 13.3+ (Ventura or later)
- Apple Silicon (M1/M2/M3/M4)
- Node.js 18+
Windows:
- Windows 10/11 (x64)
- Node.js 18+
- Vulkan-capable GPU (optional, for GPU acceleration)
Installation
npm install whisper-cpp-node
# or
pnpm add whisper-cpp-nodeThe platform-specific binary is automatically installed:
- macOS ARM64:
@whisper-cpp-node/darwin-arm64 - Windows x64:
@whisper-cpp-node/win32-x64
Quick Start
File-based transcription
import {
createWhisperContext,
transcribeAsync,
} from "whisper-cpp-node";
// Create a context with your model
const ctx = createWhisperContext({
model: "./models/ggml-base.en.bin",
use_gpu: true,
});
// Transcribe audio file
const result = await transcribeAsync(ctx, {
fname_inp: "./audio.wav",
language: "en",
});
// Result: { segments: [["00:00:00,000", "00:00:02,500", " Hello world"], ...] }
for (const [start, end, text] of result.segments) {
console.log(`[${start} --> ${end}]${text}`);
}
// Clean up
ctx.free();Buffer-based transcription
import {
createWhisperContext,
transcribeAsync,
} from "whisper-cpp-node";
const ctx = createWhisperContext({
model: "./models/ggml-base.en.bin",
use_gpu: true,
});
// Pass raw PCM audio (16kHz, mono, float32)
const pcmData = new Float32Array(/* your audio samples */);
const result = await transcribeAsync(ctx, {
pcmf32: pcmData,
language: "en",
});
for (const [start, end, text] of result.segments) {
console.log(`[${start} --> ${end}]${text}`);
}
ctx.free();Streaming transcription
Get real-time output as audio is processed. The on_new_segment callback fires for each segment as it's generated, while the final callback still receives all segments at completion (backward compatible):
import { createWhisperContext, transcribe } from "whisper-cpp-node";
const ctx = createWhisperContext({
model: "./models/ggml-base.en.bin",
});
transcribe(ctx, {
fname_inp: "./long-audio.wav",
language: "en",
// Called for each segment as it's generated
on_new_segment: (segment) => {
console.log(`[${segment.start}]${segment.text}`);
},
}, (err, result) => {
// Final callback still receives ALL segments at completion
console.log(`Done! ${result.segments.length} segments`);
ctx.free();
});API
createWhisperContext(options)
Create a persistent context for transcription.
interface WhisperContextOptions {
model: string; // Path to GGML model file (required)
use_gpu?: boolean; // Enable GPU acceleration (default: true)
// Uses Metal on macOS, Vulkan on Windows
use_coreml?: boolean; // Enable Core ML on macOS (default: false)
use_openvino?: boolean; // Enable OpenVINO encoder on Intel (default: false)
openvino_device?: string; // OpenVINO device: 'CPU', 'GPU', 'NPU' (default: 'CPU')
openvino_model_path?: string; // Path to OpenVINO encoder model (auto-derived)
openvino_cache_dir?: string; // Cache dir for compiled OpenVINO models
flash_attn?: boolean; // Enable Flash Attention (default: false)
gpu_device?: number; // GPU device index (default: 0, see getGpuDevices())
dtw?: string; // DTW preset for word timestamps
no_prints?: boolean; // Suppress log output (default: false)
}transcribeAsync(context, options)
Transcribe audio (Promise-based). Accepts either a file path or PCM buffer.
// File input
interface TranscribeOptionsFile {
fname_inp: string; // Path to audio file
// ... common options
}
// Buffer input
interface TranscribeOptionsBuffer {
pcmf32: Float32Array; // Raw PCM (16kHz, mono, float32, -1.0 to 1.0)
// ... common options
}
// Common options (partial list - see types.ts for full options)
interface TranscribeOptionsBase {
// Language
language?: string; // Language code ('en', 'zh', 'auto')
translate?: boolean; // Translate to English
detect_language?: boolean; // Auto-detect language
// Threading
n_threads?: number; // CPU threads (default: 4)
n_processors?: number; // Parallel processors
// Audio processing
offset_ms?: number; // Start offset in ms
duration_ms?: number; // Duration to process (0 = all)
// Output control
no_timestamps?: boolean; // Disable timestamps
max_len?: number; // Max segment length (chars)
max_tokens?: number; // Max tokens per segment
split_on_word?: boolean; // Split on word boundaries
token_timestamps?: boolean; // Include token-level timestamps
// Sampling
temperature?: number; // Sampling temperature (0.0 = greedy)
beam_size?: number; // Beam search size (-1 = greedy)
best_of?: number; // Best-of-N sampling
// Thresholds
entropy_thold?: number; // Entropy threshold
logprob_thold?: number; // Log probability threshold
no_speech_thold?: number; // No-speech probability threshold
// Context
prompt?: string; // Initial prompt text
no_context?: boolean; // Don't use previous context
// VAD preprocessing
vad?: boolean; // Enable VAD preprocessing
vad_model?: string; // Path to VAD model
vad_threshold?: number; // VAD threshold (0.0-1.0)
vad_min_speech_duration_ms?: number;
vad_min_silence_duration_ms?: number;
vad_speech_pad_ms?: number;
// Callbacks
progress_callback?: (progress: number) => void;
on_new_segment?: (segment: StreamingSegment) => void; // Streaming callback
}
// Streaming segment (passed to on_new_segment callback)
interface StreamingSegment {
start: string; // Start timestamp "HH:MM:SS,mmm"
end: string; // End timestamp
text: string; // Transcribed text
segment_index: number; // 0-based index
is_partial: boolean; // Reserved for future use
tokens?: StreamingToken[]; // Only if token_timestamps enabled
}
// Result
interface TranscribeResult {
segments: TranscriptSegment[];
}
// Segment is a tuple: [start, end, text]
type TranscriptSegment = [string, string, string];
// Example: ["00:00:00,000", "00:00:02,500", " Hello world"]getGpuDevices()
Enumerate available GPU backend devices. Returns an array of GPU/IGPU devices. Never throws — returns an empty array if no GPUs are available.
import { getGpuDevices, createWhisperContext } from "whisper-cpp-node";
const gpus = getGpuDevices();
for (const gpu of gpus) {
console.log(`[${gpu.index}] ${gpu.description} (${gpu.type}, ${(gpu.memory_total / 1e9).toFixed(1)} GB)`);
}
// Example output:
// [0] NVIDIA GeForce RTX 4050 Laptop GPU (gpu, 6.0 GB)
// [1] AMD Radeon 740M (igpu, 8.0 GB)
// Use a specific GPU for transcription:
const ctx = createWhisperContext({
model: "./models/ggml-base.en.bin",
gpu_device: gpus[0].index,
});interface GpuDevice {
index: number; // GPU-relative index (matches gpu_device option)
name: string; // Backend device name (e.g., "Vulkan0")
description: string; // Human-readable name (e.g., "NVIDIA GeForce RTX 4050")
type: "gpu" | "igpu"; // Discrete or integrated GPU
memory_free: number; // Free memory in bytes
memory_total: number; // Total memory in bytes
}createVadContext(options)
Create a voice activity detection context for streaming audio.
interface VadContextOptions {
model: string; // Path to Silero VAD model
threshold?: number; // Speech threshold (default: 0.5)
n_threads?: number; // Number of threads (default: 1)
no_prints?: boolean; // Suppress log output
}
interface VadContext {
getWindowSamples(): number; // Returns 512 (32ms at 16kHz)
getSampleRate(): number; // Returns 16000
process(samples: Float32Array): number; // Returns probability 0.0-1.0
reset(): void; // Reset LSTM state
free(): void; // Release resources
}VAD Example
import { createVadContext } from "whisper-cpp-node";
const vad = createVadContext({
model: "./models/ggml-silero-v6.2.0.bin",
threshold: 0.5,
});
const windowSize = vad.getWindowSamples(); // 512 samples
// Process audio in 32ms chunks
function processAudioChunk(samples: Float32Array) {
const probability = vad.process(samples);
if (probability >= 0.5) {
console.log("Speech detected!", probability);
}
}
// Reset when starting new audio stream
vad.reset();
// Clean up when done
vad.free();Core ML Acceleration (macOS)
For 3x+ faster encoding on Apple Silicon:
Generate a Core ML model:
pip install ane_transformers openai-whisper coremltools ./models/generate-coreml-model.sh base.enPlace it next to your GGML model:
models/ggml-base.en.bin models/ggml-base.en-encoder.mlmodelc/Enable Core ML:
const ctx = createWhisperContext({ model: "./models/ggml-base.en.bin", use_coreml: true, });
OpenVINO Acceleration (Intel)
For faster encoder inference on Intel CPUs and GPUs (requires build with OpenVINO support):
Install OpenVINO and convert the model:
pip install openvino openvino-dev python models/convert-whisper-to-openvino.py --model base.enThe OpenVINO model files are placed next to your GGML model:
models/ggml-base.en.bin models/ggml-base.en-encoder-openvino.xml models/ggml-base.en-encoder-openvino.binEnable OpenVINO:
const ctx = createWhisperContext({ model: "./models/ggml-base.en.bin", use_openvino: true, openvino_device: "CPU", // or "GPU" for Intel iGPU openvino_cache_dir: "./openvino_cache", // optional, speeds up init });
Note: OpenVINO support requires the addon to be built with -DADDON_OPENVINO=ON.
Models
Download models from Hugging Face:
# Base English model (~150MB)
curl -L -o models/ggml-base.en.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin
# Large v3 Turbo quantized (~500MB)
curl -L -o models/ggml-large-v3-turbo-q4_0.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo-q4_0.bin
# Silero VAD model (for streaming VAD)
curl -L -o models/ggml-silero-v6.2.0.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-silero-v6.2.0.binLicense
MIT
