@mynamezxc/mow-speech-to-text
v1.2.1
Published
Advanced speech-to-text transcription tool using OpenAI Whisper with GPU acceleration support
Maintainers
Readme
@mynamezxc/mow-speech-to-text
Zero-Python speech-to-text for Node.js — powered by OpenAI Whisper with built-in speaker diarization, GPU acceleration, anti-hallucination filtering, and smart VAD.
No Python. No Docker. Just
npm installand go.
Features
- 3 Backends — HuggingFace Local (ONNX), HuggingFace Inference API (full-precision), OpenAI API
- GPU Acceleration — Auto-detects GPU via ONNX Runtime providers (DirectML on Windows, CUDA on Linux); override with
--device gpu|cpu - Speaker Diarization — Automatic multi-speaker detection with labeled output (
SPEAKER1,SPEAKER2, ...) - Anti-Hallucination Filter — Removes fabricated content such as repeated phrases, language mismatches, and known noise patterns
- Smart VAD — Voice Activity Detection with tunable onset/offset thresholds and minimum segment filtering
- Confidence Scoring — Per-segment and global confidence output
- CLI + REST API — Transcribe files and folders from the command line, or run as an HTTP server
- Multi-Language — Supports 90+ languages via
--language
Requirements
- Node.js >= 18
- FFmpeg — Bundled via
ffmpeg-static; falls back to system FFmpeg if available - GPU (optional) — Windows: DirectML (any DirectX 12 GPU); Linux: NVIDIA GPU with CUDA 11+
Installation
npm install -g @mynamezxc/mow-speech-to-textOr install locally:
npm install @mynamezxc/mow-speech-to-text
npx mow helpQuick Start
# Transcribe a file (default model: large-v3)
mow convert "recording.wav" --language en
# Specify language explicitly
mow convert "recording.wav" --language vi
# Force GPU execution
mow convert "recording.wav" --device gpu --language en
# Use HuggingFace Inference API (full-precision, remote)
mow convert "recording.wav" --model openai/whisper-large-v3 --hf-token hf_xxxxx --language en
# Use OpenAI API
mow convert "recording.wav" --model gpt-4o-transcribe --openai-key sk-xxxxx --language enBackend Architecture
| Backend | Flag | Models | Quality | Requirements |
|---|---|---|---|---|
| HuggingFace Local (default) | (none) | Xenova/whisper-* | Good (ONNX quantized) | No API key |
| HuggingFace Inference API | --hf-token | openai/whisper-* | Very good (full-precision) | HF Token |
| OpenAI API | --openai-key | whisper-1, gpt-4o-transcribe | Best | OpenAI API Key |
GPU Acceleration
MOW uses @huggingface/transformers v3, which bundles [email protected] with built-in GPU support:
| Platform | Provider | GPU Support | |---|---|---| | Windows x64/arm64 | DirectML (DML) | Any DirectX 12 GPU (NVIDIA, AMD, Intel) | | Linux x64 | CUDA | NVIDIA GPU with CUDA 11.8+ | | macOS | — | CPU only |
GPU detection priority:
MOW_DEVICEenv var — SetMOW_DEVICE=gpuorMOW_DEVICE=cputo override- Platform detection — Windows → DirectML; Linux x64 with CUDA → CUDA
--deviceflag —--device gpuor--device cpu
If GPU initialization fails at model load time, it falls back to CPU automatically.
DirectML (Windows): No CUDA Toolkit needed. DirectML uses DirectX 12 — works on any modern GPU from NVIDIA, AMD, or Intel.
FP32 models on GPU: DirectML does not support INT8 quantized models. When GPU is active, MOW loads FP32 (full-precision) models automatically.
# GPU (auto-detected or explicit)
mow convert "audio.wav" --device gpu --language en
# Force CPU
mow convert "audio.wav" --device cpu --language en
# Override via environment variable
MOW_DEVICE=gpu mow convert "audio.wav"CLI Reference
Commands
| Command | Description |
|---|---|
| mow convert <input> [options] | Transcribe a file or folder |
| mow models | List all supported models |
| mow serve [port] | Start the REST API server |
| mow help | Show help |
Options
| Option | Description | Default |
|---|---|---|
| --model <name> | Whisper model name or alias | large-v3 |
| --language <code> | Language code (en, vi, ja, ...) | en |
| --device <cpu\|gpu> | Compute device (DirectML / CUDA) | auto-detected |
| --output <path> | Output file or directory path | (same dir as input) |
| --hf-token <token> | HuggingFace token (enables Inference API) | — |
| --openai-key <key> | OpenAI API key (enables OpenAI backend) | — |
| --vad-onset <n> | VAD onset threshold | 0.85 |
| --vad-offset <n> | VAD offset threshold | 0.65 |
| --min-speakers <n> | Minimum number of speakers | 1 |
| --max-speakers <n> | Maximum number of speakers | 3 |
| --diarization false | Disable speaker diarization | true |
| --recursive | Scan subfolders for folder input | true |
| --json | Output .json alongside .txt | false |
Examples
# Single file with output path
mow convert "call.wav" --language vi --output "call_transcript.txt"
# Entire folder with recursive scan
mow convert "./recordings/" --output "./transcripts/" --recursive --language en
# Disable diarization for single-speaker audio
mow convert "lecture.wav" --diarization false
# Custom VAD and speaker range
mow convert "meeting.wav" --vad-onset 0.9 --vad-offset 0.7 --min-speakers 2 --max-speakers 5
# Use a lighter model for faster processing
mow convert "note.wav" --model tiny --language enSample Output
[00:00:01.170 - 00:00:04.710] SPEAKER1 (85.4%): I need some help with the insurance software.
[00:00:05.760 - 00:00:07.800] SPEAKER2 (72.2%): Sure, what's the issue?
[00:00:09.180 - 00:00:14.230] SPEAKER1 (78.0%): The form won't submit. It keeps showing an error.
[00:00:14.430 - 00:00:16.770] SPEAKER2 (81.3%): Let me take a look at that for you.
[00:00:21.180 - 00:00:24.030] SPEAKER1 (71.3%): Thank you, I appreciate it.Anti-Hallucination Filter
Whisper models — especially ONNX quantized variants — are prone to hallucination. MOW includes a multi-layered filter that automatically removes fabricated output:
| Filter | Description | |---|---| | Pattern Matching | Detects known hallucination phrases ("subscribe", "thank you for watching", URLs, etc.) | | Language Mismatch | Flags pure-English output when a non-English language is specified (and vice versa) | | Repetition Detection | Identifies repeated phrases and single-word loops — strong hallucination signals | | Length Ratio | Rejects segments where text length is disproportionate to audio duration (>25 chars/sec) | | Short Segments | Skips segments shorter than 0.8s to avoid noise artifacts | | Tight VAD | Default onset=0.85, offset=0.65 minimizes noise-as-speech false positives |
REST API
Start the Server
mow serve 3001Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /health | Health check and engine info |
| GET | /api/models | List available models |
| POST | /api/transcribe | Transcribe an audio file |
| GET | /api/docs | API documentation |
Request Example
curl -X POST http://localhost:3001/api/transcribe \
-F "[email protected]" \
-F "model=large-v3" \
-F "language=en" \
-F "min_speakers=1" \
-F "max_speakers=3"Response Example
{
"text": "SPEAKER1: I need some help with the insurance software...",
"segments": [
{
"speaker": "SPEAKER1",
"start": 1.17,
"end": 4.71,
"text": "I need some help with the insurance software",
"confidence": 0.854,
"confidenceSource": "estimated"
}
],
"confidence": 0.78,
"confidenceSource": "estimated",
"asr": {
"engine": "huggingface-xenova",
"model": "Xenova/whisper-large-v3",
"modelId": "Xenova/whisper-large-v3",
"runtimeDevice": "local-gpu-DmlExecutionProvider",
"computeType": "gpu-fp16",
"batchSize": "1"
}
}Supported Models
HuggingFace Local (ONNX — offline)
| Alias | Model ID | Size |
|---|---|---|
| tiny | Xenova/whisper-tiny | ~39 MB |
| base | Xenova/whisper-base | ~74 MB |
| small | Xenova/whisper-small | ~244 MB |
| medium | Xenova/whisper-medium | ~769 MB |
| large | Xenova/whisper-large | ~1.5 GB |
| large-v3 | Xenova/whisper-large-v3 | ~1.5 GB |
| turbo | Xenova/whisper-large-v3-turbo | ~809 MB |
HuggingFace Inference API (requires --hf-token)
| Model ID | Notes |
|---|---|
| openai/whisper-large-v3 | Full-precision, highest quality |
| openai/whisper-large-v3-turbo | Faster, near large-v3 quality |
| openai/whisper-large-v2 | Previous generation |
| openai/whisper-medium | Balanced speed/quality |
| openai/whisper-small | Lightweight |
OpenAI API (requires --openai-key)
| Model ID | Notes |
|---|---|
| whisper-1 | OpenAI Whisper API |
| gpt-4o-mini-transcribe | GPT-4o mini transcribe |
| gpt-4o-transcribe | GPT-4o transcribe (best quality) |
Environment Variables
| Variable | Default | Description |
|---|---|---|
| MOW_DEVICE | (auto) | Force compute device: gpu or cpu |
| WHISPER_LANGUAGE | en | Default language code |
| WHISPER_VAD_ONSET | 0.85 | VAD onset threshold |
| WHISPER_VAD_OFFSET | 0.65 | VAD offset threshold |
| DIARIZATION_MIN_SPEAKERS | 1 | Minimum speakers |
| DIARIZATION_MAX_SPEAKERS | 3 | Maximum speakers |
| DIARIZATION_VAD_ONSET | 0.85 | Diarization VAD onset |
| DIARIZATION_VAD_OFFSET | 0.65 | Diarization VAD offset |
Notes
- First run — Local HuggingFace models are downloaded on first use and cached at
~/.cache/huggingface. Subsequent runs use the cache with no internet required. - Large models —
large-v3(~1.5 GB) may require increasing Node.js heap size:NODE_OPTIONS=--max-old-space-size=8192. - Confidence — When the model provides confidence scores directly, those are used. Otherwise, confidence is estimated from signal features (
confidenceSource: "estimated"). - Hallucination — ONNX quantized models (Xenova) are more prone to hallucination than full-precision models. For best accuracy, use
--hf-tokenwithopenai/whisper-large-v3or--openai-keywithgpt-4o-transcribe.
License
MIT
