@mynamezxc/mow-speech-to-text

v1.2.1

Published

5 days ago

Advanced speech-to-text transcription tool using OpenAI Whisper with GPU acceleration support

0High
0Medium
0Low

mynamezxc

speech-to-text whisper transcription audio gpu openai diarization cli

@mynamezxc/mow-speech-to-text

Zero-Python speech-to-text for Node.js — powered by OpenAI Whisper with built-in speaker diarization, GPU acceleration, anti-hallucination filtering, and smart VAD.

No Python. No Docker. Just npm install and go.

Features

3 Backends — HuggingFace Local (ONNX), HuggingFace Inference API (full-precision), OpenAI API
GPU Acceleration — Auto-detects GPU via ONNX Runtime providers (DirectML on Windows, CUDA on Linux); override with --device gpu|cpu
Speaker Diarization — Automatic multi-speaker detection with labeled output (SPEAKER1, SPEAKER2, ...)
Anti-Hallucination Filter — Removes fabricated content such as repeated phrases, language mismatches, and known noise patterns
Smart VAD — Voice Activity Detection with tunable onset/offset thresholds and minimum segment filtering
Confidence Scoring — Per-segment and global confidence output
CLI + REST API — Transcribe files and folders from the command line, or run as an HTTP server
Multi-Language — Supports 90+ languages via --language

Requirements

Node.js >= 18
FFmpeg — Bundled via ffmpeg-static; falls back to system FFmpeg if available
GPU (optional) — Windows: DirectML (any DirectX 12 GPU); Linux: NVIDIA GPU with CUDA 11+

Installation

npm install -g @mynamezxc/mow-speech-to-text

Or install locally:

npm install @mynamezxc/mow-speech-to-text
npx mow help

Quick Start

# Transcribe a file (default model: large-v3)
mow convert "recording.wav" --language en

# Specify language explicitly
mow convert "recording.wav" --language vi

# Force GPU execution
mow convert "recording.wav" --device gpu --language en

# Use HuggingFace Inference API (full-precision, remote)
mow convert "recording.wav" --model openai/whisper-large-v3 --hf-token hf_xxxxx --language en

# Use OpenAI API
mow convert "recording.wav" --model gpt-4o-transcribe --openai-key sk-xxxxx --language en

Backend Architecture

| Backend | Flag | Models | Quality | Requirements | |---|---|---|---|---| | HuggingFace Local (default) | (none) | Xenova/whisper-* | Good (ONNX quantized) | No API key | | HuggingFace Inference API | --hf-token | openai/whisper-* | Very good (full-precision) | HF Token | | OpenAI API | --openai-key | whisper-1, gpt-4o-transcribe | Best | OpenAI API Key |

GPU Acceleration

MOW uses @huggingface/transformers v3, which bundles [email protected] with built-in GPU support:

| Platform | Provider | GPU Support | |---|---|---| | Windows x64/arm64 | DirectML (DML) | Any DirectX 12 GPU (NVIDIA, AMD, Intel) | | Linux x64 | CUDA | NVIDIA GPU with CUDA 11.8+ | | macOS | — | CPU only |

GPU detection priority:

MOW_DEVICE env var — Set MOW_DEVICE=gpu or MOW_DEVICE=cpu to override
Platform detection — Windows → DirectML; Linux x64 with CUDA → CUDA
--device flag — --device gpu or --device cpu

If GPU initialization fails at model load time, it falls back to CPU automatically.

DirectML (Windows): No CUDA Toolkit needed. DirectML uses DirectX 12 — works on any modern GPU from NVIDIA, AMD, or Intel.

FP32 models on GPU: DirectML does not support INT8 quantized models. When GPU is active, MOW loads FP32 (full-precision) models automatically.

# GPU (auto-detected or explicit)
mow convert "audio.wav" --device gpu --language en

# Force CPU
mow convert "audio.wav" --device cpu --language en

# Override via environment variable
MOW_DEVICE=gpu mow convert "audio.wav"

CLI Reference

Commands

| Command | Description | |---|---| | mow convert <input> [options] | Transcribe a file or folder | | mow models | List all supported models | | mow serve [port] | Start the REST API server | | mow help | Show help |

Options

| Option | Description | Default | |---|---|---| | --model <name> | Whisper model name or alias | large-v3 | | --language <code> | Language code (en, vi, ja, ...) | en | | --device <cpu\|gpu> | Compute device (DirectML / CUDA) | auto-detected | | --output <path> | Output file or directory path | (same dir as input) | | --hf-token <token> | HuggingFace token (enables Inference API) | — | | --openai-key <key> | OpenAI API key (enables OpenAI backend) | — | | --vad-onset <n> | VAD onset threshold | 0.85 | | --vad-offset <n> | VAD offset threshold | 0.65 | | --min-speakers <n> | Minimum number of speakers | 1 | | --max-speakers <n> | Maximum number of speakers | 3 | | --diarization false | Disable speaker diarization | true | | --recursive | Scan subfolders for folder input | true | | --json | Output .json alongside .txt | false |

Examples

# Single file with output path
mow convert "call.wav" --language vi --output "call_transcript.txt"

# Entire folder with recursive scan
mow convert "./recordings/" --output "./transcripts/" --recursive --language en

# Disable diarization for single-speaker audio
mow convert "lecture.wav" --diarization false

# Custom VAD and speaker range
mow convert "meeting.wav" --vad-onset 0.9 --vad-offset 0.7 --min-speakers 2 --max-speakers 5

# Use a lighter model for faster processing  
mow convert "note.wav" --model tiny --language en

Sample Output

[00:00:01.170 - 00:00:04.710] SPEAKER1 (85.4%): I need some help with the insurance software.
[00:00:05.760 - 00:00:07.800] SPEAKER2 (72.2%): Sure, what's the issue?
[00:00:09.180 - 00:00:14.230] SPEAKER1 (78.0%): The form won't submit. It keeps showing an error.
[00:00:14.430 - 00:00:16.770] SPEAKER2 (81.3%): Let me take a look at that for you.
[00:00:21.180 - 00:00:24.030] SPEAKER1 (71.3%): Thank you, I appreciate it.

Anti-Hallucination Filter

Whisper models — especially ONNX quantized variants — are prone to hallucination. MOW includes a multi-layered filter that automatically removes fabricated output:

| Filter | Description | |---|---| | Pattern Matching | Detects known hallucination phrases ("subscribe", "thank you for watching", URLs, etc.) | | Language Mismatch | Flags pure-English output when a non-English language is specified (and vice versa) | | Repetition Detection | Identifies repeated phrases and single-word loops — strong hallucination signals | | Length Ratio | Rejects segments where text length is disproportionate to audio duration (>25 chars/sec) | | Short Segments | Skips segments shorter than 0.8s to avoid noise artifacts | | Tight VAD | Default onset=0.85, offset=0.65 minimizes noise-as-speech false positives |

REST API

Start the Server

mow serve 3001

Endpoints

| Method | Path | Description | |---|---|---| | GET | /health | Health check and engine info | | GET | /api/models | List available models | | POST | /api/transcribe | Transcribe an audio file | | GET | /api/docs | API documentation |

Request Example

curl -X POST http://localhost:3001/api/transcribe \
  -F "[email protected]" \
  -F "model=large-v3" \
  -F "language=en" \
  -F "min_speakers=1" \
  -F "max_speakers=3"

Response Example

{
  "text": "SPEAKER1: I need some help with the insurance software...",
  "segments": [
    {
      "speaker": "SPEAKER1",
      "start": 1.17,
      "end": 4.71,
      "text": "I need some help with the insurance software",
      "confidence": 0.854,
      "confidenceSource": "estimated"
    }
  ],
  "confidence": 0.78,
  "confidenceSource": "estimated",
  "asr": {
    "engine": "huggingface-xenova",
    "model": "Xenova/whisper-large-v3",
    "modelId": "Xenova/whisper-large-v3",
    "runtimeDevice": "local-gpu-DmlExecutionProvider",
    "computeType": "gpu-fp16",
    "batchSize": "1"
  }
}

Supported Models

HuggingFace Local (ONNX — offline)

| Alias | Model ID | Size | |---|---|---| | tiny | Xenova/whisper-tiny | ~39 MB | | base | Xenova/whisper-base | ~74 MB | | small | Xenova/whisper-small | ~244 MB | | medium | Xenova/whisper-medium | ~769 MB | | large | Xenova/whisper-large | ~1.5 GB | | large-v3 | Xenova/whisper-large-v3 | ~1.5 GB | | turbo | Xenova/whisper-large-v3-turbo | ~809 MB |

HuggingFace Inference API (requires `--hf-token`)

| Model ID | Notes | |---|---| | openai/whisper-large-v3 | Full-precision, highest quality | | openai/whisper-large-v3-turbo | Faster, near large-v3 quality | | openai/whisper-large-v2 | Previous generation | | openai/whisper-medium | Balanced speed/quality | | openai/whisper-small | Lightweight |

OpenAI API (requires `--openai-key`)

| Model ID | Notes | |---|---| | whisper-1 | OpenAI Whisper API | | gpt-4o-mini-transcribe | GPT-4o mini transcribe | | gpt-4o-transcribe | GPT-4o transcribe (best quality) |

Environment Variables

| Variable | Default | Description | |---|---|---| | MOW_DEVICE | (auto) | Force compute device: gpu or cpu | | WHISPER_LANGUAGE | en | Default language code | | WHISPER_VAD_ONSET | 0.85 | VAD onset threshold | | WHISPER_VAD_OFFSET | 0.65 | VAD offset threshold | | DIARIZATION_MIN_SPEAKERS | 1 | Minimum speakers | | DIARIZATION_MAX_SPEAKERS | 3 | Maximum speakers | | DIARIZATION_VAD_ONSET | 0.85 | Diarization VAD onset | | DIARIZATION_VAD_OFFSET | 0.65 | Diarization VAD offset |

Notes

First run — Local HuggingFace models are downloaded on first use and cached at ~/.cache/huggingface. Subsequent runs use the cache with no internet required.
Large models — large-v3 (~1.5 GB) may require increasing Node.js heap size: NODE_OPTIONS=--max-old-space-size=8192.
Confidence — When the model provides confidence scores directly, those are used. Otherwise, confidence is estimated from signal features (confidenceSource: "estimated").
Hallucination — ONNX quantized models (Xenova) are more prone to hallucination than full-precision models. For best accuracy, use --hf-token with openai/whisper-large-v3 or --openai-key with gpt-4o-transcribe.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@mynamezxc/mow-speech-to-text

Features

Requirements

Installation

Quick Start

Backend Architecture

GPU Acceleration

CLI Reference

Commands

Options

Examples

Sample Output

Anti-Hallucination Filter

REST API

Start the Server

Endpoints

Request Example

Response Example

Supported Models

HuggingFace Local (ONNX — offline)

HuggingFace Inference API (requires --hf-token)

OpenAI API (requires --openai-key)

Environment Variables

Notes

License

HuggingFace Inference API (requires `--hf-token`)

OpenAI API (requires `--openai-key`)