onnx-asr-web
v0.1.3
Published
JavaScript ONNX ASR for Node.js and browser
Maintainers
Readme
onnx-asr-web
JavaScript ONNX ASR for Node.js and browser using onnxruntime-web. This package was heavily inspired by the Python istupakov/onnx-asr and aims to be the minimalistic way to achieve state of the art automatic speech recognition with JavaScript.
Features
- Loads models from Hugging Face or local directories
- Autodetects model-type from files
- Supports quantized models
- Works with WAV files/buffers
- Uses Voice Activity Detection (VAD) to do long-form speech-to-text
- Extracts word-level timestamps
- Minimal dependencies (just
onnxruntime-web) - Full types
Supported Model Types
- Nvidia Parakeet, Canary, FastFormer, and Conformer
- OpenAI Whisper
- GigaChat GigaAM
- Kaldi Icefall Zipformer
- T-Tech T-one
- Custom CTC, RNNT, TDT, and Transformer models
Install
npm install onnx-asr-webonnxruntime-web must be 1.24.x or newer. Earlier versions can fail on some models (notably browser VAD graphs).
API Reference
Generated API docs are published in API.md. They are emitted from the TypeScript declaration output during npm run build, so they stay aligned with the shipped package surface.
Node.js
import {
loadLocalModel,
loadHuggingfaceModel,
loadLocalVadModel,
loadHuggingfaceVadModel,
} from "onnx-asr-web/node";
const vad = await loadHuggingfaceVadModel("onnx-community/silero-vad", {
cacheDir: "models",
quantization: "int8",
});
const local = await loadLocalModel("models/istupakov/parakeet-tdt-0.6b-v3-onnx", {
quantization: "int8", // default: prefers *.int8.onnx, falls back to *.onnx
sessionOptions: { executionProviders: ["wasm"] },
vadModel: vad, // optional: chunks long audio by non-speech
});
const hf = await loadHuggingfaceModel("istupakov/parakeet-tdt-0.6b-v3-onnx", {
cacheDir: "models",
quantization: "int8",
revision: "main",
});loadHuggingfaceModel() downloads into ${cacheDir}/${repo_id} and reuses cached files.
Browser
import {
configureOrtWeb,
loadLocalModel,
loadHuggingfaceModel,
loadHuggingfaceVadModel,
} from "onnx-asr-web/browser";
configureOrtWeb({ wasmPaths: "/node_modules/onnxruntime-web/dist/" });
const vad = await loadHuggingfaceVadModel("onnx-community/silero-vad");
const modelA = await loadLocalModel("/models/parakeet-tdt-0.6b-v3-onnx/", { vadModel: vad });
const modelB = await loadHuggingfaceModel("istupakov/parakeet-tdt-0.6b-v3-onnx");Transcription
const result = await model.transcribeWavBuffer(await file.arrayBuffer());
console.log(result.text);
console.log(result.words); // [{word, start, end}] in secondsUse transcribeWavBuffer() when you have a real WAV file as bytes, for example from:
- a browser file input
fs.readFile()in Node.js- a downloaded
.wavasset
Use transcribeSamples() when you already have decoded mono PCM samples and know the sample rate:
const result = await model.transcribeSamples(float32Samples, sampleRate);
console.log(result.text);This is usually the right choice when audio is coming from:
- Web Audio API decoding such as
AudioBuffer.getChannelData(...) - microphone capture pipelines that already produce PCM chunks
- custom preprocessing or resampling code
- non-WAV formats that you decode yourself before transcription
transcribeSamples() expects normalized mono PCM, typically a Float32Array with values in [-1, 1], plus the input sample rate. The library will resample internally when needed.
transcribeWavBuffer() is just a convenience wrapper: it decodes the WAV container first and then forwards the decoded samples into transcribeSamples().
Model Files
loadLocalModel() expects config.json plus model files referenced by model type:
- TDT (
nemo-conformer-tdt):nemo128.onnx,encoder-model.onnx,decoder_joint-model.onnx, andvocab.txtortokens.txt - RNNT (
nemo-conformer-rnnt):encoder-model.onnx,decoder_joint-model.onnx, andvocab.txtortokens.txt - CTC (
nemo-conformer-ctc):model.onnxandvocab.txtortokens.txt - Canary AED (
nemo-conformer-aed):encoder-model.onnx,decoder-model.onnx, andvocab.txtortokens.txt - FastConformer (
nemo-conformer): prefers RNNT split (encoder-model.onnx+decoder_joint-model.onnx) and falls back to CTCmodel.onnx, withvocab.txtortokens.txt - GigaAM (
gigaam): auto-detectsv2_*/v3_*files, prefers RNNT triplet (*_rnnt_encoder/decoder/joint) and falls back to CTC (*_ctc.onnx), withv2_vocab.txt/v3_vocab.txt - Tone CTC (
tone-ctc):model.onnxwith vocab fromdecoder_params.vocabularyinconfig.json(orvocab.json) - Whisper ORT (
whisper-ort):*_beamsearch.onnxmodel, plusvocab.json(and optionallyadded_tokens.json) - Whisper HF (
whisper):onnx/encoder_model*.onnx,onnx/decoder_model_merged*.onnx, plusvocab.json(and optionallyadded_tokens.json) - Sherpa transducer (no config):
am-onnx/(oram/) withencoder.onnx,decoder.onnx,joiner.onnx, pluslang/tokens.txt(ortokens.txt) - VAD (
onnx-community/silero-vad):onnx/model*.onnx(e.g.onnx/model_int8.onnx)
When quantization is enabled (quantization: "int8"), *.int8.onnx is preferred.
For Node Hugging Face downloads, *.onnx.data sidecars are also fetched when present.
In browser mode, models are loaded by URL so ONNX Runtime can fetch sidecars automatically.
When vadModel is supplied to loadLocalModel() / loadHuggingfaceModel(), transcription runs on VAD speech chunks and returns a segments array in output.
Word timestamps are currently provided for NeMo transducer models. Whisper returns transcript text and token IDs; words is empty.
Examples
Node.js CLI
npm run build
node examples/node/transcribe.mjs --repo-id istupakov/parakeet-tdt-0.6b-v3-onnx --cache-dir models --audio test.wavBrowser UI
npx http-server . # then /examples/browser/index.htmlBrowser UI (CDN package import)
npx http-server . # then /examples/browser-cdn/index.htmlTesting
Run type/syntax checks:
npm run checkRun integration model tests (requires local model folders under models/):
npm testBuild and Publish
Create distributable artifacts:
npm run buildThis produces:
dist/index.jsdist/node.jsdist/browser.js
Publish to npm:
npm publishContributing
See CONTRIBUTING.md.
