kitten-tts-webgpu
v0.1.1
Published
Run Kitten TTS (80M) locally in the browser via WebGPU. One function call: textToSpeech('Hello!') → WAV blob.
Maintainers
Readme
Kitten TTS WebGPU
Pure WebGPU text-to-speech for the browser. 80M params, sub-second on desktop, ~1.2s on iPhone. No ONNX Runtime, no WASM inference — just 29 compute shaders. 753KB gzipped JS + model weights downloaded at runtime.
Live Demo | npm | Model Card
Quick Start
npm install kitten-tts-webgpuimport { textToSpeech } from 'kitten-tts-webgpu';
const blob = await textToSpeech("The quick brown fox jumps over the lazy dog.");
const audio = new Audio(URL.createObjectURL(blob));
audio.play();One function. Text in, WAV blob out (16-bit PCM, 24 kHz mono). The model downloads on first call and is cached for subsequent calls. Full TypeScript types included.
Note: This library requires WebGPU. For server-side rendering frameworks (Next.js, Nuxt), dynamically import on the client side only.
Size & Performance
What gets downloaded
| | Size | When |
|-|------|------|
| JS bundle | 753 KB gzipped (2.9 MB raw) | npm install / bundled into your app |
| Model weights | 24–78 MB (see below) | First textToSpeech() call, cached by browser |
The JS bundle includes the WebGPU engine, 29 compute shaders, and a 234K-word phonemizer dictionary. No WASM binaries, no ONNX Runtime.
Models
Three Kitten TTS v0.8 sizes, same API:
| Model | Params | Weights | M4 Pro (Chrome) | iPhone 17 Pro Max (Safari) | |-------|--------|---------|------------------|----------------------------| | Mini | 80M | 78 MB | 1.80s (3.3× RT) | ~1.2s | | Micro | 40M | 41 MB | 1.05s (6.2× RT) | — | | Nano | 15M | 24 MB | 0.93s (7.3× RT) | — |
RT = real-time factor (audio duration ÷ generation time). Higher is better. Times are for warm generation (model already in GPU). First call adds ~2-4s for model download depending on connection.
await textToSpeech("Hello world"); // Default: nano (fastest, 24 MB)
await textToSpeech("Hello world", { model: 'micro' }); // Balanced (41 MB)
await textToSpeech("Hello world", { model: 'mini' }); // Best quality (78 MB)Options
const blob = await textToSpeech("Welcome to the future.", {
voice: "Leo", // 8 voices: Bella, Luna, Rosie, Kiki, Jasper, Bruno, Hugo, Leo
speed: 1.2, // 0.5x – 2.0x
model: "micro", // mini | micro | nano
onProgress: (stage) => console.log(stage), // string: "Initializing WebGPU…", "Downloading…", "Generating speech…", etc.
});Voices
| Female | Male | |--------|------| | Bella | Jasper | | Luna | Bruno | | Rosie | Hugo | | Kiki | Leo |
Error Handling
// Check for WebGPU support
if (!navigator.gpu) {
console.log("WebGPU not available — use Chrome 113+, Edge 113+, or Safari 26+");
}
// textToSpeech throws on:
// - No WebGPU support
// - Network error (model download fails)
// - Empty text input
try {
const blob = await textToSpeech("Hello");
} catch (err) {
console.error("TTS failed:", err.message);
}Advanced: Direct Engine Access
For repeated generations or fine-grained control:
import { KittenTTSEngine, textToInputIds, float32ToWav } from 'kitten-tts-webgpu';
const engine = new KittenTTSEngine();
await engine.init();
await engine.loadModel(onnxUrl, voicesUrl);
const { ids } = await textToInputIds("Hello world");
const { waveform } = await engine.generate(ids, "Bella", 1.0);
// waveform: Float32Array of 24kHz PCM samples
const wavBlob = float32ToWav(waveform, 24000);How It Works
29 hand-written WGSL compute shaders execute the full TTS pipeline on GPU:
Text → Phonemes (234K-word dictionary + espeak rules in pure JS)
→ ALBERT encoder (embedding, multi-head attention, FFN)
→ Duration predictor (LSTM + CNN)
→ Acoustic decoder (LSTM + AdaIN + CNN, style-conditioned)
→ HiFi-GAN vocoder (ConvTranspose1d, Snake activations, iSTFT)
→ 24kHz WAVWhy not ONNX Runtime Web?
Most browser TTS uses ONNX Runtime Web (~2MB WASM binary + C++ runtime). This project takes a different approach:
- Custom ONNX parser — dequantizes int8/uint8/float16 weights in pure TypeScript, no C++ runtime
- 234K-word phonemizer — espeak-ng rules ported to pure JS (WASM espeak hangs on iOS Safari)
- GPU buffer pooling — reuses buffers across HiFi-GAN iterations, ~130MB peak on mobile
- Dynamic architecture — detects model dimensions from weight shapes, one engine for all 3 sizes
Browser Support
| Browser | Status | |---------|--------| | Chrome 113+ | ✅ | | Edge 113+ | ✅ | | Safari 26+ (macOS/iOS) | ✅ | | Firefox Nightly | Experimental |
FAQ
Max input length? Recommended under ~500 characters per call. For longer text, split into sentences.
Languages? English only (matches the upstream Kitten TTS model).
Offline? Yes, after the model is cached in the browser. No server needed for inference.
Self-hosting models? Pass custom URLs to KittenTTSEngine.loadModel(onnxUrl, voicesUrl).
Bundle size? 753KB gzipped (2.9MB raw). Includes engine, 29 compute shaders, and 234K-word phonemizer dictionary. Model weights (24–78MB depending on model size) are downloaded separately at runtime on first call and cached by the browser.
Model license? Kitten TTS models are released under Apache 2.0. Code in this repo is MIT.
Development
git clone https://github.com/svenflow/kitten-tts-webgpu.git
cd kitten-tts-webgpu
npm install
npm run dev # Dev server
npm run build # Production build
npm test # Phonemizer testsCredits
- Kitten TTS models by KittenML (Apache 2.0)
- espeak-ng pronunciation dictionary and letter-to-sound rules (GPL-3.0, bundled as data files)
- phonemizer by Xenova (espeak-ng WASM, used as primary backend on Chrome/Firefox; pure JS fallback on Safari)
License
MIT
