streamtts
v0.1.5
Published
Chatterbox Turbo TTS in the browser — WebGPU/WASM, voice cloning, localized model chunks
Downloads
649
Maintainers
Readme
streamtts
Chatterbox Turbo TTS in the browser — WebGPU/WASM, voice cloning, chunked model delivery.
Ships the Resemble AI Chatterbox Turbo ONNX model (~350M) via a q4/q4f16 quantized pipeline that runs fully client-side through Transformers.js v4. A companion models branch hosts the model split into ≤99MB chunks so it can be served from GitHub Pages / raw.githubusercontent without hitting the 100MB per-file limit.
On WebGPU, streamtts loads spacekaren/chatterbox-turbo-webgpu instead — the same 350M model with all int64 ops replaced by int32 so it actually runs on WebGPU (WebGPU cannot execute int64 Cast operations). On WASM it falls back to the stock ResembleAI repo.
install
npm install streamtts @huggingface/transformersuse
import { ChatterboxSDK } from 'streamtts'
const tts = new ChatterboxSDK()
// optional — load model from a chunked mirror instead of HF Hub
await tts.configure({
modelBasePath: 'https://raw.githubusercontent.com/AnEntrypoint/streamtts/models/',
allowRemoteModels: false,
})
await tts.load() // auto-detect WebGPU, fall back to WASM
await tts.encodeSpeaker('voice-a', float32AudioData) // from a WAV decoded to mono f32
const { waveform } = await tts.generate('hello', 'voice-a', 0.5) // 24kHz mono float32For long text, use generateChunked(text, speakerId, exaggeration, onProgress) — it splits on sentence/paragraph boundaries and stitches the outputs with silence. Call tts.abort() to interrupt a chunked run in flight.
model chunks
The models branch is rebuilt by build-model.yml whenever the upstream HF commit changes. Every file over 99MB is split into .part0, .part1, …; a chunks.json manifest at the branch root lists each split file with its total size and part count. The SDK's worker installs a fetch interceptor that transparently reassembles .part* ranges before onnx-runtime sees the bytes.
demo
Live demo app is in this repo (React + Vite). Run npm run dev to bring it up locally; the Vercel preview is at https://transformersjs-chatterbox-demo.vercel.app/.
Features
- Zero-shot voice cloning — Record or upload a 5-10 second voice sample, then generate speech in that voice
- Expressiveness control — Adjust the exaggeration slider (0–1.5) to control how expressive the generated speech sounds
- WebGPU acceleration — Automatically detects and uses WebGPU when available, falls back to WASM
- Offline after first load — Model files (~1.5 GB) are cached by the browser after the initial download
- Web Worker inference — All model computation runs off the main thread for a smooth UI
Demo Modes
Playground
Full-featured TTS explorer. Type text, record a reference voice, adjust expressiveness, and generate speech. Displays real-time performance metrics (inference time, audio duration, real-time factor).
Echo — Voice Message Maker
Create personalized voice message cards in three steps:
- Record your voice (or upload a sample)
- Compose your message — pick a themed card (Birthday, Thank You, Holiday, Congrats, Get Well, Love), write your text, and adjust expressiveness
- Preview & Share — listen to the result and download as a WAV file
VoiceCraft — Dialogue Creator
Build multi-character dialogues with different voices:
- Add characters, each with their own voice sample and color
- Write a script with per-line character assignment and expressiveness control
- Generate all lines sequentially, each using the correct speaker embedding
- View the dialogue as a color-coded timeline
- Export the entire conversation as a single WAV file with natural pauses between lines
Narrator — Story Reader
Turn text into narrated audio with automatic dialogue detection:
- Paste any story or pick from built-in samples (The Fox and the Grapes, The Last Robot, Counting Stars)
- Automatic dialogue detection via regex — identifies quoted speech and attributes characters
- Assign different voices to the narrator and each detected character
- Read-along display with paragraph-level highlighting during playback
- Navigate between paragraphs
Getting Started
Prerequisites
- Node.js 18+ (20+ recommended)
- npm 9+
- A modern browser with WebGPU support (Chrome 113+, Edge 113+) for best performance. Falls back to WASM on older browsers.
Installation
git clone https://github.com/resemble-ai/transformersjs-chatterbox-demo.git
cd transformersjs-chatterbox-demo
npm installDevelopment
npm run devOpen http://localhost:5173 in your browser.
Production Build
npm run build
npm run preview # preview the build locallyThe built files are in dist/ and can be deployed to any static hosting (Vercel, Netlify, GitHub Pages, etc.).
How It Works
Architecture
┌─────────────────────────────────────────────────┐
│ Main Thread │
│ │
│ React App ──► tts-client.js ──► Web Worker │
│ (UI) (RPC bridge) (Chatterbox) │
│ │
│ Zustand Store ◄── events ◄── Worker messages │
└─────────────────────────────────────────────────┘Web Worker (
src/workers/tts.worker.js) — Loads the Chatterbox ONNX model, handles all inference. The model has 4 ONNX sessions:embed_tokens,speech_encoder,language_model(quantized to q4/q4f16), andconditional_decoder.RPC Client (
src/lib/tts-client.js) — Singleton that provides a promise-based API over the worker'spostMessageinterface. Handles progress events, error propagation, and worker lifecycle.React Hooks —
useTTS()for model loading/generation,useAudioRecorder()for microphone recording with 24kHz resampling,useAudioPlayer()for playback with time tracking.State — Zustand store with per-mode slices. Audio buffers are stored as
Float32Arrayto avoid serialization overhead.
Speaker Caching
Voice embeddings are computed once per speaker via model.encode_speech() and cached in the worker's memory. Subsequent generations with the same voice skip the encoding step entirely.
Model Details
| Session | Size | Quantization | |---------|------|-------------| | Embed Tokens | ~61 MB | fp32 | | Speech Encoder | ~591 MB | fp32 | | Language Model | ~353 MB | q4 (WASM) / q4f16 (WebGPU) | | Conditional Decoder | ~533 MB | fp32 |
The model is loaded from onnx-community/chatterbox-ONNX on Hugging Face and cached by the browser after the first download.
Tech Stack
| Technology | Version | Purpose | |-----------|---------|---------| | React | 19 | UI framework | | Vite | 7 | Build tool & dev server | | Tailwind CSS | 4 | Styling | | Zustand | 5 | State management | | React Router | 7 | Client-side routing | | Framer Motion | 12 | Page transitions & animations | | Transformers.js | 4.0.0-next.2 | In-browser ML inference |
Project Structure
src/
├── main.jsx # Entry point
├── App.jsx # Router + layout shell
├── index.css # Tailwind + custom styles
├── workers/
│ └── tts.worker.js # Chatterbox model inference
├── lib/
│ ├── tts-client.js # Promise-based RPC to worker
│ ├── audio-recorder.js # Mic recording + 24kHz resampling
│ ├── audio-utils.js # WAV encoding, concat, silence
│ ├── audio-player.js # AudioContext playback engine
│ └── constants.js # Model ID, sample rate, tags, templates
├── hooks/
│ ├── useTTS.js # Model load, generate, speaker encode
│ ├── useAudioRecorder.js # Record / upload voice samples
│ ├── useAudioPlayer.js # Play / pause / seek
│ └── useModelStatus.js # Global model readiness
├── store/
│ └── app-store.js # Zustand store (model + per-mode state)
└── components/
├── layout/ # AppShell, Sidebar, ModeHeader
├── shared/ # ModelLoader, VoiceRecorder, AudioPlayer,
│ # AudioWaveform, ExaggerationSlider, etc.
├── home/ # Landing page with mode cards
├── playground/ # TTS feature explorer
├── echo/ # Voice message card maker
├── voicecraft/ # Multi-character dialogue creator
└── narrator/ # Story reader with read-alongBrowser Compatibility
| Browser | WebGPU | WASM Fallback | |---------|--------|---------------| | Chrome 113+ | Yes | Yes | | Edge 113+ | Yes | Yes | | Firefox | No | Yes | | Safari 18+ | Partial | Yes |
WebGPU provides significantly faster inference. The app auto-detects availability and falls back gracefully.
Known Limitations
- No paralinguistic tag support — The Transformers.js ONNX port of Chatterbox does not currently support emotion/paralinguistic tags (e.g.
[laugh],[sigh]). Tags in input text will be ignored or read literally. This may be added in a future Transformers.js release. - First load is large — The model weighs ~1.5 GB and must be downloaded on first visit. Subsequent visits use the browser cache.
- Audio length — Generation uses
max_new_tokens: 256, which limits output to roughly 5-10 seconds per call. Longer text should be split into chunks.
License
MIT
Acknowledgments
- Chatterbox by Resemble AI — the underlying TTS model
- Transformers.js by Hugging Face — browser-based ML inference
- ONNX Runtime Web — the runtime powering WebGPU/WASM execution
