streamtts

v0.1.5

Published

10 days ago

Chatterbox Turbo TTS in the browser — WebGPU/WASM, voice cloning, localized model chunks

Downloads

649

0High
0Medium
0Low

lanmower

streamtts tts chatterbox voice-cloning webgpu onnx transformers.js

streamtts

Chatterbox Turbo TTS in the browser — WebGPU/WASM, voice cloning, chunked model delivery.

Ships the Resemble AI Chatterbox Turbo ONNX model (~350M) via a q4/q4f16 quantized pipeline that runs fully client-side through Transformers.js v4. A companion models branch hosts the model split into ≤99MB chunks so it can be served from GitHub Pages / raw.githubusercontent without hitting the 100MB per-file limit.

On WebGPU, streamtts loads spacekaren/chatterbox-turbo-webgpu instead — the same 350M model with all int64 ops replaced by int32 so it actually runs on WebGPU (WebGPU cannot execute int64 Cast operations). On WASM it falls back to the stock ResembleAI repo.

install

npm install streamtts @huggingface/transformers

use

import { ChatterboxSDK } from 'streamtts'

const tts = new ChatterboxSDK()

// optional — load model from a chunked mirror instead of HF Hub
await tts.configure({
  modelBasePath: 'https://raw.githubusercontent.com/AnEntrypoint/streamtts/models/',
  allowRemoteModels: false,
})

await tts.load()                                       // auto-detect WebGPU, fall back to WASM
await tts.encodeSpeaker('voice-a', float32AudioData)   // from a WAV decoded to mono f32
const { waveform } = await tts.generate('hello', 'voice-a', 0.5)  // 24kHz mono float32

For long text, use generateChunked(text, speakerId, exaggeration, onProgress) — it splits on sentence/paragraph boundaries and stitches the outputs with silence. Call tts.abort() to interrupt a chunked run in flight.

model chunks

The models branch is rebuilt by build-model.yml whenever the upstream HF commit changes. Every file over 99MB is split into .part0, .part1, …; a chunks.json manifest at the branch root lists each split file with its total size and part count. The SDK's worker installs a fetch interceptor that transparently reassembles .part* ranges before onnx-runtime sees the bytes.

demo

Live demo app is in this repo (React + Vite). Run npm run dev to bring it up locally; the Vercel preview is at https://transformersjs-chatterbox-demo.vercel.app/.

Features

Zero-shot voice cloning — Record or upload a 5-10 second voice sample, then generate speech in that voice
Expressiveness control — Adjust the exaggeration slider (0–1.5) to control how expressive the generated speech sounds
WebGPU acceleration — Automatically detects and uses WebGPU when available, falls back to WASM
Offline after first load — Model files (~1.5 GB) are cached by the browser after the initial download
Web Worker inference — All model computation runs off the main thread for a smooth UI

Demo Modes

Playground

Full-featured TTS explorer. Type text, record a reference voice, adjust expressiveness, and generate speech. Displays real-time performance metrics (inference time, audio duration, real-time factor).

Echo — Voice Message Maker

Create personalized voice message cards in three steps:

Record your voice (or upload a sample)
Compose your message — pick a themed card (Birthday, Thank You, Holiday, Congrats, Get Well, Love), write your text, and adjust expressiveness
Preview & Share — listen to the result and download as a WAV file

VoiceCraft — Dialogue Creator

Build multi-character dialogues with different voices:

Add characters, each with their own voice sample and color
Write a script with per-line character assignment and expressiveness control
Generate all lines sequentially, each using the correct speaker embedding
View the dialogue as a color-coded timeline
Export the entire conversation as a single WAV file with natural pauses between lines

Narrator — Story Reader

Turn text into narrated audio with automatic dialogue detection:

Paste any story or pick from built-in samples (The Fox and the Grapes, The Last Robot, Counting Stars)
Automatic dialogue detection via regex — identifies quoted speech and attributes characters
Assign different voices to the narrator and each detected character
Read-along display with paragraph-level highlighting during playback
Navigate between paragraphs

Getting Started

Prerequisites

Node.js 18+ (20+ recommended)
npm 9+
A modern browser with WebGPU support (Chrome 113+, Edge 113+) for best performance. Falls back to WASM on older browsers.

Installation

git clone https://github.com/resemble-ai/transformersjs-chatterbox-demo.git
cd transformersjs-chatterbox-demo
npm install

Development

npm run dev

Open http://localhost:5173 in your browser.

Production Build

npm run build
npm run preview   # preview the build locally

The built files are in dist/ and can be deployed to any static hosting (Vercel, Netlify, GitHub Pages, etc.).

How It Works

Architecture

┌─────────────────────────────────────────────────┐
│                  Main Thread                     │
│                                                  │
│  React App ──► tts-client.js ──► Web Worker     │
│    (UI)        (RPC bridge)     (Chatterbox)     │
│                                                  │
│  Zustand Store ◄── events ◄── Worker messages    │
└─────────────────────────────────────────────────┘

Web Worker (src/workers/tts.worker.js) — Loads the Chatterbox ONNX model, handles all inference. The model has 4 ONNX sessions: embed_tokens, speech_encoder, language_model (quantized to q4/q4f16), and conditional_decoder.
RPC Client (src/lib/tts-client.js) — Singleton that provides a promise-based API over the worker's postMessage interface. Handles progress events, error propagation, and worker lifecycle.
React Hooks — useTTS() for model loading/generation, useAudioRecorder() for microphone recording with 24kHz resampling, useAudioPlayer() for playback with time tracking.
State — Zustand store with per-mode slices. Audio buffers are stored as Float32Array to avoid serialization overhead.

Speaker Caching

Voice embeddings are computed once per speaker via model.encode_speech() and cached in the worker's memory. Subsequent generations with the same voice skip the encoding step entirely.

Model Details

| Session | Size | Quantization | |---------|------|-------------| | Embed Tokens | ~61 MB | fp32 | | Speech Encoder | ~591 MB | fp32 | | Language Model | ~353 MB | q4 (WASM) / q4f16 (WebGPU) | | Conditional Decoder | ~533 MB | fp32 |

The model is loaded from onnx-community/chatterbox-ONNX on Hugging Face and cached by the browser after the first download.

Tech Stack

| Technology | Version | Purpose | |-----------|---------|---------| | React | 19 | UI framework | | Vite | 7 | Build tool & dev server | | Tailwind CSS | 4 | Styling | | Zustand | 5 | State management | | React Router | 7 | Client-side routing | | Framer Motion | 12 | Page transitions & animations | | Transformers.js | 4.0.0-next.2 | In-browser ML inference |

Project Structure

src/
├── main.jsx                          # Entry point
├── App.jsx                           # Router + layout shell
├── index.css                         # Tailwind + custom styles
├── workers/
│   └── tts.worker.js                 # Chatterbox model inference
├── lib/
│   ├── tts-client.js                 # Promise-based RPC to worker
│   ├── audio-recorder.js             # Mic recording + 24kHz resampling
│   ├── audio-utils.js                # WAV encoding, concat, silence
│   ├── audio-player.js               # AudioContext playback engine
│   └── constants.js                  # Model ID, sample rate, tags, templates
├── hooks/
│   ├── useTTS.js                     # Model load, generate, speaker encode
│   ├── useAudioRecorder.js           # Record / upload voice samples
│   ├── useAudioPlayer.js             # Play / pause / seek
│   └── useModelStatus.js             # Global model readiness
├── store/
│   └── app-store.js                  # Zustand store (model + per-mode state)
└── components/
    ├── layout/                       # AppShell, Sidebar, ModeHeader
    ├── shared/                       # ModelLoader, VoiceRecorder, AudioPlayer,
    │                                 # AudioWaveform, ExaggerationSlider, etc.
    ├── home/                         # Landing page with mode cards
    ├── playground/                   # TTS feature explorer
    ├── echo/                         # Voice message card maker
    ├── voicecraft/                   # Multi-character dialogue creator
    └── narrator/                     # Story reader with read-along

Browser Compatibility

| Browser | WebGPU | WASM Fallback | |---------|--------|---------------| | Chrome 113+ | Yes | Yes | | Edge 113+ | Yes | Yes | | Firefox | No | Yes | | Safari 18+ | Partial | Yes |

WebGPU provides significantly faster inference. The app auto-detects availability and falls back gracefully.

Known Limitations

No paralinguistic tag support — The Transformers.js ONNX port of Chatterbox does not currently support emotion/paralinguistic tags (e.g. [laugh], [sigh]). Tags in input text will be ignored or read literally. This may be added in a future Transformers.js release.
First load is large — The model weighs ~1.5 GB and must be downloaded on first visit. Subsequent visits use the browser cache.
Audio length — Generation uses max_new_tokens: 256, which limits output to roughly 5-10 seconds per call. Longer text should be split into chunks.

License

MIT

Acknowledgments

Chatterbox by Resemble AI — the underlying TTS model
Transformers.js by Hugging Face — browser-based ML inference
ONNX Runtime Web — the runtime powering WebGPU/WASM execution

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

streamtts

install

use

model chunks

demo

Features

Demo Modes

Playground

Echo — Voice Message Maker

VoiceCraft — Dialogue Creator

Narrator — Story Reader

Getting Started

Prerequisites

Installation

Development

Production Build

How It Works

Architecture

Speaker Caching

Model Details

Tech Stack

Project Structure

Browser Compatibility

Known Limitations

License

Acknowledgments