pocket-tts-js
v0.1.0
Published
Tiny browser API for the Pocket TTS ONNX model: streaming neural text-to-speech with voice cloning, running entirely client-side in a Web Worker.
Maintainers
Readme
pocket-tts-js
Tiny browser API for the Pocket TTS ONNX model — streaming neural text-to-speech with voice cloning, running entirely client-side in a Web Worker so it never blocks your UI.
- 🪶 Tiny package — pure-JS SentencePiece tokenizer (no 4 MB WASM build);
onnxruntime-webis loaded from a CDN, not bundled. - 🌍 Per-language bundles — load only the language you need.
- ⚡ Quantized or full precision — INT8 by default; only the variant you choose is downloaded.
- 📥 Downloads only what's used — the voice encoder is fetched only when cloning is enabled; built-in voices only when requested.
- 🧵 Off the main thread — all inference runs in a Web Worker; audio streams out chunk by chunk.
Models are streamed at runtime from vlapky/pocket-tts-onnx on Hugging Face.
▶ Live demo — runs in your browser; the source is in example/.
Install
npm install pocket-tts-jsYou also need a bundler that supports the new Worker(new URL('./worker.js', import.meta.url)) pattern (Vite, webpack 5, Rollup, Parcel 2 — all do).
Quick start
import { PocketTTS, StreamingPlayer } from "pocket-tts-js";
const tts = new PocketTTS({
language: "english_2026-04", // see PocketTTS.LANGUAGES
quantized: true, // INT8 (smaller/faster) — set false for full precision
voiceCloning: true, // download the encoder so cloneVoice() works
});
await tts.load((p) => {
if (p.total) console.log(`${p.label}: ${(p.loaded / p.total * 100) | 0}%`);
});
const player = new StreamingPlayer({ sampleRate: tts.sampleRate });
await player.resume(); // call from a user gesture
// Pick a built-in voice…
const voice = await tts.loadVoice("alba");
// …and stream speech. Pass `meta` so playback stays gapless.
await tts.generate("Hello from your browser!", {
voice,
onChunk: (audio /* Float32Array @ tts.sampleRate */, meta) => player.play(audio, meta),
});
player.flush(); // release any audio still held by the jitter bufferSee Examples for built-in voices, cloning, and cache management.
Examples
English + built-in voice "alba"
Built-in voices come from a voices.bin and need no encoder, so you can skip the
~21 MB cloning model.
import { PocketTTS, StreamingPlayer } from "pocket-tts-js";
const tts = new PocketTTS({
language: "english_2026-04",
voiceCloning: false, // built-in voices don't need the encoder
});
await tts.load((p) => p.total && console.log(`${p.label} ${(p.loaded / p.total * 100) | 0}%`));
const player = new StreamingPlayer({ sampleRate: tts.sampleRate });
await player.resume(); // must run inside a user gesture (click/tap)
const alba = await tts.loadVoice("alba"); // any name from tts.predefinedVoices
const metrics = await tts.generate("Hi, I'm Alba, speaking right inside your browser.", {
voice: alba,
onChunk: (audio, meta) => player.play(audio, meta),
});
player.flush();
console.log(`done — RTFx ${metrics.rtfx.toFixed(2)}x`);English + voice cloning
Clone a voice from any mono reference clip (file upload, fetch, microphone…).
import { PocketTTS, StreamingPlayer } from "pocket-tts-js";
const tts = new PocketTTS({
language: "english_2026-04",
voiceCloning: true, // default — downloads the encoder
});
await tts.load();
const player = new StreamingPlayer({ sampleRate: tts.sampleRate });
await player.resume();
// Decode the reference clip to a mono Float32Array (any sample rate)
const fileBuffer = await referenceFile.arrayBuffer();
const audioCtx = new AudioContext();
const decoded = await audioCtx.decodeAudioData(fileBuffer);
const mono = decoded.getChannelData(0);
const myVoice = await tts.cloneVoice(mono, { inputSampleRate: decoded.sampleRate });
await audioCtx.close();
await tts.generate("This sentence is spoken in the cloned voice.", {
voice: myVoice,
onChunk: (audio, meta) => player.play(audio, meta),
});
player.flush();Clearing the cached model
Free the persisted models/voices from disk and force a fresh download next time.
import { PocketTTS } from "pocket-tts-js";
// Tear down a running instance first (worker + in-memory ONNX sessions)…
tts.destroy();
// …then delete the on-disk Cache Storage bucket.
await PocketTTS.clearCache();
// Optional: inspect what the browser still stores for this origin.
const est = await PocketTTS.storageEstimate(); // { usage, quota } in bytes, or null
if (est) console.log(`using ${(est.usage / 1e6).toFixed(0)} MB of ${(est.quota / 1e6).toFixed(0)} MB`);Cross-origin isolation (recommended)
onnxruntime-web runs multi-threaded when the page is cross-origin isolated. Serve your app with:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corpWithout these it still works, just single-threaded (slower). Hugging Face and jsDelivr both send the CORS/CORP headers required under require-corp.
API
new PocketTTS(options)
| option | default | description |
| -------------- | ---------------------- | ----------- |
| language | "english_2026-04" | Language bundle (PocketTTS.LANGUAGES lists all). |
| quantized | true | INT8 models vs full precision. |
| voiceCloning | true | Download the encoder so cloneVoice() works (~21 MB). |
| modelBaseUrl | HF …/onnx | Base URL of the onnx/ folder. |
| ortBaseUrl | jsDelivr ORT 1.20.0 | Base URL for onnxruntime-web dist files. |
| voicesUrl | null | Explicit URL to a voices.bin (see Built-in voices). |
| maxThreads | 8 | Max WASM threads when cross-origin isolated. |
Methods
load(onProgress?) → Promise<BundleInfo>— download the runtime + selected models and initialise.cloneVoice(audio: Float32Array, { inputSampleRate?, name? }) → Promise<string>— encode a reference clip into a voice reference.loadVoice(name: string) → Promise<string>— prepare a built-in voice (requiresvoices.bin).generate(text, { voice, onChunk }) → Promise<{ rtfx, genTime, audioDuration }>— stream synthesis.stop() → Promise<void>— stop the current generation early.destroy()— terminate the worker and free resources.
Helpers
StreamingPlayer— gapless playback of streamed chunks (play,reset,stop,resume,analyser).chunksToWavBlob(chunks, sampleRate)— assemble collected chunks into a downloadable WAV.resampleLinear(data, fromRate, toRate)— simple linear resampler.SentencePieceTokenizer— the standalone pure-JS Unigram tokenizer.
What gets downloaded
For the chosen language + quantized setting only:
| file | needed for | INT8 size |
| ---- | ---------- | --------- |
| bundle.json, tokenizer.model | always | ~80 KB |
| flow_lm_main, flow_lm_flow, mimi_decoder | always | ~109 MB |
| text_conditioner | always | ~16 MB |
| mimi_encoder | only if voiceCloning | ~21 MB |
| bos_before_voice.npy | only if voiceCloning | <1 KB |
| voices.bin | only when a built-in voice is requested | varies |
Full-precision (quantized: false) variants are larger.
Caching (no re-download every load)
Downloaded assets (models, tokenizer, voices.bin) are persisted in the browser's
Cache Storage by default, so after the first visit later loads read from disk —
no network, works offline. Only ONNX session compilation runs (a few seconds).
new PocketTTS({ cache: true }); // default
new PocketTTS({ cache: false }); // always fetch from network
new PocketTTS({ cacheName: "my-bucket" }); // custom Cache Storage bucket
await PocketTTS.clearCache(); // free the space / force fresh download
await PocketTTS.storageEstimate(); // { usage, quota } in bytes (or null)Progress callbacks include fromCache: true when a file is served from the cache.
Caching needs a secure context (https:// or localhost); it silently falls back
to plain network fetches if Cache Storage is unavailable or the quota is exceeded.
Bump cacheName (or call clearCache()) when you publish new model weights.
Built-in voices
Each language bundle ships a voices.bin with several ready-made speakers. List
them via tts.predefinedVoices and prepare one with tts.loadVoice(name). The
voices.bin for a language is downloaded lazily, only the first time you request
a built-in voice.
console.log(tts.predefinedVoices); // e.g. ["alba", "azelma", "cosette", …]
const voice = await tts.loadVoice("alba");To serve voices from a different location, point the library at your own file:
new PocketTTS({ voicesUrl: "https://example.com/english_2026-04/voices.bin" });Voice cloning is also available and needs no voices.bin at all.
License
The library code is licensed under the MIT License.
The Pocket TTS model weights and the bundled built-in voice assets that this library downloads at runtime are © Kyutai and licensed under CC-BY-4.0 — see kyutai/pocket-tts. Your use of those assets is subject to the CC-BY-4.0 terms (including attribution), independently of this package's MIT license.
