pocket-tts-js

v0.1.0

Published

3 days ago

Tiny browser API for the Pocket TTS ONNX model: streaming neural text-to-speech with voice cloning, running entirely client-side in a Web Worker.

0High
0Medium
0Low

vlapky

tts text-to-speech pocket-tts onnx onnxruntime-web voice-cloning speech-synthesis browser webworker wasm

pocket-tts-js

Tiny browser API for the Pocket TTS ONNX model — streaming neural text-to-speech with voice cloning, running entirely client-side in a Web Worker so it never blocks your UI.

🪶 Tiny package — pure-JS SentencePiece tokenizer (no 4 MB WASM build); onnxruntime-web is loaded from a CDN, not bundled.
🌍 Per-language bundles — load only the language you need.
⚡ Quantized or full precision — INT8 by default; only the variant you choose is downloaded.
📥 Downloads only what's used — the voice encoder is fetched only when cloning is enabled; built-in voices only when requested.
🧵 Off the main thread — all inference runs in a Web Worker; audio streams out chunk by chunk.

Models are streamed at runtime from vlapky/pocket-tts-onnx on Hugging Face.

▶ Live demo — runs in your browser; the source is in example/.

Install

npm install pocket-tts-js

You also need a bundler that supports the new Worker(new URL('./worker.js', import.meta.url)) pattern (Vite, webpack 5, Rollup, Parcel 2 — all do).

Quick start

import { PocketTTS, StreamingPlayer } from "pocket-tts-js";

const tts = new PocketTTS({
  language: "english_2026-04", // see PocketTTS.LANGUAGES
  quantized: true,             // INT8 (smaller/faster) — set false for full precision
  voiceCloning: true,          // download the encoder so cloneVoice() works
});

await tts.load((p) => {
  if (p.total) console.log(`${p.label}: ${(p.loaded / p.total * 100) | 0}%`);
});

const player = new StreamingPlayer({ sampleRate: tts.sampleRate });
await player.resume(); // call from a user gesture

// Pick a built-in voice…
const voice = await tts.loadVoice("alba");

// …and stream speech. Pass `meta` so playback stays gapless.
await tts.generate("Hello from your browser!", {
  voice,
  onChunk: (audio /* Float32Array @ tts.sampleRate */, meta) => player.play(audio, meta),
});
player.flush(); // release any audio still held by the jitter buffer

See Examples for built-in voices, cloning, and cache management.

Examples

English + built-in voice "alba"

Built-in voices come from a voices.bin and need no encoder, so you can skip the ~21 MB cloning model.

import { PocketTTS, StreamingPlayer } from "pocket-tts-js";

const tts = new PocketTTS({
  language: "english_2026-04",
  voiceCloning: false, // built-in voices don't need the encoder
});
await tts.load((p) => p.total && console.log(`${p.label} ${(p.loaded / p.total * 100) | 0}%`));

const player = new StreamingPlayer({ sampleRate: tts.sampleRate });
await player.resume(); // must run inside a user gesture (click/tap)

const alba = await tts.loadVoice("alba"); // any name from tts.predefinedVoices

const metrics = await tts.generate("Hi, I'm Alba, speaking right inside your browser.", {
  voice: alba,
  onChunk: (audio, meta) => player.play(audio, meta),
});
player.flush();
console.log(`done — RTFx ${metrics.rtfx.toFixed(2)}x`);

English + voice cloning

Clone a voice from any mono reference clip (file upload, fetch, microphone…).

import { PocketTTS, StreamingPlayer } from "pocket-tts-js";

const tts = new PocketTTS({
  language: "english_2026-04",
  voiceCloning: true, // default — downloads the encoder
});
await tts.load();

const player = new StreamingPlayer({ sampleRate: tts.sampleRate });
await player.resume();

// Decode the reference clip to a mono Float32Array (any sample rate)
const fileBuffer = await referenceFile.arrayBuffer();
const audioCtx = new AudioContext();
const decoded = await audioCtx.decodeAudioData(fileBuffer);
const mono = decoded.getChannelData(0);

const myVoice = await tts.cloneVoice(mono, { inputSampleRate: decoded.sampleRate });
await audioCtx.close();

await tts.generate("This sentence is spoken in the cloned voice.", {
  voice: myVoice,
  onChunk: (audio, meta) => player.play(audio, meta),
});
player.flush();

Clearing the cached model

Free the persisted models/voices from disk and force a fresh download next time.

import { PocketTTS } from "pocket-tts-js";

// Tear down a running instance first (worker + in-memory ONNX sessions)…
tts.destroy();

// …then delete the on-disk Cache Storage bucket.
await PocketTTS.clearCache();

// Optional: inspect what the browser still stores for this origin.
const est = await PocketTTS.storageEstimate(); // { usage, quota } in bytes, or null
if (est) console.log(`using ${(est.usage / 1e6).toFixed(0)} MB of ${(est.quota / 1e6).toFixed(0)} MB`);

Cross-origin isolation (recommended)

onnxruntime-web runs multi-threaded when the page is cross-origin isolated. Serve your app with:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Without these it still works, just single-threaded (slower). Hugging Face and jsDelivr both send the CORS/CORP headers required under require-corp.

API

`new PocketTTS(options)`

| option | default | description | | -------------- | ---------------------- | ----------- | | language | "english_2026-04" | Language bundle (PocketTTS.LANGUAGES lists all). | | quantized | true | INT8 models vs full precision. | | voiceCloning | true | Download the encoder so cloneVoice() works (~21 MB). | | modelBaseUrl | HF …/onnx | Base URL of the onnx/ folder. | | ortBaseUrl | jsDelivr ORT 1.20.0 | Base URL for onnxruntime-web dist files. | | voicesUrl | null | Explicit URL to a voices.bin (see Built-in voices). | | maxThreads | 8 | Max WASM threads when cross-origin isolated. |

Methods

load(onProgress?) → Promise<BundleInfo> — download the runtime + selected models and initialise.
cloneVoice(audio: Float32Array, { inputSampleRate?, name? }) → Promise<string> — encode a reference clip into a voice reference.
loadVoice(name: string) → Promise<string> — prepare a built-in voice (requires voices.bin).
generate(text, { voice, onChunk }) → Promise<{ rtfx, genTime, audioDuration }> — stream synthesis.
stop() → Promise<void> — stop the current generation early.
destroy() — terminate the worker and free resources.

Helpers

StreamingPlayer — gapless playback of streamed chunks (play, reset, stop, resume, analyser).
chunksToWavBlob(chunks, sampleRate) — assemble collected chunks into a downloadable WAV.
resampleLinear(data, fromRate, toRate) — simple linear resampler.
SentencePieceTokenizer — the standalone pure-JS Unigram tokenizer.

What gets downloaded

For the chosen language + quantized setting only:

| file | needed for | INT8 size | | ---- | ---------- | --------- | | bundle.json, tokenizer.model | always | ~80 KB | | flow_lm_main, flow_lm_flow, mimi_decoder | always | ~109 MB | | text_conditioner | always | ~16 MB | | mimi_encoder | only if voiceCloning | ~21 MB | | bos_before_voice.npy | only if voiceCloning | <1 KB | | voices.bin | only when a built-in voice is requested | varies |

Full-precision (quantized: false) variants are larger.

Caching (no re-download every load)

Downloaded assets (models, tokenizer, voices.bin) are persisted in the browser's Cache Storage by default, so after the first visit later loads read from disk — no network, works offline. Only ONNX session compilation runs (a few seconds).

new PocketTTS({ cache: true });            // default
new PocketTTS({ cache: false });           // always fetch from network
new PocketTTS({ cacheName: "my-bucket" }); // custom Cache Storage bucket

await PocketTTS.clearCache();              // free the space / force fresh download
await PocketTTS.storageEstimate();         // { usage, quota } in bytes (or null)

Progress callbacks include fromCache: true when a file is served from the cache. Caching needs a secure context (https:// or localhost); it silently falls back to plain network fetches if Cache Storage is unavailable or the quota is exceeded. Bump cacheName (or call clearCache()) when you publish new model weights.

Built-in voices

Each language bundle ships a voices.bin with several ready-made speakers. List them via tts.predefinedVoices and prepare one with tts.loadVoice(name). The voices.bin for a language is downloaded lazily, only the first time you request a built-in voice.

console.log(tts.predefinedVoices); // e.g. ["alba", "azelma", "cosette", …]
const voice = await tts.loadVoice("alba");

To serve voices from a different location, point the library at your own file:

new PocketTTS({ voicesUrl: "https://example.com/english_2026-04/voices.bin" });

Voice cloning is also available and needs no voices.bin at all.

License

The library code is licensed under the MIT License.

The Pocket TTS model weights and the bundled built-in voice assets that this library downloads at runtime are © Kyutai and licensed under CC-BY-4.0 — see kyutai/pocket-tts. Your use of those assets is subject to the CC-BY-4.0 terms (including attribution), independently of this package's MIT license.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

pocket-tts-js

Install

Quick start

Examples

English + built-in voice "alba"

English + voice cloning

Clearing the cached model

Cross-origin isolation (recommended)

API

new PocketTTS(options)

Methods

Helpers

What gets downloaded

Caching (no re-download every load)

Built-in voices

License

`new PocketTTS(options)`