@uzen/kokoro-js
v1.2.4
Published
High-quality text-to-speech for the web
Readme
Kokoro TTS
Language: English | 简体中文
@uzen/kokoro-js is a JavaScript runtime wrapper for Kokoro TTS based on Transformers.js. This fork is wired for the v1.1 Chinese model and supports Mandarin plus mixed Chinese/English text through the built-in phonemizer.
Goals
- Add browser-local Mandarin and mixed Chinese/English TTS support to
kokoro.js. - Use Python
kokoroplusmisakiv1.1 Chinese frontend behavior as the reference for Chinese phonemization, normalization, tone sandhi, and erhua handling. - Keep the JavaScript package self-contained and browser-friendly at runtime; Python is used only as a behavior reference, not as a runtime dependency.
- Target practical web inference with the v1.1 Chinese ONNX model, local voice assets, and a WebGPU-first default path.
- Preserve usable English spans in mixed text by routing them through the existing English phonemizer path instead of dropping them.
Current Status
- Chinese phonemization now emits v1.1 Chinese tokenizer symbols directly instead of falling back to English espeak phonemization.
- The implementation covers common Mandarin normalization cases including time expressions, dates, numeric ranges, fractions, negative numbers, percentages, phone numbers, temperatures, measurements, money, and quantified numbers.
- Tone-related handling includes selected neutral-tone words and particles, reduplication,
一/不sandhi, third-tone sandhi, and erhua handling. - Golden corpus tracking is based on Python
misakiv1.1 outputs; the current plan records 164 examples with 143 matches and 21 known gaps. - Remaining gaps mainly come from
Intl.Segmenterversusjieba.possegsegmentation differences, missing POS information in the browser path, and long-tail text normalization cases.
Features
- Runs Kokoro TTS in browser and Node.js through Transformers.js.
- Supports the v1.1 Chinese ONNX model:
onnx-community/Kokoro-82M-v1.1-zh-ONNX. - Supports Mandarin text and mixed Chinese/English text with explicit Chinese phonemization.
- Provides local voice loading through
voicePath; voice.binfiles are not bundled in the npm package. - Supports single-shot generation through
tts.generate()and chunked generation throughtts.stream().
Known Limitations
- Voice assets must be hosted or copied by the application before inference can run.
- The registered English voice set is limited to
af_maple,af_sol, andbf_vale. - Chinese phonemization is implemented in JavaScript and may still differ from Python
misakion long-tail polyphones or text normalization cases. - Browser performance depends on WebGPU availability; use
wasmonly as a fallback whenwebgpuis unavailable. fp32,fp16, andq4f16are recommended;q8andq4are supported by the loader but not recommended for this model.
Install
npm i @uzen/kokoro-jsInstalling this package also installs its runtime dependencies, including @huggingface/transformers. Transformers.js brings in ONNX Runtime packages such as onnxruntime-web as transitive dependencies; applications do not need to install them separately.
Model And Voices
Use the v1.1 Chinese ONNX model:
onnx-community/Kokoro-82M-v1.1-zh-ONNXVoice .bin files are not bundled in the npm package. Download the voice files you need from the model repository and expose them through voicePath.
- Browser/Vite apps: put files such as
zf_001.binunderpublic/kokoro/voices/; the defaultvoicePathis/kokoro/voices. - Node.js: put the files in a local directory and pass that directory as
voicePath, for examplevoicePath: "./voices".
Current registered voices are the v1.1 Chinese voices plus two American English voices and one British English voice:
af_maple, af_sol, bf_vale
zf_001, zf_002, zf_003, zf_004, zf_005, zf_006, zf_007, zf_008,
zf_017, zf_018, zf_019, zf_021, zf_022, zf_023, zf_024, zf_026,
zf_027, zf_028, zf_032, zf_036, zf_038, zf_039, zf_040, zf_042,
zf_043, zf_044, zf_046, zf_047, zf_048, zf_049, zf_051, zf_059,
zf_060, zf_067, zf_070, zf_071, zf_072, zf_073, zf_074, zf_075,
zf_076, zf_077, zf_078, zf_079, zf_083, zf_084, zf_085, zf_086,
zf_087, zf_088, zf_090, zf_092, zf_093, zf_094, zf_099,
zm_009, zm_010, zm_011, zm_012, zm_013, zm_014, zm_015, zm_016,
zm_020, zm_025, zm_029, zm_030, zm_031, zm_033, zm_034, zm_035,
zm_037, zm_041, zm_045, zm_050, zm_052, zm_053, zm_054, zm_055,
zm_056, zm_057, zm_058, zm_061, zm_062, zm_063, zm_064, zm_065,
zm_066, zm_068, zm_069, zm_080, zm_081, zm_082, zm_089, zm_091,
zm_095, zm_096, zm_097, zm_098, zm_100Call tts.list_voices() or read tts.voices at runtime for the authoritative list from the installed package.
Generate Speech
import { KokoroTTS } from "@uzen/kokoro-js";
const model_id = "onnx-community/Kokoro-82M-v1.1-zh-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
dtype: "fp32",
device: "webgpu",
voicePath: "/kokoro/voices",
});
const audio = await tts.generate("你好,欢迎使用 Kokoro 中文语音。", {
voice: "zf_001",
speed: 1,
});
audio.save("audio.wav");Use "fp32" as the default dtype and "webgpu" as the default browser device. Recommended dtype values are "fp32", "fp16", and "q4f16". Other supported quantized variants such as "q8" and "q4" are not recommended for this model. Supported device values are "webgpu", "wasm", "cpu", or null; browser apps should prefer "webgpu" when available and can fall back to "wasm", while Node.js uses "cpu".
Stream Speech
import { KokoroTTS, TextSplitterStream } from "@uzen/kokoro-js";
const model_id = "onnx-community/Kokoro-82M-v1.1-zh-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
dtype: "fp32",
device: "webgpu",
});
const splitter = new TextSplitterStream();
const stream = tts.stream(splitter, { voice: "zf_001" });
(async () => {
let index = 0;
for await (const { text, phonemes, audio } of stream) {
console.log({ text, phonemes });
audio.save(`audio-${index++}.wav`);
}
})();
splitter.push("Kokoro 支持中文和中英 mixed text。它可以分段生成音频。");
splitter.close();tts.stream() also accepts a string directly. By default it uses TextSplitterStream sentence splitting and further splits long chunks at punctuation boundaries.
API Notes
KokoroTTS.from_pretrained(model_id, options)loads the model and tokenizer through Transformers.js.tts.generate(text, { voice, speed })returns a Transformers.jsRawAudioobject.tts.stream(textOrSplitter, options)yields{ text, phonemes, audio }for each generated chunk.voicePathis a base path, not a full file path; the runtime loads${voicePath}/${voice}.binin browsers and resolves the same pattern from the local filesystem in Node.js.- Chinese voices begin with
zf_orzm_and use the Chinese phonemizer path. English voices beginning withaf_orbf_use the English phonemizer path.
