@pocketpalai/react-native-speech

v2.5.1

Published

19 days ago

A high-performance React Native library for text-to-speech on iOS and Android

@pocketpalai/react-native-speech

On-device, multi-engine text-to-speech for React Native. Wraps the OS-native TTS (iOS AVSpeechSynthesizer / Android TextToSpeech) and three neural engines — Kokoro, Supertonic, Kitten — behind a single API, with native audio playback, progress events, and audio-focus handling.

New Architecture only. Requires React Native's New Architecture. RN 0.76+ enables it by default. For 0.68–0.75 see the enable-apps guide.

Preview

Streaming (LLM token stream → TTS)

| iOS | Android | | :---: | :---: | | | |

One-shot speak

| iOS | Android | | :---: | :---: | | | |

Features

Four engines behind one API: OS_NATIVE (platform TTS), KOKORO (high quality, multi-language), SUPERTONIC (fast, lightweight), KITTEN (compact IPA-driven).
License-neutral runner: the library is MIT and ships no model or dictionary data. Consumer apps supply both at runtime. See LICENSES.md.
On-device synthesis: neural TTS runs entirely on-device. The library performs no network I/O during synthesis. Any initial model or dictionary download is performed by the consumer app using its own network stack.
Interruption-aware audio: iOS AVAudioSession and Android AudioFocus are wired through a JS onAudioInterruption event so apps can react to phone calls and other interruptions.
Turbo-module native layer: native audio playback, progress events, and chunk progress for neural engines.
Permissive phonemization: default is phonemize (MIT). Optionally supply a mmap'd EPD1 dict via the NativeDict API for higher accuracy — see PHONEMIZATION.md.
HighlightedText component: highlight spoken text as it synthesizes.
TypeScript: full type definitions; per-engine config is a discriminated union on the engine field.

Installation

npm install @pocketpalai/react-native-speech
# or
yarn add @pocketpalai/react-native-speech

iOS:

cd ios && pod install

Expo (bare only — not supported in Expo Go):

npx expo install @pocketpalai/react-native-speech
npx expo prebuild

Neural engines (optional)

The neural engines need onnxruntime-react-native (optional peer):

npm install onnxruntime-react-native

OS-native TTS works without it.

Quickstart

import Speech, {TTSEngine} from '@pocketpalai/react-native-speech';

await Speech.initialize({engine: TTSEngine.OS_NATIVE});
// voiceId is optional for OS_NATIVE — omitted uses the platform default voice.
await Speech.speak('Hello world');

Neural engine quickstarts

The consumer app is responsible for downloading models and passing file paths. See example/src/utils/ for reference model managers.

// Kokoro
await Speech.initialize({
  engine: TTSEngine.KOKORO,
  modelPath: 'file:///.../kokoro.onnx',
  voicesPath: 'file:///.../voices.bin',
  tokenizerPath: 'file:///.../tokenizer.json',
});
await Speech.speak('Hello from Kokoro.', 'af_bella');

// Supertonic (4 ONNX files)
await Speech.initialize({
  engine: TTSEngine.SUPERTONIC,
  durationPredictorPath: 'file:///.../duration_predictor.onnx',
  textEncoderPath: 'file:///.../text_encoder.onnx',
  vectorEstimatorPath: 'file:///.../vector_estimator.onnx',
  vocoderPath: 'file:///.../vocoder.onnx',
  unicodeIndexerPath: 'file:///.../unicode_indexer.json',
  voicesPath: 'file:///.../voices/',
});
await Speech.speak('Hello from Supertonic.', 'F1');

// Supertonic is multilingual on the v3 model (Supertone/supertonic-3):
// 31 languages plus 'na' for language-agnostic synthesis. Pass a code via
// `language` (v1 = en only; v2 = en/ko/es/pt/fr; v3 = all 31 + na). Voices
// are language-agnostic — any voice works with any language.
await Speech.speak('Bonjour le monde.', 'F1', {language: 'fr'});
await Speech.speak('Mixed-language text, just works.', 'M2', {language: 'na'});

// Kitten
await Speech.initialize({
  engine: TTSEngine.KITTEN,
  modelPath: 'file:///.../kitten.onnx',
  voicesPath: 'file:///.../voices.json',
  dictPath: 'file:///.../en-us.bin', // optional EPD1 dict
});
await Speech.speak('Hello from Kitten.', 'expr-voice-2-f');

Full options (execution providers, chunking, phonemizer selection) are documented in USAGE.md.

Streaming input (LLM token streams)

If your app plays a token-by-token LLM response through TTS, use createSpeechStream() instead of calling speak() per sentence. It buffers incoming text and adaptively flushes batches through the underlying engine so playback sounds continuous — the first sentence flushes as soon as it completes (low latency) and subsequent batches are packed up to targetChars characters.

const stream = Speech.createSpeechStream('af_bella', {
  targetChars: 300, // default
  onError: err => console.warn(err),
});

for await (const token of llmTokenStream) {
  stream.append(token); // non-blocking
}

await stream.finalize(); // flushes the tail and resolves when playback ends
// or: await stream.cancel(); // stops and discards

Per-sentence speak() chains produce audible gaps: each call resets the engine's internal synth pipeline, starting a fresh F0 contour and a cold first-chunk inference. The stream avoids this by keeping one continuous synth+play loop alive for the stream's entire lifetime — the next chunk is synthesized while the current one plays, so the only gap is genuine token-rate underrun (LLM slower than playback).

You can also track playback position with stream-absolute offsets:

stream.onProgress(event => {
  // event.streamRange is relative to the total text appended so far
  highlightText(event.streamRange.start, event.streamRange.end);
});

Works with all neural engines (Kokoro, Supertonic, Kitten) as well as the OS engine. See the Streaming tab in example/ for a live demo that simulates variable token rates.

Architecture (short)

Speech is the public facade. Speech.initialize(config) dispatches on config.engine and constructs the matching engine.
Each engine implements TTSEngineInterface<TConfig>. Neural engines run ONNX sessions under onnxruntime-react-native and stream PCM to the native audio player.
Native code handles playback, progress events, and OS-level audio focus / session interruptions.

See ARCHITECTURE.md for the full picture, including memory and device requirements.

Model & dictionary downloads

The library ships no model or dictionary assets. Consumer apps fetch them from their own origin (typically Hugging Face) and pass local paths into initialize(). See LICENSES.md for upstream sources and license notes per engine.

Known limitations

First run per engine has a 200–2000 ms cold-start (model load + compilation).
Neural engines recommend a 3 GB+ RAM device. Low-memory devices should prefer the Kitten nano/micro variants or fall back to OS_NATIVE.
OS TTS interruption handling is limited to what the platform provides — no library-level custom ducking beyond what iOS/Android expose.
Hermes is supported, but has no TextDecoder or WASM — relevant only if you extend the library's text pipeline.
Android 16 KB page sizes (Android 15+): the library's own native_dict.so is 16 KB-aligned, but onnxruntime-react-native (≤ 1.24.3 at time of writing) is not — apps that load a neural engine on a 16 KB-page device will fail with dlopen errors. Workaround: a one-line linker flag added to its CMakeLists.txt via patch-package. See example/patches/onnxruntime-react-native+1.24.3.patch and the postinstall wiring in example/package.json for the full setup. Drop the patch once upstream ships the fix.

Testing

Mock the module in tests by creating __mocks__/@pocketpalai/react-native-speech.ts:

module.exports = require('@pocketpalai/react-native-speech/jest');

Contributing

See CONTRIBUTING.md.

Credits

Forked from @mhpdev/react-native-speech by Mhpdev. The 1.x line provided the OS-native TTS foundation and the HighlightedText component; 2.0 extended the library into a multi-engine neural platform under a new package name.

Built on top of:

phonemize by hans00 — the MIT G2P library that powers the default phonemizer.
onnxruntime-react-native — Microsoft's ONNX Runtime bindings for RN, which every neural engine uses for inference.
@dr.pogodin/react-native-fs — file I/O for model and dict loading.

Neural model credits (weights are not bundled):

Kokoro-82M by hexgrad (Apache-2.0).
Supertonic by Supertone (code MIT, weights OpenRAIL).
KittenML kitten-tts (Apache-2.0).

Full license details in LICENSES.md.

License

MIT. See LICENSE. For model and third-party data licenses, see LICENSES.md.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@pocketpalai/react-native-speech

Preview

Streaming (LLM token stream → TTS)

One-shot speak

Features

Installation

Neural engines (optional)

Quickstart

Neural engine quickstarts

Streaming input (LLM token streams)

Architecture (short)

Model & dictionary downloads

Known limitations

Testing

Contributing

Credits

License