@reaatech/voice-agent-tts
v0.1.0
Published
Provider-agnostic text-to-speech interface with Deepgram Aura, AWS Polly, and Google Cloud Text-to-Speech adapters
Readme
@reaatech/voice-agent-tts
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Provider-agnostic text-to-speech interface with five adapter implementations: Deepgram Aura, AWS Polly, Google Cloud Text-to-Speech, ElevenLabs, and Cartesia. Streaming audio output via AsyncIterable<AudioChunk>, cancelable synthesis, and Twilio-ready audio formatting.
Installation
npm install @reaatech/voice-agent-tts
pnpm add @reaatech/voice-agent-ttsProvider SDKs (install only what you use)
The cloud adapters load their provider SDKs lazily and declare them as optional peer dependencies, so you only install the SDK for the provider you actually use. Deepgram needs no extra SDK.
# AWS Polly
npm install @aws-sdk/client-polly @aws-sdk/credential-provider-ini
# Google Cloud Text-to-Speech
npm install @google-cloud/text-to-speechFeature Overview
- Unified TTS interface —
TTSProviderwithsynthesize()returningAsyncIterable<AudioChunk> - Deepgram Aura adapter — Low-latency HTTP/2 streaming with voice selection and mulaw encoding
- AWS Polly adapter — Neural engine with SSML support, multiple voice IDs, sample rate configuration
- Google Cloud TTS adapter — 220+ voices, speaking rate, pitch, volume control, and SSML gender
- ElevenLabs adapter — Streaming HTTP/2 with ultra-realistic voices (Turbo v2.5, Flash v2.5)
- Cartesia adapter — Ultra-low latency streaming with Sonic model and emotion control
- Cancelable synthesis —
cancel()stops in-progress TTS immediately (barge-in support) - Twilio audio formatting — Automatic mulaw 8kHz conversion via
formatAudioForTwilio() - Silence generation —
createSilenceChunk()for injecting pauses between utterances - Text chunking —
chunkTextForStreaming()to split long responses for streaming TTS - Provider factory —
createTTSProvider()for runtime provider selection
Quick Start
import { DeepgramTTSProvider } from '@reaatech/voice-agent-tts';
const tts = new DeepgramTTSProvider();
for await (const chunk of tts.synthesize('Hello, how can I help you today?', {
provider: 'deepgram',
apiKey: process.env.DEEPGRAM_API_KEY,
voice: 'asteria',
model: 'aura',
encoding: 'mulaw',
sampleRate: 8000,
})) {
// Send chunk.buffer to Twilio Media Stream
twilioHandler.sendAudio(chunk);
}API Reference
TTSProvider Interface
interface TTSProvider {
readonly name: string;
synthesize(text: string, config: DeepgramTTSConfig | AWSPollyConfig | GoogleCloudTTSConfig): AsyncIterable<AudioChunk>;
readonly supportsStreaming: boolean;
readonly firstByteLatencyMs: number | null;
cancel(): void;
connect?(config: unknown): Promise<void>;
}TTSProviderInterface (Static Utilities)
class TTSProviderInterface {
static formatAudioForTwilio(chunk: AudioChunk): AudioChunk;
static createSilenceChunk(durationMs: number, sampleRate?: number): AudioChunk;
static chunkTextForStreaming(text: string, maxChunkSize?: number): string[];
}| Method | Description |
|--------|-------------|
| formatAudioForTwilio | Converts any audio chunk to mulaw 8kHz for Twilio Media Streams |
| createSilenceChunk | Creates a mulaw silence buffer of specified duration (default 8kHz) |
| chunkTextForStreaming | Splits long text at sentence boundaries for sentence-by-sentence TTS |
DeepgramTTSProvider
class DeepgramTTSProvider implements TTSProvider {
readonly name = 'deepgram';
readonly supportsStreaming = true;
constructor(options?: DeepgramTTSOptions);
getLastFirstByteLatency(): number | null;
}
interface DeepgramTTSOptions {
apiUrl?: string; // default: 'api.deepgram.com'
version?: string; // default: 'v1'
}
interface DeepgramTTSConfig extends TTSConfig {
model?: 'aura';
voice?: string; // e.g., 'asteria', 'luna', 'stella', 'arcas'
encoding?: 'mulaw' | 'linear16' | 'pcm';
sampleRate?: number; // 8000, 16000, 24000, 48000
container?: 'none' | 'wav';
}AWSPollyProvider
class AWSPollyProvider extends EventEmitter implements TTSProvider {
readonly name = 'aws-polly';
readonly supportsStreaming = true;
constructor(options?: AWSPollyOptions);
connect(config: AWSPollyConfig): Promise<void>;
onError(cb: (error: Error) => void): void;
close(): Promise<void>;
isConnected(): boolean;
}
interface AWSPollyOptions {
region?: string; // default: 'us-east-1'
defaultVoiceId?: string; // default: 'Joanna'
defaultEngine?: Engine; // default: NEURAL
}
interface AWSPollyConfig extends TTSConfig {
region: string;
voiceId?: string; // Joanna, Matthew, Salli, etc.
engine?: 'standard' | 'neural';
languageCode?: string;
sampleRate?: number; // 8000, 16000, 22050
textType?: 'text' | 'ssml';
}GoogleCloudTTSProvider
class GoogleCloudTTSProvider implements TTSProvider {
readonly name = 'google-cloud-tts';
readonly supportsStreaming = true;
constructor(options?: GoogleCloudTTSOptions);
getLastFirstByteLatency(): number | null;
}
interface GoogleCloudTTSOptions {
projectId?: string;
keyFilename?: string;
}
interface GoogleCloudTTSConfig extends TTSConfig {
projectId: string;
voiceName?: string; // e.g., 'en-US-Standard-A'
languageCode?: string; // e.g., 'en-US'
ssmlGender?: 'MALE' | 'FEMALE' | 'NEUTRAL';
audioEncoding?: 'MP3' | 'LINEAR16' | 'OGG_OPUS' | 'MULAW' | 'ALAW';
sampleRateHertz?: number;
speakingRate?: number; // 0.25–4.0
pitch?: number; // -20.0–20.0
volumeGainDb?: number; // -96.0–16.0
}ElevenLabsProvider
class ElevenLabsProvider implements TTSProvider {
readonly name = 'elevenlabs';
readonly supportsStreaming = true;
constructor(options?: ElevenLabsOptions);
getLastFirstByteLatency(): number | null;
}
interface ElevenLabsConfig extends TTSConfig {
modelId?: 'eleven_turbo_v2_5' | 'eleven_flash_v2_5';
voiceId?: string;
stability?: number;
similarityBoost?: number;
optimizeStreamingLatency?: number;
outputFormat?: 'mp3_44100' | 'pcm_8000' | 'mulaw_8000';
}Streaming HTTP/2 adapter for ElevenLabs ultra-realistic voices. Supports latency optimization and multiple output formats.
CartesiaProvider
class CartesiaProvider implements TTSProvider {
readonly name = 'cartesia';
readonly supportsStreaming = true;
constructor(options?: CartesiaOptions);
getLastFirstByteLatency(): number | null;
}
interface CartesiaConfig extends TTSConfig {
modelId?: 'sonic' | 'sonic-2';
voiceId?: string;
speed?: 'slowest' | 'slow' | 'normal' | 'fast' | 'fastest';
emotion?: 'anger' | 'positivity' | 'surprise' | 'sadness' | 'curiosity' | 'neutral';
language?: string;
outputFormat?: 'raw' | 'wav' | 'mp3';
sampleRate?: number;
}Ultra-low latency streaming adapter with Sonic model and emotion control. Sub-100ms P50 latency for real-time use.
Provider Factory
import { createTTSProvider } from '@reaatech/voice-agent-tts';
const tts = createTTSProvider({
provider: 'deepgram', // 'deepgram' | 'aws-polly' | 'google-cloud-tts' | 'elevenlabs' | 'cartesia'
config: { provider: 'deepgram', apiKey: '...' },
});Usage Patterns
Barge-In (Cancel In-Progress TTS)
// Start TTS
const ttsStream = tts.synthesize(text, config);
// User interrupts — cancel immediately
tts.cancel();
// The synthesize() generator will exit cleanlySentence-Level Streaming for Low Latency
import { TTSProviderInterface } from '@reaatech/voice-agent-tts';
const sentences = TTSProviderInterface.chunkTextForStreaming(longText, 200);
for (const sentence of sentences) {
for await (const chunk of tts.synthesize(sentence, config)) {
handler.sendAudio(chunk);
}
}Silence Between Utterances
import { TTSProviderInterface } from '@reaatech/voice-agent-tts';
// 500ms silence gap
const silence = TTSProviderInterface.createSilenceChunk(500);
handler.sendAudio(silence);Related Packages
- @reaatech/voice-agent-core — Core types, pipeline, config
- @reaatech/voice-agent-stt — Speech-to-text providers
- @reaatech/voice-agent-telephony — Twilio Media Streams handler
