@reaatech/voice-agent-tts

v0.1.0

Published

23 days ago

Provider-agnostic text-to-speech interface with Deepgram Aura, AWS Polly, and Google Cloud Text-to-Speech adapters

0High
0Medium
0Low

reaatech

@reaatech/voice-agent-tts

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Provider-agnostic text-to-speech interface with five adapter implementations: Deepgram Aura, AWS Polly, Google Cloud Text-to-Speech, ElevenLabs, and Cartesia. Streaming audio output via AsyncIterable<AudioChunk>, cancelable synthesis, and Twilio-ready audio formatting.

Installation

npm install @reaatech/voice-agent-tts
pnpm add @reaatech/voice-agent-tts

Provider SDKs (install only what you use)

The cloud adapters load their provider SDKs lazily and declare them as optional peer dependencies, so you only install the SDK for the provider you actually use. Deepgram needs no extra SDK.

# AWS Polly
npm install @aws-sdk/client-polly @aws-sdk/credential-provider-ini

# Google Cloud Text-to-Speech
npm install @google-cloud/text-to-speech

Feature Overview

Unified TTS interface — TTSProvider with synthesize() returning AsyncIterable<AudioChunk>
Deepgram Aura adapter — Low-latency HTTP/2 streaming with voice selection and mulaw encoding
AWS Polly adapter — Neural engine with SSML support, multiple voice IDs, sample rate configuration
Google Cloud TTS adapter — 220+ voices, speaking rate, pitch, volume control, and SSML gender
ElevenLabs adapter — Streaming HTTP/2 with ultra-realistic voices (Turbo v2.5, Flash v2.5)
Cartesia adapter — Ultra-low latency streaming with Sonic model and emotion control
Cancelable synthesis — cancel() stops in-progress TTS immediately (barge-in support)
Twilio audio formatting — Automatic mulaw 8kHz conversion via formatAudioForTwilio()
Silence generation — createSilenceChunk() for injecting pauses between utterances
Text chunking — chunkTextForStreaming() to split long responses for streaming TTS
Provider factory — createTTSProvider() for runtime provider selection

Quick Start

import { DeepgramTTSProvider } from '@reaatech/voice-agent-tts';

const tts = new DeepgramTTSProvider();

for await (const chunk of tts.synthesize('Hello, how can I help you today?', {
  provider: 'deepgram',
  apiKey: process.env.DEEPGRAM_API_KEY,
  voice: 'asteria',
  model: 'aura',
  encoding: 'mulaw',
  sampleRate: 8000,
})) {
  // Send chunk.buffer to Twilio Media Stream
  twilioHandler.sendAudio(chunk);
}

API Reference

TTSProvider Interface

interface TTSProvider {
  readonly name: string;
  synthesize(text: string, config: DeepgramTTSConfig | AWSPollyConfig | GoogleCloudTTSConfig): AsyncIterable<AudioChunk>;
  readonly supportsStreaming: boolean;
  readonly firstByteLatencyMs: number | null;
  cancel(): void;
  connect?(config: unknown): Promise<void>;
}

TTSProviderInterface (Static Utilities)

class TTSProviderInterface {
  static formatAudioForTwilio(chunk: AudioChunk): AudioChunk;
  static createSilenceChunk(durationMs: number, sampleRate?: number): AudioChunk;
  static chunkTextForStreaming(text: string, maxChunkSize?: number): string[];
}

| Method | Description | |--------|-------------| | formatAudioForTwilio | Converts any audio chunk to mulaw 8kHz for Twilio Media Streams | | createSilenceChunk | Creates a mulaw silence buffer of specified duration (default 8kHz) | | chunkTextForStreaming | Splits long text at sentence boundaries for sentence-by-sentence TTS |

DeepgramTTSProvider

class DeepgramTTSProvider implements TTSProvider {
  readonly name = 'deepgram';
  readonly supportsStreaming = true;
  constructor(options?: DeepgramTTSOptions);
  getLastFirstByteLatency(): number | null;
}

interface DeepgramTTSOptions {
  apiUrl?: string;   // default: 'api.deepgram.com'
  version?: string;  // default: 'v1'
}

interface DeepgramTTSConfig extends TTSConfig {
  model?: 'aura';
  voice?: string;        // e.g., 'asteria', 'luna', 'stella', 'arcas'
  encoding?: 'mulaw' | 'linear16' | 'pcm';
  sampleRate?: number;   // 8000, 16000, 24000, 48000
  container?: 'none' | 'wav';
}

AWSPollyProvider

class AWSPollyProvider extends EventEmitter implements TTSProvider {
  readonly name = 'aws-polly';
  readonly supportsStreaming = true;
  constructor(options?: AWSPollyOptions);
  connect(config: AWSPollyConfig): Promise<void>;
  onError(cb: (error: Error) => void): void;
  close(): Promise<void>;
  isConnected(): boolean;
}

interface AWSPollyOptions {
  region?: string;          // default: 'us-east-1'
  defaultVoiceId?: string;  // default: 'Joanna'
  defaultEngine?: Engine;   // default: NEURAL
}

interface AWSPollyConfig extends TTSConfig {
  region: string;
  voiceId?: string;          // Joanna, Matthew, Salli, etc.
  engine?: 'standard' | 'neural';
  languageCode?: string;
  sampleRate?: number;       // 8000, 16000, 22050
  textType?: 'text' | 'ssml';
}

GoogleCloudTTSProvider

class GoogleCloudTTSProvider implements TTSProvider {
  readonly name = 'google-cloud-tts';
  readonly supportsStreaming = true;
  constructor(options?: GoogleCloudTTSOptions);
  getLastFirstByteLatency(): number | null;
}

interface GoogleCloudTTSOptions {
  projectId?: string;
  keyFilename?: string;
}

interface GoogleCloudTTSConfig extends TTSConfig {
  projectId: string;
  voiceName?: string;              // e.g., 'en-US-Standard-A'
  languageCode?: string;           // e.g., 'en-US'
  ssmlGender?: 'MALE' | 'FEMALE' | 'NEUTRAL';
  audioEncoding?: 'MP3' | 'LINEAR16' | 'OGG_OPUS' | 'MULAW' | 'ALAW';
  sampleRateHertz?: number;
  speakingRate?: number;           // 0.25–4.0
  pitch?: number;                  // -20.0–20.0
  volumeGainDb?: number;           // -96.0–16.0
}

ElevenLabsProvider

class ElevenLabsProvider implements TTSProvider {
  readonly name = 'elevenlabs';
  readonly supportsStreaming = true;
  constructor(options?: ElevenLabsOptions);
  getLastFirstByteLatency(): number | null;
}

interface ElevenLabsConfig extends TTSConfig {
  modelId?: 'eleven_turbo_v2_5' | 'eleven_flash_v2_5';
  voiceId?: string;
  stability?: number;
  similarityBoost?: number;
  optimizeStreamingLatency?: number;
  outputFormat?: 'mp3_44100' | 'pcm_8000' | 'mulaw_8000';
}

Streaming HTTP/2 adapter for ElevenLabs ultra-realistic voices. Supports latency optimization and multiple output formats.

CartesiaProvider

class CartesiaProvider implements TTSProvider {
  readonly name = 'cartesia';
  readonly supportsStreaming = true;
  constructor(options?: CartesiaOptions);
  getLastFirstByteLatency(): number | null;
}

interface CartesiaConfig extends TTSConfig {
  modelId?: 'sonic' | 'sonic-2';
  voiceId?: string;
  speed?: 'slowest' | 'slow' | 'normal' | 'fast' | 'fastest';
  emotion?: 'anger' | 'positivity' | 'surprise' | 'sadness' | 'curiosity' | 'neutral';
  language?: string;
  outputFormat?: 'raw' | 'wav' | 'mp3';
  sampleRate?: number;
}

Ultra-low latency streaming adapter with Sonic model and emotion control. Sub-100ms P50 latency for real-time use.

Provider Factory

import { createTTSProvider } from '@reaatech/voice-agent-tts';

const tts = createTTSProvider({
  provider: 'deepgram',             // 'deepgram' | 'aws-polly' | 'google-cloud-tts' | 'elevenlabs' | 'cartesia'
  config: { provider: 'deepgram', apiKey: '...' },
});

Usage Patterns

Barge-In (Cancel In-Progress TTS)

// Start TTS
const ttsStream = tts.synthesize(text, config);

// User interrupts — cancel immediately
tts.cancel();
// The synthesize() generator will exit cleanly

Sentence-Level Streaming for Low Latency

import { TTSProviderInterface } from '@reaatech/voice-agent-tts';

const sentences = TTSProviderInterface.chunkTextForStreaming(longText, 200);

for (const sentence of sentences) {
  for await (const chunk of tts.synthesize(sentence, config)) {
    handler.sendAudio(chunk);
  }
}

Silence Between Utterances

import { TTSProviderInterface } from '@reaatech/voice-agent-tts';

// 500ms silence gap
const silence = TTSProviderInterface.createSilenceChunk(500);
handler.sendAudio(silence);

Related Packages

@reaatech/voice-agent-core — Core types, pipeline, config
@reaatech/voice-agent-stt — Speech-to-text providers
@reaatech/voice-agent-telephony — Twilio Media Streams handler

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@reaatech/voice-agent-tts

Installation

Provider SDKs (install only what you use)

Feature Overview

Quick Start

API Reference

TTSProvider Interface

TTSProviderInterface (Static Utilities)

DeepgramTTSProvider

AWSPollyProvider

GoogleCloudTTSProvider

ElevenLabsProvider

CartesiaProvider

Provider Factory

Usage Patterns

Barge-In (Cancel In-Progress TTS)

Sentence-Level Streaming for Low Latency

Silence Between Utterances

Related Packages

License