@create-voice-agent/elevenlabs

v0.1.0

Published

11 days ago

ElevenLabs Text-to-Speech integration for voice agents

0High
0Medium
0Low

christian-bromann

@create-voice-agent/elevenlabs 🔊

ElevenLabs Text-to-Speech integration for create-voice-agent.

This package provides high-quality, low-latency voice synthesis using ElevenLabs' streaming TTS API.

Installation

npm install @create-voice-agent/elevenlabs
# or
pnpm add @create-voice-agent/elevenlabs

Quick Start

import { createVoiceAgent } from "create-voice-agent";
import { AssemblyAISpeechToText } from "@create-voice-agent/assemblyai";
import { ElevenLabsTextToSpeech } from "@create-voice-agent/elevenlabs";

const voiceAgent = createVoiceAgent({
  model: new ChatOpenAI({ model: "gpt-4o" }),
  
  stt: new AssemblyAISpeechToText({ /* ... */ }),
  
  tts: new ElevenLabsTextToSpeech({
    apiKey: process.env.ELEVENLABS_API_KEY!,
    voiceId: process.env.ELEVENLABS_VOICE_ID!,
  }),
});

API Reference

`ElevenLabsTextToSpeech`

Streaming Text-to-Speech model using ElevenLabs' HTTP API.

import { ElevenLabsTextToSpeech } from "@create-voice-agent/elevenlabs";

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel
  
  // Optional configuration
  modelId: "eleven_flash_v2_5",
  outputFormat: "pcm_16000",
  optimizeStreamingLatency: 3,
  
  // Voice settings
  voiceSettings: {
    stability: 0.5,
    similarityBoost: 0.75,
    style: 0.3,
    speed: 1.0,
    useSpeakerBoost: true,
  },
  
  // Token batching
  flushDelayMs: 300,
  
  // Callbacks
  onAudioComplete: () => console.log("Finished speaking"),
  onInterrupt: () => console.log("Speech interrupted"),
});

Configuration Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | apiKey | string | required | ElevenLabs API key | | voiceId | string | required | Voice ID to use | | modelId | string | "eleven_flash_v2_5" | TTS model ID | | languageCode | string | - | ISO 639-1 language code (e.g., "en", "es") | | outputFormat | string | "pcm_16000" | Audio output format | | optimizeStreamingLatency | 0-4 | 3 | Latency optimization level | | flushDelayMs | number | 300 | Token batching delay (ms) | | seed | number | - | Seed for deterministic generation | | previousText | string | - | Context text before current request | | nextText | string | - | Context text after current request | | applyTextNormalization | "auto" \| "on" \| "off" | "auto" | Text normalization mode | | applyLanguageTextNormalization | boolean | false | Language-specific normalization (⚠️ high latency) |

Voice Settings

Fine-tune the generated speech characteristics:

interface ElevenLabsVoiceSettings {
  /** Speech stability (0-1). Lower = more expressive, higher = more consistent */
  stability?: number;
  
  /** Voice similarity (0-1). Higher = closer to reference voice */
  similarityBoost?: number;
  
  /** Enable speaker boost for enhanced clarity */
  useSpeakerBoost?: boolean;
  
  /** Style/expressiveness (0-1). Only for certain models */
  style?: number;
  
  /** Speech speed (0.5-2.0) */
  speed?: number;
}

Example: Expressive Storytelling Voice

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  voiceSettings: {
    stability: 0.3,        // More expressive
    similarityBoost: 0.8,  // Close to reference
    style: 0.6,            // More stylized
    speed: 0.9,            // Slightly slower
  },
});

Example: Consistent Professional Voice

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  voiceSettings: {
    stability: 0.8,        // Very consistent
    similarityBoost: 0.7,
    useSpeakerBoost: true, // Enhanced clarity
    speed: 1.0,
  },
});

Models

| Model ID | Description | Best For | |----------|-------------|----------| | eleven_flash_v2_5 | Fastest, lowest latency (default) | Real-time conversations | | eleven_turbo_v2_5 | Fast with higher quality | Balanced speed/quality | | eleven_multilingual_v2 | Best multilingual support | Non-English or mixed languages | | eleven_monolingual_v1 | Original English model | Legacy compatibility |

Output Formats

PCM (Recommended for voice agents)

| Format | Sample Rate | Description | |--------|-------------|-------------| | pcm_8000 | 8 kHz | Telephone quality | | pcm_16000 | 16 kHz | Standard voice (default) | | pcm_22050 | 22.05 kHz | Higher quality | | pcm_24000 | 24 kHz | High quality | | pcm_44100 | 44.1 kHz | CD quality | | pcm_48000 | 48 kHz | Professional quality |

MP3

| Format | Sample Rate | Bitrate | |--------|-------------|---------| | mp3_22050_32 | 22.05 kHz | 32 kbps | | mp3_44100_64 | 44.1 kHz | 64 kbps | | mp3_44100_128 | 44.1 kHz | 128 kbps | | mp3_44100_192 | 44.1 kHz | 192 kbps |

Other Formats

| Format | Description | |--------|-------------| | ulaw_8000 | μ-law 8kHz (telephony) | | alaw_8000 | A-law 8kHz (telephony) | | opus_48000_* | Opus codec (32-192 kbps) |

Latency Optimization

Control the trade-off between latency and quality:

| Level | Description | Use Case | |-------|-------------|----------| | 0 | No optimization | Highest quality | | 1 | ~50% latency reduction | Balanced | | 2 | ~75% latency reduction | Lower latency | | 3 | Maximum optimization (default) | Real-time conversations | | 4 | Max + disable text normalizer | Fastest (may mispronounce numbers/dates) |

// For real-time conversations (fastest)
const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  optimizeStreamingLatency: 4,
});

// For pre-recorded content (highest quality)
const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  optimizeStreamingLatency: 0,
});

Token Batching

The TTS model batches incoming text tokens before sending to ElevenLabs for more natural speech generation:

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  
  // Wait 300ms after last token before generating speech
  flushDelayMs: 300,
});

Lower values (100-200ms): Faster response, may sound choppy
Higher values (400-500ms): More natural speech, higher latency
Default (300ms): Good balance for most use cases

Instance Methods

`interrupt()`

Interrupt the current speech generation. Useful for barge-in handling.

// User started speaking - stop the agent
tts.interrupt();

`speak(text: string): ReadableStream<Buffer>`

Generate speech directly without going through the voice pipeline. Returns a ReadableStream of PCM audio buffers.

This is useful for:

Initial greetings when a call starts
System announcements that bypass the agent
One-off speech synthesis outside of conversations

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
});

// Generate and play a greeting
const audioStream = tts.speak("Welcome to our service! How can I help you?");

for await (const chunk of audioStream) {
  // Send to audio output (speakers, WebRTC, etc.)
  audioOutput.write(chunk);
}

The speak() method uses the same voice settings and configuration as the main TTS pipeline, ensuring consistent voice quality.

Callbacks

`onAudioComplete`

Called when speech generation finishes (not interrupted).

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  onAudioComplete: () => {
    console.log("Agent finished speaking");
    // Trigger next action, update UI, etc.
  },
});

`onInterrupt`

Called when speech is interrupted (e.g., by barge-in).

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  onInterrupt: () => {
    console.log("Speech was interrupted");
  },
});

Finding Voice IDs

Using the API

const response = await fetch("https://api.elevenlabs.io/v1/voices", {
  headers: { "xi-api-key": process.env.ELEVENLABS_API_KEY! },
});
const { voices } = await response.json();

for (const voice of voices) {
  console.log(`${voice.name}: ${voice.voice_id}`);
}

Popular Pre-made Voices

| Voice | ID | Description | |-------|-----|-------------| | Rachel | 21m00Tcm4TlvDq8ikWAM | American female, calm | | Domi | AZnzlk1XvdvUeBnXmlld | American female, strong | | Bella | EXAVITQu4vr4xnSDxMaL | American female, soft | | Antoni | ErXwobaYiN019PkySvjV | American male, warm | | Josh | TxGEqnHWrfWFTfGW9XjX | American male, deep | | Arnold | VR6AewLTigWG4xSOukaG | American male, crisp | | Adam | pNInz6obpgDQGcFmaJgB | American male, deep | | Sam | yoZ06aMxZJJ28mfd3POQ | American male, raspy |

Multilingual Support

For non-English or mixed-language content:

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  modelId: "eleven_multilingual_v2",
  languageCode: "es", // Spanish
});

Supported Languages

The eleven_multilingual_v2 model supports 29 languages including: English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Arabic, Japanese, Korean, Mandarin, and more.

Text Normalization

Control how text is processed before synthesis:

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  
  // "auto" - Let the system decide (default)
  // "on"   - Always normalize (spell out numbers, dates, etc.)
  // "off"  - Skip normalization
  applyTextNormalization: "on",
});

Note: For eleven_turbo_v2_5 and eleven_flash_v2_5 models, text normalization requires an Enterprise plan.

Deterministic Generation

Use seeds for reproducible output:

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: "your-voice-id",
  seed: 12345, // 0 to 4294967295
});

Note: Determinism is not guaranteed but the system will attempt to produce consistent results.

Complete Example

import { createVoiceAgent, createThinkingFillerMiddleware } from "create-voice-agent";
import { AssemblyAISpeechToText } from "@create-voice-agent/assemblyai";
import { ElevenLabsTextToSpeech } from "@create-voice-agent/elevenlabs";
import { ChatOpenAI } from "@langchain/openai";

const tts = new ElevenLabsTextToSpeech({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  voiceId: process.env.ELEVENLABS_VOICE_ID!,
  modelId: "eleven_flash_v2_5",
  outputFormat: "pcm_16000",
  optimizeStreamingLatency: 3,
  
  voiceSettings: {
    stability: 0.5,
    similarityBoost: 0.75,
    useSpeakerBoost: true,
  },
  
  onAudioComplete: () => console.log("Agent finished speaking"),
});

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  onSpeechStart: () => {
    // Barge-in: user started speaking, interrupt the agent
    tts.interrupt();
  },
});

const voiceAgent = createVoiceAgent({
  model: new ChatOpenAI({ model: "gpt-4o" }),
  prompt: "You are a friendly voice assistant. Keep responses concise.",
  
  stt,
  tts,
  
  middleware: [
    createThinkingFillerMiddleware({ thresholdMs: 1000 }),
  ],
});

// Process audio streams
const audioOutput = voiceAgent.process(audioInputStream);

License

MIT