@drawdream/livespeech

v0.1.16

Published

2 months ago

Real-time speech-to-speech AI conversation SDK

0High
0Medium
0Low

yjaykim

speech-to-text text-to-speech ai llm websocket real-time voice conversation

LiveSpeech SDK for TypeScript

A TypeScript/JavaScript SDK for real-time speech-to-speech AI conversations.

Features

🎙️ Real-time Voice Conversations - Natural, low-latency voice interactions
🌐 Multi-language Support - Korean, English, Japanese, Chinese, and more
🔊 Streaming Audio - Send and receive audio in real-time
⏹️ Barge-in Support - Interrupt AI mid-speech by talking or programmatically
🔄 Auto-reconnection - Automatic recovery from network issues
🌐 Browser & Node.js - Works in both environments

Installation

npm install @drawdream/livespeech

Quick Start (5 minutes)

import { LiveSpeechClient } from '@drawdream/livespeech';

const client = new LiveSpeechClient({
  region: 'ap-northeast-2',
  apiKey: 'your-api-key',
});

// Handle only 4 essential events!
client.setAudioHandler((audioData) => {
  audioPlayer.queue(audioData);  // PCM16 — use event.sampleRate (24kHz Live, 16kHz Composed)
});

client.on('interrupted', () => {
  audioPlayer.clear();  // CRITICAL: Clear buffer on interrupt!
});

client.on('turnComplete', () => {
  console.log('AI finished');
});

client.setErrorHandler((error) => {
  console.error('Error:', error.message);
});

// Connect and start
await client.connect();
await client.startSession({ prePrompt: 'You are a helpful assistant.' });

// Send audio
client.audioStart();
client.sendAudioChunk(pcmData);  // PCM16 @ 16kHz
client.audioEnd();

// Cleanup
await client.endSession();
client.disconnect();

Core API

Everything you need for basic voice conversations.

Methods

| Method | Description | |--------|-------------| | connect() | Establish connection | | disconnect() | Close connection | | startSession(config) | Start conversation with system prompt | | endSession() | End conversation | | sendAudioChunk(data) | Send PCM16 audio (16kHz) |

Events

| Event | Description | Action Required | |-------|-------------|-----------------| | audio | AI's audio output | Play audio (PCM16 — check sampleRate) | | turnComplete | AI finished speaking | Ready for next input | | interrupted | User barged in | Clear audio buffer! | | error | Error occurred | Handle/log error |

⚠️ Critical: Handle `interrupted`

When the user speaks while AI is responding, you must clear your audio buffer:

client.on('interrupted', () => {
  audioPlayer.clear();  // Stop buffered audio immediately
  audioPlayer.stop();
});

Without this, 2-3 seconds of buffered audio continues playing after the user interrupts.

Audio Format

| Direction | Format | Sample Rate | |-----------|--------|-------------| | Input (mic) | PCM16 | 16,000 Hz | | Output (AI) — Live mode | PCM16 | 24,000 Hz | | Output (AI) — Composed mode | PCM16 | 16,000 Hz |

Important: The audio event includes a sampleRate field. Always use it to configure your audio decoder rather than hardcoding a rate.

Configuration

const client = new LiveSpeechClient({
  region: 'ap-northeast-2',       // Required
  apiKey: 'your-api-key',         // Required
});

await client.startSession({
  prePrompt: 'You are a helpful assistant.',
  language: 'ko-KR',              // Optional: ko-KR, en-US, ja-JP, etc.
});

Composed Mode

Use composed mode for higher accuracy with slightly more latency. It runs a separate STT → LLM → TTS pipeline instead of direct audio-to-audio.

await client.startSession({
  prePrompt: 'You are a helpful assistant.',
  pipelineMode: 'composed',
  language: 'ko-KR',
});

client.audioStart();
// Send/receive audio the same way as live mode

Live vs Composed

| | Live | Composed | |---|---|---| | Latency | ~300ms | ~1-2s | | Pipeline | Direct audio-to-audio (Gemini Live) | STT → LLM → TTS | | Accuracy | Good | Higher | | aiSpeaksFirst | ✅ Supported | ❌ Not supported | | tools (function calling) | ✅ Supported | ❌ Not supported | | Output sample rate | 24,000 Hz | 16,000 Hz | | Barge-in | Automatic (Gemini VAD) | Automatic |

Note: All other SDK methods and events work identically in both modes. The only code change is adding pipelineMode: 'composed' to your session config.

Event Correlation (`turnId`)

In Composed mode, all events include a turnId field (monotonic counter starting from 0). Events sharing the same turnId belong to the same speech turn — use this to match userTranscript, response, audio, and turnComplete events together. In Live mode, turnId is not present.

client.on('userTranscript', (e) => {
    console.log(`Turn ${e.turnId}: User said '${e.text}'`);
});
client.on('response', (e) => {
    if (e.isFinal) console.log(`Turn ${e.turnId}: AI responded '${e.text}'`);
});
client.on('turnComplete', (e) => {
    console.log(`Turn ${e.turnId} complete`);
});

Advanced API

Optional features for power users.

Additional Methods

| Method | Description | |--------|-------------| | audioStart() / audioEnd() | Manual audio stream control | | interrupt() | Explicitly stop AI response (for Stop button) | | sendSystemMessage(msg) | Inject context during conversation | | sendToolResponse(id, result) | Reply to function calls | | updateUserId(userId) | Migrate guest to authenticated user |

Additional Events

| Event | Description | |-------|-------------| | connected / disconnected | Connection lifecycle | | sessionStarted / sessionEnded | Session lifecycle | | ready | Session ready for audio | | userTranscript | User's speech transcribed | | response | AI's response text | | toolCall | AI wants to call a function | | reconnecting | Auto-reconnection attempt | | userIdUpdated | Guest-to-user migration complete | | sessionWarning | Session nearing duration limit | | sessionGoodbye | Session about to end |

Explicit Interrupt (Stop Button)

For UI "Stop" buttons or programmatic control:

// User clicks Stop button
client.interrupt();

Note: Voice barge-in works automatically via Gemini's VAD. This method is for explicit control.

System Messages

Inject text context during live sessions (game events, app state, etc.):

// AI responds immediately
client.sendSystemMessage("User completed level 5. Congratulate them!");

// Context only, no response
client.sendSystemMessage({ text: "User is browsing", triggerResponse: false });

Requires active live session (audioStart() called). Max 500 characters.

Function Calling (Tool Use)

Let AI call functions in your app:

1. Define Tools

const tools = [{
  name: 'get_price',
  description: 'Gets product price by ID',
  parameters: {
    type: 'OBJECT',
    properties: { productId: { type: 'string' } },
    required: ['productId']
  }
}];

await client.startSession({
  prePrompt: 'You are helpful.',
  tools,
});

2. Handle toolCall Events

client.on('toolCall', (event) => {
  if (event.name === 'get_price') {
    const price = lookupPrice(event.args.productId);
    client.sendToolResponse(event.id, { price });
  }
});

Conversation Memory

Enable persistent memory across sessions:

const client = new LiveSpeechClient({
  region: 'ap-northeast-2',
  apiKey: 'your-api-key',
  userId: 'user-123',  // Enables memory
});

| Mode | Memory | |------|--------| | With userId | Permanent (entities, summaries) | | Without userId | Session only (guest) |

Guest-to-User Migration

// User logs in during session
await client.updateUserId('authenticated-user-123');

// Listen for confirmation
client.on('userIdUpdated', (event) => {
  console.log(`Migrated ${event.migratedMessages} messages`);
});

AI Speaks First

AI initiates the conversation:

await client.startSession({
  prePrompt: 'Greet the customer warmly.',
  aiSpeaksFirst: true,
});

client.audioStart();  // AI speaks immediately

Session Options

| Option | Default | Description | |--------|---------|-------------| | prePrompt | - | System prompt | | language | 'en-US' | Language code | | outputLanguage | - | TTS voice language override (composed mode only) | | pipelineMode | 'live' | 'live' (~300ms) or 'composed' (~1-2s) | | aiSpeaksFirst | false | AI initiates (live mode only) | | allowHarmCategory | false | Disable safety filters | | tools | [] | Function definitions | | sessionDuration | - | Enables session duration limits when provided |

Notes

Duration checks are disabled by default. They activate only when sessionDuration is provided.
If only sessionDuration.maxSeconds is provided, enableWarning/enableGoodbye default to false in the SDK.
Server limits take precedence in production.

Browser Example

import { LiveSpeechClient, float32ToInt16, int16ToUint8 } from '@drawdream/livespeech';

// Capture microphone
const stream = await navigator.mediaDevices.getUserMedia({
  audio: { sampleRate: 16000, channelCount: 1 }
});

const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (e) => {
  const float32 = e.inputBuffer.getChannelData(0);
  const int16 = float32ToInt16(float32);
  const pcm = int16ToUint8(int16);
  client.sendAudioChunk(pcm);
};

source.connect(processor);
processor.connect(audioContext.destination);

Audio Utilities

import { float32ToInt16, int16ToUint8, wrapPcmInWav } from '@drawdream/livespeech';

const int16 = float32ToInt16(float32Data);
const bytes = int16ToUint8(int16);
const wav = wrapPcmInWav(bytes, { sampleRate: 16000, channels: 1, bitDepth: 16 });

Error Handling

client.on('error', (event) => {
  switch (event.code) {
    case 'authentication_failed': console.error('Invalid API key'); break;
    case 'connection_timeout': console.error('Timed out'); break;
    default: console.error(`Error: ${event.message}`);
  }
});

client.on('reconnecting', (event) => {
  console.log(`Reconnecting ${event.attempt}/${event.maxAttempts}`);
});

Regions

| Region | Code | |--------|------| | Seoul (Korea) | ap-northeast-2 |

License

MIT