@drawdream/livespeech
v0.1.16
Published
Real-time speech-to-speech AI conversation SDK
Maintainers
Readme
LiveSpeech SDK for TypeScript
A TypeScript/JavaScript SDK for real-time speech-to-speech AI conversations.
Features
- 🎙️ Real-time Voice Conversations - Natural, low-latency voice interactions
- 🌐 Multi-language Support - Korean, English, Japanese, Chinese, and more
- 🔊 Streaming Audio - Send and receive audio in real-time
- ⏹️ Barge-in Support - Interrupt AI mid-speech by talking or programmatically
- 🔄 Auto-reconnection - Automatic recovery from network issues
- 🌐 Browser & Node.js - Works in both environments
Installation
npm install @drawdream/livespeechQuick Start (5 minutes)
import { LiveSpeechClient } from '@drawdream/livespeech';
const client = new LiveSpeechClient({
region: 'ap-northeast-2',
apiKey: 'your-api-key',
});
// Handle only 4 essential events!
client.setAudioHandler((audioData) => {
audioPlayer.queue(audioData); // PCM16 — use event.sampleRate (24kHz Live, 16kHz Composed)
});
client.on('interrupted', () => {
audioPlayer.clear(); // CRITICAL: Clear buffer on interrupt!
});
client.on('turnComplete', () => {
console.log('AI finished');
});
client.setErrorHandler((error) => {
console.error('Error:', error.message);
});
// Connect and start
await client.connect();
await client.startSession({ prePrompt: 'You are a helpful assistant.' });
// Send audio
client.audioStart();
client.sendAudioChunk(pcmData); // PCM16 @ 16kHz
client.audioEnd();
// Cleanup
await client.endSession();
client.disconnect();Core API
Everything you need for basic voice conversations.
Methods
| Method | Description |
|--------|-------------|
| connect() | Establish connection |
| disconnect() | Close connection |
| startSession(config) | Start conversation with system prompt |
| endSession() | End conversation |
| sendAudioChunk(data) | Send PCM16 audio (16kHz) |
Events
| Event | Description | Action Required |
|-------|-------------|-----------------|
| audio | AI's audio output | Play audio (PCM16 — check sampleRate) |
| turnComplete | AI finished speaking | Ready for next input |
| interrupted | User barged in | Clear audio buffer! |
| error | Error occurred | Handle/log error |
⚠️ Critical: Handle interrupted
When the user speaks while AI is responding, you must clear your audio buffer:
client.on('interrupted', () => {
audioPlayer.clear(); // Stop buffered audio immediately
audioPlayer.stop();
});Without this, 2-3 seconds of buffered audio continues playing after the user interrupts.
Audio Format
| Direction | Format | Sample Rate | |-----------|--------|-------------| | Input (mic) | PCM16 | 16,000 Hz | | Output (AI) — Live mode | PCM16 | 24,000 Hz | | Output (AI) — Composed mode | PCM16 | 16,000 Hz |
Important: The
audioevent includes asampleRatefield. Always use it to configure your audio decoder rather than hardcoding a rate.
Configuration
const client = new LiveSpeechClient({
region: 'ap-northeast-2', // Required
apiKey: 'your-api-key', // Required
});
await client.startSession({
prePrompt: 'You are a helpful assistant.',
language: 'ko-KR', // Optional: ko-KR, en-US, ja-JP, etc.
});Composed Mode
Use composed mode for higher accuracy with slightly more latency. It runs a separate STT → LLM → TTS pipeline instead of direct audio-to-audio.
await client.startSession({
prePrompt: 'You are a helpful assistant.',
pipelineMode: 'composed',
language: 'ko-KR',
});
client.audioStart();
// Send/receive audio the same way as live modeLive vs Composed
| | Live | Composed |
|---|---|---|
| Latency | ~300ms | ~1-2s |
| Pipeline | Direct audio-to-audio (Gemini Live) | STT → LLM → TTS |
| Accuracy | Good | Higher |
| aiSpeaksFirst | ✅ Supported | ❌ Not supported |
| tools (function calling) | ✅ Supported | ❌ Not supported |
| Output sample rate | 24,000 Hz | 16,000 Hz |
| Barge-in | Automatic (Gemini VAD) | Automatic |
Note: All other SDK methods and events work identically in both modes. The only code change is adding
pipelineMode: 'composed'to your session config.
Event Correlation (turnId)
In Composed mode, all events include a turnId field (monotonic counter starting from 0). Events sharing the same turnId belong to the same speech turn — use this to match userTranscript, response, audio, and turnComplete events together. In Live mode, turnId is not present.
client.on('userTranscript', (e) => {
console.log(`Turn ${e.turnId}: User said '${e.text}'`);
});
client.on('response', (e) => {
if (e.isFinal) console.log(`Turn ${e.turnId}: AI responded '${e.text}'`);
});
client.on('turnComplete', (e) => {
console.log(`Turn ${e.turnId} complete`);
});Advanced API
Optional features for power users.
Additional Methods
| Method | Description |
|--------|-------------|
| audioStart() / audioEnd() | Manual audio stream control |
| interrupt() | Explicitly stop AI response (for Stop button) |
| sendSystemMessage(msg) | Inject context during conversation |
| sendToolResponse(id, result) | Reply to function calls |
| updateUserId(userId) | Migrate guest to authenticated user |
Additional Events
| Event | Description |
|-------|-------------|
| connected / disconnected | Connection lifecycle |
| sessionStarted / sessionEnded | Session lifecycle |
| ready | Session ready for audio |
| userTranscript | User's speech transcribed |
| response | AI's response text |
| toolCall | AI wants to call a function |
| reconnecting | Auto-reconnection attempt |
| userIdUpdated | Guest-to-user migration complete |
| sessionWarning | Session nearing duration limit |
| sessionGoodbye | Session about to end |
Explicit Interrupt (Stop Button)
For UI "Stop" buttons or programmatic control:
// User clicks Stop button
client.interrupt();Note: Voice barge-in works automatically via Gemini's VAD. This method is for explicit control.
System Messages
Inject text context during live sessions (game events, app state, etc.):
// AI responds immediately
client.sendSystemMessage("User completed level 5. Congratulate them!");
// Context only, no response
client.sendSystemMessage({ text: "User is browsing", triggerResponse: false });Requires active live session (
audioStart()called). Max 500 characters.
Function Calling (Tool Use)
Let AI call functions in your app:
1. Define Tools
const tools = [{
name: 'get_price',
description: 'Gets product price by ID',
parameters: {
type: 'OBJECT',
properties: { productId: { type: 'string' } },
required: ['productId']
}
}];
await client.startSession({
prePrompt: 'You are helpful.',
tools,
});2. Handle toolCall Events
client.on('toolCall', (event) => {
if (event.name === 'get_price') {
const price = lookupPrice(event.args.productId);
client.sendToolResponse(event.id, { price });
}
});Conversation Memory
Enable persistent memory across sessions:
const client = new LiveSpeechClient({
region: 'ap-northeast-2',
apiKey: 'your-api-key',
userId: 'user-123', // Enables memory
});| Mode | Memory |
|------|--------|
| With userId | Permanent (entities, summaries) |
| Without userId | Session only (guest) |
Guest-to-User Migration
// User logs in during session
await client.updateUserId('authenticated-user-123');
// Listen for confirmation
client.on('userIdUpdated', (event) => {
console.log(`Migrated ${event.migratedMessages} messages`);
});AI Speaks First
AI initiates the conversation:
await client.startSession({
prePrompt: 'Greet the customer warmly.',
aiSpeaksFirst: true,
});
client.audioStart(); // AI speaks immediatelySession Options
| Option | Default | Description |
|--------|---------|-------------|
| prePrompt | - | System prompt |
| language | 'en-US' | Language code |
| outputLanguage | - | TTS voice language override (composed mode only) |
| pipelineMode | 'live' | 'live' (~300ms) or 'composed' (~1-2s) |
| aiSpeaksFirst | false | AI initiates (live mode only) |
| allowHarmCategory | false | Disable safety filters |
| tools | [] | Function definitions |
| sessionDuration | - | Enables session duration limits when provided |
Notes
- Duration checks are disabled by default. They activate only when
sessionDurationis provided. - If only
sessionDuration.maxSecondsis provided,enableWarning/enableGoodbyedefault tofalsein the SDK. - Server limits take precedence in production.
Browser Example
import { LiveSpeechClient, float32ToInt16, int16ToUint8 } from '@drawdream/livespeech';
// Capture microphone
const stream = await navigator.mediaDevices.getUserMedia({
audio: { sampleRate: 16000, channelCount: 1 }
});
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
const float32 = e.inputBuffer.getChannelData(0);
const int16 = float32ToInt16(float32);
const pcm = int16ToUint8(int16);
client.sendAudioChunk(pcm);
};
source.connect(processor);
processor.connect(audioContext.destination);Audio Utilities
import { float32ToInt16, int16ToUint8, wrapPcmInWav } from '@drawdream/livespeech';
const int16 = float32ToInt16(float32Data);
const bytes = int16ToUint8(int16);
const wav = wrapPcmInWav(bytes, { sampleRate: 16000, channels: 1, bitDepth: 16 });Error Handling
client.on('error', (event) => {
switch (event.code) {
case 'authentication_failed': console.error('Invalid API key'); break;
case 'connection_timeout': console.error('Timed out'); break;
default: console.error(`Error: ${event.message}`);
}
});
client.on('reconnecting', (event) => {
console.log(`Reconnecting ${event.attempt}/${event.maxAttempts}`);
});Regions
| Region | Code |
|--------|------|
| Seoul (Korea) | ap-northeast-2 |
License
MIT
