@sipgate/ai-flow-sdk
v1.8.0
Published
Official SDK for sipgate AI Flow
Downloads
527
Readme
@sipgate/ai-flow-sdk
Official SDK for sipgate AI Flow - A powerful TypeScript SDK for building AI-powered voice assistants with real-time speech processing capabilities.
Table of Contents
- Installation
- Quick Start
- Core Concepts
- API Reference
- Integration Guides
- Outbound Calls
- Working Without the Assistant Wrapper
Installation
npm install @sipgate/ai-flow-sdk
# or
yarn add @sipgate/ai-flow-sdk
# or
pnpm add @sipgate/ai-flow-sdkRequirements:
- Node.js >= 22.0.0
- TypeScript 5.x recommended
Quick Start
Basic Assistant
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const assistant = AiFlowAssistant.create({
debug: true,
onSessionStart: async (event) => {
console.log(`Session started for ${event.session.phone_number}`);
return "Hello! How can I help you today?";
},
onUserSpeak: async (event) => {
const userText = event.text;
console.log(`User said: ${userText}`);
// Process user input and return response
return `You said: ${userText}`;
},
onSessionEnd: async (event) => {
console.log(`Session ${event.session.id} ended`);
},
onUserBargeIn: async (event) => {
console.log(`User interrupted with: ${event.text}`);
return "I'm listening, please continue.";
},
});Core Concepts
Event-Driven Architecture
The SDK uses an event-driven model where your assistant responds to events from the AI Flow service:
- Session Start - Called when a new call session begins
- User Speak - Called when the user says something (after speech-to-text)
- User Barge In - Called when the user interrupts the assistant
- Assistant Speak - Called after your assistant starts speaking (event may be left out)
- Assistant Speech Ended - Called when the assistant's speech playback ends
- User Input Timeout - Called when no user speech is detected within the configured timeout period
- Session End - Called when the call ends
Response Types
Event handlers can return these response types:
// 1. Simple string (automatically converted to speak action)
return "Hello, how can I help?";
// 2. Action object (for advanced control)
return {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: "Hello!",
barge_in: { strategy: BargeInStrategy.MINIMUM_CHARACTERS },
};
// 3. Array of actions (executed in sequence)
return [
{ type: AiFlowActionType.BARGE_IN, session_id: event.session.id },
{ type: AiFlowActionType.SPEAK, session_id: event.session.id, text: "Sorry, let me correct that." },
];
// 4. null/undefined (no response needed)
return null;API Reference
AiFlowAssistant
The main class for creating AI voice assistants.
AiFlowAssistant.create(options)
Creates a new assistant instance.
Options:
interface AiFlowAssistantOptions {
// Optional API key for authentication
apiKey?: string;
// Enable debug logging
debug?: boolean;
// Event handlers
onSessionStart?: (
event: AiFlowApiEventSessionStart
) => Promise<InvocationResponseType>;
onUserSpeechStarted?: (
event: AiFlowEventUserSpeechStarted
) => Promise<void>; // WebSocket only — no return value expected
onUserSpeak?: (
event: AiFlowApiEventUserSpeak
) => Promise<InvocationResponseType>;
onAssistantSpeak?: (
event: AiFlowApiEventAssistantSpeak
) => Promise<InvocationResponseType>;
onAssistantSpeechEnded?: (
event: AiFlowEventAssistantSpeechEnded
) => Promise<InvocationResponseType>;
onUserInputTimeout?: (
event: AiFlowEventUserInputTimeout
) => Promise<InvocationResponseType>;
onDtmfReceived?: (
event: AiFlowEventDtmfReceived
) => Promise<InvocationResponseType>;
onSessionEnd?: (
event: AiFlowApiEventSessionEnd
) => Promise<InvocationResponseType>;
onUserBargeIn?: (
event: AiFlowEventUserBargeIn
) => Promise<InvocationResponseType>;
}
type InvocationResponseType = AiFlowApiAction | AiFlowApiAction[] | string | null | undefined;Instance Methods
assistant.express()
Returns an Express.js middleware function for handling webhook requests.
app.post("/webhook", assistant.express());assistant.ws(websocket)
Returns a WebSocket message handler.
wss.on("connection", (ws) => {
ws.on("message", assistant.ws(ws));
});assistant.onEvent(event)
Manually process an event (useful for custom integrations).
const action = await assistant.onEvent(event);Event Types
SessionStart Event
Triggered when a new call session begins.
interface AiFlowApiEventSessionStart {
type: "session_start";
session: {
id: string; // UUID of the session
account_id: string; // Account identifier
phone_number: string; // Phone number for this flow session
direction?: "inbound" | "outbound"; // Call direction
from_phone_number: string; // Phone number of the caller
to_phone_number: string; // Phone number of the callee
};
}Example:
onSessionStart: async (event) => {
// Log session details
console.log(`${event.session.direction} call from ${event.session.from_phone_number} to ${event.session.to_phone_number}`);
// Return greeting
return "Welcome to our service!";
};UserSpeechStarted Event
Triggered when the user's speech is first detected — before the full transcript is available. Uses Voice Activity Detection (VAD), typically 20–120 ms after speech onset.
WebSocket only — not delivered to HTTP webhook handlers.
interface AiFlowEventUserSpeechStarted {
type: "user_speech_started";
session: SessionInfo;
}Fires at most once per speech turn; resets automatically after the corresponding user_speak event. No return value is expected.
Example:
onUserSpeechStarted: async (event) => {
console.log("User started speaking, session:", event.session.id);
// No return value needed
},UserSpeak Event
Triggered when the user speaks and speech-to-text completes.
interface AiFlowApiEventUserSpeak {
type: "user_speak";
text: string; // Recognized speech text
session: {
id: string;
account_id: string;
phone_number: string;
};
}Example:
onUserSpeak: async (event) => {
const intent = analyzeIntent(event.text);
if (intent === "help") {
return "I can help you with billing, support, or sales.";
}
return processUserInput(event.text);
};AssistantSpeak Event
Triggered after the assistant starts speaking. Event may be omitted for some text-to-speech models.
interface AiFlowApiEventAssistantSpeak {
type: "assistant_speak";
text?: string; // Text that was spoken
ssml?: string; // SSML that was used (if applicable)
duration_ms: number; // Duration of speech in milliseconds
speech_started_at: number; // Unix timestamp (ms) when speech started
session: SessionInfo;
}Example:
onAssistantSpeak: async (event) => {
console.log(`Spoke for ${event.duration_ms}ms`);
// Track conversation metrics
trackMetrics({
sessionId: event.session.id,
duration: event.duration_ms,
text: event.text,
});
};AssistantSpeechEnded Event
Triggered after the assistant finishes speaking.
interface AiFlowEventAssistantSpeechEnded {
type: "assistant_speech_ended";
session: SessionInfo;
}Example:
onAssistantSpeechEnded: async (event) => {
console.log(`Finished speaking for session ${event.session.id}`);
// Hangup if needed
};UserInputTimeout Event
Triggered when no user speech is detected within the configured timeout period after the assistant finishes speaking.
interface AiFlowEventUserInputTimeout {
type: "user_input_timeout";
session: SessionInfo;
}When Triggered:
- A
speakaction includes auser_input_timeout_secondsfield - The assistant finishes speaking (
assistant_speech_endedevent fires) - The specified timeout period elapses without any user speech detected
Example:
onUserInputTimeout: async (event) => {
console.log(`User input timeout for session ${event.session.id}`);
// Retry the question
return {
type: "speak",
session_id: event.session.id,
text: "Are you still there? Please say yes or no.",
user_input_timeout_seconds: 5
};
};Configuring Timeout:
Set user_input_timeout_seconds in the speak action:
onSessionStart: async (event) => {
return {
type: "speak",
session_id: event.session.id,
text: "What is your account number?",
user_input_timeout_seconds: 5 // Wait 5 seconds for response
};
};Common Use Cases:
// Hangup after multiple timeouts
const timeoutCounts = new Map<string, number>();
onUserInputTimeout: async (event) => {
const sessionId = event.session.id;
const count = (timeoutCounts.get(sessionId) || 0) + 1;
timeoutCounts.set(sessionId, count);
if (count >= 3) {
return {
type: "hangup",
session_id: sessionId
};
}
return {
type: "speak",
session_id: sessionId,
text: `I didn't hear anything. Please respond. Attempt ${count} of 3.`,
user_input_timeout_seconds: 5
};
};DtmfReceived Event
Triggered when the user presses a key on their phone keypad.
interface AiFlowEventDtmfReceived {
type: "dtmf_received";
digit: string; // The key pressed: "0"–"9", "*", or "#"
session: SessionInfo;
}Example:
onDtmfReceived: async (event) => {
console.log(`User pressed: ${event.digit}`);
if (event.digit === '1') {
return {
type: 'transfer',
session_id: event.session.id,
transfer_to: '49211100200'
};
}
if (event.digit === '#') {
return { type: 'hangup', session_id: event.session.id };
}
};Notes:
- All standard DTMF tones are supported:
0–9,*,# - Each key press triggers a separate event
- Useful for IVR menus, PIN entry, or confirmation flows
SessionEnd Event
Triggered when the call session ends.
interface AiFlowApiEventSessionEnd {
type: "session_end";
session: SessionInfo;
}Example:
onSessionEnd: async (event) => {
// Save conversation history
await saveConversation(event.session.id);
// Send analytics
await trackSessionEnd(event.session);
};Barge-In Detection
User interruptions are detected via the barged_in flag in user_speak events:
interface AiFlowEventUserSpeak {
type: "user_speak";
text: string; // Recognized speech text
barged_in?: boolean; // true if user interrupted assistant
session: SessionInfo;
}When barged_in is true, the user interrupted the assistant mid-speech. The SDK automatically routes these to your onUserBargeIn handler.
Action Types
Actions are responses that tell the AI Flow service what to do next.
Speak Action
Speaks text or SSML to the user.
interface AiFlowActionSpeak {
type: "speak";
session_id: string;
// Either text OR ssml (not both)
text?: string; // Plain text to speak
ssml?: string; // SSML markup for advanced control
// Optional configurations
tts?: TtsConfig; // TTS provider settings
barge_in?: BargeInConfig; // Barge-in behavior
}Examples:
// Simple text
return {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: "Hello, how can I help you?",
};
// With SSML
return {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
ssml: `
<speak version="1.0" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="slow">Please listen carefully.</prosody>
<break time="500ms"/>
Your account balance is <say-as interpret-as="currency">$42.50</say-as>
</voice>
</speak>
`,
};
// With custom TTS provider
return {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: "Hello in a different voice",
tts: {
provider: TtsProvider.AZURE,
language: "en-US",
voice: "en-US-JennyNeural",
},
};Audio Action
Plays pre-recorded audio to the user.
interface AiFlowActionAudio {
type: "audio";
session_id: string;
audio: string; // Base64 encoded WAV (8kHz, mono, 16-bit)
barge_in?: BargeInConfig;
}Example:
// Play hold music or pre-recorded message
return {
type: AiFlowActionType.AUDIO,
session_id: event.session.id,
audio: base64EncodedWavData,
barge_in: {
strategy: BargeInStrategy.MINIMUM_CHARACTERS,
minimum_characters: 3,
},
};Hangup Action
Ends the call.
interface AiFlowActionHangup {
type: "hangup";
session_id: string;
}Example:
onUserSpeak: async (event) => {
if (event.text.toLowerCase().includes("goodbye")) {
return {
type: AiFlowActionType.HANGUP,
session_id: event.session.id,
};
}
};Transfer Action
Transfers the call to another phone number. Optionally, pass a timeout to
enable transfer fallback — if the target doesn't answer in time (or
rejects/hangs up), the service re-emits session_start with the same
session.id so your agent can handle the call again.
interface AiFlowActionTransfer {
type: "transfer";
session_id: string;
target_phone_number: string; // E.164 format without leading + recommended
caller_id_name: string;
caller_id_number: string;
/**
* Optional transfer timeout in seconds (5–120). When set, a failed transfer
* returns the call to the agent via a new `session_start` event instead of
* ending the call. Omit for legacy "hang up on failure" behavior.
*/
timeout?: number;
}Example:
// Transfer to sales department with fallback
return {
type: AiFlowActionType.TRANSFER,
session_id: event.session.id,
target_phone_number: "1234567890",
caller_id_name: "Sales Department",
caller_id_number: "1234567890",
timeout: 30,
};BargeIn Action
Manually triggers barge-in (interrupts current playback).
interface AiFlowActionBargeIn {
type: "barge_in";
session_id: string;
}MixAudio Action
Plays a looping background sound (e.g. ambient noise) under the call for the rest of the session — both during the assistant's TTS turns and during silences. Send again with stop: true to remove the loop. The audio must be a base64-encoded WAV (16 kHz, mono, 16-bit PCM) — same format as the audio action. The loop is dropped automatically when the session ends.
interface AiFlowActionMixAudio {
type: "mix_audio";
session_id: string;
/** Base64 WAV (16 kHz, mono, 16-bit PCM). Required unless stop=true. */
audio?: string;
/** Mix volume 0.0–1.0. Defaults to 0.5. */
volume?: number;
/** When true, removes the active background loop. */
stop?: boolean;
}Example — start an ambient loop:
return {
type: AiFlowActionType.MIX_AUDIO,
session_id: event.session.id,
audio: base64WavStation,
volume: 0.3,
};Example — stop the ambient loop:
return {
type: AiFlowActionType.MIX_AUDIO,
session_id: event.session.id,
stop: true,
};TTS Providers
Configure text-to-speech providers for different voices and languages. The SDK supports both Azure Cognitive Services and ElevenLabs for high-quality voice synthesis.
Azure Cognitive Services
Azure provides a wide range of neural voices across many languages and regions.
interface TtsProviderConfigAzure {
provider: TtsProvider.AZURE;
language?: string; // BCP-47 format (e.g., "en-US", "de-DE")
voice?: string; // Voice name (e.g., "en-US-JennyNeural")
}Examples:
// English (US) - Female
provider: {
provider: TtsProvider.AZURE,
language: "en-US",
voice: "en-US-JennyNeural"
}
// English (GB) - Female
provider: {
provider: TtsProvider.AZURE,
language: "en-GB",
voice: "en-GB-SoniaNeural"
}
// German - Male
provider: {
provider: TtsProvider.AZURE,
language: "de-DE",
voice: "de-DE-ConradNeural"
}
// Spanish - Female
provider: {
provider: TtsProvider.AZURE,
language: "es-ES",
voice: "es-ES-ElviraNeural"
}Popular Azure Voices:
| Language | Voice Name | Gender | Description | | -------- | ------------------ | ------ | ---------------------- | | en-US | en-US-JennyNeural | Female | Friendly, professional | | en-US | en-US-GuyNeural | Male | Clear, neutral | | en-GB | en-GB-SoniaNeural | Female | British, professional | | en-GB | en-GB-RyanNeural | Male | British, friendly | | de-DE | de-DE-KatjaNeural | Female | Professional, clear | | de-DE | de-DE-ConradNeural | Male | Deep, authoritative |
Full Voice List: See Azure TTS documentation for complete list of 400+ voices in 140+ languages.
ElevenLabs
ElevenLabs provides ultra-realistic AI voices optimized for conversational use cases.
interface TtsProviderConfigElevenLabs {
provider: TtsProvider.ELEVEN_LABS;
voice?: string; // Voice ID (e.g., "21m00Tcm4TlvDq8ikWAM")
}Examples:
// Using the default voice (sipgate voice — used when voice is omitted)
provider: {
provider: TtsProvider.ELEVEN_LABS
}
// Using a specific voice ID
provider: {
provider: TtsProvider.ELEVEN_LABS,
voice: "21m00Tcm4TlvDq8ikWAM" // Rachel
}Available ElevenLabs Voices:
The first entry is the default voice used when no voice is specified.
| Voice Name | ID | Description | Verified Locales | |---------------------|----------------------|---------------------------------------------------------------------------|------------------------------------| | sipgate voice ⭐ | dSu12TX3MEDQXAarG4s6 | Clean male voice used by sipgate for system announcements. Default. | de-DE | | Rachel | 21m00Tcm4TlvDq8ikWAM | Matter-of-fact, personable woman. Great for conversational use cases. | | | Sarah | EXAVITQu4vr4xnSDxMaL | Young adult woman with a confident and warm, mature quality. | en-US, fr-FR, cmn-CN, hi-IN | | Laura | FGY2WhTYpPnrIDTdsKH5 | Young adult female delivers sunny enthusiasm with quirky attitude. | en-US, fr-FR, cmn-CN, de-DE | | George | JBFqnCBsd6RMkjVDRZzb | Warm resonance that instantly captivates listeners. | en-GB, fr-FR, ja-JP, cs-CZ | | Thomas | GBv7mTt0atIp3Br8iCZE | Soft and subdued male, optimal for narrations or meditations. | en-US | | Roger | CwhRBWXzGAHq8TQ4Fs17 | Easy going and perfect for casual conversations. | en-US, fr-FR, de-DE, nl-NL | | Eric | cjVigY5qzO86Huf0OWal | Smooth tenor pitch from a man in his 40s - perfect for agentic use cases. | en-US, fr-FR, de-DE, sk-SK | | Brian | nPczCjzI2devNBz1zQrb | Middle-aged man with resonant and comforting tone. | en-US, cmn-CN, de-DE, nl-NL | | Jessica | cgSgspJ2msm6clMCkdW9 | Young and playful American female, perfect for trendy content. | en-US, fr-FR, ja-JP, cmn-CN, de-DE | | Liam | TX3LPaxmHKxFdv7VOQHJ | Young adult with energy and warmth - suitable for reels and shorts. | en-US, de-DE, cs-CZ, pl-PL, tr-TR | | Alice | Xb7hH8MSUJpSbSDYk0k2 | Clear and engaging, friendly British woman suitable for e-learning. | en-GB, it-IT, fr-FR, ja-JP, pl-PL | | Daniel | onwK4e9ZLuTAKqWW03F9 | Strong voice perfect for professional broadcast or news. | en-GB, de-DE, tr-TR | | Lily | pFZP5JQG7iQjIQuC4Bku | Velvety British female delivers news with warmth and clarity. | it-IT, de-DE, cmn-CN, cs-CZ, nl-NL | | River | SAz9YHcvj6GT2YYXdXww | Relaxed, neutral voice ready for narrations or conversational projects. | en-US, it-IT, fr-FR, cmn-CN | | Charlie | IKne3meq5aSn9XLyUdCD | Young Australian male with confident and energetic voice. | en-AU, cmn-CN, fil-PH | | Aria | 9BWtsMINqrJLrRacOk9x | Middle-aged female with African-American accent. Calm with hint of rasp. | en-US, fr-FR, cmn-CN, tr-TR | | Matilda | XrExE9yKIg1WjnnlVkGX | Professional woman with pleasing alto pitch. Suitable for many use cases. | en-US, it-IT, fr-FR, de-DE | | Will | bIHbv24MWmeRgasZH58o | Conversational and laid back. | en-US, fr-FR, de-DE, cmn-CN, cs-CZ | | Chris | iP95p4xoKVk53GoZ742B | Natural and real, down-to-earth voice great across many use-cases. | en-US, fr-FR, sv-SE, hi-IN | | Bill | pqHfZKP75CvOlQylNhV4 | Friendly and comforting voice ready to narrate stories. | en-US, fr-FR, cmn-CN, de-DE, cs-CZ |
Note: 50+ voices available in total. The SDK includes full TypeScript type definitions for all voice IDs and names.
Choosing a TTS Provider
Use Azure when:
- You need support for many languages (140+ languages available)
- You want consistent quality across all locales
- You need specific regional accents or dialects
- Budget is a primary concern
Use ElevenLabs when:
- You need the most natural, human-like voices
- Conversational quality is critical (phone calls, virtual assistants)
- You're primarily working with English or common European languages
- You want voices with distinct personalities
Barge-In Configuration
Control how users can interrupt the assistant while speaking.
interface BargeInConfig {
strategy: BargeInStrategy;
minimum_characters?: number; // Default: 3
allow_after_ms?: number; // Delay before allowing interruption
}Strategies
BargeInStrategy.NONE
Disables barge-in completely. Audio plays fully without interruption.
barge_in: {
strategy: BargeInStrategy.NONE;
}Use cases:
- Critical information that must be heard
- Legal disclaimers
- Emergency instructions
BargeInStrategy.MANUAL
Allows manual barge-in via API only (no automatic detection).
barge_in: {
strategy: BargeInStrategy.MANUAL;
}Use cases:
- Custom interruption logic
- Button-triggered interruption
- External event-based interruption
BargeInStrategy.MINIMUM_CHARACTERS
Automatically detects barge-in when user speech exceeds character threshold.
barge_in: {
strategy: BargeInStrategy.MINIMUM_CHARACTERS,
minimum_characters: 5, // Trigger after 5 characters
allow_after_ms: 500 // Wait 500ms before allowing interruption
}Use cases:
- Natural conversation flow
- Customer service scenarios
- Interactive voice menus
Example with protection period:
return {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: "Your account number is 1234567890. Please write this down.",
barge_in: {
strategy: BargeInStrategy.MINIMUM_CHARACTERS,
minimum_characters: 10, // Require substantial speech
allow_after_ms: 2000, // Protect first 2 seconds
},
};BargeInStrategy.IMMEDIATE ⚡ NEW
Most responsive option - Interrupts immediately when user starts speaking using Voice Activity Detection (VAD).
barge_in: {
strategy: BargeInStrategy.IMMEDIATE,
allow_after_ms: 500 // Optional: protect first 500ms
}How it works:
- Azure/Deepgram: Uses Voice Activity Detection (triggers before any text is recognized)
- ElevenLabs: Uses first partial transcript
- Latency: 20-100ms (2-4x faster than
MINIMUM_CHARACTERS) - No text required: Interrupts on voice detection, not transcription
Use cases:
- High-priority conversations requiring instant responsiveness
- Natural dialogue where interruptions should feel seamless
- Customer service where quick response matters
- Urgent or time-sensitive interactions
Example:
return {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: "I can help you with billing, support, or sales. What would you like?",
barge_in: {
strategy: BargeInStrategy.IMMEDIATE,
allow_after_ms: 500, // Protect first 500ms from accidental noise
},
};Comparison with MINIMUM_CHARACTERS:
| Feature | IMMEDIATE | MINIMUM_CHARACTERS |
|---------------------|----------------------|----------------------------------|
| Trigger | Voice Activity (VAD) | Text recognition (3+ characters) |
| Latency | 20-100ms | 50-200ms |
| User Experience | Instant interruption | Slight delay |
| Accuracy | May trigger on noise | More reliable (text-based) |
Best practices:
- Use
allow_after_ms: 500-1000to prevent accidental interruptions at start - Test with real users to find optimal
allow_after_msvalue - Consider network latency in production environments
Integration Guides
Express.js Integration
Complete example with error handling and logging:
import express from "express";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const app = express();
app.use(express.json());
const assistant = AiFlowAssistant.create({
debug: process.env.NODE_ENV !== "production",
onSessionStart: async (event) => {
return "Welcome! How can I help you today?";
},
onUserSpeak: async (event) => {
// Your conversation logic here
return processUserInput(event.text);
},
onSessionEnd: async (event) => {
await cleanupSession(event.session.id);
},
});
// Webhook endpoint
app.post("/webhook", assistant.express());
// Health check
app.get("/health", (req, res) => {
res.json({ status: "ok" });
});
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`AI Flow assistant running on port ${PORT}`);
});WebSocket Integration
import WebSocket from "ws";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";
const wss = new WebSocket.Server({
port: 8080,
perMessageDeflate: false,
});
const assistant = AiFlowAssistant.create({
onUserSpeak: async (event) => {
return "Hello from WebSocket!";
},
});
wss.on("connection", (ws, req) => {
console.log("New WebSocket connection");
ws.on("message", assistant.ws(ws));
ws.on("error", (error) => {
console.error("WebSocket error:", error);
});
ws.on("close", () => {
console.log("WebSocket connection closed");
});
});
console.log("WebSocket server listening on port 8080");Advanced Example: Customer Service Bot
A more complete example demonstrating state management and routing:
import { AiFlowAssistant, BargeInStrategy } from "@sipgate/ai-flow-sdk";
import express from "express";
// Session state management
const sessions = new Map<string, { state: string; data: any }>();
const assistant = AiFlowAssistant.create({
debug: true,
onSessionStart: async (event) => {
// Initialize session state
sessions.set(event.session.id, {
state: "greeting",
data: { attempts: 0 },
});
return {
type: "speak",
session_id: event.session.id,
text: "Welcome to customer support. How can I help you today? You can ask about billing, technical support, or sales.",
barge_in: {
strategy: BargeInStrategy.MINIMUM_CHARACTERS,
minimum_characters: 3,
},
};
},
onUserSpeak: async (event) => {
const session = sessions.get(event.session.id);
if (!session) return null;
const text = event.text.toLowerCase();
// Intent routing
if (text.includes("billing") || text.includes("invoice")) {
return {
type: "transfer",
session_id: event.session.id,
target_phone_number: "1234567890",
caller_id_name: "Billing Department",
caller_id_number: "1234567890",
};
}
if (text.includes("goodbye") || text.includes("bye")) {
return {
type: "speak",
session_id: event.session.id,
text: "Thank you for calling. Have a great day!",
barge_in: { strategy: BargeInStrategy.NONE }, // Don't allow interruption
};
}
if (text.includes("technical") || text.includes("support")) {
session.state = "technical_support";
return "I'll connect you with our technical support team. Please describe your issue.";
}
// Default response
session.data.attempts++;
if (session.data.attempts > 2) {
return "I'm having trouble understanding. Let me transfer you to a representative.";
}
return "I can help with billing, technical support, or sales. Which would you like?";
},
onUserBargeIn: async (event) => {
console.log(`User interrupted: ${event.text}`);
return "Yes, I'm listening.";
},
onSessionEnd: async (event) => {
// Cleanup session state
sessions.delete(event.session.id);
console.log(`Session ${event.session.id} ended`);
},
});
const app = express();
app.use(express.json());
app.post("/webhook", assistant.express());
app.listen(3000, () => {
console.log("Customer service bot running on port 3000");
});Working Without the Assistant Wrapper
If you prefer to work directly with the SDK's event and action system without using the AiFlowAssistant wrapper, you can manually handle events and construct actions.
Complete Event Reference
All events extend the base event structure:
interface BaseEvent {
session: {
id: string; // UUID of the session
account_id: string; // Account identifier
phone_number: string; // Phone number for this flow session
direction?: "inbound" | "outbound"; // Call direction
from_phone_number: string; // Phone number of the caller
to_phone_number: string; // Phone number of the callee
};
}All Event Types
| Event Type | Transport | Description | When Triggered |
|-----------------------|--------------------|-----------------------------|-----------------------------------------------------------------|
| session_start | HTTP + WebSocket | Call session begins | When a new call is initiated |
| user_speech_started | WebSocket only | Speech onset detected | When VAD detects the user starting to speak (before transcript) |
| user_speak | HTTP + WebSocket | User speech detected | After speech-to-text completes (includes barged_in flag) |
| assistant_speak | HTTP + WebSocket | Assistant finished speaking | After TTS playback completes |
| session_end | HTTP + WebSocket | Call session ends | When the call terminates |
| sms_failed | HTTP + WebSocket | SMS delivery failed | After a send_sms action fails — includes reason for handling |
Event Type Definitions
// session_start
interface AiFlowEventSessionStart {
type: "session_start";
session: {
id: string;
account_id: string;
phone_number: string; // Phone number for this flow session
direction?: "inbound" | "outbound"; // Call direction
from_phone_number: string;
to_phone_number: string;
};
}
// user_speech_started (WebSocket only)
interface AiFlowEventUserSpeechStarted {
type: "user_speech_started";
session: SessionInfo;
}
// user_speak
interface AiFlowEventUserSpeak {
type: "user_speak";
text: string; // Recognized speech text
barged_in?: boolean; // true if user interrupted assistant
session: SessionInfo;
}
// assistant_speak
interface AiFlowEventAssistantSpeak {
type: "assistant_speak";
text?: string; // Text that was spoken
ssml?: string; // SSML that was used (if applicable)
duration_ms: number; // Duration of speech in milliseconds
speech_started_at: number; // Unix timestamp (ms) when speech started
session: SessionInfo;
}
// session_end
interface AiFlowEventSessionEnd {
type: "session_end";
session: SessionInfo;
}
// sms_failed - emitted when a send_sms action fails
interface AiFlowEventSmsFailed {
type: "sms_failed";
session: SessionInfo;
recipient: string;
reason:
| "sender_not_allowed"
| "insufficient_balance"
| "no_sms_extension"
| "smsc_unavailable"
| "unknown";
message?: string;
}Complete Action Reference
All actions require a session_id and type field:
interface BaseAction {
session_id: string; // UUID from the event's session.id
type: string; // Action type identifier
}All Action Types
| Action Type | Description | Primary Use Case |
|---------------------------|-------------------------------------------------|------------------------------------------------------------|
| speak | Speak text or SSML | Respond to user with synthesized speech |
| audio | Play pre-recorded audio | Play hold music, pre-recorded messages |
| hangup | End the call | Terminate conversation |
| transfer | Transfer to another number | Route to human agent or department |
| barge_in | Manually interrupt playback | Stop current audio immediately |
| configure_transcription | Change STT provider and/or language(s) mid-call | Switch recognition language or provider without hanging up |
| send_sms | Send an SMS from the sipgate account | Deliver confirmation codes, booking summaries, links |
| mix_audio | Loop a background sound mixed into speech | Add ambient noise (train station, office) under the agent |
Action Type Definitions
// speak - Text-to-speech response
interface AiFlowActionSpeak {
type: "speak";
session_id: string;
// Provide either text OR ssml (not both)
text?: string;
ssml?: string;
// Optional configurations
provider?: {
provider: "azure" | "eleven_labs";
language?: string; // e.g., "en-US", "de-DE"
voice?: string; // Provider-specific voice ID/name
};
barge_in?: {
strategy: "none" | "manual" | "minimum_characters" | "immediate";
minimum_characters?: number; // Default: 3 (only for minimum_characters)
allow_after_ms?: number; // Delay before allowing interruption
};
}
// audio - Play pre-recorded audio
interface AiFlowActionAudio {
type: "audio";
session_id: string;
audio: string; // Base64 encoded WAV (8kHz, mono, 16-bit PCM)
barge_in?: {
strategy: "none" | "manual" | "minimum_characters" | "immediate";
minimum_characters?: number; // Only for minimum_characters strategy
allow_after_ms?: number;
};
}
// hangup - End call
interface AiFlowActionHangup {
type: "hangup";
session_id: string;
}
// transfer - Transfer call
interface AiFlowActionTransfer {
type: "transfer";
session_id: string;
target_phone_number: string; // E.164 format without leading + recommended
caller_id_name: string;
caller_id_number: string;
/** Optional transfer timeout (5–120s). Enables transfer fallback. */
timeout?: number;
}
// mix_audio - Loop a background sound under outbound speech
interface AiFlowActionMixAudio {
type: "mix_audio";
session_id: string;
audio?: string; // Base64 WAV (16 kHz, mono, 16-bit PCM); required unless stop=true
volume?: number; // 0.0–1.0, default 0.5
stop?: boolean; // true to remove the active loop
}
// barge_in - Manual interrupt
interface AiFlowActionBargeIn {
type: "barge_in";
session_id: string;
}
// configure_transcription - Change STT provider and/or language(s) mid-call
interface AiFlowActionConfigureTranscription {
type: "configure_transcription";
session_id: string;
provider?: "AZURE" | "DEEPGRAM" | "ELEVEN_LABS"; // omit to keep current provider
languages?: string[]; // BCP-47 codes, 1–4 entries; omit to reset to provider default
custom_vocabulary?: string[]; // words/phrases to boost recognition; max 100 entries, 200 chars each
}
// send_sms - Send an SMS from the sipgate account behind the flow
// Availability: gated behind sipgate support approval (fraud / scam protection).
interface AiFlowActionSendSms {
type: "send_sms";
session_id: string;
phone_number: string; // E.164 digits, without leading '+' preferred (leading '+' accepted and stripped)
message: string; // SMS body, min 1 char
}configure_transcription notes:
- At least one of
provider,languages, orcustom_vocabularyshould be set; sending none of them is a no-op - Both
providerandlanguagesuse full-replace semantics (no merging with existing config) - Any change requires a brief transcription engine restart (~100–500 ms for language-only, ~200–800 ms for provider switch)
- Multi-language: Azure supports up to 4 simultaneous languages; Deepgram and ElevenLabs use only the first entry
// Switch language mid-call
return {
type: "configure_transcription",
session_id: event.session.id,
languages: ["de-DE"],
};
// Switch provider
return {
type: "configure_transcription",
session_id: event.session.id,
provider: "DEEPGRAM",
};
// Switch both at once
return {
type: "configure_transcription",
session_id: event.session.id,
provider: "DEEPGRAM",
languages: ["en-US"],
};Custom Vocabulary
Boost recognition of domain-specific terms, product names, or proper nouns by passing a custom_vocabulary array in your configure_transcription action.
assistant.on("session_start", async (event) => {
return [
{
type: "configure_transcription",
custom_vocabulary: ["sipgate", "VoIP", "SIP-Trunk", "Portsplitter"],
},
];
});Custom vocabulary is merged with client-level vocabulary configured during onboarding. Supported by Azure, Deepgram, and ElevenLabs providers. Max 100 entries, max 200 characters per entry.
Direct Integration Example
Here's how to handle events and construct actions without the assistant wrapper:
import express from "express";
import { AiFlowEventType, AiFlowActionType } from "@sipgate/ai-flow-sdk";
const app = express();
app.use(express.json());
app.post("/webhook", async (req, res) => {
const event = req.body;
let action = null;
switch (event.type) {
case "session_start":
action = {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: "Welcome to our service!",
barge_in: {
strategy: "minimum_characters",
minimum_characters: 5,
},
};
break;
case "user_speak":
if (event.barged_in) {
// User interrupted
console.log(`User interrupted with: ${event.text}`);
action = {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: "I'm listening, go ahead.",
};
} else if (event.text.toLowerCase().includes("transfer")) {
action = {
type: AiFlowActionType.TRANSFER,
session_id: event.session.id,
target_phone_number: "1234567890",
caller_id_name: "Support",
caller_id_number: "1234567890",
};
} else if (event.text.toLowerCase().includes("goodbye")) {
action = {
type: AiFlowActionType.HANGUP,
session_id: event.session.id,
};
} else {
action = {
type: AiFlowActionType.SPEAK,
session_id: event.session.id,
text: `You said: ${event.text}`,
};
}
break;
case "assistant_speak":
console.log(`Spoke for ${event.duration_ms}ms`);
// Optional: track metrics, no action needed
break;
case "session_end":
console.log(`Session ${event.session.id} ended`);
// Cleanup logic, no action needed
break;
}
// Return action if one was created
if (action) {
res.json(action);
} else {
res.status(204).send();
}
});
app.listen(3000, () => {
console.log("Webhook server listening on port 3000");
});Event-Action Flow Diagram
┌─────────────────┐
│ session_start │──> Respond with speak/audio or do nothing
└─────────────────┘
┌─────────────────┐
│ user_speak │──> Respond with speak/audio/transfer/hangup
│ (barged_in?) │ Check barged_in flag for interruptions
└─────────────────┘
┌─────────────────┐
│ assistant_speak │──> Optional: track metrics, trigger next action
└─────────────────┘
┌─────────────────┐
│ session_end │──> Cleanup only, no actions accepted
└─────────────────┘Validation with Zod
The SDK exports Zod schemas for runtime validation:
import { AiFlowEventSchema, AiFlowActionSchema } from "@sipgate/ai-flow-sdk";
// Validate incoming event
try {
const event = AiFlowEventSchema.parse(req.body);
// event is now type-safe
} catch (error) {
console.error("Invalid event:", error);
}
// Validate outgoing action
try {
const action = AiFlowActionSchema.parse({
type: "speak",
session_id: event.session.id,
text: "Hello!",
});
res.json(action);
} catch (error) {
console.error("Invalid action:", error);
}Outbound Calls
Access Required: Outbound calls are only available upon request and after a positive review by sipgate support (fraud/spam protection). Contact [email protected] to request access.
Use assistant.call() to initiate an outbound call. Once the recipient answers, the session proceeds exactly like an inbound call — the same event handlers apply.
Setup
Pass your token (and optionally baseUrl) when creating the assistant:
const assistant = AiFlowAssistant.create({
token: process.env.SIPGATE_TOKEN,
onSessionStart: async (event) => {
if (event.session.direction === "outbound") {
return "Hello! This is an automated call. Do you have a moment?";
}
return "Hello! How can I help you today?";
},
onUserSpeak: async (event) => {
return processWithLLM(event.text);
},
});Initiating a Call
await assistant.call({
aiFlowId: "e3670012-96a3-4ae5-ac42-87abe22015c3",
billingDevice: "e2", // provided by sipgate support during onboarding
toPhoneNumber: "4915790000687", // E.164 format without leading +
});| Parameter | Type | Description |
|----------------|--------|-----------------------------------------------------|
| aiFlowId | string | ID of the AI flow to use for the call |
| billingDevice| string | Billing device suffix, provided during onboarding |
| toPhoneNumber| string | Target phone number in E.164 format without leading + |
Custom Base URL
const assistant = AiFlowAssistant.create({
token: process.env.SIPGATE_TOKEN,
baseUrl: "https://api.sipgate.com", // default, can be omitted
});Handling the Session
When the recipient answers, your onSessionStart handler fires with session.direction === "outbound". The direction field is available on the session object of every subsequent event as well.
Troubleshooting
Common Issues
WebSocket Connection Errors
If you encounter WebSocket connection issues:
wss.on("connection", (ws, req) => {
ws.on("error", (error) => {
console.error("WebSocket error:", error);
});
ws.on("close", (code, reason) => {
console.log(`Connection closed: ${code} - ${reason}`);
});
ws.on("message", assistant.ws(ws));
});Event Validation Errors
Use Zod schemas to validate incoming events:
import { AiFlowEventSchema } from "@sipgate/ai-flow-sdk";
app.post("/webhook", async (req, res) => {
try {
const event = AiFlowEventSchema.parse(req.body);
const action = await assistant.onEvent(event);
if (action) {
res.json(action);
} else {
res.status(204).send();
}
} catch (error) {
console.error("Invalid event:", error);
res.status(400).json({ error: "Invalid event format" });
}
});Debug Mode
Enable debug logging to see all events and actions:
const assistant = AiFlowAssistant.create({
debug: true, // Logs all events and actions
// ... your handlers
});Audio Format Issues
When using the audio action, ensure your audio is in the correct format:
- Format: WAV
- Sample Rate: 8kHz
- Channels: Mono
- Bit Depth: 16-bit PCM
- Encoding: Base64
// Example: Convert audio file to correct format
import fs from "fs";
const audioBuffer = fs.readFileSync("audio.wav");
const base64Audio = audioBuffer.toString("base64");
return {
type: AiFlowActionType.AUDIO,
session_id: event.session.id,
audio: base64Audio,
};Additional Resources
- Official Documentation: sipgate.de/lp/ai-flow
- Support Email: [email protected]
- GitHub Issues: Report bugs or request features
License
Apache-2.0
Need help? Contact us at [email protected]
