@sipgate/ai-flow-sdk

v1.2.0

Published

15 days ago

Official SDK for sipgate AI Flow

Downloads

627

0High
0Medium
0Low

sipgate ai-flow telephony voice sdk typescript communication api

@sipgate/ai-flow-sdk

Official SDK for sipgate AI Flow - A powerful TypeScript SDK for building AI-powered voice assistants with real-time speech processing capabilities.

Installation

npm install @sipgate/ai-flow-sdk
# or
yarn add @sipgate/ai-flow-sdk
# or
pnpm add @sipgate/ai-flow-sdk

Requirements:

Node.js >= 22.0.0
TypeScript 5.x recommended

Quick Start

Basic Assistant

import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";

const assistant = AiFlowAssistant.create({
  debug: true,

  onSessionStart: async (event) => {
    console.log(`Session started for ${event.session.phone_number}`);
    return "Hello! How can I help you today?";
  },

  onUserSpeak: async (event) => {
    const userText = event.text;
    console.log(`User said: ${userText}`);

    // Process user input and return response
    return `You said: ${userText}`;
  },

  onSessionEnd: async (event) => {
    console.log(`Session ${event.session.id} ended`);
  },

  onUserBargeIn: async (event) => {
    console.log(`User interrupted with: ${event.text}`);
    return "I'm listening, please continue.";
  },
});

Core Concepts

Event-Driven Architecture

The SDK uses an event-driven model where your assistant responds to events from the AI Flow service:

Session Start - Called when a new call session begins
User Speak - Called when the user says something (after speech-to-text)
User Barge In - Called when the user interrupts the assistant
Assistant Speak - Called after your assistant starts speaking (event may be left out)
Assistant Speech Ended - Called when the assistant's speech playback ends
User Input Timeout - Called when no user speech is detected within the configured timeout period
Session End - Called when the call ends

Response Types

Event handlers can return three types of responses:

// 1. Simple string (automatically converted to speak action)
return "Hello, how can I help?";

// 2. Action object (for advanced control)
return {
  type: AiFlowActionType.SPEAK,
  session_id: event.session.id,
  text: "Hello!",
  barge_in: { strategy: BargeInStrategy.MINIMUM_CHARACTERS },
};

// 3. null/undefined (no response needed)
return null;

API Reference

AiFlowAssistant

The main class for creating AI voice assistants.

`AiFlowAssistant.create(options)`

Creates a new assistant instance.

Options:

interface AiFlowAssistantOptions {
  // Optional API key for authentication
  apiKey?: string;

  // Enable debug logging
  debug?: boolean;

  // Event handlers
  onSessionStart?: (
    event: AiFlowApiEventSessionStart
  ) => Promise<InvocationResponseType>;

  onUserSpeak?: (
    event: AiFlowApiEventUserSpeak
  ) => Promise<InvocationResponseType>;

  onAssistantSpeak?: (
    event: AiFlowApiEventAssistantSpeak
  ) => Promise<InvocationResponseType>;

  onAssistantSpeechEnded?: (
    event: AiFlowEventAssistantSpeechEnded
  ) => Promise<InvocationResponseType>;

  onUserInputTimeout?: (
    event: AiFlowEventUserInputTimeout
  ) => Promise<InvocationResponseType>;

  onSessionEnd?: (
    event: AiFlowApiEventSessionEnd
  ) => Promise<InvocationResponseType>;

  onUserBargeIn?: (
    event: AiFlowEventUserBargeIn
  ) => Promise<InvocationResponseType>;
}

type InvocationResponseType = AiFlowApiAction | string | null | undefined;

Instance Methods

`assistant.express()`

Returns an Express.js middleware function for handling webhook requests.

app.post("/webhook", assistant.express());

`assistant.ws(websocket)`

Returns a WebSocket message handler.

wss.on("connection", (ws) => {
  ws.on("message", assistant.ws(ws));
});

`assistant.onEvent(event)`

Manually process an event (useful for custom integrations).

const action = await assistant.onEvent(event);

Event Types

SessionStart Event

Triggered when a new call session begins.

interface AiFlowApiEventSessionStart {
  type: "session_start";
  session: {
    id: string; // UUID of the session
    account_id: string; // Account identifier
    phone_number: string; // Phone number for this flow session
    direction?: "inbound" | "outbound"; // Call direction
    from_phone_number: string; // Phone number of the caller
    to_phone_number: string; // Phone number of the callee
  };
}

Example:

onSessionStart: async (event) => {
  // Log session details
  console.log(`${event.session.direction} call from ${event.session.from_phone_number} to ${event.session.to_phone_number}`);

  // Return greeting
  return "Welcome to our service!";
};

UserSpeak Event

Triggered when the user speaks and speech-to-text completes.

interface AiFlowApiEventUserSpeak {
  type: "user_speak";
  text: string; // Recognized speech text
  session: {
    id: string;
    account_id: string;
    phone_number: string;
  };
}

Example:

onUserSpeak: async (event) => {
  const intent = analyzeIntent(event.text);

  if (intent === "help") {
    return "I can help you with billing, support, or sales.";
  }

  return processUserInput(event.text);
};

AssistantSpeak Event

Triggered after the assistant starts speaking. Event may be omitted for some text-to-speech models.

interface AiFlowApiEventAssistantSpeak {
  type: "assistant_speak";
  text?: string; // Text that was spoken
  ssml?: string; // SSML that was used (if applicable)
  duration_ms: number; // Duration of speech in milliseconds
  speech_started_at: number; // Unix timestamp (ms) when speech started
  session: SessionInfo;
}

Example:

onAssistantSpeak: async (event) => {
  console.log(`Spoke for ${event.duration_ms}ms`);

  // Track conversation metrics
  trackMetrics({
    sessionId: event.session.id,
    duration: event.duration_ms,
    text: event.text,
  });
};

AssistantSpeechEnded Event

Triggered after the assistant finishes speaking.

interface AiFlowEventAssistantSpeechEnded {
  type: "assistant_speech_ended";
  session: SessionInfo;
}

Example:

onAssistantSpeechEnded: async (event) => {
  console.log(`Finished speaking for session ${event.session.id}`);

  // Hangup if needed
};

UserInputTimeout Event

Triggered when no user speech is detected within the configured timeout period after the assistant finishes speaking.

interface AiFlowEventUserInputTimeout {
  type: "user_input_timeout";
  session: SessionInfo;
}

When Triggered:

A speak action includes a user_input_timeout_seconds field
The assistant finishes speaking (assistant_speech_ended event fires)
The specified timeout period elapses without any user speech detected

Example:

onUserInputTimeout: async (event) => {
  console.log(`User input timeout for session ${event.session.id}`);

  // Retry the question
  return {
    type: "speak",
    session_id: event.session.id,
    text: "Are you still there? Please say yes or no.",
    user_input_timeout_seconds: 5
  };
};

Configuring Timeout:

Set user_input_timeout_seconds in the speak action:

onSessionStart: async (event) => {
  return {
    type: "speak",
    session_id: event.session.id,
    text: "What is your account number?",
    user_input_timeout_seconds: 5  // Wait 5 seconds for response
  };
};

Common Use Cases:

// Hangup after multiple timeouts
const timeoutCounts = new Map<string, number>();

onUserInputTimeout: async (event) => {
  const sessionId = event.session.id;
  const count = (timeoutCounts.get(sessionId) || 0) + 1;
  timeoutCounts.set(sessionId, count);

  if (count >= 3) {
    return {
      type: "hangup",
      session_id: sessionId
    };
  }

  return {
    type: "speak",
    session_id: sessionId,
    text: `I didn't hear anything. Please respond. Attempt ${count} of 3.`,
    user_input_timeout_seconds: 5
  };
};

SessionEnd Event

Triggered when the call session ends.

interface AiFlowApiEventSessionEnd {
  type: "session_end";
  session: SessionInfo;
}

Example:

onSessionEnd: async (event) => {
  // Save conversation history
  await saveConversation(event.session.id);

  // Send analytics
  await trackSessionEnd(event.session);
};

Barge-In Detection

User interruptions are detected via the barged_in flag in user_speak events:

interface AiFlowEventUserSpeak {
  type: "user_speak";
  text: string; // Recognized speech text
  barged_in?: boolean; // true if user interrupted assistant
  session: SessionInfo;
}

When barged_in is true, the user interrupted the assistant mid-speech. The SDK automatically routes these to your onUserBargeIn handler.

Action Types

Actions are responses that tell the AI Flow service what to do next.

Speak Action

Speaks text or SSML to the user.

interface AiFlowActionSpeak {
  type: "speak";
  session_id: string;

  // Either text OR ssml (not both)
  text?: string; // Plain text to speak
  ssml?: string; // SSML markup for advanced control

  // Optional configurations
  tts?: TtsConfig; // TTS provider settings
  barge_in?: BargeInConfig; // Barge-in behavior
}

Examples:

// Simple text
return {
  type: AiFlowActionType.SPEAK,
  session_id: event.session.id,
  text: "Hello, how can I help you?",
};

// With SSML
return {
  type: AiFlowActionType.SPEAK,
  session_id: event.session.id,
  ssml: `
    <speak version="1.0" xml:lang="en-US">
      <voice name="en-US-JennyNeural">
        <prosody rate="slow">Please listen carefully.</prosody>
        <break time="500ms"/>
        Your account balance is <say-as interpret-as="currency">$42.50</say-as>
      </voice>
    </speak>
  `,
};

// With custom TTS provider
return {
  type: AiFlowActionType.SPEAK,
  session_id: event.session.id,
  text: "Hello in a different voice",
  tts: {
    provider: TtsProvider.AZURE,
    language: "en-US",
    voice: "en-US-JennyNeural",
  },
};

Audio Action

Plays pre-recorded audio to the user.

interface AiFlowActionAudio {
  type: "audio";
  session_id: string;
  audio: string; // Base64 encoded WAV (8kHz, mono, 16-bit)
  barge_in?: BargeInConfig;
}

Example:

// Play hold music or pre-recorded message
return {
  type: AiFlowActionType.AUDIO,
  session_id: event.session.id,
  audio: base64EncodedWavData,
  barge_in: {
    strategy: BargeInStrategy.MINIMUM_CHARACTERS,
    minimum_characters: 3,
  },
};

Hangup Action

Ends the call.

interface AiFlowActionHangup {
  type: "hangup";
  session_id: string;
}

Example:

onUserSpeak: async (event) => {
  if (event.text.toLowerCase().includes("goodbye")) {
    return {
      type: AiFlowActionType.HANGUP,
      session_id: event.session.id,
    };
  }
};

Transfer Action

Transfers the call to another phone number.

interface AiFlowActionTransfer {
  type: "transfer";
  session_id: string;
  target_phone_number: string; // E.164 format recommended
  caller_id_name: string;
  caller_id_number: string;
}

Example:

// Transfer to sales department
return {
  type: AiFlowActionType.TRANSFER,
  session_id: event.session.id,
  target_phone_number: "+1234567890",
  caller_id_name: "Sales Department",
  caller_id_number: "+1234567890",
};

DTMF Action

Sends DTMF (touch-tone) digits.

interface AiFlowActionDtmf {
  type: "dtmf";
  session_id: string;
  dtmf: number; // Single digit (0-9)
}

Example:

// Send DTMF digit
return {
  type: AiFlowActionType.DTMF,
  session_id: event.session.id,
  dtmf: 5,
};

Gather DTMF Action

Gathers DTMF (touch-tone) input from the user during a call.

interface AiFlowActionGatherDtmf {
  type: "gather_dtmf";
  session_id: string;
  allowed_gathered_characters?: string; // Default: "0123456789"
  audio_url?: string; // Optional audio to play while gathering
  finish_on_key?: string; // Default: "#"
  num_characters?: number; // Default: 1
  timeout_ms?: number; // Default: 5000
}

Configuration Options:

allowed_gathered_characters - Characters allowed as input (0-9, #, *). All other characters are ignored. Default: "0123456789"
audio_url - URL of audio to play when gathering DTMF input (optional)
finish_on_key - Key that signals the end of input. Default: "#"
num_characters - Maximum number of characters to gather. Default: 1
timeout_ms - Milliseconds to wait for input before timing out. Default: 5000

Examples:

// Gather a single digit with default settings
return {
  type: AiFlowActionType.GATHER_DTMF,
  session_id: event.session.id,
};

// Gather a 4-digit PIN with custom prompt
return {
  type: AiFlowActionType.GATHER_DTMF,
  session_id: event.session.id,
  num_characters: 4,
  audio_url: "https://example.com/prompts/enter-pin.wav",
  finish_on_key: "#",
  timeout_ms: 10000,
};

// Gather account number (digits only, up to 10 characters)
return {
  type: AiFlowActionType.GATHER_DTMF,
  session_id: event.session.id,
  allowed_gathered_characters: "0123456789",
  num_characters: 10,
  finish_on_key: "#",
  timeout_ms: 15000,
};

// Menu selection with * and # included
return {
  type: AiFlowActionType.GATHER_DTMF,
  session_id: event.session.id,
  allowed_gathered_characters: "0123456789*#",
  num_characters: 1,
  audio_url: "https://example.com/prompts/menu.wav",
};

Use Cases:

Collecting PIN codes or account numbers
Interactive voice menus (IVR)
Verification codes
Extension dialing
Survey responses

BargeIn Action

Manually triggers barge-in (interrupts current playback).

interface AiFlowActionBargeIn {
  type: "barge_in";
  session_id: string;
}

TTS Providers

Configure text-to-speech providers for different voices and languages. The SDK supports both Azure Cognitive Services and ElevenLabs for high-quality voice synthesis.

Azure Cognitive Services

Azure provides a wide range of neural voices across many languages and regions.

interface TtsProviderConfigAzure {
  provider: TtsProvider.AZURE;
  language?: string; // BCP-47 format (e.g., "en-US", "de-DE")
  voice?: string; // Voice name (e.g., "en-US-JennyNeural")
}

Examples:

// English (US) - Female
provider: {
  provider: TtsProvider.AZURE,
  language: "en-US",
  voice: "en-US-JennyNeural"
}

// English (GB) - Female
provider: {
  provider: TtsProvider.AZURE,
  language: "en-GB",
  voice: "en-GB-SoniaNeural"
}

// German - Male
provider: {
  provider: TtsProvider.AZURE,
  language: "de-DE",
  voice: "de-DE-ConradNeural"
}

// Spanish - Female
provider: {
  provider: TtsProvider.AZURE,
  language: "es-ES",
  voice: "es-ES-ElviraNeural"
}

Popular Azure Voices:

| Language | Voice Name | Gender | Description | | -------- | ------------------ | ------ | ---------------------- | | en-US | en-US-JennyNeural | Female | Friendly, professional | | en-US | en-US-GuyNeural | Male | Clear, neutral | | en-GB | en-GB-SoniaNeural | Female | British, professional | | en-GB | en-GB-RyanNeural | Male | British, friendly | | de-DE | de-DE-KatjaNeural | Female | Professional, clear | | de-DE | de-DE-ConradNeural | Male | Deep, authoritative |

Full Voice List: See Azure TTS documentation for complete list of 400+ voices in 140+ languages.

ElevenLabs

ElevenLabs provides ultra-realistic AI voices optimized for conversational use cases.

interface TtsProviderConfigElevenLabs {
  provider: TtsProvider.ELEVEN_LABS;
  voice?: string; // Voice ID (e.g., "21m00Tcm4TlvDq8ikWAM")
}

Examples:

// Using voice ID
provider: {
  provider: TtsProvider.ELEVEN_LABS,
  voice: "21m00Tcm4TlvDq8ikWAM"  // Rachel
}

// Using default voice (optional)
provider: {
  provider: TtsProvider.ELEVEN_LABS
}

Available ElevenLabs Voices:

| Voice Name | ID | Description | Verified Locales | | ----------- | -------------------- | ------------------------------------------------------------------------- | ---------------------------------- | | Rachel | 21m00Tcm4TlvDq8ikWAM | Matter-of-fact, personable woman. Great for conversational use cases. | en-US | | Sarah | EXAVITQu4vr4xnSDxMaL | Young adult woman with a confident and warm, mature quality. | en-US, fr-FR, cmn-CN, hi-IN | | Laura | FGY2WhTYpPnrIDTdsKH5 | Young adult female delivers sunny enthusiasm with quirky attitude. | en-US, fr-FR, cmn-CN, de-DE | | George | JBFqnCBsd6RMkjVDRZzb | Warm resonance that instantly captivates listeners. | en-GB, fr-FR, ja-JP, cs-CZ | | Thomas | GBv7mTt0atIp3Br8iCZE | Soft and subdued male, optimal for narrations or meditations. | en-US | | Roger | CwhRBWXzGAHq8TQ4Fs17 | Easy going and perfect for casual conversations. | en-US, fr-FR, de-DE, nl-NL | | Eric | cjVigY5qzO86Huf0OWal | Smooth tenor pitch from a man in his 40s - perfect for agentic use cases. | en-US, fr-FR, de-DE, sk-SK | | Brian | nPczCjzI2devNBz1zQrb | Middle-aged man with resonant and comforting tone. | en-US, cmn-CN, de-DE, nl-NL | | Jessica | cgSgspJ2msm6clMCkdW9 | Young and playful American female, perfect for trendy content. | en-US, fr-FR, ja-JP, cmn-CN, de-DE | | Liam | TX3LPaxmHKxFdv7VOQHJ | Young adult with energy and warmth - suitable for reels and shorts. | en-US, de-DE, cs-CZ, pl-PL, tr-TR | | Alice | Xb7hH8MSUJpSbSDYk0k2 | Clear and engaging, friendly British woman suitable for e-learning. | en-GB, it-IT, fr-FR, ja-JP, pl-PL | | Daniel | onwK4e9ZLuTAKqWW03F9 | Strong voice perfect for professional broadcast or news. | en-GB, de-DE, tr-TR | | Lily | pFZP5JQG7iQjIQuC4Bku | Velvety British female delivers news with warmth and clarity. | it-IT, de-DE, cmn-CN, cs-CZ, nl-NL | | River | SAz9YHcvj6GT2YYXdXww | Relaxed, neutral voice ready for narrations or conversational projects. | en-US, it-IT, fr-FR, cmn-CN | | Charlie | IKne3meq5aSn9XLyUdCD | Young Australian male with confident and energetic voice. | en-AU, cmn-CN, fil-PH | | Aria | 9BWtsMINqrJLrRacOk9x | Middle-aged female with African-American accent. Calm with hint of rasp. | en-US, fr-FR, cmn-CN, tr-TR | | Matilda | XrExE9yKIg1WjnnlVkGX | Professional woman with pleasing alto pitch. Suitable for many use cases. | en-US, it-IT, fr-FR, de-DE | | Will | bIHbv24MWmeRgasZH58o | Conversational and laid back. | en-US, fr-FR, de-DE, cmn-CN, cs-CZ | | Chris | iP95p4xoKVk53GoZ742B | Natural and real, down-to-earth voice great across many use-cases. | en-US, fr-FR, sv-SE, hi-IN | | Bill | pqHfZKP75CvOlQylNhV4 | Friendly and comforting voice ready to narrate stories. | en-US, fr-FR, cmn-CN, de-DE, cs-CZ |

Note: 50+ voices available in total. The SDK includes full TypeScript type definitions for all voice IDs and names.

Choosing a TTS Provider

Use Azure when:

You need support for many languages (140+ languages available)
You want consistent quality across all locales
You need specific regional accents or dialects
Budget is a primary concern

Use ElevenLabs when:

You need the most natural, human-like voices
Conversational quality is critical (phone calls, virtual assistants)
You're primarily working with English or common European languages
You want voices with distinct personalities

Barge-In Configuration

Control how users can interrupt the assistant while speaking.

interface BargeInConfig {
  strategy: BargeInStrategy;
  minimum_characters?: number; // Default: 3
  allow_after_ms?: number; // Delay before allowing interruption
}

Strategies

`BargeInStrategy.NONE`

Disables barge-in completely. Audio plays fully without interruption.

barge_in: {
  strategy: BargeInStrategy.NONE;
}

Use cases:

Critical information that must be heard
Legal disclaimers
Emergency instructions

`BargeInStrategy.MANUAL`

Allows manual barge-in via API only (no automatic detection).

barge_in: {
  strategy: BargeInStrategy.MANUAL;
}

Use cases:

Custom interruption logic
Button-triggered interruption
External event-based interruption

`BargeInStrategy.MINIMUM_CHARACTERS`

Automatically detects barge-in when user speech exceeds character threshold.

barge_in: {
  strategy: BargeInStrategy.MINIMUM_CHARACTERS,
  minimum_characters: 5,      // Trigger after 5 characters
  allow_after_ms: 500         // Wait 500ms before allowing interruption
}

Use cases:

Natural conversation flow
Customer service scenarios
Interactive voice menus

Example with protection period:

return {
  type: AiFlowActionType.SPEAK,
  session_id: event.session.id,
  text: "Your account number is 1234567890. Please write this down.",
  barge_in: {
    strategy: BargeInStrategy.MINIMUM_CHARACTERS,
    minimum_characters: 10, // Require substantial speech
    allow_after_ms: 2000, // Protect first 2 seconds
  },
};

Integration Guides

Express.js Integration

Complete example with error handling and logging:

import express from "express";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";

const app = express();
app.use(express.json());

const assistant = AiFlowAssistant.create({
  debug: process.env.NODE_ENV !== "production",

  onSessionStart: async (event) => {
    return "Welcome! How can I help you today?";
  },

  onUserSpeak: async (event) => {
    // Your conversation logic here
    return processUserInput(event.text);
  },

  onSessionEnd: async (event) => {
    await cleanupSession(event.session.id);
  },
});

// Webhook endpoint
app.post("/webhook", assistant.express());

// Health check
app.get("/health", (req, res) => {
  res.json({ status: "ok" });
});

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
  console.log(`AI Flow assistant running on port ${PORT}`);
});

WebSocket Integration

import WebSocket from "ws";
import { AiFlowAssistant } from "@sipgate/ai-flow-sdk";

const wss = new WebSocket.Server({
  port: 8080,
  perMessageDeflate: false,
});

const assistant = AiFlowAssistant.create({
  onUserSpeak: async (event) => {
    return "Hello from WebSocket!";
  },
});

wss.on("connection", (ws, req) => {
  console.log("New WebSocket connection");

  ws.on("message", assistant.ws(ws));

  ws.on("error", (error) => {
    console.error("WebSocket error:", error);
  });

  ws.on("close", () => {
    console.log("WebSocket connection closed");
  });
});

console.log("WebSocket server listening on port 8080");

Advanced Example: Customer Service Bot

A more complete example demonstrating state management and routing:

import { AiFlowAssistant, BargeInStrategy } from "@sipgate/ai-flow-sdk";
import express from "express";

// Session state management
const sessions = new Map<string, { state: string; data: any }>();

const assistant = AiFlowAssistant.create({
  debug: true,

  onSessionStart: async (event) => {
    // Initialize session state
    sessions.set(event.session.id, {
      state: "greeting",
      data: { attempts: 0 },
    });

    return {
      type: "speak",
      session_id: event.session.id,
      text: "Welcome to customer support. How can I help you today? You can ask about billing, technical support, or sales.",
      barge_in: {
        strategy: BargeInStrategy.MINIMUM_CHARACTERS,
        minimum_characters: 3,
      },
    };
  },

  onUserSpeak: async (event) => {
    const session = sessions.get(event.session.id);
    if (!session) return null;

    const text = event.text.toLowerCase();

    // Intent routing
    if (text.includes("billing") || text.includes("invoice")) {
      return {
        type: "transfer",
        session_id: event.session.id,
        target_phone_number: "+1234567890",
        caller_id_name: "Billing Department",
        caller_id_number: "+1234567890",
      };
    }

    if (text.includes("goodbye") || text.includes("bye")) {
      return {
        type: "speak",
        session_id: event.session.id,
        text: "Thank you for calling. Have a great day!",
        barge_in: { strategy: BargeInStrategy.NONE }, // Don't allow interruption
      };
    }

    if (text.includes("technical") || text.includes("support")) {
      session.state = "technical_support";
      return "I'll connect you with our technical support team. Please describe your issue.";
    }

    // Default response
    session.data.attempts++;
    if (session.data.attempts > 2) {
      return "I'm having trouble understanding. Let me transfer you to a representative.";
    }

    return "I can help with billing, technical support, or sales. Which would you like?";
  },

  onUserBargeIn: async (event) => {
    console.log(`User interrupted: ${event.text}`);
    return "Yes, I'm listening.";
  },

  onSessionEnd: async (event) => {
    // Cleanup session state
    sessions.delete(event.session.id);
    console.log(`Session ${event.session.id} ended`);
  },
});

const app = express();
app.use(express.json());
app.post("/webhook", assistant.express());

app.listen(3000, () => {
  console.log("Customer service bot running on port 3000");
});

Working Without the Assistant Wrapper

If you prefer to work directly with the SDK's event and action system without using the AiFlowAssistant wrapper, you can manually handle events and construct actions.

Complete Event Reference

All events extend the base event structure:

interface BaseEvent {
  session: {
    id: string; // UUID of the session
    account_id: string; // Account identifier
    phone_number: string; // Phone number for this flow session
    direction?: "inbound" | "outbound"; // Call direction
    from_phone_number: string; // Phone number of the caller
    to_phone_number: string; // Phone number of the callee
  };
}

All Event Types

| Event Type | Description | When Triggered | | ----------------- | --------------------------- | ------------------------------------------ | | session_start | Call session begins | When a new call is initiated | | user_speak | User speech detected | After speech-to-text completes (includes barged_in flag) | | assistant_speak | Assistant finished speaking | After TTS playback completes | | session_end | Call session ends | When the call terminates |

Event Type Definitions

// session_start
interface AiFlowEventSessionStart {
  type: "session_start";
  session: {
    id: string;
    account_id: string;
    phone_number: string; // Phone number for this flow session
    direction?: "inbound" | "outbound"; // Call direction
    from_phone_number: string;
    to_phone_number: string;
  };
}

// user_speak
interface AiFlowEventUserSpeak {
  type: "user_speak";
  text: string; // Recognized speech text
  barged_in?: boolean; // true if user interrupted assistant
  session: SessionInfo;
}

// assistant_speak
interface AiFlowEventAssistantSpeak {
  type: "assistant_speak";
  text?: string; // Text that was spoken
  ssml?: string; // SSML that was used (if applicable)
  duration_ms: number; // Duration of speech in milliseconds
  speech_started_at: number; // Unix timestamp (ms) when speech started
  session: SessionInfo;
}

// session_end
interface AiFlowEventSessionEnd {
  type: "session_end";
  session: SessionInfo;
}

Complete Action Reference

All actions require a session_id and type field:

interface BaseAction {
  session_id: string; // UUID from the event's session.id
  type: string; // Action type identifier
}

All Action Types

| Action Type | Description | Primary Use Case | | -------------- | --------------------------- | --------------------------------------- | | speak | Speak text or SSML | Respond to user with synthesized speech | | audio | Play pre-recorded audio | Play hold music, pre-recorded messages | | hangup | End the call | Terminate conversation | | transfer | Transfer to another number | Route to human agent or department | | dtmf | Send touch-tone digit | Navigate IVR systems | | gather_dtmf | Gather touch-tone input | Collect PIN codes, menu selections | | barge_in | Manually interrupt playback | Stop current audio immediately |

Action Type Definitions

// speak - Text-to-speech response
interface AiFlowActionSpeak {
  type: "speak";
  session_id: string;

  // Provide either text OR ssml (not both)
  text?: string;
  ssml?: string;

  // Optional configurations
  provider?: {
    provider: "azure" | "eleven_labs";
    language?: string; // e.g., "en-US", "de-DE"
    voice?: string; // Provider-specific voice ID/name
  };

  barge_in?: {
    strategy: "none" | "manual" | "minimum_characters";
    minimum_characters?: number; // Default: 3
    allow_after_ms?: number; // Delay before allowing interruption
  };
}

// audio - Play pre-recorded audio
interface AiFlowActionAudio {
  type: "audio";
  session_id: string;
  audio: string; // Base64 encoded WAV (8kHz, mono, 16-bit PCM)

  barge_in?: {
    strategy: "none" | "manual" | "minimum_characters";
    minimum_characters?: number;
    allow_after_ms?: number;
  };
}

// hangup - End call
interface AiFlowActionHangup {
  type: "hangup";
  session_id: string;
}

// transfer - Transfer call
interface AiFlowActionTransfer {
  type: "transfer";
  session_id: string;
  target_phone_number: string; // E.164 format recommended
  caller_id_name: string;
  caller_id_number: string;
}

// dtmf - Send touch-tone digit
interface AiFlowActionDtmf {
  type: "dtmf";
  session_id: string;
  dtmf: number; // Single digit (0-9)
}

// gather_dtmf - Gather touch-tone input
interface AiFlowActionGatherDtmf {
  type: "gather_dtmf";
  session_id: string;
  allowed_gathered_characters?: string; // Default: "0123456789"
  audio_url?: string; // Optional audio to play while gathering
  finish_on_key?: string; // Default: "#"
  num_characters?: number; // Default: 1
  timeout_ms?: number; // Default: 5000
}

// barge_in - Manual interrupt
interface AiFlowActionBargeIn {
  type: "barge_in";
  session_id: string;
}

Direct Integration Example

Here's how to handle events and construct actions without the assistant wrapper:

import express from "express";
import { AiFlowEventType, AiFlowActionType } from "@sipgate/ai-flow-sdk";

const app = express();
app.use(express.json());

app.post("/webhook", async (req, res) => {
  const event = req.body;

  let action = null;

  switch (event.type) {
    case "session_start":
      action = {
        type: AiFlowActionType.SPEAK,
        session_id: event.session.id,
        text: "Welcome to our service!",
        barge_in: {
          strategy: "minimum_characters",
          minimum_characters: 5,
        },
      };
      break;

    case "user_speak":
      if (event.text.toLowerCase().includes("menu")) {
        // Present IVR menu using DTMF gathering
        action = {
          type: AiFlowActionType.GATHER_DTMF,
          session_id: event.session.id,
          num_characters: 1,
          audio_url: "https://example.com/prompts/main-menu.wav",
          timeout_ms: 10000,
        };
      } else if (event.barged_in) {
        // User interrupted
        console.log(`User interrupted with: ${event.text}`);
        action = {
          type: AiFlowActionType.SPEAK,
          session_id: event.session.id,
          text: "I'm listening, go ahead.",
        };
      } else if (event.text.toLowerCase().includes("transfer")) {
        action = {
          type: AiFlowActionType.TRANSFER,
          session_id: event.session.id,
          target_phone_number: "+1234567890",
          caller_id_name: "Support",
          caller_id_number: "+1234567890",
        };
      } else if (event.text.toLowerCase().includes("goodbye")) {
        action = {
          type: AiFlowActionType.HANGUP,
          session_id: event.session.id,
        };
      } else {
        action = {
          type: AiFlowActionType.SPEAK,
          session_id: event.session.id,
          text: `You said: ${event.text}`,
        };
      }
      break;

    case "assistant_speak":
      console.log(`Spoke for ${event.duration_ms}ms`);
      // Optional: track metrics, no action needed
      break;

    case "session_end":
      console.log(`Session ${event.session.id} ended`);
      // Cleanup logic, no action needed
      break;
  }

  // Return action if one was created
  if (action) {
    res.json(action);
  } else {
    res.status(204).send();
  }
});

app.listen(3000, () => {
  console.log("Webhook server listening on port 3000");
});

Event-Action Flow Diagram

┌─────────────────┐
│  session_start  │──> Respond with speak/audio or do nothing
└─────────────────┘

┌─────────────────┐
│   user_speak    │──> Respond with speak/audio/transfer/hangup/dtmf
│  (barged_in?)   │    Check barged_in flag for interruptions
└─────────────────┘

┌─────────────────┐
│ assistant_speak │──> Optional: track metrics, trigger next action
└─────────────────┘

┌─────────────────┐
│   session_end   │──> Cleanup only, no actions accepted
└─────────────────┘

Validation with Zod

The SDK exports Zod schemas for runtime validation:

import { AiFlowEventSchema, AiFlowActionSchema } from "@sipgate/ai-flow-sdk";

// Validate incoming event
try {
  const event = AiFlowEventSchema.parse(req.body);
  // event is now type-safe
} catch (error) {
  console.error("Invalid event:", error);
}

// Validate outgoing action
try {
  const action = AiFlowActionSchema.parse({
    type: "speak",
    session_id: event.session.id,
    text: "Hello!",
  });
  res.json(action);
} catch (error) {
  console.error("Invalid action:", error);
}

Troubleshooting

Common Issues

WebSocket Connection Errors

If you encounter WebSocket connection issues:

wss.on("connection", (ws, req) => {
  ws.on("error", (error) => {
    console.error("WebSocket error:", error);
  });

  ws.on("close", (code, reason) => {
    console.log(`Connection closed: ${code} - ${reason}`);
  });

  ws.on("message", assistant.ws(ws));
});

Event Validation Errors

Use Zod schemas to validate incoming events:

import { AiFlowEventSchema } from "@sipgate/ai-flow-sdk";

app.post("/webhook", async (req, res) => {
  try {
    const event = AiFlowEventSchema.parse(req.body);
    const action = await assistant.onEvent(event);
    if (action) {
      res.json(action);
    } else {
      res.status(204).send();
    }
  } catch (error) {
    console.error("Invalid event:", error);
    res.status(400).json({ error: "Invalid event format" });
  }
});

Debug Mode

Enable debug logging to see all events and actions:

const assistant = AiFlowAssistant.create({
  debug: true, // Logs all events and actions
  // ... your handlers
});

Audio Format Issues

When using the audio action, ensure your audio is in the correct format:

Format: WAV
Sample Rate: 8kHz
Channels: Mono
Bit Depth: 16-bit PCM
Encoding: Base64

// Example: Convert audio file to correct format
import fs from "fs";

const audioBuffer = fs.readFileSync("audio.wav");
const base64Audio = audioBuffer.toString("base64");

return {
  type: AiFlowActionType.AUDIO,
  session_id: event.session.id,
  audio: base64Audio,
};

Additional Resources

Official Documentation: sipgate.de/lp/ai-flow
Support Email: [email protected]
GitHub Issues: Report bugs or request features

License

Apache-2.0

Need help? Contact us at [email protected]

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@sipgate/ai-flow-sdk

Table of Contents

Installation

Quick Start

Basic Assistant

Core Concepts

Event-Driven Architecture

Response Types

API Reference

AiFlowAssistant

AiFlowAssistant.create(options)

Instance Methods

assistant.express()

assistant.ws(websocket)

assistant.onEvent(event)

Event Types

SessionStart Event

UserSpeak Event

AssistantSpeak Event

AssistantSpeechEnded Event

UserInputTimeout Event

SessionEnd Event

Barge-In Detection

Action Types

Speak Action

Audio Action

Hangup Action

Transfer Action

DTMF Action

Gather DTMF Action

BargeIn Action

TTS Providers

Azure Cognitive Services

ElevenLabs

Choosing a TTS Provider

Barge-In Configuration

Strategies

BargeInStrategy.NONE

BargeInStrategy.MANUAL

BargeInStrategy.MINIMUM_CHARACTERS

Integration Guides

Express.js Integration

WebSocket Integration

Advanced Example: Customer Service Bot

Working Without the Assistant Wrapper

Complete Event Reference

All Event Types

Event Type Definitions

Complete Action Reference

All Action Types

Action Type Definitions

Direct Integration Example

Event-Action Flow Diagram

Validation with Zod

Troubleshooting

Common Issues

WebSocket Connection Errors

Event Validation Errors

Debug Mode

Audio Format Issues

Additional Resources

License

`AiFlowAssistant.create(options)`

`assistant.express()`

`assistant.ws(websocket)`

`assistant.onEvent(event)`

`BargeInStrategy.NONE`

`BargeInStrategy.MANUAL`

`BargeInStrategy.MINIMUM_CHARACTERS`