@eleven-am/voice-agent

v0.0.1

Published

a month ago

TypeScript SDK for building voice agents on the Voice Gateway

0High
0Medium
0Low

eleven-am

voice agent websocket streaming llm assistant

Voice Agent SDK (TypeScript)

TypeScript SDK for building voice agents on the Voice Gateway.

Installation

npm install @eleven-am/voice-agent

Quick Start

import { VoiceAgent, ConnectionModes } from '@eleven-am/voice-agent';

const agent = new VoiceAgent({
  apiKey: 'sk-voice-xxx',
  gatewayUrl: 'wss://gateway.example.com',
  mode: ConnectionModes.WebSocket,
});

agent
  .onUtterance(async (ctx) => {
    console.log(`User said: ${ctx.text}`);

    // Stream response
    ctx.sendDelta('Hello ');
    ctx.sendDelta('World!');
    ctx.done();
  })
  .onInterrupt((sessionId, reason) => {
    console.log(`Interrupted: ${reason}`);
  })
  .onError((error) => {
    console.error('Error:', error);
  });

await agent.connect();

Streaming with LLM

import { VoiceAgent } from '@eleven-am/voice-agent';
import OpenAI from 'openai';

const openai = new OpenAI();

const agent = new VoiceAgent({
  apiKey: process.env.VOICE_API_KEY!,
  gatewayUrl: process.env.GATEWAY_URL!,
});

agent.onUtterance(async (ctx) => {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: ctx.text }],
    stream: true,
  });

  for await (const chunk of stream) {
    if (ctx.abortSignal.aborted) break;

    const delta = chunk.choices[0]?.delta?.content;
    if (delta) {
      ctx.sendDelta(delta);
    }
  }

  ctx.done();
});

await agent.connect();

Using Vision (Video Frames)

Agents with vision scope can request video frames from the user's session:

agent.onUtterance(async (ctx) => {
  // Check if vision context is available
  if (ctx.vision?.available) {
    console.log('Auto-analyzed:', ctx.vision.description);
  }

  // Request raw frames for custom analysis
  const frames = await ctx.requestFrames({
    limit: 5,
    rawBase64: true,
  });

  if (frames.frames) {
    for (const frame of frames.frames) {
      // frame.base64 contains the image data
      // frame.timestamp is when it was captured
    }
  }

  // Or get pre-analyzed descriptions
  const analyzed = await ctx.requestFrames({ limit: 3 });
  if (analyzed.descriptions) {
    console.log('Frame descriptions:', analyzed.descriptions);
  }

  ctx.done('I can see what you\'re showing me!');
});

Using Memory

Agents with memory scope can query the user's stored facts:

agent.onUtterance(async (ctx) => {
  // Query relevant memories
  const memories = await ctx.queryMemory({
    query: ctx.text,
    topK: 5,
    threshold: 0.7,
    types: ['preference', 'fact'],
  });

  if (memories.facts && memories.facts.length > 0) {
    const context = memories.facts
      .map(f => f.content)
      .join('\n');

    // Use memories as context for LLM
    const response = await generateWithContext(ctx.text, context);
    ctx.done(response);
  } else {
    ctx.done('I don\'t have any relevant memories about that.');
  }
});

Handling Interrupts

When the user starts speaking, the gateway sends an interrupt. Use the abort signal to stop processing:

agent.onUtterance(async (ctx) => {
  for await (const chunk of streamResponse(ctx.text)) {
    // Check before each operation
    if (ctx.abortSignal.aborted) {
      console.log('User interrupted, stopping');
      return;
    }
    ctx.sendDelta(chunk);
  }
  ctx.done();
});

agent.onInterrupt((sessionId, reason) => {
  // reason: "new_user_speech" | "lost_arbitration" | "supersede"
  console.log(`Session ${sessionId} interrupted: ${reason}`);
});

Configuration

interface VoiceAgentConfig {
  apiKey: string;              // Your API key (sk-voice-xxx)
  gatewayUrl: string;          // Gateway WebSocket/HTTP URL
  mode?: ConnectionMode;       // 'websocket' (default) or 'sse'
  reconnect?: boolean;         // Auto-reconnect on disconnect (default: true)
  reconnectInterval?: number;  // Base reconnect delay in ms (default: 1000)
  maxReconnectAttempts?: number; // Max reconnect attempts (default: unlimited)
  logger?: Logger;             // Custom logger (default: console)
}

Context API

The UtteranceContext provides:

| Property | Type | Description | |----------|------|-------------| | text | string | The user's utterance text | | isFinal | boolean | Whether this is a final transcript | | user | UserInfo \| undefined | User info (if profile/email/location scope) | | vision | VisionContext \| undefined | Vision context (if vision scope) | | sessionId | string | Current session ID | | requestId | string | Current request ID | | userId | string \| undefined | User ID | | timestamp | Date | When the utterance was received | | abortSignal | AbortSignal | Signals when interrupted |

| Method | Description | |--------|-------------| | sendDelta(delta) | Stream a text chunk to the user | | done(finalText?) | Complete the response | | requestFrames(options?) | Request video frames (async) | | queryMemory(options) | Query user memories (async) |

Connection Modes

WebSocket (recommended)

Full-duplex communication, lower latency:

const agent = new VoiceAgent({
  mode: ConnectionModes.WebSocket,
  // ...
});

Server-Sent Events (SSE)

One-way server push with HTTP POST for sending. Works in browser environments:

const agent = new VoiceAgent({
  mode: ConnectionModes.SSE,
  // ...
});

License

MIT