@eleven-am/hu-sdk

v0.0.6

Published

18 days ago

TypeScript SDK for building voice agents on the Voice Gateway

0High
0Medium
0Low

eleven-am

voice agent websocket streaming llm assistant

HU SDK (TypeScript)

TypeScript SDK for building voice agents on the Voice Gateway.

Installation

npm install @eleven-am/hu-sdk

Quick Start

import { VoiceAgent, ConnectionModes } from '@eleven-am/hu-sdk';

const agent = new VoiceAgent({
  apiKey: 'sk-voice-xxx',
  gatewayUrl: 'wss://gateway.example.com',
  mode: ConnectionModes.WebSocket,
});

agent
  .onUtterance(async (ctx) => {
    console.log(`User said: ${ctx.text}`);

    // Stream response
    ctx.sendDelta('Hello ');
    ctx.sendDelta('World!');
    ctx.done();
  })
  .onInterrupt((sessionId, reason) => {
    console.log(`Interrupted: ${reason}`);
  })
  .onError((error) => {
    console.error('Error:', error);
  });

await agent.connect();

Streaming with LLM

import { VoiceAgent } from '@eleven-am/hu-sdk';
import OpenAI from 'openai';

const openai = new OpenAI();

const agent = new VoiceAgent({
  apiKey: process.env.VOICE_API_KEY!,
  gatewayUrl: process.env.GATEWAY_URL!,
});

agent.onUtterance(async (ctx) => {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: ctx.text }],
    stream: true,
  });

  for await (const chunk of stream) {
    if (ctx.abortSignal.aborted) break;

    const delta = chunk.choices[0]?.delta?.content;
    if (delta) {
      ctx.sendDelta(delta);
    }
  }

  ctx.done();
});

await agent.connect();

Using Vision (Video Frames)

Agents with vision scope can request video frames from the user's session:

agent.onUtterance(async (ctx) => {
  // Check if vision context is available
  if (ctx.vision?.available) {
    console.log('Auto-analyzed:', ctx.vision.description);
  }

  // Request raw frames for custom analysis
  const frames = await ctx.requestFrames({
    limit: 5,
    rawBase64: true,
  });

  if (frames.frames) {
    for (const frame of frames.frames) {
      // frame.base64 contains the image data
      // frame.timestamp is when it was captured
    }
  }

  // Or get pre-analyzed descriptions
  const analyzed = await ctx.requestFrames({ limit: 3 });
  if (analyzed.descriptions) {
    console.log('Frame descriptions:', analyzed.descriptions);
  }

  ctx.done('I can see what you\'re showing me!');
});

Using Memory

Agents with memory scope can query the user's stored facts:

agent.onUtterance(async (ctx) => {
  // Query relevant memories
  const memories = await ctx.queryMemory({
    query: ctx.text,
    topK: 5,
    threshold: 0.7,
    types: ['preference', 'fact'],
  });

  if (memories.facts && memories.facts.length > 0) {
    const context = memories.facts
      .map(f => f.content)
      .join('\n');

    // Use memories as context for LLM
    const response = await generateWithContext(ctx.text, context);
    ctx.done(response);
  } else {
    ctx.done('I don\'t have any relevant memories about that.');
  }
});

Routing Filters

Agents can register filters to control which utterances are routed to them. Filters are evaluated server-side for efficient routing in multi-agent setups:

await agent.connect();

// Register filters after connecting
agent.registerFilters({
  // Match utterances containing these entity types or values
  entities: ['PERSON', 'John'],
  // Match utterances about these topics
  topics: ['weather', 'travel'],
  // Match utterances containing these keywords
  keywords: ['urgent', 'help'],
  // Match specific speakers
  speakers: ['user'],
  // Number of previous utterances to include for context (used with "filtered" tier)
  includeContext: 5,
  // Data access tier - controls what data the agent receives:
  // - "full": everything (whole conversation stream)
  // - "filtered": matching messages + context window (default)
  // - "summary": just {entities, topics} - no text
  tier: 'filtered',
});

Filters can be updated at any time while connected. The gateway will apply the new filters to subsequent utterances.

Handling Interrupts

When the user starts speaking, the gateway sends an interrupt. Use the abort signal to stop processing:

agent.onUtterance(async (ctx) => {
  for await (const chunk of streamResponse(ctx.text)) {
    // Check before each operation
    if (ctx.abortSignal.aborted) {
      console.log('User interrupted, stopping');
      return;
    }
    ctx.sendDelta(chunk);
  }
  ctx.done();
});

agent.onInterrupt((sessionId, reason) => {
  // reason: "new_user_speech" | "lost_arbitration" | "supersede"
  console.log(`Session ${sessionId} interrupted: ${reason}`);
});

Configuration

interface VoiceAgentConfig {
  apiKey: string;              // Your API key (sk-voice-xxx)
  gatewayUrl: string;          // Gateway WebSocket/HTTP URL
  mode?: ConnectionMode;       // 'websocket' (default) or 'sse'
  reconnect?: boolean;         // Auto-reconnect on disconnect (default: true)
  reconnectInterval?: number;  // Base reconnect delay in ms (default: 1000)
  maxReconnectAttempts?: number; // Max reconnect attempts (default: unlimited)
  logger?: Logger;             // Custom logger (default: console)
}

Context API

The UtteranceContext provides:

| Property | Type | Description | |----------|------|-------------| | text | string | The user's utterance text | | isFinal | boolean | Whether this is a final transcript | | user | UserInfo \| undefined | User info (if profile/email/location scope) | | vision | VisionContext \| undefined | Vision context (if vision scope) | | entities | EntityInfo[] | Entities extracted from the utterance (NER) | | topics | string[] | Topics detected in the utterance | | context | ContextUtterance[] | Previous utterances (if includeContext filter set) | | sessionId | string | Current session ID | | requestId | string | Current request ID | | userId | string \| undefined | User ID | | timestamp | Date | When the utterance was received | | abortSignal | AbortSignal | Signals when interrupted |

| Method | Description | |--------|-------------| | sendDelta(delta) | Stream a text chunk to the user | | done(finalText?) | Complete the response | | requestFrames(options?) | Request video frames (async) | | queryMemory(options) | Query user memories (async) |

Connection Modes

WebSocket (recommended)

Full-duplex communication, lower latency:

const agent = new VoiceAgent({
  mode: ConnectionModes.WebSocket,
  // ...
});

Server-Sent Events (SSE)

One-way server push with HTTP POST for sending. Works in browser environments:

const agent = new VoiceAgent({
  mode: ConnectionModes.SSE,
  // ...
});

License

MIT