azure-realtime-webrtc

v0.2.1

Published

2 days ago

TypeScript SDK for Azure OpenAI Realtime API with WebRTC and WebSocket support

0High
0Medium
0Low

azure openai realtime webrtc websocket gpt audio speech voice ai

azure-realtime-webrtc is the missing npm package for Azure OpenAI's Realtime API. It handles the complex wiring of ephemeral tokens, SDP negotiation, WebRTC data channels, audio streams, and function calling — so you can build voice AI in minutes, not days.

// 5 lines to a working voice assistant
import { VoiceAssistant } from "azure-realtime-webrtc/sdk";

const assistant = new VoiceAssistant({
  resource: "my-resource",
  deployment: "gpt-4o-realtime-preview",
  tokenProvider: () => fetch("/api/token", { method: "POST" }).then(r => r.json()).then(d => d.token),
  instructions: "You are a helpful assistant.",
  voice: "alloy",
});

assistant.on("transcript", (entries) => renderConversation(entries));
assistant.on("stateChange", (state) => updateUI(state)); // "listening" | "thinking" | "speaking"
await assistant.start();

What's Inside

| Entry Point | Purpose | |-------------|---------| | azure-realtime-webrtc | Low-level WebRTC & WebSocket client, typed events, audio management | | azure-realtime-webrtc/sdk | High-level classes: VoiceAssistant, TextChat, ToolAgent | | azure-realtime-webrtc/streaming | Async iterators, ReadableStreams, Server-Sent Events | | azure-realtime-webrtc/server | Express middleware: token server, SDP proxy, Entra ID auth |

Features

| Feature | Details | |---------|---------| | WebRTC + WebSocket | Both connection modes with a unified API | | Full TypeScript | Strict types for all 32+ server events and 11 client events | | SDK: VoiceAssistant | Complete voice chat with state machine (listening → thinking → speaking) | | SDK: TextChat | Streaming text chat with message history | | SDK: ToolAgent | Autonomous multi-step tool calling with execution trace | | Streaming | for await iterators, Web ReadableStreams, SSE handler | | Audio-Synced Text | Transcript streams in sync with the AI's voice playback | | Function Calling | registerTool() with automatic call → execute → respond cycle | | Express Middleware | Drop-in token server with rate limiting, CORS, input validation | | Both Auth Methods | API Key and Microsoft Entra ID | | Zero Runtime Deps | Only express as optional peer dep for the server module | | Security First | API keys never reach the browser. Ephemeral tokens only. |

Install

npm install azure-realtime-webrtc

For the server module:

npm install azure-realtime-webrtc express

For Entra ID:

npm install azure-realtime-webrtc express @azure/identity

Prerequisites

You need three things from the Azure Portal:

| Value | Where to find it | Example | |-------|-------------------|---------| | Resource name | Your Azure OpenAI resource URL: https://<THIS>.openai.azure.com | my-openai-resource | | API Key | Azure Portal → Your OpenAI resource → Keys and Endpoint | abc123... | | Deployment name | Azure AI Foundry → Deployments (must be a realtime model) | gpt-4o-realtime-preview |

Your deployment must be a realtime-capable model deployed in East US 2 or Sweden Central.

Architecture

Browser                          Your Server                    Azure OpenAI
  │                                  │                              │
  │  POST /api/realtime/token        │                              │
  │─────────────────────────────────>│  POST /client_secrets        │
  │                                  │─────────────────────────────>│
  │                                  │  { value: ephemeral_token }  │
  │  { token: ephemeral_token }      │<─────────────────────────────│
  │<─────────────────────────────────│                              │
  │                                                                 │
  │  WebRTC SDP offer + ephemeral token                             │
  │────────────────────────────────────────────────────────────────>│
  │                                         SDP answer              │
  │<────────────────────────────────────────────────────────────────│
  │                                                                 │
  │  ◄══════════ Bidirectional Audio (WebRTC media) ═══════════►   │
  │  ◄══════════ JSON Events (WebRTC data channel)  ═══════════►   │

Your API key never leaves your server. Only short-lived ephemeral tokens reach the browser.

Quick Start

Step 1: Token Server (Node.js)

import express from "express";
import { createRealtimeMiddleware } from "azure-realtime-webrtc/server";

const app = express();
app.use(createRealtimeMiddleware({
  resource: "my-resource",
  deployment: "gpt-4o-realtime-preview",
  auth: { type: "api-key", apiKey: process.env.AZURE_OPENAI_API_KEY! },
  session: {
    instructions: "You are a helpful assistant.",
    audio: { output: { voice: "alloy" } },
  },
  express,
}));
app.listen(3001);

Creates: POST /api/realtime/token · POST /api/realtime/negotiate · GET /api/realtime/health

Step 2: Browser Client

import { RealtimeClient } from "azure-realtime-webrtc";

const res = await fetch("/api/realtime/token", { method: "POST" });
const { token } = await res.json();

const client = new RealtimeClient({
  resource: "my-resource",
  deployment: "gpt-4o-realtime-preview",
  ephemeralToken: token,
  webrtcFilter: true,
});

client.on("session.created", () => console.log("Ready!"));
client.on("response.audio_transcript.delta", (e) => console.log(e.delta));
client.on("error", (e) => console.error(e.error.message));

await client.connect(); // mic + audio playback handled automatically

client.addItem({
  type: "message", role: "user",
  content: [{ type: "input_text", text: "Hello!" }],
});
client.createResponse();

High-Level SDK

Import from azure-realtime-webrtc/sdk. These classes handle all event wiring, state management, and conversation lifecycle for you.

VoiceAssistant

Complete voice chat with automatic state machine: idle → connecting → listening → thinking → speaking → listening …

import { VoiceAssistant } from "azure-realtime-webrtc/sdk";

const assistant = new VoiceAssistant({
  resource: "my-resource",
  deployment: "gpt-4o-realtime-preview",
  tokenProvider: async () => {
    const res = await fetch("/api/realtime/token", { method: "POST" });
    return (await res.json()).token;
  },
  instructions: "You are a travel advisor.",
  voice: "coral",
  transcriptionModel: "whisper-1",
});

assistant.on("transcript", (entries) => {
  // Full conversation — renders both user speech and AI responses
  entries.forEach((e) => console.log(`[${e.role}] ${e.text}${e.partial ? "..." : ""}`));
});

assistant.on("stateChange", (state) => updateStatusBadge(state));

await assistant.start();            // connects, requests mic, starts listening
assistant.sendText("Beach destinations in Europe?");
assistant.setMuted(true);           // mute mic
assistant.interrupt();              // stop AI mid-sentence
assistant.updateInstructions("Focus on budget options.");
assistant.stop();                   // disconnect

Events: transcript · stateChange · userSpeechStarted · userSpeechStopped · assistantAudioStarted · assistantAudioStopped · error · rawEvent

TextChat

Streaming text chat. No mic complexity. Ideal for chatbots.

import { TextChat } from "azure-realtime-webrtc/sdk";

const chat = new TextChat({
  resource: "my-resource",
  deployment: "gpt-4o-realtime-preview",
  tokenProvider,
  instructions: "You are customer support for Acme Corp.",
});

chat.on("message", (msg) => {
  if (msg.streaming) updateBubble(msg.id, msg.content);
  else finalizeBubble(msg.id, msg.content);
});

chat.on("responseStart", () => showTypingIndicator());
chat.on("responseEnd", () => hideTypingIndicator());

await chat.connect();
chat.send("How do I reset my password?");

Events: message · messages · responseStart · responseEnd · connected · error · rawEvent

ToolAgent

Autonomous agent that handles multi-turn tool calling loops. Send a task, get a result with full execution trace.

import { ToolAgent } from "azure-realtime-webrtc/sdk";

const agent = new ToolAgent({
  resource: "my-resource",
  deployment: "gpt-4o-realtime-preview",
  tokenProvider,
  instructions: "You are a research assistant. Use tools to find information.",
  maxToolRounds: 10,
});

agent.registerTool({
  definition: {
    type: "function", name: "web_search",
    description: "Search the web",
    parameters: { type: "object", properties: { query: { type: "string" } }, required: ["query"] },
  },
  handler: async (args) => JSON.stringify(await searchWeb(args.query)),
});

agent.on("step", (step) => console.log(`[${step.type}] ${step.content}`));

await agent.connect();
const result = await agent.run("Latest WebRTC developments in 2026");
console.log(result.response);        // Final answer
console.log(result.toolCallCount);   // How many tool calls were made
console.log(result.steps);           // Full execution trace

Events: step · toolCall · toolResult · textDelta · runComplete · connected · error · rawEvent

Streaming

Import from azure-realtime-webrtc/streaming. Works with any client — RealtimeClient, VoiceAssistant, TextChat, or ToolAgent.

Async Iterators

import { transcriptStream, audioStream, eventStream } from "azure-realtime-webrtc/streaming";

// Stream transcript word by word
for await (const chunk of transcriptStream(client)) {
  if (chunk.type === "delta") process.stdout.write(chunk.text);
  if (chunk.type === "done") console.log(`\n[${chunk.role}] Complete`);
}

// Stream all events
for await (const { type, event } of eventStream(client)) {
  console.log(type, event);
}

Web ReadableStreams

import { createTranscriptReadableStream } from "azure-realtime-webrtc/streaming";

const stream = createTranscriptReadableStream(client);
const reader = stream.getReader();

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  document.getElementById("output").textContent += value.text;
}

// Or pipe to a Response (Next.js streaming route, Cloudflare Worker)
return new Response(stream.pipeThrough(new TransformStream({
  transform(chunk, ctrl) { ctrl.enqueue(`data: ${JSON.stringify(chunk)}\n\n`); },
})), { headers: { "Content-Type": "text/event-stream" } });

Server-Sent Events (SSE)

import { createSSEHandler } from "azure-realtime-webrtc/streaming";

// Express endpoint
app.get("/api/stream", async (req, res) => {
  const client = new RealtimeClient({ ... });
  await client.connect();
  createSSEHandler(client, { events: "transcript" })(req, res);
});

// Browser
const source = new EventSource("/api/stream");
source.addEventListener("response.output_audio_transcript.delta", (e) => {
  console.log(JSON.parse(e.data).text);
});

Audio-Synced Transcript

Text transcript arrives faster than audio plays. To show text in sync with the AI's voice:

let wordBuffer = [], displayedText = "", dripTimer = null;

client.on("output_audio_buffer.started", () => {
  dripTimer = setInterval(() => {
    const word = wordBuffer.shift();
    if (word) { displayedText += word; render(displayedText); }
  }, 285); // ~3.5 words/sec = natural speech pace
});

client.on("output_audio_buffer.stopped", () => {
  displayedText += wordBuffer.join(""); wordBuffer = [];
  clearInterval(dripTimer); render(displayedText);
});

for await (const chunk of transcriptStream(client)) {
  if (chunk.role === "assistant" && chunk.type === "delta") {
    wordBuffer.push(chunk.text); // buffered, not shown yet
  }
}

Framework Guides

React

import { useRef, useState, useCallback, useEffect } from "react";
import { VoiceAssistant, TranscriptEntry, VoiceAssistantState } from "azure-realtime-webrtc/sdk";

export function useVoiceAssistant() {
  const ref = useRef<VoiceAssistant | null>(null);
  const [state, setState] = useState<VoiceAssistantState>("idle");
  const [transcript, setTranscript] = useState<TranscriptEntry[]>([]);

  const start = useCallback(async () => {
    const a = new VoiceAssistant({
      resource: process.env.NEXT_PUBLIC_AZURE_RESOURCE!,
      deployment: process.env.NEXT_PUBLIC_AZURE_DEPLOYMENT!,
      tokenProvider: async () => {
        const res = await fetch("/api/realtime/token", { method: "POST" });
        return (await res.json()).token;
      },
      instructions: "You are a helpful assistant.",
      voice: "alloy",
    });
    a.on("stateChange", setState);
    a.on("transcript", setTranscript);
    ref.current = a;
    await a.start();
  }, []);

  const stop = useCallback(() => { ref.current?.stop(); ref.current = null; }, []);
  const sendText = useCallback((t: string) => ref.current?.sendText(t), []);
  const toggleMute = useCallback(() => {
    const a = ref.current;
    if (a) a.setMuted(!a.isMuted);
  }, []);

  useEffect(() => () => { ref.current?.stop(); }, []);

  return { state, transcript, start, stop, sendText, toggleMute };
}

Next.js (App Router)

// app/api/realtime/token/route.ts
import { NextResponse } from "next/server";

export async function POST() {
  const res = await fetch(
    `https://${process.env.AZURE_RESOURCE}.openai.azure.com/openai/v1/realtime/client_secrets`,
    {
      method: "POST",
      headers: { "api-key": process.env.AZURE_OPENAI_API_KEY!, "Content-Type": "application/json" },
      body: JSON.stringify({
        session: { type: "realtime", model: process.env.AZURE_DEPLOYMENT! },
      }),
    }
  );
  const data = await res.json();
  return NextResponse.json({ token: data.value });
}

Vue / Nuxt

<script setup lang="ts">
import { ref, onUnmounted } from "vue";
import { VoiceAssistant } from "azure-realtime-webrtc/sdk";

const state = ref("idle");
const transcript = ref([]);
let assistant = null;

async function start() {
  assistant = new VoiceAssistant({
    resource: import.meta.env.VITE_AZURE_RESOURCE,
    deployment: import.meta.env.VITE_AZURE_DEPLOYMENT,
    tokenProvider: async () => (await fetch("/api/realtime/token", { method: "POST" }).then(r => r.json())).token,
  });
  assistant.on("stateChange", (s) => (state.value = s));
  assistant.on("transcript", (t) => (transcript.value = t));
  await assistant.start();
}

onUnmounted(() => assistant?.stop());
</script>

Angular

import { Injectable, OnDestroy } from "@angular/core";
import { BehaviorSubject } from "rxjs";
import { VoiceAssistant, TranscriptEntry, VoiceAssistantState } from "azure-realtime-webrtc/sdk";

@Injectable({ providedIn: "root" })
export class RealtimeService implements OnDestroy {
  private assistant: VoiceAssistant | null = null;
  state$ = new BehaviorSubject<VoiceAssistantState>("idle");
  transcript$ = new BehaviorSubject<TranscriptEntry[]>([]);

  async start() {
    this.assistant = new VoiceAssistant({ /* config */ });
    this.assistant.on("stateChange", (s) => this.state$.next(s));
    this.assistant.on("transcript", (t) => this.transcript$.next(t));
    await this.assistant.start();
  }

  stop() { this.assistant?.stop(); }
  ngOnDestroy() { this.stop(); }
}

Vanilla JavaScript

<script type="module">
  import { VoiceAssistant } from "https://cdn.jsdelivr.net/npm/azure-realtime-webrtc/dist/sdk.js";

  const assistant = new VoiceAssistant({
    resource: "my-resource",
    deployment: "gpt-4o-realtime-preview",
    tokenProvider: async () => (await fetch("/api/token", { method: "POST" }).then(r => r.json())).token,
  });
  assistant.on("transcript", (e) => { /* render */ });
  await assistant.start();
</script>

Node.js (Server-Only)

import { RealtimeClient } from "azure-realtime-webrtc";

const client = new RealtimeClient({
  resource: "my-resource",
  deployment: "gpt-4o-realtime-preview",
  mode: "websocket",
  auth: { type: "api-key", apiKey: process.env.AZURE_OPENAI_API_KEY! },
});

client.on("response.text.done", (e) => console.log("AI:", e.text));
await client.connect();

client.addItem({ type: "message", role: "user", content: [{ type: "input_text", text: "Explain WebRTC." }] });
client.createResponse();

API Reference

`RealtimeClient`

| Option | Type | Default | Description | |--------|------|---------|-------------| | resource | string | required | Azure resource name | | deployment | string | required | Model deployment name | | mode | "webrtc" \| "websocket" | "webrtc" | Connection mode | | ephemeralToken | string | — | Token from your server (browser) | | auth | AuthConfig | — | Direct auth (server-side) | | session | SessionConfig | — | Session config | | webrtcFilter | boolean | false | Filter data channel events | | autoMicrophone | boolean | true | Auto-request mic | | channelTimeout | number | 10000 | Data channel open timeout (ms) |

| Method | Description | |--------|-------------| | connect() | Connect to Azure OpenAI Realtime | | send(event) | Send any client event | | createResponse() | Trigger model response | | updateSession(config) | Update session mid-conversation | | addItem(item) | Add a conversation item | | registerTool(reg) | Register a function tool with auto-handling | | setMicrophoneMuted(muted) | Mute/unmute mic (WebRTC only) | | disconnect() | Disconnect and cleanup |

`createRealtimeMiddleware(options)`

| Option | Type | Default | Description | |--------|------|---------|-------------| | resource | string | required | Azure resource name | | deployment | string | required | Deployment name | | auth | AuthConfig | required | Server-side auth | | session | SessionConfig | — | Default session config | | prefix | string | "/api/realtime" | Route prefix | | corsOrigin | string | "*" | CORS origin | | rateLimit | number | 10 | Requests/min/IP | | express | Express | — | Required in ESM |

`createEntraAuth(options?)`

import { createEntraAuth } from "azure-realtime-webrtc/server";
const auth = await createEntraAuth({ tenantId: "...", clientId: "..." });

Session Configuration

const session = {
  instructions: "You are a helpful assistant.",
  audio: {
    output: { voice: "alloy", format: "pcm16" },
    input: {
      format: "pcm16",
      transcription: { model: "whisper-1" },
      turn_detection: {
        type: "server_vad",
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 200,
        create_response: true,
      },
    },
  },
  modalities: ["audio", "text"],
  temperature: 0.8,
  max_response_output_tokens: 4096,
  tools: [{ type: "function", name: "...", description: "...", parameters: { ... } }],
  tool_choice: "auto",
};

Voices: alloy · ash · ballad · coral · echo · sage · shimmer · verse · marin

Function Calling

client.registerTool({
  definition: {
    type: "function", name: "get_weather",
    description: "Get current weather for a city",
    parameters: { type: "object", properties: { city: { type: "string" } }, required: ["city"] },
  },
  handler: async (args: { city: string }) => {
    const data = await fetchWeather(args.city);
    return JSON.stringify(data);
  },
});
// Model calls get_weather → handler runs → result sent back → model continues

Events Reference

Server Events

| Category | Events | |----------|--------| | Session | session.created · session.updated | | Conversation | conversation.created · conversation.item.created · conversation.item.deleted · conversation.item.truncated | | User Speech | input_audio_buffer.speech_started · input_audio_buffer.speech_stopped · conversation.item.input_audio_transcription.completed | | AI Transcript | response.audio_transcript.delta · response.audio_transcript.done · response.output_audio_transcript.delta · response.output_audio_transcript.done | | AI Text | response.text.delta · response.text.done · response.output_text.delta · response.output_text.done | | AI Audio | response.audio.delta · response.audio.done · output_audio_buffer.started · output_audio_buffer.stopped | | Tool Calls | response.function_call_arguments.delta · response.function_call_arguments.done | | Response | response.created · response.done · response.output_item.added · response.content_part.added | | Other | error · rate_limits.updated |

Wildcard + Connection

client.on("*", (event) => console.log(event.type));    // all events
client.on("connected", () => { });                      // connected
client.on("disconnected", ({ reason }) => { });         // disconnected

Supported Models

| Model | Version | |-------|---------| | gpt-4o-mini-realtime-preview | 2024-12-17 | | gpt-4o-realtime-preview | 2024-12-17 | | gpt-realtime | 2025-08-28 | | gpt-realtime-mini | 2025-10-06, 2025-12-15 | | gpt-realtime-1.5 | 2026-02-23 |

Regions: East US 2 and Sweden Central only.

Security

| Measure | Details | |---------|---------| | Token isolation | API keys never reach the browser — only short-lived ephemeral tokens | | Rate limiting | Built-in per-IP sliding window (configurable) | | Input validation | SDP offers validated (format + 64KB max). JSON bodies validated. | | Security headers | Cache-Control: no-store · X-Content-Type-Options: nosniff | | No eval | All JSON parsed with JSON.parse. No eval() or Function(). | | CORS | Configurable origin restrictions |

Production Checklist

[ ] Set corsOrigin to your specific domain(s)
[ ] Use environment variables for API keys
[ ] Deploy behind HTTPS
[ ] Use Entra ID instead of API keys (createEntraAuth())
[ ] Set up Azure billing alerts

Troubleshooting

| Issue | Solution | |-------|----------| | Token request 500 | Voice must be nested: audio.output.voice. Transcription: audio.input.transcription.model. Not flat fields. | | No transcript streaming | Listen to BOTH event names: response.audio_transcript.delta AND response.output_audio_transcript.delta | | RTCPeerConnection not available | Use mode: "websocket" for server-side | | Mic permission denied | Check browser permissions. HTTPS required in production. | | Data channel timeout | Verify deployment is in East US 2 or Sweden Central | | Auth 401 | Check API key belongs to the correct resource | | Text arrives before audio | Use the Audio-Synced Transcript pattern |

Author & Maintainer

Komal Vardhan Lolugu Lead Product Engineer — Agentic AI & Generative Models

| Platform | Link | |----------|------| | Portfolio | komalsrinivas.vercel.app | | LinkedIn | linkedin.com/in/komalvardhanlolugu | | GitHub | github.com/komalSrinivasan | | Medium | komalvardhan.medium.com | | Topmate | topmate.io/komal_vardhan_lolugu |

For bugs, questions, or collaboration — reach out via LinkedIn or open an issue.

License

MIT