azure-realtime-webrtc
v0.2.1
Published
TypeScript SDK for Azure OpenAI Realtime API with WebRTC and WebSocket support
Maintainers
Readme
azure-realtime-webrtc is the missing npm package for Azure OpenAI's Realtime API. It handles the complex wiring of ephemeral tokens, SDP negotiation, WebRTC data channels, audio streams, and function calling — so you can build voice AI in minutes, not days.
// 5 lines to a working voice assistant
import { VoiceAssistant } from "azure-realtime-webrtc/sdk";
const assistant = new VoiceAssistant({
resource: "my-resource",
deployment: "gpt-4o-realtime-preview",
tokenProvider: () => fetch("/api/token", { method: "POST" }).then(r => r.json()).then(d => d.token),
instructions: "You are a helpful assistant.",
voice: "alloy",
});
assistant.on("transcript", (entries) => renderConversation(entries));
assistant.on("stateChange", (state) => updateUI(state)); // "listening" | "thinking" | "speaking"
await assistant.start();What's Inside
| Entry Point | Purpose |
|-------------|---------|
| azure-realtime-webrtc | Low-level WebRTC & WebSocket client, typed events, audio management |
| azure-realtime-webrtc/sdk | High-level classes: VoiceAssistant, TextChat, ToolAgent |
| azure-realtime-webrtc/streaming | Async iterators, ReadableStreams, Server-Sent Events |
| azure-realtime-webrtc/server | Express middleware: token server, SDP proxy, Entra ID auth |
Features
| Feature | Details |
|---------|---------|
| WebRTC + WebSocket | Both connection modes with a unified API |
| Full TypeScript | Strict types for all 32+ server events and 11 client events |
| SDK: VoiceAssistant | Complete voice chat with state machine (listening → thinking → speaking) |
| SDK: TextChat | Streaming text chat with message history |
| SDK: ToolAgent | Autonomous multi-step tool calling with execution trace |
| Streaming | for await iterators, Web ReadableStreams, SSE handler |
| Audio-Synced Text | Transcript streams in sync with the AI's voice playback |
| Function Calling | registerTool() with automatic call → execute → respond cycle |
| Express Middleware | Drop-in token server with rate limiting, CORS, input validation |
| Both Auth Methods | API Key and Microsoft Entra ID |
| Zero Runtime Deps | Only express as optional peer dep for the server module |
| Security First | API keys never reach the browser. Ephemeral tokens only. |
Install
npm install azure-realtime-webrtcFor the server module:
npm install azure-realtime-webrtc expressFor Entra ID:
npm install azure-realtime-webrtc express @azure/identityPrerequisites
You need three things from the Azure Portal:
| Value | Where to find it | Example |
|-------|-------------------|---------|
| Resource name | Your Azure OpenAI resource URL: https://<THIS>.openai.azure.com | my-openai-resource |
| API Key | Azure Portal → Your OpenAI resource → Keys and Endpoint | abc123... |
| Deployment name | Azure AI Foundry → Deployments (must be a realtime model) | gpt-4o-realtime-preview |
Your deployment must be a realtime-capable model deployed in East US 2 or Sweden Central.
Architecture
Browser Your Server Azure OpenAI
│ │ │
│ POST /api/realtime/token │ │
│─────────────────────────────────>│ POST /client_secrets │
│ │─────────────────────────────>│
│ │ { value: ephemeral_token } │
│ { token: ephemeral_token } │<─────────────────────────────│
│<─────────────────────────────────│ │
│ │
│ WebRTC SDP offer + ephemeral token │
│────────────────────────────────────────────────────────────────>│
│ SDP answer │
│<────────────────────────────────────────────────────────────────│
│ │
│ ◄══════════ Bidirectional Audio (WebRTC media) ═══════════► │
│ ◄══════════ JSON Events (WebRTC data channel) ═══════════► │Your API key never leaves your server. Only short-lived ephemeral tokens reach the browser.
Quick Start
Step 1: Token Server (Node.js)
import express from "express";
import { createRealtimeMiddleware } from "azure-realtime-webrtc/server";
const app = express();
app.use(createRealtimeMiddleware({
resource: "my-resource",
deployment: "gpt-4o-realtime-preview",
auth: { type: "api-key", apiKey: process.env.AZURE_OPENAI_API_KEY! },
session: {
instructions: "You are a helpful assistant.",
audio: { output: { voice: "alloy" } },
},
express,
}));
app.listen(3001);Creates: POST /api/realtime/token · POST /api/realtime/negotiate · GET /api/realtime/health
Step 2: Browser Client
import { RealtimeClient } from "azure-realtime-webrtc";
const res = await fetch("/api/realtime/token", { method: "POST" });
const { token } = await res.json();
const client = new RealtimeClient({
resource: "my-resource",
deployment: "gpt-4o-realtime-preview",
ephemeralToken: token,
webrtcFilter: true,
});
client.on("session.created", () => console.log("Ready!"));
client.on("response.audio_transcript.delta", (e) => console.log(e.delta));
client.on("error", (e) => console.error(e.error.message));
await client.connect(); // mic + audio playback handled automatically
client.addItem({
type: "message", role: "user",
content: [{ type: "input_text", text: "Hello!" }],
});
client.createResponse();High-Level SDK
Import from azure-realtime-webrtc/sdk. These classes handle all event wiring, state management, and conversation lifecycle for you.
VoiceAssistant
Complete voice chat with automatic state machine: idle → connecting → listening → thinking → speaking → listening …
import { VoiceAssistant } from "azure-realtime-webrtc/sdk";
const assistant = new VoiceAssistant({
resource: "my-resource",
deployment: "gpt-4o-realtime-preview",
tokenProvider: async () => {
const res = await fetch("/api/realtime/token", { method: "POST" });
return (await res.json()).token;
},
instructions: "You are a travel advisor.",
voice: "coral",
transcriptionModel: "whisper-1",
});
assistant.on("transcript", (entries) => {
// Full conversation — renders both user speech and AI responses
entries.forEach((e) => console.log(`[${e.role}] ${e.text}${e.partial ? "..." : ""}`));
});
assistant.on("stateChange", (state) => updateStatusBadge(state));
await assistant.start(); // connects, requests mic, starts listening
assistant.sendText("Beach destinations in Europe?");
assistant.setMuted(true); // mute mic
assistant.interrupt(); // stop AI mid-sentence
assistant.updateInstructions("Focus on budget options.");
assistant.stop(); // disconnectEvents: transcript · stateChange · userSpeechStarted · userSpeechStopped · assistantAudioStarted · assistantAudioStopped · error · rawEvent
TextChat
Streaming text chat. No mic complexity. Ideal for chatbots.
import { TextChat } from "azure-realtime-webrtc/sdk";
const chat = new TextChat({
resource: "my-resource",
deployment: "gpt-4o-realtime-preview",
tokenProvider,
instructions: "You are customer support for Acme Corp.",
});
chat.on("message", (msg) => {
if (msg.streaming) updateBubble(msg.id, msg.content);
else finalizeBubble(msg.id, msg.content);
});
chat.on("responseStart", () => showTypingIndicator());
chat.on("responseEnd", () => hideTypingIndicator());
await chat.connect();
chat.send("How do I reset my password?");Events: message · messages · responseStart · responseEnd · connected · error · rawEvent
ToolAgent
Autonomous agent that handles multi-turn tool calling loops. Send a task, get a result with full execution trace.
import { ToolAgent } from "azure-realtime-webrtc/sdk";
const agent = new ToolAgent({
resource: "my-resource",
deployment: "gpt-4o-realtime-preview",
tokenProvider,
instructions: "You are a research assistant. Use tools to find information.",
maxToolRounds: 10,
});
agent.registerTool({
definition: {
type: "function", name: "web_search",
description: "Search the web",
parameters: { type: "object", properties: { query: { type: "string" } }, required: ["query"] },
},
handler: async (args) => JSON.stringify(await searchWeb(args.query)),
});
agent.on("step", (step) => console.log(`[${step.type}] ${step.content}`));
await agent.connect();
const result = await agent.run("Latest WebRTC developments in 2026");
console.log(result.response); // Final answer
console.log(result.toolCallCount); // How many tool calls were made
console.log(result.steps); // Full execution traceEvents: step · toolCall · toolResult · textDelta · runComplete · connected · error · rawEvent
Streaming
Import from azure-realtime-webrtc/streaming. Works with any client — RealtimeClient, VoiceAssistant, TextChat, or ToolAgent.
Async Iterators
import { transcriptStream, audioStream, eventStream } from "azure-realtime-webrtc/streaming";
// Stream transcript word by word
for await (const chunk of transcriptStream(client)) {
if (chunk.type === "delta") process.stdout.write(chunk.text);
if (chunk.type === "done") console.log(`\n[${chunk.role}] Complete`);
}
// Stream all events
for await (const { type, event } of eventStream(client)) {
console.log(type, event);
}Web ReadableStreams
import { createTranscriptReadableStream } from "azure-realtime-webrtc/streaming";
const stream = createTranscriptReadableStream(client);
const reader = stream.getReader();
while (true) {
const { value, done } = await reader.read();
if (done) break;
document.getElementById("output").textContent += value.text;
}
// Or pipe to a Response (Next.js streaming route, Cloudflare Worker)
return new Response(stream.pipeThrough(new TransformStream({
transform(chunk, ctrl) { ctrl.enqueue(`data: ${JSON.stringify(chunk)}\n\n`); },
})), { headers: { "Content-Type": "text/event-stream" } });Server-Sent Events (SSE)
import { createSSEHandler } from "azure-realtime-webrtc/streaming";
// Express endpoint
app.get("/api/stream", async (req, res) => {
const client = new RealtimeClient({ ... });
await client.connect();
createSSEHandler(client, { events: "transcript" })(req, res);
});
// Browser
const source = new EventSource("/api/stream");
source.addEventListener("response.output_audio_transcript.delta", (e) => {
console.log(JSON.parse(e.data).text);
});Audio-Synced Transcript
Text transcript arrives faster than audio plays. To show text in sync with the AI's voice:
let wordBuffer = [], displayedText = "", dripTimer = null;
client.on("output_audio_buffer.started", () => {
dripTimer = setInterval(() => {
const word = wordBuffer.shift();
if (word) { displayedText += word; render(displayedText); }
}, 285); // ~3.5 words/sec = natural speech pace
});
client.on("output_audio_buffer.stopped", () => {
displayedText += wordBuffer.join(""); wordBuffer = [];
clearInterval(dripTimer); render(displayedText);
});
for await (const chunk of transcriptStream(client)) {
if (chunk.role === "assistant" && chunk.type === "delta") {
wordBuffer.push(chunk.text); // buffered, not shown yet
}
}Framework Guides
React
import { useRef, useState, useCallback, useEffect } from "react";
import { VoiceAssistant, TranscriptEntry, VoiceAssistantState } from "azure-realtime-webrtc/sdk";
export function useVoiceAssistant() {
const ref = useRef<VoiceAssistant | null>(null);
const [state, setState] = useState<VoiceAssistantState>("idle");
const [transcript, setTranscript] = useState<TranscriptEntry[]>([]);
const start = useCallback(async () => {
const a = new VoiceAssistant({
resource: process.env.NEXT_PUBLIC_AZURE_RESOURCE!,
deployment: process.env.NEXT_PUBLIC_AZURE_DEPLOYMENT!,
tokenProvider: async () => {
const res = await fetch("/api/realtime/token", { method: "POST" });
return (await res.json()).token;
},
instructions: "You are a helpful assistant.",
voice: "alloy",
});
a.on("stateChange", setState);
a.on("transcript", setTranscript);
ref.current = a;
await a.start();
}, []);
const stop = useCallback(() => { ref.current?.stop(); ref.current = null; }, []);
const sendText = useCallback((t: string) => ref.current?.sendText(t), []);
const toggleMute = useCallback(() => {
const a = ref.current;
if (a) a.setMuted(!a.isMuted);
}, []);
useEffect(() => () => { ref.current?.stop(); }, []);
return { state, transcript, start, stop, sendText, toggleMute };
}Next.js (App Router)
// app/api/realtime/token/route.ts
import { NextResponse } from "next/server";
export async function POST() {
const res = await fetch(
`https://${process.env.AZURE_RESOURCE}.openai.azure.com/openai/v1/realtime/client_secrets`,
{
method: "POST",
headers: { "api-key": process.env.AZURE_OPENAI_API_KEY!, "Content-Type": "application/json" },
body: JSON.stringify({
session: { type: "realtime", model: process.env.AZURE_DEPLOYMENT! },
}),
}
);
const data = await res.json();
return NextResponse.json({ token: data.value });
}Vue / Nuxt
<script setup lang="ts">
import { ref, onUnmounted } from "vue";
import { VoiceAssistant } from "azure-realtime-webrtc/sdk";
const state = ref("idle");
const transcript = ref([]);
let assistant = null;
async function start() {
assistant = new VoiceAssistant({
resource: import.meta.env.VITE_AZURE_RESOURCE,
deployment: import.meta.env.VITE_AZURE_DEPLOYMENT,
tokenProvider: async () => (await fetch("/api/realtime/token", { method: "POST" }).then(r => r.json())).token,
});
assistant.on("stateChange", (s) => (state.value = s));
assistant.on("transcript", (t) => (transcript.value = t));
await assistant.start();
}
onUnmounted(() => assistant?.stop());
</script>Angular
import { Injectable, OnDestroy } from "@angular/core";
import { BehaviorSubject } from "rxjs";
import { VoiceAssistant, TranscriptEntry, VoiceAssistantState } from "azure-realtime-webrtc/sdk";
@Injectable({ providedIn: "root" })
export class RealtimeService implements OnDestroy {
private assistant: VoiceAssistant | null = null;
state$ = new BehaviorSubject<VoiceAssistantState>("idle");
transcript$ = new BehaviorSubject<TranscriptEntry[]>([]);
async start() {
this.assistant = new VoiceAssistant({ /* config */ });
this.assistant.on("stateChange", (s) => this.state$.next(s));
this.assistant.on("transcript", (t) => this.transcript$.next(t));
await this.assistant.start();
}
stop() { this.assistant?.stop(); }
ngOnDestroy() { this.stop(); }
}Vanilla JavaScript
<script type="module">
import { VoiceAssistant } from "https://cdn.jsdelivr.net/npm/azure-realtime-webrtc/dist/sdk.js";
const assistant = new VoiceAssistant({
resource: "my-resource",
deployment: "gpt-4o-realtime-preview",
tokenProvider: async () => (await fetch("/api/token", { method: "POST" }).then(r => r.json())).token,
});
assistant.on("transcript", (e) => { /* render */ });
await assistant.start();
</script>Node.js (Server-Only)
import { RealtimeClient } from "azure-realtime-webrtc";
const client = new RealtimeClient({
resource: "my-resource",
deployment: "gpt-4o-realtime-preview",
mode: "websocket",
auth: { type: "api-key", apiKey: process.env.AZURE_OPENAI_API_KEY! },
});
client.on("response.text.done", (e) => console.log("AI:", e.text));
await client.connect();
client.addItem({ type: "message", role: "user", content: [{ type: "input_text", text: "Explain WebRTC." }] });
client.createResponse();API Reference
RealtimeClient
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| resource | string | required | Azure resource name |
| deployment | string | required | Model deployment name |
| mode | "webrtc" \| "websocket" | "webrtc" | Connection mode |
| ephemeralToken | string | — | Token from your server (browser) |
| auth | AuthConfig | — | Direct auth (server-side) |
| session | SessionConfig | — | Session config |
| webrtcFilter | boolean | false | Filter data channel events |
| autoMicrophone | boolean | true | Auto-request mic |
| channelTimeout | number | 10000 | Data channel open timeout (ms) |
| Method | Description |
|--------|-------------|
| connect() | Connect to Azure OpenAI Realtime |
| send(event) | Send any client event |
| createResponse() | Trigger model response |
| updateSession(config) | Update session mid-conversation |
| addItem(item) | Add a conversation item |
| registerTool(reg) | Register a function tool with auto-handling |
| setMicrophoneMuted(muted) | Mute/unmute mic (WebRTC only) |
| disconnect() | Disconnect and cleanup |
createRealtimeMiddleware(options)
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| resource | string | required | Azure resource name |
| deployment | string | required | Deployment name |
| auth | AuthConfig | required | Server-side auth |
| session | SessionConfig | — | Default session config |
| prefix | string | "/api/realtime" | Route prefix |
| corsOrigin | string | "*" | CORS origin |
| rateLimit | number | 10 | Requests/min/IP |
| express | Express | — | Required in ESM |
createEntraAuth(options?)
import { createEntraAuth } from "azure-realtime-webrtc/server";
const auth = await createEntraAuth({ tenantId: "...", clientId: "..." });Session Configuration
const session = {
instructions: "You are a helpful assistant.",
audio: {
output: { voice: "alloy", format: "pcm16" },
input: {
format: "pcm16",
transcription: { model: "whisper-1" },
turn_detection: {
type: "server_vad",
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 200,
create_response: true,
},
},
},
modalities: ["audio", "text"],
temperature: 0.8,
max_response_output_tokens: 4096,
tools: [{ type: "function", name: "...", description: "...", parameters: { ... } }],
tool_choice: "auto",
};Voices: alloy · ash · ballad · coral · echo · sage · shimmer · verse · marin
Function Calling
client.registerTool({
definition: {
type: "function", name: "get_weather",
description: "Get current weather for a city",
parameters: { type: "object", properties: { city: { type: "string" } }, required: ["city"] },
},
handler: async (args: { city: string }) => {
const data = await fetchWeather(args.city);
return JSON.stringify(data);
},
});
// Model calls get_weather → handler runs → result sent back → model continuesEvents Reference
Server Events
| Category | Events |
|----------|--------|
| Session | session.created · session.updated |
| Conversation | conversation.created · conversation.item.created · conversation.item.deleted · conversation.item.truncated |
| User Speech | input_audio_buffer.speech_started · input_audio_buffer.speech_stopped · conversation.item.input_audio_transcription.completed |
| AI Transcript | response.audio_transcript.delta · response.audio_transcript.done · response.output_audio_transcript.delta · response.output_audio_transcript.done |
| AI Text | response.text.delta · response.text.done · response.output_text.delta · response.output_text.done |
| AI Audio | response.audio.delta · response.audio.done · output_audio_buffer.started · output_audio_buffer.stopped |
| Tool Calls | response.function_call_arguments.delta · response.function_call_arguments.done |
| Response | response.created · response.done · response.output_item.added · response.content_part.added |
| Other | error · rate_limits.updated |
Wildcard + Connection
client.on("*", (event) => console.log(event.type)); // all events
client.on("connected", () => { }); // connected
client.on("disconnected", ({ reason }) => { }); // disconnectedSupported Models
| Model | Version |
|-------|---------|
| gpt-4o-mini-realtime-preview | 2024-12-17 |
| gpt-4o-realtime-preview | 2024-12-17 |
| gpt-realtime | 2025-08-28 |
| gpt-realtime-mini | 2025-10-06, 2025-12-15 |
| gpt-realtime-1.5 | 2026-02-23 |
Regions: East US 2 and Sweden Central only.
Security
| Measure | Details |
|---------|---------|
| Token isolation | API keys never reach the browser — only short-lived ephemeral tokens |
| Rate limiting | Built-in per-IP sliding window (configurable) |
| Input validation | SDP offers validated (format + 64KB max). JSON bodies validated. |
| Security headers | Cache-Control: no-store · X-Content-Type-Options: nosniff |
| No eval | All JSON parsed with JSON.parse. No eval() or Function(). |
| CORS | Configurable origin restrictions |
Production Checklist
- [ ] Set
corsOriginto your specific domain(s) - [ ] Use environment variables for API keys
- [ ] Deploy behind HTTPS
- [ ] Use Entra ID instead of API keys (
createEntraAuth()) - [ ] Set up Azure billing alerts
Troubleshooting
| Issue | Solution |
|-------|----------|
| Token request 500 | Voice must be nested: audio.output.voice. Transcription: audio.input.transcription.model. Not flat fields. |
| No transcript streaming | Listen to BOTH event names: response.audio_transcript.delta AND response.output_audio_transcript.delta |
| RTCPeerConnection not available | Use mode: "websocket" for server-side |
| Mic permission denied | Check browser permissions. HTTPS required in production. |
| Data channel timeout | Verify deployment is in East US 2 or Sweden Central |
| Auth 401 | Check API key belongs to the correct resource |
| Text arrives before audio | Use the Audio-Synced Transcript pattern |
Author & Maintainer
Komal Vardhan Lolugu Lead Product Engineer — Agentic AI & Generative Models
| Platform | Link | |----------|------| | Portfolio | komalsrinivas.vercel.app | | LinkedIn | linkedin.com/in/komalvardhanlolugu | | GitHub | github.com/komalSrinivasan | | Medium | komalvardhan.medium.com | | Topmate | topmate.io/komal_vardhan_lolugu |
For bugs, questions, or collaboration — reach out via LinkedIn or open an issue.
License
MIT
