voice-agent-ai-sdk
v1.0.1
Published
Voice AI Agent with ai-sdk
Maintainers
Readme
voice-agent-ai-sdk
Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.
Features
- Streaming text generation via AI SDK
streamTextwith multi-step tool calling. - Chunked streaming TTS — text is split at sentence boundaries and converted to speech in parallel as the LLM streams, giving low time-to-first-audio.
- Audio transcription via AI SDK
experimental_transcribe(e.g. Whisper). - Barge-in / interruption — user speech cancels both the in-flight LLM stream and pending TTS, saving tokens and latency.
- Memory management — configurable sliding-window on conversation history (
maxMessages,maxTotalChars) and audio input size limits. - Serial request queue — concurrent
sendText/ audio inputs are queued and processed one at a time, preventing race conditions. - Graceful lifecycle —
disconnect()aborts all in-flight work;destroy()permanently releases every resource. - WebSocket transport with a full protocol of stream, tool, and speech lifecycle events.
- Works without WebSocket — call
sendText()directly for text-only or server-side use.
Prerequisites
- Node.js 20+
- pnpm
- OpenAI API key
Setup
Install dependencies:
pnpm install
Configure environment variables in
.env:OPENAI_API_KEY=your_openai_api_key VOICE_WS_ENDPOINT=ws://localhost:8080
VOICE_WS_ENDPOINT is optional for text-only usage.
VoiceAgent usage (as in the demo)
Minimal end-to-end example using AI SDK tools, streaming text, and streaming TTS:
import "dotenv/config";
import { VoiceAgent } from "./src";
import { tool } from "ai";
import { z } from "zod";
import { openai } from "@ai-sdk/openai";
const weatherTool = tool({
description: "Get the weather in a location",
inputSchema: z.object({ location: z.string() }),
execute: async ({ location }) => ({ location, temperature: 72, conditions: "sunny" }),
});
const agent = new VoiceAgent({
model: openai("gpt-4o"),
transcriptionModel: openai.transcription("whisper-1"),
speechModel: openai.speech("gpt-4o-mini-tts"),
instructions: "You are a helpful voice assistant.",
voice: "alloy",
speechInstructions: "Speak in a friendly, natural conversational tone.",
outputFormat: "mp3",
streamingSpeech: {
minChunkSize: 40,
maxChunkSize: 180,
parallelGeneration: true,
maxParallelRequests: 2,
},
// Memory management (new in 0.1.0)
history: {
maxMessages: 50, // keep last 50 messages
maxTotalChars: 100_000, // or trim when total chars exceed 100k
},
maxAudioInputSize: 5 * 1024 * 1024, // 5 MB limit
endpoint: process.env.VOICE_WS_ENDPOINT,
tools: { getWeather: weatherTool },
});
agent.on("text", ({ role, text }) => {
const prefix = role === "user" ? "👤" : "🤖";
console.log(prefix, text);
});
agent.on("chunk:text_delta", ({ text }) => process.stdout.write(text));
agent.on("speech_start", ({ streaming }) => console.log("speech_start", streaming));
agent.on("audio_chunk", ({ chunkId, format, uint8Array }) => {
console.log("audio_chunk", chunkId, format, uint8Array.length);
});
await agent.sendText("What's the weather in San Francisco?");
if (process.env.VOICE_WS_ENDPOINT) {
await agent.connect(process.env.VOICE_WS_ENDPOINT);
}Configuration options
The agent accepts:
| Option | Required | Default | Description |
|---|---|---|---|
| model | yes | — | AI SDK chat model (e.g. openai("gpt-4o")) |
| transcriptionModel | no | — | AI SDK transcription model (e.g. openai.transcription("whisper-1")) |
| speechModel | no | — | AI SDK speech model (e.g. openai.speech("gpt-4o-mini-tts")) |
| instructions | no | "You are a helpful voice assistant." | System prompt |
| stopWhen | no | stepCountIs(5) | Stopping condition for multi-step tool loops |
| tools | no | {} | AI SDK tools map |
| endpoint | no | — | Default WebSocket URL for connect() |
| voice | no | "alloy" | TTS voice |
| speechInstructions | no | — | Style instructions passed to the speech model |
| outputFormat | no | "mp3" | Audio output format (mp3, opus, wav, …) |
| streamingSpeech | no | see below | Streaming TTS chunk tuning |
| history | no | see below | Conversation memory limits |
| maxAudioInputSize | no | 10485760 (10 MB) | Maximum accepted audio input in bytes |
streamingSpeech
| Key | Default | Description |
|---|---|---|
| minChunkSize | 50 | Min characters before a sentence is sent to TTS |
| maxChunkSize | 200 | Max characters per chunk (force-split at clause boundary) |
| parallelGeneration | true | Start TTS for upcoming chunks while the current one plays |
| maxParallelRequests | 3 | Cap on concurrent TTS requests |
history
| Key | Default | Description |
|---|---|---|
| maxMessages | 100 | Max messages kept in history (0 = unlimited). Oldest are trimmed in pairs. |
| maxTotalChars | 0 (unlimited) | Max total characters across all messages. Oldest are trimmed when exceeded. |
Methods
| Method | Description |
|---|---|
| sendText(text) | Process text input. Returns a promise with the full assistant response. Requests are queued serially. |
| sendAudio(base64Audio) | Transcribe base64 audio and process the result. |
| sendAudioBuffer(buffer) | Same as above, accepts a raw Buffer / Uint8Array. |
| transcribeAudio(buffer) | Transcribe audio to text without generating a response. |
| generateAndSendSpeechFull(text) | Non-streaming TTS fallback (entire text at once). |
| interruptSpeech(reason?) | Cancel in-flight TTS only (LLM stream keeps running). |
| interruptCurrentResponse(reason?) | Cancel both the LLM stream and TTS. Used for barge-in. |
| connect(url?) / handleSocket(ws) | Establish or attach a WebSocket. Safe to call multiple times. |
| disconnect() | Close the socket and abort all in-flight work. |
| destroy() | Permanently release all resources. The agent cannot be reused. |
| clearHistory() | Clear conversation history. |
| getHistory() / setHistory(msgs) | Read or restore conversation history. |
| registerTools(tools) | Merge additional tools into the agent. |
Read-only properties
| Property | Type | Description |
|---|---|---|
| connected | boolean | Whether a WebSocket is connected |
| processing | boolean | Whether a request is currently being processed |
| speaking | boolean | Whether audio is currently being generated / sent |
| pendingSpeechChunks | number | Number of queued TTS chunks |
| destroyed | boolean | Whether destroy() has been called |
Events
| Event | Payload | When |
|---|---|---|
| text | { role, text } | User input received or full assistant response ready |
| chunk:text_delta | { id, text } | Each streaming text token from the LLM |
| chunk:reasoning_delta | { id, text } | Each reasoning token (models that support it) |
| chunk:tool_call | { toolName, toolCallId, input } | Tool invocation detected |
| tool_result | { name, toolCallId, result } | Tool execution finished |
| speech_start | { streaming } | TTS generation begins |
| speech_complete | { streaming } | All TTS chunks sent |
| speech_interrupted | { reason } | Speech was cancelled (barge-in, disconnect, error) |
| speech_chunk_queued | { id, text } | A text chunk entered the TTS queue |
| audio_chunk | { chunkId, data, format, text, uint8Array } | One TTS chunk is ready |
| audio | { data, format, uint8Array } | Full non-streaming TTS audio |
| transcription | { text, language } | Audio transcription result |
| audio_received | { size } | Raw audio input received (before transcription) |
| history_trimmed | { removedCount, reason } | Oldest messages evicted from history |
| connected / disconnected | — | WebSocket lifecycle |
| warning | string | Non-fatal issues (empty input, etc.) |
| error | Error | Errors from LLM, TTS, transcription, or WebSocket |
Run (text-only check)
This validates LLM + tool + streaming speech without requiring WebSocket:
pnpm demo
Expected logs include text, chunk:text_delta, tool events, and speech chunk events.
Run (WebSocket check)
Start the local WS server:
pnpm ws:serverIn another terminal, run the demo:
pnpm demo
The demo will:
- run
sendText()first (text-only sanity check), then - connect to
VOICE_WS_ENDPOINTif provided, - emit streaming protocol messages (
text_delta,tool_call,audio_chunk,response_complete, etc.).
Browser voice client (HTML)
A simple browser client is available at example/voice-client.html.
What it does:
- captures microphone speech using Web Speech API (speech-to-text)
- sends transcript to the agent via WebSocket (
type: "transcript") - receives streaming
audio_chunkmessages and plays them in order
How to use:
- Start your agent server/WebSocket endpoint.
- Open example/voice-client.html in a browser (Chrome/Edge recommended).
- Connect to
ws://localhost:8080(or your endpoint), then click Start Mic.
Scripts
pnpm build– build TypeScriptpnpm dev– watch TypeScriptpnpm demo– run demo clientpnpm ws:server– run local test WebSocket server
Notes
- If
VOICE_WS_ENDPOINTis empty, WebSocket connect is skipped. - The sample WS server sends a mock
transcriptmessage for end-to-end testing. - Streaming TTS uses chunk queueing and supports interruption (
interrupt).
