npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

voice-agent-ai-sdk

v1.0.1

Published

Voice AI Agent with ai-sdk

Readme

voice-agent-ai-sdk

npm version

Streaming voice/text agent SDK built on AI SDK with optional WebSocket transport.

Features

  • Streaming text generation via AI SDK streamText with multi-step tool calling.
  • Chunked streaming TTS — text is split at sentence boundaries and converted to speech in parallel as the LLM streams, giving low time-to-first-audio.
  • Audio transcription via AI SDK experimental_transcribe (e.g. Whisper).
  • Barge-in / interruption — user speech cancels both the in-flight LLM stream and pending TTS, saving tokens and latency.
  • Memory management — configurable sliding-window on conversation history (maxMessages, maxTotalChars) and audio input size limits.
  • Serial request queue — concurrent sendText / audio inputs are queued and processed one at a time, preventing race conditions.
  • Graceful lifecycledisconnect() aborts all in-flight work; destroy() permanently releases every resource.
  • WebSocket transport with a full protocol of stream, tool, and speech lifecycle events.
  • Works without WebSocket — call sendText() directly for text-only or server-side use.

Prerequisites

  • Node.js 20+
  • pnpm
  • OpenAI API key

Setup

  1. Install dependencies:

    pnpm install

  2. Configure environment variables in .env:

    OPENAI_API_KEY=your_openai_api_key VOICE_WS_ENDPOINT=ws://localhost:8080

VOICE_WS_ENDPOINT is optional for text-only usage.

VoiceAgent usage (as in the demo)

Minimal end-to-end example using AI SDK tools, streaming text, and streaming TTS:

import "dotenv/config";
import { VoiceAgent } from "./src";
import { tool } from "ai";
import { z } from "zod";
import { openai } from "@ai-sdk/openai";

const weatherTool = tool({
   description: "Get the weather in a location",
   inputSchema: z.object({ location: z.string() }),
   execute: async ({ location }) => ({ location, temperature: 72, conditions: "sunny" }),
});

const agent = new VoiceAgent({
   model: openai("gpt-4o"),
   transcriptionModel: openai.transcription("whisper-1"),
   speechModel: openai.speech("gpt-4o-mini-tts"),
   instructions: "You are a helpful voice assistant.",
   voice: "alloy",
   speechInstructions: "Speak in a friendly, natural conversational tone.",
   outputFormat: "mp3",
   streamingSpeech: {
      minChunkSize: 40,
      maxChunkSize: 180,
      parallelGeneration: true,
      maxParallelRequests: 2,
   },
   // Memory management (new in 0.1.0)
   history: {
      maxMessages: 50,       // keep last 50 messages
      maxTotalChars: 100_000, // or trim when total chars exceed 100k
   },
   maxAudioInputSize: 5 * 1024 * 1024, // 5 MB limit
   endpoint: process.env.VOICE_WS_ENDPOINT,
   tools: { getWeather: weatherTool },
});

agent.on("text", ({ role, text }) => {
   const prefix = role === "user" ? "👤" : "🤖";
   console.log(prefix, text);
});

agent.on("chunk:text_delta", ({ text }) => process.stdout.write(text));
agent.on("speech_start", ({ streaming }) => console.log("speech_start", streaming));
agent.on("audio_chunk", ({ chunkId, format, uint8Array }) => {
   console.log("audio_chunk", chunkId, format, uint8Array.length);
});

await agent.sendText("What's the weather in San Francisco?");

if (process.env.VOICE_WS_ENDPOINT) {
   await agent.connect(process.env.VOICE_WS_ENDPOINT);
}

Configuration options

The agent accepts:

| Option | Required | Default | Description | |---|---|---|---| | model | yes | — | AI SDK chat model (e.g. openai("gpt-4o")) | | transcriptionModel | no | — | AI SDK transcription model (e.g. openai.transcription("whisper-1")) | | speechModel | no | — | AI SDK speech model (e.g. openai.speech("gpt-4o-mini-tts")) | | instructions | no | "You are a helpful voice assistant." | System prompt | | stopWhen | no | stepCountIs(5) | Stopping condition for multi-step tool loops | | tools | no | {} | AI SDK tools map | | endpoint | no | — | Default WebSocket URL for connect() | | voice | no | "alloy" | TTS voice | | speechInstructions | no | — | Style instructions passed to the speech model | | outputFormat | no | "mp3" | Audio output format (mp3, opus, wav, …) | | streamingSpeech | no | see below | Streaming TTS chunk tuning | | history | no | see below | Conversation memory limits | | maxAudioInputSize | no | 10485760 (10 MB) | Maximum accepted audio input in bytes |

streamingSpeech

| Key | Default | Description | |---|---|---| | minChunkSize | 50 | Min characters before a sentence is sent to TTS | | maxChunkSize | 200 | Max characters per chunk (force-split at clause boundary) | | parallelGeneration | true | Start TTS for upcoming chunks while the current one plays | | maxParallelRequests | 3 | Cap on concurrent TTS requests |

history

| Key | Default | Description | |---|---|---| | maxMessages | 100 | Max messages kept in history (0 = unlimited). Oldest are trimmed in pairs. | | maxTotalChars | 0 (unlimited) | Max total characters across all messages. Oldest are trimmed when exceeded. |

Methods

| Method | Description | |---|---| | sendText(text) | Process text input. Returns a promise with the full assistant response. Requests are queued serially. | | sendAudio(base64Audio) | Transcribe base64 audio and process the result. | | sendAudioBuffer(buffer) | Same as above, accepts a raw Buffer / Uint8Array. | | transcribeAudio(buffer) | Transcribe audio to text without generating a response. | | generateAndSendSpeechFull(text) | Non-streaming TTS fallback (entire text at once). | | interruptSpeech(reason?) | Cancel in-flight TTS only (LLM stream keeps running). | | interruptCurrentResponse(reason?) | Cancel both the LLM stream and TTS. Used for barge-in. | | connect(url?) / handleSocket(ws) | Establish or attach a WebSocket. Safe to call multiple times. | | disconnect() | Close the socket and abort all in-flight work. | | destroy() | Permanently release all resources. The agent cannot be reused. | | clearHistory() | Clear conversation history. | | getHistory() / setHistory(msgs) | Read or restore conversation history. | | registerTools(tools) | Merge additional tools into the agent. |

Read-only properties

| Property | Type | Description | |---|---|---| | connected | boolean | Whether a WebSocket is connected | | processing | boolean | Whether a request is currently being processed | | speaking | boolean | Whether audio is currently being generated / sent | | pendingSpeechChunks | number | Number of queued TTS chunks | | destroyed | boolean | Whether destroy() has been called |

Events

| Event | Payload | When | |---|---|---| | text | { role, text } | User input received or full assistant response ready | | chunk:text_delta | { id, text } | Each streaming text token from the LLM | | chunk:reasoning_delta | { id, text } | Each reasoning token (models that support it) | | chunk:tool_call | { toolName, toolCallId, input } | Tool invocation detected | | tool_result | { name, toolCallId, result } | Tool execution finished | | speech_start | { streaming } | TTS generation begins | | speech_complete | { streaming } | All TTS chunks sent | | speech_interrupted | { reason } | Speech was cancelled (barge-in, disconnect, error) | | speech_chunk_queued | { id, text } | A text chunk entered the TTS queue | | audio_chunk | { chunkId, data, format, text, uint8Array } | One TTS chunk is ready | | audio | { data, format, uint8Array } | Full non-streaming TTS audio | | transcription | { text, language } | Audio transcription result | | audio_received | { size } | Raw audio input received (before transcription) | | history_trimmed | { removedCount, reason } | Oldest messages evicted from history | | connected / disconnected | — | WebSocket lifecycle | | warning | string | Non-fatal issues (empty input, etc.) | | error | Error | Errors from LLM, TTS, transcription, or WebSocket |

Run (text-only check)

This validates LLM + tool + streaming speech without requiring WebSocket:

pnpm demo

Expected logs include text, chunk:text_delta, tool events, and speech chunk events.

Run (WebSocket check)

  1. Start the local WS server:

    pnpm ws:server
  2. In another terminal, run the demo:

    pnpm demo

The demo will:

  • run sendText() first (text-only sanity check), then
  • connect to VOICE_WS_ENDPOINT if provided,
  • emit streaming protocol messages (text_delta, tool_call, audio_chunk, response_complete, etc.).

Browser voice client (HTML)

A simple browser client is available at example/voice-client.html.

What it does:

  • captures microphone speech using Web Speech API (speech-to-text)
  • sends transcript to the agent via WebSocket (type: "transcript")
  • receives streaming audio_chunk messages and plays them in order

How to use:

  1. Start your agent server/WebSocket endpoint.
  2. Open example/voice-client.html in a browser (Chrome/Edge recommended).
  3. Connect to ws://localhost:8080 (or your endpoint), then click Start Mic.

Scripts

  • pnpm build – build TypeScript
  • pnpm dev – watch TypeScript
  • pnpm demo – run demo client
  • pnpm ws:server – run local test WebSocket server

Notes

  • If VOICE_WS_ENDPOINT is empty, WebSocket connect is skipped.
  • The sample WS server sends a mock transcript message for end-to-end testing.
  • Streaming TTS uses chunk queueing and supports interruption (interrupt).