@cloudflare/voice
v0.0.3
Published
Voice pipeline for Cloudflare Agents — STT, TTS, VAD, streaming, and SFU utilities
Downloads
2,828
Maintainers
Readme
@cloudflare/voice
Voice pipeline for Cloudflare Agents -- STT, TTS, VAD, streaming, and real-time audio over WebSocket.
Experimental. This API is under active development and will break between releases. Pin your version and expect to rewrite when upgrading.
Install
npm install @cloudflare/voiceExports
| Export path | What it provides |
| -------------------------- | ------------------------------------------------------------------------------------------------------- |
| @cloudflare/voice | Server-side mixins (withVoice, withVoiceInput), provider types, Workers AI providers, SFU utilities |
| @cloudflare/voice/react | React hooks (useVoiceAgent, useVoiceInput) |
| @cloudflare/voice/client | Framework-agnostic VoiceClient class |
Server: full voice agent (withVoice)
Adds the complete voice pipeline: audio buffering, VAD, STT, LLM turn handling, streaming TTS, interruption, and conversation persistence.
import { Agent } from "agents";
import { withVoice, type VoiceTurnContext } from "@cloudflare/voice";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent<Env> {
async onTurn(transcript: string, context: VoiceTurnContext) {
// Return a string or AsyncIterable<string> (for streaming TTS)
return "Hello! I heard you say: " + transcript;
}
}Provider properties
| Property | Type | Required | Description |
| -------------- | ---------------------- | -------- | ------------------------------------------ |
| stt | STTProvider | No | Batch speech-to-text (default: Workers AI) |
| tts | TTSProvider | Yes | Text-to-speech (default: Workers AI) |
| vad | VADProvider | No | Voice activity detection |
| streamingStt | StreamingSTTProvider | No | Streaming STT for real-time transcripts |
Lifecycle hooks
| Method | Description |
| -------------------------------- | ---------------------------------------------------------------------------------- |
| onTurn(transcript, context) | Required. Handle a user utterance. Return string or AsyncIterable<string>. |
| onCallStart(connection) | Called when a voice call begins. |
| onCallEnd(connection) | Called when a voice call ends. |
| onInterrupt(connection) | Called when user interrupts playback. |
| beforeCallStart(connection) | Return false to reject a call. |
| onMessage(connection, message) | Handle non-voice WebSocket messages (voice protocol is intercepted automatically). |
Pipeline hooks
| Method | Description |
| ------------------------------------------ | ---------------------------------------------------- |
| beforeTranscribe(audio, connection) | Process audio before STT. Return null to skip. |
| afterTranscribe(transcript, connection) | Process transcript after STT. Return null to skip. |
| beforeSynthesize(text, connection) | Process text before TTS. Return null to skip. |
| afterSynthesize(audio, text, connection) | Process audio after TTS. Return null to skip. |
Convenience methods
speak(connection, text)-- synthesize and send audio to one connectionspeakAll(text)-- synthesize and send audio to all connectionsforceEndCall(connection)-- programmatically end a callsaveMessage(role, content)-- persist a message to conversation historygetConversationHistory()-- retrieve conversation history from SQLite
Server: voice input only (withVoiceInput)
STT-only mixin -- no TTS, no LLM. Use when you only need speech-to-text (e.g., dictation, transcription).
import { Server } from "partyserver";
import { withVoiceInput, WorkersAIFluxSTT } from "@cloudflare/voice";
const InputServer = withVoiceInput(Server);
export class VoiceInputAgent extends InputServer<Env> {
streamingStt = new WorkersAIFluxSTT(this.env.AI);
onTranscript(text: string, connection: Connection) {
console.log("User said:", text);
}
}Client: React
import { useVoiceAgent } from "@cloudflare/voice/react";
function App() {
const {
status, // "idle" | "listening" | "thinking" | "speaking"
transcript, // TranscriptMessage[]
interimTranscript, // string | null (real-time partial transcript)
metrics, // VoicePipelineMetrics | null
audioLevel, // number (0-1)
isMuted, // boolean
connected, // boolean
error, // string | null
startCall, // () => Promise<void>
endCall, // () => void
toggleMute, // () => void
sendText, // (text: string) => void
sendJSON // (data: Record<string, unknown>) => void
} = useVoiceAgent({ agent: "my-agent" });
return <div>Status: {status}</div>;
}For voice input only:
import { useVoiceInput } from "@cloudflare/voice/react";
const { transcript, interimTranscript, isListening, start, stop, clear } =
useVoiceInput({ agent: "VoiceInputAgent" });Client: vanilla JavaScript
import { VoiceClient } from "@cloudflare/voice/client";
const client = new VoiceClient({ agent: "my-agent" });
client.addEventListener("statuschange", () => console.log(client.status));
client.connect();
await client.startCall();Workers AI providers (built-in)
All default providers use Workers AI bindings -- no API keys required:
| Class | Type | Workers AI model |
| ------------------ | ------------- | --------------------------------- |
| WorkersAISTT | Batch STT | @cf/deepgram/nova-3 |
| WorkersAIFluxSTT | Streaming STT | @cf/deepgram/nova-3 (WebSocket) |
| WorkersAITTS | TTS | @cf/deepgram/aura-1 |
| WorkersAIVAD | VAD | @cf/pipecat-ai/smart-turn-v2 |
Third-party providers
| Package | What it provides |
| ------------------------------ | ---------------------------------------- |
| @cloudflare/voice-deepgram | Streaming STT (Deepgram Nova) |
| @cloudflare/voice-elevenlabs | TTS (ElevenLabs) |
| @cloudflare/voice-twilio | Telephony adapter (Twilio Media Streams) |
Related
examples/voice-agent-- full voice agent exampleexamples/voice-input-- voice input (dictation) exampleexperimental/voice.md-- detailed API reference and protocol docs
