@cloudflare/voice
v0.1.3
Published
Voice pipeline for Cloudflare Agents — STT, TTS, VAD, streaming, and SFU utilities
Maintainers
Readme
@cloudflare/voice
Voice pipeline for Cloudflare Agents -- continuous STT, TTS, streaming, and real-time audio over WebSocket.
Experimental. This API is under active development and will break between releases. Pin your version and expect to rewrite when upgrading.
Install
npm install @cloudflare/voiceExports
| Export path | What it provides |
| -------------------------- | ------------------------------------------------------------------------------------------------------- |
| @cloudflare/voice | Server-side mixins (withVoice, withVoiceInput), provider types, Workers AI providers, SFU utilities |
| @cloudflare/voice/react | React hooks (useVoiceAgent, useVoiceInput) |
| @cloudflare/voice/client | Framework-agnostic VoiceClient class |
Server: full voice agent (withVoice)
Adds the complete voice pipeline: continuous STT, LLM turn handling, streaming TTS, interruption, and conversation persistence.
import { Agent } from "agents";
import {
withVoice,
WorkersAIFluxSTT,
WorkersAITTS,
type VoiceTurnContext
} from "@cloudflare/voice";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent<Env> {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
async onTurn(transcript: string, context: VoiceTurnContext) {
return "Hello! I heard you say: " + transcript;
}
}Provider properties
| Property | Type | Required | Description |
| ------------- | ------------- | -------- | -------------------------------- |
| transcriber | Transcriber | Yes | Continuous per-call STT provider |
| tts | TTSProvider | Yes | Text-to-speech provider |
Lifecycle hooks
| Method | Description |
| -------------------------------- | ---------------------------------------------------------------------------------- |
| onTurn(transcript, context) | Required. Handle a user utterance. Return string or AsyncIterable<string>. |
| createTranscriber(connection) | Override to create a transcriber dynamically per connection. |
| onCallStart(connection) | Called when a voice call begins. |
| onCallEnd(connection) | Called when a voice call ends. |
| onInterrupt(connection) | Called when user interrupts playback. |
| beforeCallStart(connection) | Return false to reject a call. |
| onMessage(connection, message) | Handle non-voice WebSocket messages (voice protocol is intercepted automatically). |
Pipeline hooks
| Method | Description |
| ------------------------------------------ | ---------------------------------------------------- |
| afterTranscribe(transcript, connection) | Process transcript after STT. Return null to skip. |
| beforeSynthesize(text, connection) | Process text before TTS. Return null to skip. |
| afterSynthesize(audio, text, connection) | Process audio after TTS. Return null to skip. |
Convenience methods
speak(connection, text)-- synthesize and send audio to one connectionspeakAll(text)-- synthesize and send audio to all connectionsforceEndCall(connection)-- programmatically end a callsaveMessage(role, content)-- persist a message to conversation historygetConversationHistory()-- retrieve conversation history from SQLite
Server: voice input only (withVoiceInput)
STT-only mixin -- no TTS, no LLM. Use when you only need speech-to-text (e.g., dictation, transcription).
import { Agent } from "agents";
import { withVoiceInput, WorkersAINova3STT } from "@cloudflare/voice";
const InputAgent = withVoiceInput(Agent);
export class DictationAgent extends InputAgent<Env> {
transcriber = new WorkersAINova3STT(this.env.AI);
onTranscript(text: string, connection: Connection) {
console.log("User said:", text);
}
}Client: React
import { useVoiceAgent } from "@cloudflare/voice/react";
function App() {
const {
status, // "idle" | "listening" | "thinking" | "speaking"
transcript, // TranscriptMessage[]
interimTranscript, // string | null (real-time partial transcript)
metrics, // VoicePipelineMetrics | null
audioLevel, // number (0-1)
isMuted, // boolean
connected, // boolean
error, // string | null
startCall, // () => Promise<void>
endCall, // () => void
toggleMute, // () => void
sendText, // (text: string) => void
sendJSON // (data: Record<string, unknown>) => void
} = useVoiceAgent({ agent: "my-agent" });
return <div>Status: {status}</div>;
}For voice input only:
import { useVoiceInput } from "@cloudflare/voice/react";
const { transcript, interimTranscript, isListening, start, stop, clear } =
useVoiceInput({ agent: "DictationAgent" });Client: vanilla JavaScript
import { VoiceClient } from "@cloudflare/voice/client";
const client = new VoiceClient({ agent: "my-agent" });
client.addEventListener("statuschange", () => console.log(client.status));
client.connect();
await client.startCall();Workers AI providers (built-in)
All default providers use Workers AI bindings -- no API keys required:
| Class | Type | Workers AI model | Recommended for |
| ------------------- | -------------- | --------------------- | ---------------- |
| WorkersAIFluxSTT | Continuous STT | @cf/deepgram/flux | withVoice |
| WorkersAINova3STT | Continuous STT | @cf/deepgram/nova-3 | withVoiceInput |
| WorkersAITTS | TTS | @cf/deepgram/aura-1 | Both |
Third-party providers
| Package | What it provides |
| ------------------------------ | ---------------------------------------- |
| @cloudflare/voice-deepgram | Continuous STT (Deepgram Nova) |
| @cloudflare/voice-elevenlabs | TTS (ElevenLabs) |
| @cloudflare/voice-twilio | Telephony adapter (Twilio Media Streams) |
Related
examples/voice-agent-- full voice agent exampleexamples/voice-input-- voice input (dictation) example
