@javierchen/streaming-voice-sdk
v0.1.2
Published
Framework-agnostic realtime voice streaming SDK for SSE text, WebSocket audio, and MSE playback.
Maintainers
Readme
@javierchen/streaming-voice-sdk
Browser-native, framework-agnostic realtime voice streaming SDK.
Handles microphone capture, WebSocket audio transport, SSE text streaming, and MSE-based MP3 playback — all coordinated by turnId.
Install
npm install @javierchen/streaming-voice-sdkQuick Start
import { StreamingVoiceManager } from '@javierchen/streaming-voice-sdk';
// 1. Place an <audio> element in your HTML
// <audio id="voice-audio"></audio>
const audioEl = document.getElementById('voice-audio') as HTMLAudioElement;
// 2. Create manager
const voice = new StreamingVoiceManager({
wsUrl: 'ws://localhost:8525/api/ws/voice',
sseUrlBuilder: (turnId) =>
`http://localhost:8525/api/voice/turn/${turnId}/text/stream`,
mediaElement: audioEl,
});
// 3. Listen to events
voice.onText(({ text }) => {
// Incremental text deltas from the AI
document.getElementById('reply')!.textContent += text;
});
voice.onStateChange(({ visibleStatus }) => {
// 'idle' | 'listening' | 'recognizing' | 'thinking' | 'speaking' | ...
console.log(visibleStatus);
});
voice.onAsrFinal(({ text }) => {
// Final speech-to-text result
console.log('You said:', text);
});
voice.onError(({ code, message }) => {
console.error(`Error [${code}]: ${message}`);
});
// 4. Connect
await voice.start();Usage
Push-to-Talk (mic capture)
// Mouse down / touch start — begin recording
const { turnId } = await voice.startTurn({
chatId: 'my-session',
captureMicrophone: true,
});
// Mouse up / touch end — stop recording, AI starts processing
await voice.stopTurn();Text-only (no mic)
const { turnId } = await voice.startTurn({
chatId: 'my-session',
transcript: 'Explain vector databases',
// captureMicrophone defaults to false
});Manual audio upload
const { turnId } = await voice.startTurn({
chatId: 'my-session',
captureMicrophone: false,
});
// Send audio chunks manually
voice.sendAudioChunk(arrayBuffer);
await voice.stopTurn();Interrupt AI response
voice.interrupt();Cleanup
// Remove all listeners and close connections
await voice.destroy();React Example
import { useEffect, useRef, useState } from 'react';
import {
StreamingVoiceManager,
type VisibleVoiceStatus,
} from '@javierchen/streaming-voice-sdk';
export function VoiceChat() {
const audioRef = useRef<HTMLAudioElement>(null);
const managerRef = useRef<StreamingVoiceManager | null>(null);
const [status, setStatus] = useState<VisibleVoiceStatus>('idle');
const [reply, setReply] = useState('');
useEffect(() => {
if (!audioRef.current) return;
const mgr = new StreamingVoiceManager({
wsUrl: 'ws://localhost:8525/api/ws/voice',
sseUrlBuilder: (turnId) =>
`http://localhost:8525/api/voice/turn/${turnId}/text/stream`,
mediaElement: audioRef.current,
});
mgr.onStateChange((s) => setStatus(s.visibleStatus));
mgr.onText((e) => setReply((prev) => prev + e.text));
mgr.start();
managerRef.current = mgr;
return () => { mgr.destroy(); };
}, []);
const handleTalk = async () => {
if (!managerRef.current) return;
if (status === 'idle' || status === 'completed') {
setReply('');
await managerRef.current.startTurn({
chatId: 'react-demo',
captureMicrophone: true,
});
} else {
managerRef.current.stopTurn();
}
};
return (
<div>
<audio ref={audioRef} hidden />
<p>Status: {status}</p>
<p>AI: {reply}</p>
<button onMouseDown={handleTalk} onMouseUp={() => managerRef.current?.stopTurn()}>
Hold to talk
</button>
<button onClick={() => managerRef.current?.interrupt()}>Stop</button>
</div>
);
}API Reference
Constructor Options
new StreamingVoiceManager(options: StreamingVoiceManagerOptions)| Option | Type | Default | Description |
|---|---|---|---|
| wsUrl | string | — | WebSocket endpoint URL |
| sseUrlBuilder | (turnId: string) => string | — | Returns the SSE URL for a given turn |
| mediaElement | HTMLMediaElement | — | <audio> or <video> element for playback |
| mimeCodec | string | 'audio/mpeg' | MIME codec for MSE playback |
| debug | boolean | false | Emit debug events |
| autoPlay | boolean | false | Auto-play audio when buffered |
| reconnectPolicy | ReconnectPolicy | see below | WebSocket reconnect behavior |
| bufferPolicy | BufferPolicy | see below | MSE buffer management |
Methods
| Method | Returns | Description |
|---|---|---|
| start() | Promise<void> | Connect WebSocket |
| stop() | Promise<void> | Stop active turn and close WS |
| startTurn(options) | Promise<{ turnId }> | Start a new conversation turn |
| stopTurn(turnId?) | Promise<void> | End current turn |
| interrupt() | void | Interrupt AI response |
| sendAudioChunk(chunk) | void | Send raw audio data |
| destroy() | void | Full cleanup |
startTurn Options
| Option | Type | Default | Description |
|---|---|---|---|
| chatId | string | — | Required. Session identifier |
| turnId | string | auto UUID | Custom turn ID |
| transcript | string | — | Pre-supplied text (skip ASR) |
| captureMicrophone | boolean | false | Auto-start mic recording |
| webSearchEnabled | boolean | false | Enable web search |
Events
All on* methods return an unsubscribe function () => void.
| Method | Payload | Description |
|---|---|---|
| onText(fn) | { turnId, text, sequence } | Incremental text deltas |
| onStateChange(fn) | TurnStateSnapshot | UI state changes |
| onAudioState(fn) | AudioStateSnapshot | Playback state changes |
| onAsrPartial(fn) | { mode: 'partial', text } | Partial speech recognition |
| onAsrFinal(fn) | { mode: 'final', text } | Final speech recognition |
| onError(fn) | { code, message, recoverable } | Error events |
Visible Status (UI binding)
Use onStateChange → visibleStatus to drive your UI:
| Status | Meaning |
|---|---|
| idle | No active turn |
| listening | Mic is recording |
| recognizing | Speech-to-text in progress |
| thinking | AI is processing |
| speaking | AI audio is playing |
| completed | Turn finished normally |
| interrupted | Turn was interrupted |
| failed | Turn ended with error |
Reconnect Policy
{
enabled: true, // auto-reconnect on disconnect
initialDelayMs: 500, // first retry delay
maxDelayMs: 10000, // max retry delay
factor: 1.8, // exponential backoff multiplier
maxAttempts: Infinity, // retry limit
}Buffer Policy
{
maxQueuedChunks: 256, // max buffered chunks before dropping oldest
pruneBehindSeconds: 30, // trim buffer behind playhead
resumeAheadSeconds: 0.35, // auto-play threshold
}Architecture
Two independent transports, coordinated by turnId:
| Concern | Transport | |---|---| | Text deltas | SSE (per-turn EventSource) | | Audio uplink (mic) | WebSocket binary frames | | Audio downlink (TTS MP3) | WebSocket binary frames | | Control signals | WebSocket JSON |
Browser Compatibility
- Requires
MediaSourceAPI for audio playback (Chrome, Edge, Firefox, Safari 17+) - Falls back gracefully to text-only mode if
MediaSourceis unavailable - Uses
ScriptProcessorNodefor mic capture (broadest compatibility)
License
MIT
