browser-voice

v0.1.0

Published

2 months ago

Browser-first voice capture, playback, and WebSocket transport for realtime apps.

0High
0Medium
0Low

eduardpaul

audio browser pcm16 realtime voice vad webaudio websocket

`browser-voice`

Browser-first voice capture, playback, and WebSocket transport for realtime apps.

This package is a small TypeScript library for:

capturing microphone audio in the browser
sending audio over plain WebSocket in configurable formats
receiving remote audio over plain WebSocket in configurable formats
playing remote audio with Web Audio controls
exchanging custom JSON/control events alongside audio

It is intentionally simpler than a full realtime media SDK:

no room / participant / publication model
no WebRTC signaling
no SFU assumptions
no vendor-specific backend contract

Status

Experimental, but usable.

The API is designed to stay small and explicit. Expect additive evolution while the transport and browser-compatibility edges continue to harden.

Inspiration

This library is inspired by, and honestly vibe-coded from, LiveKit's WebSocket / voice implementation patterns, especially the browser audio, transport, and recovery ideas in livekit-client.

It is not a port of LiveKit, and it does not depend on LiveKit at runtime.

Attribution and license notes

This package is licensed under Apache-2.0.

Because the implementation is heavily inspired by, and in a few places adapted from, upstream open-source work, the package also includes:

NOTICE
CREDITS.md
THIRD_PARTY_NOTICES.md

These files document upstream attributions and third-party notices, including:

LiveKit client SDK inspiration / Apache-2.0 notice context
ts-debounce MIT attribution for the adapted debounce helper

Why this package exists

Use browser-voice when you want browser voice behavior inspired by larger realtime SDKs without adopting a full room/signaling model.

Typical use cases:

browser client -> custom .NET / Node / Python voice backend over WebSocket
raw PCM16 or JSON audio payloads instead of WebRTC
custom control events mixed with audio on the same socket
Azure / OpenAI / bespoke realtime backends

Features

microphone capture defaults tuned for voice
autoplay-safe AudioContext handling
explicit startAudio() playback unlock flow
optional pre-connect audio buffering
automatic capture recovery for ended / missing microphone tracks
debounced media-device observation
configurable incoming and outgoing audio formats
reconnect backoff for plain WebSocket sessions
backpressure-aware frame queueing
optional capture-side noise suppression processor
voice activity tracking
remote playback analyser for visualizers
remote playback gain / EQ / limiter hooks
custom JSON event send / receive support

Install

npm install browser-voice tslib

pnpm add browser-voice tslib

Runtime requirements

modern browser with:
- MediaDevices.getUserMedia
- WebSocket
- AudioContext
Node.js >= 20.19.0 for package development, docs generation, and demo tooling

This is a browser-first library. It is not intended to capture or play audio directly in Node.js.

Quick start

import {
  NoiseSuppressionProcessor,
  PcmAudioPlayer,
  VoiceCapture,
  VoiceWebSocket,
} from 'browser-voice';

const capture = new VoiceCapture({
  autoRecover: true,
  preConnectBufferMs: 1500,
  processor: new NoiseSuppressionProcessor(),
  targetSampleRate: 24000,
  targetChannelCount: 1,
});

const player = new PcmAudioPlayer({
  initialBufferMs: 120,
});

const voiceSocket = new VoiceWebSocket({
  url: 'wss://example.com/voice',
  capture,
  player,
  autoReconnect: true,
  incomingAudioFormat: 'raw-pcm16',
  incomingAudioFormatOptions: {
    sampleRate: 24000,
    channels: 1,
  },
  outgoingAudioFormat: 'raw-pcm16',
  onJsonEvent: ({ parsed }) => {
    console.log('control event', parsed);
  },
});

await voiceSocket.connect();
await player.startAudio();
await capture.start();

voiceSocket.sendJsonEvent({
  type: 'ping',
  timestamp: Date.now(),
});

Transport formats

Incoming audio formats

VoiceWebSocket supports:

framed-pcm16
raw-pcm16
json
auto

Outgoing audio formats

VoiceWebSocket supports:

framed-pcm16
raw-pcm16
json

Default framed PCM layout

The default framed-pcm16 binary format is:

4 bytes: sample rate (uint32, little-endian)
2 bytes: channel count (uint16, little-endian)
2 bytes: reserved flags (uint16, little-endian)
remaining bytes: signed PCM16 payload

`raw-pcm16`

Use raw-pcm16 when the backend expects plain binary PCM16 with no custom library header.

Important:

outgoing browser audio is sent as raw PCM16 bytes only
incoming binary audio is interpreted using incomingAudioFormatOptions
text / JSON messages are treated as non-audio and routed to onJsonEvent

`json`

Expected audio JSON examples:

{
  "sampleRate": 24000,
  "channels": 1,
  "pcm16": [100, -200, 300]
}

{
  "sampleRate": 24000,
  "channels": 1,
  "pcm16Base64": "..."
}

Non-audio JSON is routed to onJsonEvent.

`auto`

auto is for mixed transports. It will:

try framed PCM first for binary audio
accept JSON-wrapped audio
ignore or route non-audio JSON to onJsonEvent
fall back to raw PCM16 when incomingAudioFormatOptions are configured

Custom control / data events

You can send custom JSON to the backend independently of the audio format:

voiceSocket.sendJsonEvent({
  type: 'assistant.reset',
  correlationId: '123',
});

You can also send plain text:

voiceSocket.sendTextMessage('ping');

On inbound messages, use:

const socket = new VoiceWebSocket({
  // ...
  onJsonEvent: ({ format, parsed, rawText }) => {
    console.log(format, parsed, rawText);
  },
});

Audio behavior notes

Silence and server VAD

If your backend depends on server-side VAD, do not drop silent outgoing frames unless you are also manually committing turns.

skipSilentFrames can reduce bandwidth, but it can also prevent backends like Azure server VAD from detecting end-of-speech.

Noise suppression

The library uses two layers:

browser-native constraints such as echoCancellation, noiseSuppression, autoGainControl, and voiceIsolation
an optional NoiseSuppressionProcessor that applies lightweight browser-side filtering / gating / compression

The built-in processor is not a full acoustic echo canceller. It complements browser voice processing; it does not replace it.

Playback effects

PcmAudioPlayer supports:

setGain() / setVolume()
setEqualizer() for low / mid / high EQ shelves
setLimiter() for a built-in limiter / dynamics-compressor setup
setProcessorChain() for custom AudioNode[]
getAnalyser() for visualizers

Example:

const player = new PcmAudioPlayer();

player.setEqualizer({
  lowDb: 2,
  midDb: -1,
  highDb: 1,
});

player.setLimiter({
  enabled: true,
  threshold: -8,
  ratio: 12,
});

const analyser = player.getAnalyser();
const volume = analyser.getVolume();
const bars = analyser.getFrequencyBands(16);

Demo

Run the local demo server:

pnpm demo

Then open the printed URL in your browser.

The demo includes:

transport format selection
sample rate / channel configuration
mic capture
playback unlock
playback visualizer
gain / EQ / limiter controls
custom JSON event sender
mixed text + binary relay through the demo server

API overview

Main exports:

VoiceCapture
VoiceWebSocket
PcmAudioPlayer
AudioPlaybackManager
PlaybackAnalyser
NoiseSuppressionProcessor
VoiceActivityDetector
observeMediaDevices
BackoffStrategy
debounce

Development

pnpm build
pnpm run docs:api
pnpm lint
pnpm test
pnpm coverage

Generated API docs are written to docs/api/.

Coverage reports are written to coverage/.

If demo behavior changes, also run:

node --check packages/browser-voice/demo/server.mjs
node --check packages/browser-voice/demo/public/app.js

License

Apache-2.0