@craftedxp/voice-js
v0.6.0
Published
JS SDK for embedding a voice agent call in any JS environment — browser, Node.js, Electron. Zero framework dependencies. Drop-in companion to @craftedxp/voice-rn (React Native).
Downloads
1,267
Maintainers
Readme
@craftedxp/voice-js
JS SDK for embedding a voice agent call in any JS environment — browser tabs, Node.js processes, Electron apps. Zero framework deps.
Companion to @craftedxp/voice-rn (React Native) and @craftedxp/sdk-node (server-side sk_ SDK).
Internal testing release. API surface may evolve before a stable release. 0.3.2 is a bug fix release —
onStateChangenow fires correctly for state transitions driven by server frames; the callback was silently swallowed since 0.2.0 forconnected → listening,agent_turn_start → agent_speaking, etc. Consumers using onlyonTranscriptwere unaffected; anyone building UI fromonStateChangeshould upgrade. 0.3.1 added Node-consumer ergonomics (onInterrupt/onAgentTurnStartcallbacks,NodeVoiceClientFactoryreturn type) — those depend on the state-callback path so 0.3.2 is the minimum recommended. 0.3.0 added client tools — handlers the agent's LLM can call on the consumer's machine. 0.2.0 was a breaking rename + redesign of the previous@voxline/[email protected]— the singleton-VoiceClient-with-apiKeypattern is gone in favour of aconfigureVoiceClient({ fetchToken })factory that mirrorsvoice-rn0.3.x. See Migrating from@voxline/webbelow.
Install
npm install @craftedxp/voice-js
# Node consumers also need:
npm install wsws is declared as an OPTIONAL peer — only needed in Node / Electron-main. Browsers use the native WebSocket and skip it.
How the integration fits together
The same three-party flow as voice-rn. Your backend mints ct_ tokens with its sk_ API key (via @craftedxp/sdk-node or a raw POST /v1/call-tokens), the SDK calls your fetchToken callback whenever it needs a fresh one, and your client never sees sk_.
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Your web app │ │ Your backend │ │ Voissia server │
│ │ │ │ │ │
│ fetchToken ────┼───────►│ call Voissia ──┼───────►│ mint ct_ │
│ │ │ │ with sk_ │ │ │ │
│ │◄───────┼────────┼──── ct_ ────────┼────────┼─── ct_ │
│ startCall(...) ┼────────┼──── WSS /v1/agents/.../call?token=ct_ ─────►│
└─────────────────┘ └──────────────────┘ └─────────────────┘The sk_ API key never lives in browser code. The SDK has no apiKey option — pre-0.2 had one, anyone reading the docs would bake their server credential into client code, that whole class of footgun is gone.
Quick start (browser)
import { configureVoiceClient } from '@craftedxp/voice-js'
const voice = configureVoiceClient({
apiBase: 'https://api.your-server.com',
// SDK calls this whenever it needs a fresh ct_ — initial connect
// and any mid-call token refresh. Your backend handles the mint and
// forwards the mint response's `transport` + `webrtcGatewayBase` so
// the SDK can dispatch WS vs WebRTC per the agent's configuration.
// (New agents default to WebRTC since 2026-05-16. The bare-string
// form below is still accepted for back-compat — it always uses WS.)
fetchToken: async ({ agentId }) => {
const r = await fetch('/api/voice/mint', {
method: 'POST',
body: JSON.stringify({ agentId }),
})
const body = await r.json()
return {
token: body.token,
transport: body.transport, // 'ws' | 'webrtc'
webrtcGatewayBase: body.webrtcGatewayBase, // present when transport=webrtc
}
},
// Optional — applied to every call. Per-call options merge on top.
defaultMetadata: { surface: 'web', appVersion: '1.4.0' },
})
// Per call (typically inside a click handler so the AudioContext gets
// the user gesture it needs):
const call = await voice.startCall({
agentId: 'agt_xxx',
context: { userId: 'usr_123', topic: 'billing' },
metadata: { sessionId: 'sess_x' },
bargeIn: true,
onStateChange: (state) => console.log('state', state),
onTranscript: (entries) => render(entries),
onVolume: ({ input, output }) => drawMeters(input, output),
onError: ({ code, message }) => toast(`${code}: ${message}`),
onEnd: ({ reason, durationMs }) => log('ended', reason, durationMs),
})
call.mute() // gate mic frames (server still sees wire cadence)
call.unmute()
call.end() // close WS + stop mic + fire onEndQuick start (Node / Electron-main)
import { configureVoiceClient } from '@craftedxp/voice-js/node'
import { spawn } from 'child_process'
const voice = configureVoiceClient({
apiBase: 'https://api.your-server.com',
fetchToken: async () => mintFromMyBackend(),
})
// Bring your own audio. Example: sox subprocesses for mic + speakers.
const mic = spawn('sox', [
'-d',
'-r',
'16000',
'-c',
'1',
'-b',
'16',
'-e',
'signed',
'-t',
'raw',
'-',
])
const spk = spawn('sox', [
'-t',
'raw',
'-r',
'16000',
'-c',
'1',
'-b',
'16',
'-e',
'signed',
'-',
'-d',
])
const call = await voice.startCall({
agentId: 'agt_xxx',
onAudioChunk: (pcm) => spk.stdin.write(Buffer.from(pcm)),
onEnd: () => {
mic.kill()
spk.stdin.end()
},
})
mic.stdout.on('data', (chunk) => call.sendAudioChunk(chunk))The Node bundle has the same configureVoiceClient / startCall shape, plus an extra sendAudioChunk(pcm) method on the call handle and an onAudioChunk(pcm) per-call callback. No built-in audio adapter — feed PCM in/out yourself with whatever your host has handy (sox, PortAudio, RTP relay, Electron IPC bridge).
API reference
configureVoiceClient(config)
| Field | Type | Notes |
| ----------------- | ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| apiBase | string | Full HTTPS URL of the Voissia server. WS scheme derived: https→wss. Trailing slash optional. |
| fetchToken | (args) => Promise<string \| FetchTokenResult> | Called by the SDK whenever it needs a fresh ct_. Args: { agentId, userId?, context?, metadata? }. Return either a bare ct_ string (always WS, back-compat) or the rich { token, transport, webrtcGatewayBase? } object so the SDK can dispatch WS vs WebRTC per the agent's configuration. |
| defaultMetadata | Record<string, string>? | Applied to every startCall. Per-call merges on top. |
| defaultContext | Record<string, unknown>? | Applied to every startCall. Per-call merges on top. |
Returns a VoiceClientFactory with one method:
factory.startCall(options)
| Field | Type | Notes |
| ------------------ | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| agentId | string | Required. |
| userId | string? | Round-tripped to fetchToken as userId; server uses it for contact memory. |
| context | Record<string, unknown>? | Per-call structured context. Merged on top of defaultContext. Lowered into the agent's system prompt server-side. |
| metadata | Record<string, string>? | Per-call key/value. Merged on top of defaultMetadata. Round-tripped on call.ended webhook. NOT lowered into the prompt. |
| bargeIn | boolean? | Default true. Set false for alarm-style flows where the user shouldn't accidentally interrupt the script. |
| clientTools | ClientToolMap? | Per-call client tools the agent's LLM can invoke. See Client tools section below. Validated synchronously at startCall — bad input throws. |
| token | string? | Test-only escape hatch — pre-minted ct_, bypasses fetchToken. Don't use in production. |
| onStateChange | (state) => void | Fires on every state machine transition. |
| onTranscript | (entries) => void | Fires on every transcript update. |
| onInterrupt | () => void | Server signaled barge-in. Browser bundle auto-flushes built-in playback before this fires. Node consumers should drain their custom playback queue here. |
| onAgentTurnStart | () => void | New agent turn began. Use when you want a precise turn-start anchor without diffing onStateChange. |
| onVolume | ({ input, output }) => void | 0-1 RMS. ~10 Hz cadence. Browser bundle only. |
| onError | (err) => void | Stable code from CallErrorCode; matches voice-rn codes where overlap. |
| onEnd | ({ reason, errorCode?, durationMs }) => void | Fires once when the call ends. |
Resolves to a Call handle:
interface Call {
readonly state: CallState
readonly transcript: TranscriptEntry[]
readonly isMuted: boolean
end: () => void
mute: () => void
unmute: () => void
}Node consumers get a NodeCall extension with one extra method:
interface NodeCall extends Call {
sendAudioChunk: (pcm: ArrayBuffer | ArrayBufferView) => boolean
}Stable types
type CallState =
| 'idle'
| 'connecting'
| 'listening'
| 'user_speaking'
| 'agent_speaking'
| 'ended'
| 'error'
type CallErrorCode =
| 'missing_credentials'
| 'forbidden'
| 'mic_denied'
| 'mic_start_failed'
| 'audio_session_failed'
| 'token_expired'
| 'token_invalid'
| 'unauthorized'
| 'network_unreachable'
| 'socket_error'
| 'payment_required'
| 'not_found'
| 'silence_timeout'
| 'server_error'
type CallEndReason = 'agent_ended' | 'user_hangup' | 'timeout' | 'error'Animating while the agent is talking
Four optional callbacks on startCall give you everything you need to drive an "agent is talking" animation:
| Callback | What |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------- |
| onStateChange(state) | state === 'agent_speaking' is true while the agent is in its speaking turn. |
| onAgentTurnStart() | Fires the moment the agent starts a turn — a discrete trigger if you want an "activation" cue. |
| onInterrupt() | The user barged in. End the animation early. |
| onVolume({ input, output }) | Real-time amplitude. output is the agent's playback level (≈ what the listener hears); input is your mic. |
onStateChange === 'agent_speaking' is protocol-driven: it flips on as soon as the server begins the turn, before the first audio sample reaches the speaker. For a visual that follows what the listener actually hears, drive it from onVolume.output.
The usual recipe is a stable gate × live amplitude:
let isAgentTurn = false
let amplitude = 0
const call = await client.startCall({
agentId,
onStateChange: (s) => {
isAgentTurn = s === 'agent_speaking'
render()
},
onInterrupt: () => {
isAgentTurn = false
render()
},
onVolume: ({ output }) => {
amplitude = output
render()
},
})
function render() {
if (isAgentTurn)
showPulse({ scale: 1 + amplitude }) // your renderer
else hidePulse()
}1:1 calls only — in multi-party rooms (
joinRoom) the in-room agent is a silent notetaker and has no speaking turn, so none of these signals fire for it.
Client tools
You can declare tools the agent's LLM can call on the consumer's machine. The tool's handler runs in your app — server side has no access to it. Useful for surface-only actions (read DOM state, hit a private API, mutate local storage, control the UI).
import { configureVoiceClient, type ClientToolMap } from '@craftedxp/voice-js'
const tools: ClientToolMap = {
addTodoItem: {
description: "Add an item to the user's todo list.",
parameters: {
type: 'object',
properties: { text: { type: 'string' } },
required: ['text'],
},
usage: 'Call when the user asks to add or capture a task.',
handler: async ({ text }) => {
await myAppApi.addTodo(String(text))
return `Added "${text}".`
},
},
}
const voice = configureVoiceClient({ apiBase: '...', fetchToken: async () => '...' })
const call = await voice.startCall({ agentId: 'agt_xxx', clientTools: tools })The SDK validates clientTools at startCall (sync, throws on malformed input),
then sends client_tools_register to the server right after connected. When
the agent's LLM invokes a registered tool, your handler runs and the SDK posts
the result back through the same WebSocket.
Handler return values are stringified (object → JSON.stringify) before being
sent back; throws become { error: ... } frames. The server enforces a default
10s / max 30s timeout per timeoutMs in your declaration.
For the full wire protocol, sequencing, and constraints see
docs/sdks.md → Client tools.
Migrating from @voxline/web
- import { VoiceClient } from '@voxline/web'
+ import { configureVoiceClient } from '@craftedxp/voice-js'
- const client = new VoiceClient({
- apiBase: 'https://api.example.com',
- agentId: 'agt_xxx',
- apiKey: 'sk_REDACTED', // ⚠️ DO NOT in client code
- variables: { topic: 'billing' },
- })
- client.on('state', (s) => setState(s))
- client.on('transcript', (t) => setTranscript(t))
- client.on('volume', (v) => setVolume(v))
- client.on('error', (e) => setError(e))
- client.on('close', () => setState('ended'))
- await client.connect()
- // …
- client.mute(true)
- client.disconnect()
+ const voice = configureVoiceClient({
+ apiBase: 'https://api.example.com',
+ fetchToken: async ({ agentId }) => {
+ const r = await fetch('/api/voice/mint', {
+ method: 'POST',
+ body: JSON.stringify({ agentId }),
+ })
+ return (await r.json()).token // your backend uses sk_ to mint ct_
+ },
+ })
+ const call = await voice.startCall({
+ agentId: 'agt_xxx',
+ context: { topic: 'billing' }, // was `variables`
+ onStateChange: (s) => setState(s),
+ onTranscript: (t) => setTranscript(t),
+ onVolume: (v) => setVolume(v),
+ onError: (e) => setError(e),
+ onEnd: ({ reason }) => setState('ended'),
+ })
+ // …
+ call.mute() // toggleable: mute/unmute, no boolean
+ call.end()Three semantic shifts to be aware of:
apiKeyis gone. The SDK no longer accepts it. Move your token mint to your backend. If you have ansk_in JS code today, that's a credential leak — rotate it after the migration.variables→context. Same purpose; new name lines up withvoice-rn.mute(true|false)→mute()/unmute(). Symmetric withvoice-rn.
The embed widget (<script src="embed.js" data-token="ct_...">) keeps the same HTML API, but the data-api-key attribute is no longer accepted — mint server-side and inject data-token instead.
Troubleshooting
Agent's last syllable cuts off and plays into the next agent message. Almost always a misfiring barge-in (acoustic echo from a laptop speaker → mic, or a false-positive VAD on background noise). Three quick fixes, in order:
- Test with headphones. Eliminates acoustic echo. If the symptom disappears, it was echo. Tell production users to wear headphones, or fall back to (3).
- Check for a phantom user turn between two agent turns in the transcript that contains words the agent just said. That confirms STT is hearing the agent's voice through the mic.
- Pass
bargeIn: falseonstartCallfor non-conversational flows. Adds?barge=offto the WS URL and the SDK ignoresinterruptevents client-side. Tradeoff: user can't interrupt mid-sentence.
For the full diagnostic walkthrough (including the rarer Gemini-Live
stale-audio-leak case and audio-handling guidance for Node/Electron consumers),
see docs/sdks.md → Audio quality troubleshooting.
Embed widget
For drop-in <script> consumers (landing pages, no-build embeds):
<script
src="https://your-cdn/embed.js"
data-token="ct_REDACTED"
data-agent-id="agt_xxx"
data-api-base="https://api.your-server.com"
defer
></script>Renders a floating call button with a Shadow-DOM transcript panel. Pre-mint the ct_ server-side and inject it into the data-token attribute when you render the page.
Status
- 0.5.4 (current) — Screen sharing.
setScreenShareEnabled(on, { audio })publishes ascreen_sharevideo track (and, where the browser allows, ascreen_share_audiotrack — Chrome captures tab/system audio; macOS Chrome is tab-audio only; Safari/Firefox don't capture share audio).RoomTrackEventgains asourcefield (camera/microphone/screen_share/screen_share_audio/unknown) so a screen share can render as its own tile instead of replacing the participant's camera. AddsisScreenShareEnabled()andgetLocalScreenTrack(). No 1:1 call-surface change. Drop-in for 0.5.3 consumers (the newsourcefield is additive). - 0.5.3 —
session.participantId(this session's own stablep_…id).active.speakersincludes the local participant, so a focus-tile UI rendered you as a remote when you spoke (no remote track → blank tile with the raw id). FilterparticipantIdout ofactive.speakersto focus only remote speakers (self-view when none). No 1:1 call-surface change. Drop-in upgrade for 0.5.2 consumers. - 0.5.2 —
getRemoteTracks(): RoomTrackEvent[]. Returns remote tracks already subscribed at call time. A late joiner misses the livetrack.subscribedevents for tracks published before it connected (LiveKit delivers them duringconnect, before consumer listeners attach), so it never rendered participants who already had their camera on (e.g. the host). CallgetRemoteTracks()right after registeringtrack.subscribedto backfill them. No 1:1 call-surface change. Drop-in upgrade for 0.5.1 consumers. - 0.5.1 — Room video surface.
RoomSessiongains:track.subscribed/track.unsubscribedevents with payload{ participantId: string; kind: 'audio' | 'video'; track: RemoteTrack }(raw livekit-client track — calltrack.attach(el)/track.detach());active.speakersevent (string[]of participantIds currently speaking, drives active-speaker UI);setMicEnabled(on: boolean): Promise<void>/setCameraEnabled(on: boolean): Promise<void>(mid-call toggles);isMicEnabled(): boolean/isCameraEnabled(): boolean(read current state for toggle button UI);getLocalCameraTrack(): LocalVideoTrack | null(attach to self-view element). No API changes to the 1:1 call surface (startCall/Call). Drop-in upgrade for 0.5.0 consumers. - 0.5.0 — Multi-party rooms.
joinRoom({ roomId, joinCode, name })returns a typedRoomSessionwithparticipant.{joined,left},transcript.partial,system.message,room.endedevents. Addslivekit-clientas a direct dependency (~250 KB gzipped, tree-shakes if unused).publishMic()/publishCamera()/leave()methods. Seedocs/sdks.md→ Multi-party rooms. - 0.4.2 — WebRTC reliability:
onicecandidate+onconnectionstatechangelisteners now register beforesetLocalDescription(where ICE gathering starts). Candidates emitted beforecallIdis known are buffered and flushed once the answer arrives. Browsers buffer candidates internally so the prior listen-after-setRemoteDescriptionpattern worked in Chrome/Safari, but the explicit ordering is correct per spec and matches the fix shipped in@craftedxp/[email protected]. No API surface change — drop-in upgrade. - 0.4.1 —
client_toolsover the WebRTC DataChannel — tool register onconnected, dispatch onclient_tool_callframes, parity with the WS transport. - 0.4.0 — WebRTC transport support.
fetchTokenmay now return{ token, transport, webrtcGatewayBase? }; the SDK dispatches WS vs WebRTC accordingly. Backwards-compatible — bare-string returns still always use WS. - 0.3.2 — bug fix:
onStateChangenow fires for state transitions driven by server frames (connected → listening,agent_turn_start → agent_speaking, etc.). Latent regression since 0.2.0;onTranscript-only consumers were unaffected, but anyone deriving UI fromonStateChangeshould upgrade. No API changes — drop-in. - 0.3.1 — adds
onInterrupt/onAgentTurnStartcallbacks onStartCallOptionsandNodeVoiceClientFactoryproper return type for the Node entry. Backwards-compatible. Use 0.3.2 instead — both new callbacks depend on the state-callback path that 0.3.2 fixes. - 0.3.0 — adds client-tools support. New
clientToolsoption onstartCallaccepts aClientToolMap(description, parameters, handler, optional usage/timeoutMs/example). Browser and Node bundles both supported. Backwards-compatible — existing consumers see no change. - 0.2.0 — first
@craftedxp/voice-jsrelease. Browser + Node dual bundle,fetchTokenfactory, voice-rn 0.3.x parity. Migration path from@voxline/[email protected]documented above. - 0.1.0 —
@voxline/web. SingletonVoiceClientclass,apiKeyaccepted. Retired in 0.2.0; never published to npm so no deprecation window.
See CONSUMING.md for the full setup walkthrough and DEVELOPING.md for SDK-author iteration.
