@craftedxp/voice-js

v0.6.0

Published

13 days ago

JS SDK for embedding a voice agent call in any JS environment — browser, Node.js, Electron. Zero framework dependencies. Drop-in companion to @craftedxp/voice-rn (React Native).

Downloads

1,267

0High
0Medium
0Low

seekarun

voice voice-ai agent websocket browser node electron craftedxp

@craftedxp/voice-js

JS SDK for embedding a voice agent call in any JS environment — browser tabs, Node.js processes, Electron apps. Zero framework deps.

Companion to @craftedxp/voice-rn (React Native) and @craftedxp/sdk-node (server-side sk_ SDK).

Internal testing release. API surface may evolve before a stable release. 0.3.2 is a bug fix release — onStateChange now fires correctly for state transitions driven by server frames; the callback was silently swallowed since 0.2.0 for connected → listening, agent_turn_start → agent_speaking, etc. Consumers using only onTranscript were unaffected; anyone building UI from onStateChange should upgrade. 0.3.1 added Node-consumer ergonomics (onInterrupt/onAgentTurnStart callbacks, NodeVoiceClientFactory return type) — those depend on the state-callback path so 0.3.2 is the minimum recommended. 0.3.0 added client tools — handlers the agent's LLM can call on the consumer's machine. 0.2.0 was a breaking rename + redesign of the previous @voxline/[email protected] — the singleton-VoiceClient-with-apiKey pattern is gone in favour of a configureVoiceClient({ fetchToken }) factory that mirrors voice-rn 0.3.x. See Migrating from @voxline/web below.

Install

npm install @craftedxp/voice-js
# Node consumers also need:
npm install ws

ws is declared as an OPTIONAL peer — only needed in Node / Electron-main. Browsers use the native WebSocket and skip it.

How the integration fits together

The same three-party flow as voice-rn. Your backend mints ct_ tokens with its sk_ API key (via @craftedxp/sdk-node or a raw POST /v1/call-tokens), the SDK calls your fetchToken callback whenever it needs a fresh one, and your client never sees sk_.

┌─────────────────┐        ┌──────────────────┐        ┌─────────────────┐
│  Your web app   │        │ Your backend     │        │ Voissia server  │
│                 │        │                  │        │                 │
│  fetchToken ────┼───────►│  call Voissia  ──┼───────►│  mint ct_       │
│        │        │        │  with sk_        │        │       │         │
│        │◄───────┼────────┼──── ct_  ────────┼────────┼─── ct_          │
│  startCall(...) ┼────────┼──── WSS /v1/agents/.../call?token=ct_ ─────►│
└─────────────────┘        └──────────────────┘        └─────────────────┘

The sk_ API key never lives in browser code. The SDK has no apiKey option — pre-0.2 had one, anyone reading the docs would bake their server credential into client code, that whole class of footgun is gone.

Quick start (browser)

import { configureVoiceClient } from '@craftedxp/voice-js'

const voice = configureVoiceClient({
  apiBase: 'https://api.your-server.com',
  // SDK calls this whenever it needs a fresh ct_ — initial connect
  // and any mid-call token refresh. Your backend handles the mint and
  // forwards the mint response's `transport` + `webrtcGatewayBase` so
  // the SDK can dispatch WS vs WebRTC per the agent's configuration.
  // (New agents default to WebRTC since 2026-05-16. The bare-string
  // form below is still accepted for back-compat — it always uses WS.)
  fetchToken: async ({ agentId }) => {
    const r = await fetch('/api/voice/mint', {
      method: 'POST',
      body: JSON.stringify({ agentId }),
    })
    const body = await r.json()
    return {
      token: body.token,
      transport: body.transport, // 'ws' | 'webrtc'
      webrtcGatewayBase: body.webrtcGatewayBase, // present when transport=webrtc
    }
  },
  // Optional — applied to every call. Per-call options merge on top.
  defaultMetadata: { surface: 'web', appVersion: '1.4.0' },
})

// Per call (typically inside a click handler so the AudioContext gets
// the user gesture it needs):
const call = await voice.startCall({
  agentId: 'agt_xxx',
  context: { userId: 'usr_123', topic: 'billing' },
  metadata: { sessionId: 'sess_x' },
  bargeIn: true,
  onStateChange: (state) => console.log('state', state),
  onTranscript: (entries) => render(entries),
  onVolume: ({ input, output }) => drawMeters(input, output),
  onError: ({ code, message }) => toast(`${code}: ${message}`),
  onEnd: ({ reason, durationMs }) => log('ended', reason, durationMs),
})

call.mute() // gate mic frames (server still sees wire cadence)
call.unmute()
call.end() // close WS + stop mic + fire onEnd

Quick start (Node / Electron-main)

import { configureVoiceClient } from '@craftedxp/voice-js/node'
import { spawn } from 'child_process'

const voice = configureVoiceClient({
  apiBase: 'https://api.your-server.com',
  fetchToken: async () => mintFromMyBackend(),
})

// Bring your own audio. Example: sox subprocesses for mic + speakers.
const mic = spawn('sox', [
  '-d',
  '-r',
  '16000',
  '-c',
  '1',
  '-b',
  '16',
  '-e',
  'signed',
  '-t',
  'raw',
  '-',
])
const spk = spawn('sox', [
  '-t',
  'raw',
  '-r',
  '16000',
  '-c',
  '1',
  '-b',
  '16',
  '-e',
  'signed',
  '-',
  '-d',
])

const call = await voice.startCall({
  agentId: 'agt_xxx',
  onAudioChunk: (pcm) => spk.stdin.write(Buffer.from(pcm)),
  onEnd: () => {
    mic.kill()
    spk.stdin.end()
  },
})

mic.stdout.on('data', (chunk) => call.sendAudioChunk(chunk))

The Node bundle has the same configureVoiceClient / startCall shape, plus an extra sendAudioChunk(pcm) method on the call handle and an onAudioChunk(pcm) per-call callback. No built-in audio adapter — feed PCM in/out yourself with whatever your host has handy (sox, PortAudio, RTP relay, Electron IPC bridge).

API reference

`configureVoiceClient(config)`

| Field | Type | Notes | | ----------------- | ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | apiBase | string | Full HTTPS URL of the Voissia server. WS scheme derived: https→wss. Trailing slash optional. | | fetchToken | (args) => Promise<string \| FetchTokenResult> | Called by the SDK whenever it needs a fresh ct_. Args: { agentId, userId?, context?, metadata? }. Return either a bare ct_ string (always WS, back-compat) or the rich { token, transport, webrtcGatewayBase? } object so the SDK can dispatch WS vs WebRTC per the agent's configuration. | | defaultMetadata | Record<string, string>? | Applied to every startCall. Per-call merges on top. | | defaultContext | Record<string, unknown>? | Applied to every startCall. Per-call merges on top. |

Returns a VoiceClientFactory with one method:

`factory.startCall(options)`

| Field | Type | Notes | | ------------------ | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | agentId | string | Required. | | userId | string? | Round-tripped to fetchToken as userId; server uses it for contact memory. | | context | Record<string, unknown>? | Per-call structured context. Merged on top of defaultContext. Lowered into the agent's system prompt server-side. | | metadata | Record<string, string>? | Per-call key/value. Merged on top of defaultMetadata. Round-tripped on call.ended webhook. NOT lowered into the prompt. | | bargeIn | boolean? | Default true. Set false for alarm-style flows where the user shouldn't accidentally interrupt the script. | | clientTools | ClientToolMap? | Per-call client tools the agent's LLM can invoke. See Client tools section below. Validated synchronously at startCall — bad input throws. | | token | string? | Test-only escape hatch — pre-minted ct_, bypasses fetchToken. Don't use in production. | | onStateChange | (state) => void | Fires on every state machine transition. | | onTranscript | (entries) => void | Fires on every transcript update. | | onInterrupt | () => void | Server signaled barge-in. Browser bundle auto-flushes built-in playback before this fires. Node consumers should drain their custom playback queue here. | | onAgentTurnStart | () => void | New agent turn began. Use when you want a precise turn-start anchor without diffing onStateChange. | | onVolume | ({ input, output }) => void | 0-1 RMS. ~10 Hz cadence. Browser bundle only. | | onError | (err) => void | Stable code from CallErrorCode; matches voice-rn codes where overlap. | | onEnd | ({ reason, errorCode?, durationMs }) => void | Fires once when the call ends. |

Resolves to a Call handle:

interface Call {
  readonly state: CallState
  readonly transcript: TranscriptEntry[]
  readonly isMuted: boolean
  end: () => void
  mute: () => void
  unmute: () => void
}

Node consumers get a NodeCall extension with one extra method:

interface NodeCall extends Call {
  sendAudioChunk: (pcm: ArrayBuffer | ArrayBufferView) => boolean
}

Stable types

type CallState =
  | 'idle'
  | 'connecting'
  | 'listening'
  | 'user_speaking'
  | 'agent_speaking'
  | 'ended'
  | 'error'

type CallErrorCode =
  | 'missing_credentials'
  | 'forbidden'
  | 'mic_denied'
  | 'mic_start_failed'
  | 'audio_session_failed'
  | 'token_expired'
  | 'token_invalid'
  | 'unauthorized'
  | 'network_unreachable'
  | 'socket_error'
  | 'payment_required'
  | 'not_found'
  | 'silence_timeout'
  | 'server_error'

type CallEndReason = 'agent_ended' | 'user_hangup' | 'timeout' | 'error'

Animating while the agent is talking

Four optional callbacks on startCall give you everything you need to drive an "agent is talking" animation:

| Callback | What | | ----------------------------- | ------------------------------------------------------------------------------------------------------------- | | onStateChange(state) | state === 'agent_speaking' is true while the agent is in its speaking turn. | | onAgentTurnStart() | Fires the moment the agent starts a turn — a discrete trigger if you want an "activation" cue. | | onInterrupt() | The user barged in. End the animation early. | | onVolume({ input, output }) | Real-time amplitude. output is the agent's playback level (≈ what the listener hears); input is your mic. |

onStateChange === 'agent_speaking' is protocol-driven: it flips on as soon as the server begins the turn, before the first audio sample reaches the speaker. For a visual that follows what the listener actually hears, drive it from onVolume.output.

The usual recipe is a stable gate × live amplitude:

let isAgentTurn = false
let amplitude = 0

const call = await client.startCall({
  agentId,
  onStateChange: (s) => {
    isAgentTurn = s === 'agent_speaking'
    render()
  },
  onInterrupt: () => {
    isAgentTurn = false
    render()
  },
  onVolume: ({ output }) => {
    amplitude = output
    render()
  },
})

function render() {
  if (isAgentTurn)
    showPulse({ scale: 1 + amplitude }) // your renderer
  else hidePulse()
}

1:1 calls only — in multi-party rooms (joinRoom) the in-room agent is a silent notetaker and has no speaking turn, so none of these signals fire for it.

Client tools

You can declare tools the agent's LLM can call on the consumer's machine. The tool's handler runs in your app — server side has no access to it. Useful for surface-only actions (read DOM state, hit a private API, mutate local storage, control the UI).

import { configureVoiceClient, type ClientToolMap } from '@craftedxp/voice-js'

const tools: ClientToolMap = {
  addTodoItem: {
    description: "Add an item to the user's todo list.",
    parameters: {
      type: 'object',
      properties: { text: { type: 'string' } },
      required: ['text'],
    },
    usage: 'Call when the user asks to add or capture a task.',
    handler: async ({ text }) => {
      await myAppApi.addTodo(String(text))
      return `Added "${text}".`
    },
  },
}

const voice = configureVoiceClient({ apiBase: '...', fetchToken: async () => '...' })
const call = await voice.startCall({ agentId: 'agt_xxx', clientTools: tools })

The SDK validates clientTools at startCall (sync, throws on malformed input), then sends client_tools_register to the server right after connected. When the agent's LLM invokes a registered tool, your handler runs and the SDK posts the result back through the same WebSocket.

Handler return values are stringified (object → JSON.stringify) before being sent back; throws become { error: ... } frames. The server enforces a default 10s / max 30s timeout per timeoutMs in your declaration.

For the full wire protocol, sequencing, and constraints see docs/sdks.md → Client tools.

Migrating from `@voxline/web`

- import { VoiceClient } from '@voxline/web'
+ import { configureVoiceClient } from '@craftedxp/voice-js'

- const client = new VoiceClient({
-   apiBase: 'https://api.example.com',
-   agentId: 'agt_xxx',
-   apiKey: 'sk_REDACTED',           // ⚠️ DO NOT in client code
-   variables: { topic: 'billing' },
- })
- client.on('state', (s) => setState(s))
- client.on('transcript', (t) => setTranscript(t))
- client.on('volume', (v) => setVolume(v))
- client.on('error', (e) => setError(e))
- client.on('close', () => setState('ended'))
- await client.connect()
- // …
- client.mute(true)
- client.disconnect()

+ const voice = configureVoiceClient({
+   apiBase: 'https://api.example.com',
+   fetchToken: async ({ agentId }) => {
+     const r = await fetch('/api/voice/mint', {
+       method: 'POST',
+       body: JSON.stringify({ agentId }),
+     })
+     return (await r.json()).token  // your backend uses sk_ to mint ct_
+   },
+ })
+ const call = await voice.startCall({
+   agentId: 'agt_xxx',
+   context: { topic: 'billing' },                    // was `variables`
+   onStateChange: (s) => setState(s),
+   onTranscript: (t) => setTranscript(t),
+   onVolume: (v) => setVolume(v),
+   onError: (e) => setError(e),
+   onEnd: ({ reason }) => setState('ended'),
+ })
+ // …
+ call.mute()                                         // toggleable: mute/unmute, no boolean
+ call.end()

Three semantic shifts to be aware of:

apiKey is gone. The SDK no longer accepts it. Move your token mint to your backend. If you have an sk_ in JS code today, that's a credential leak — rotate it after the migration.
variables → context. Same purpose; new name lines up with voice-rn.
mute(true|false) → mute() / unmute(). Symmetric with voice-rn.

The embed widget (<script src="embed.js" data-token="ct_...">) keeps the same HTML API, but the data-api-key attribute is no longer accepted — mint server-side and inject data-token instead.

Troubleshooting

Agent's last syllable cuts off and plays into the next agent message. Almost always a misfiring barge-in (acoustic echo from a laptop speaker → mic, or a false-positive VAD on background noise). Three quick fixes, in order:

Test with headphones. Eliminates acoustic echo. If the symptom disappears, it was echo. Tell production users to wear headphones, or fall back to (3).
Check for a phantom user turn between two agent turns in the transcript that contains words the agent just said. That confirms STT is hearing the agent's voice through the mic.
Pass bargeIn: false on startCall for non-conversational flows. Adds ?barge=off to the WS URL and the SDK ignores interrupt events client-side. Tradeoff: user can't interrupt mid-sentence.

For the full diagnostic walkthrough (including the rarer Gemini-Live stale-audio-leak case and audio-handling guidance for Node/Electron consumers), see docs/sdks.md → Audio quality troubleshooting.

Embed widget

For drop-in <script> consumers (landing pages, no-build embeds):

<script
  src="https://your-cdn/embed.js"
  data-token="ct_REDACTED"
  data-agent-id="agt_xxx"
  data-api-base="https://api.your-server.com"
  defer
></script>

Renders a floating call button with a Shadow-DOM transcript panel. Pre-mint the ct_ server-side and inject it into the data-token attribute when you render the page.

Status

0.5.4 (current) — Screen sharing. setScreenShareEnabled(on, { audio }) publishes a screen_share video track (and, where the browser allows, a screen_share_audio track — Chrome captures tab/system audio; macOS Chrome is tab-audio only; Safari/Firefox don't capture share audio). RoomTrackEvent gains a source field (camera / microphone / screen_share / screen_share_audio / unknown) so a screen share can render as its own tile instead of replacing the participant's camera. Adds isScreenShareEnabled() and getLocalScreenTrack(). No 1:1 call-surface change. Drop-in for 0.5.3 consumers (the new source field is additive).
0.5.3 — session.participantId (this session's own stable p_… id). active.speakers includes the local participant, so a focus-tile UI rendered you as a remote when you spoke (no remote track → blank tile with the raw id). Filter participantId out of active.speakers to focus only remote speakers (self-view when none). No 1:1 call-surface change. Drop-in upgrade for 0.5.2 consumers.
0.5.2 — getRemoteTracks(): RoomTrackEvent[]. Returns remote tracks already subscribed at call time. A late joiner misses the live track.subscribed events for tracks published before it connected (LiveKit delivers them during connect, before consumer listeners attach), so it never rendered participants who already had their camera on (e.g. the host). Call getRemoteTracks() right after registering track.subscribed to backfill them. No 1:1 call-surface change. Drop-in upgrade for 0.5.1 consumers.
0.5.1 — Room video surface. RoomSession gains: track.subscribed / track.unsubscribed events with payload { participantId: string; kind: 'audio' | 'video'; track: RemoteTrack } (raw livekit-client track — call track.attach(el) / track.detach()); active.speakers event (string[] of participantIds currently speaking, drives active-speaker UI); setMicEnabled(on: boolean): Promise<void> / setCameraEnabled(on: boolean): Promise<void> (mid-call toggles); isMicEnabled(): boolean / isCameraEnabled(): boolean (read current state for toggle button UI); getLocalCameraTrack(): LocalVideoTrack | null (attach to self-view element). No API changes to the 1:1 call surface (startCall / Call). Drop-in upgrade for 0.5.0 consumers.
0.5.0 — Multi-party rooms. joinRoom({ roomId, joinCode, name }) returns a typed RoomSession with participant.{joined,left}, transcript.partial, system.message, room.ended events. Adds livekit-client as a direct dependency (~250 KB gzipped, tree-shakes if unused). publishMic() / publishCamera() / leave() methods. See docs/sdks.md → Multi-party rooms.
0.4.2 — WebRTC reliability: onicecandidate + onconnectionstatechange listeners now register before setLocalDescription (where ICE gathering starts). Candidates emitted before callId is known are buffered and flushed once the answer arrives. Browsers buffer candidates internally so the prior listen-after-setRemoteDescription pattern worked in Chrome/Safari, but the explicit ordering is correct per spec and matches the fix shipped in @craftedxp/[email protected]. No API surface change — drop-in upgrade.
0.4.1 — client_tools over the WebRTC DataChannel — tool register on connected, dispatch on client_tool_call frames, parity with the WS transport.
0.4.0 — WebRTC transport support. fetchToken may now return { token, transport, webrtcGatewayBase? }; the SDK dispatches WS vs WebRTC accordingly. Backwards-compatible — bare-string returns still always use WS.
0.3.2 — bug fix: onStateChange now fires for state transitions driven by server frames (connected → listening, agent_turn_start → agent_speaking, etc.). Latent regression since 0.2.0; onTranscript-only consumers were unaffected, but anyone deriving UI from onStateChange should upgrade. No API changes — drop-in.
0.3.1 — adds onInterrupt / onAgentTurnStart callbacks on StartCallOptions and NodeVoiceClientFactory proper return type for the Node entry. Backwards-compatible. Use 0.3.2 instead — both new callbacks depend on the state-callback path that 0.3.2 fixes.
0.3.0 — adds client-tools support. New clientTools option on startCall accepts a ClientToolMap (description, parameters, handler, optional usage/timeoutMs/example). Browser and Node bundles both supported. Backwards-compatible — existing consumers see no change.
0.2.0 — first @craftedxp/voice-js release. Browser + Node dual bundle, fetchToken factory, voice-rn 0.3.x parity. Migration path from @voxline/[email protected] documented above.
0.1.0 — @voxline/web. Singleton VoiceClient class, apiKey accepted. Retired in 0.2.0; never published to npm so no deprecation window.

See CONSUMING.md for the full setup walkthrough and DEVELOPING.md for SDK-author iteration.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@craftedxp/voice-js

Install

How the integration fits together

Quick start (browser)

Quick start (Node / Electron-main)

API reference

configureVoiceClient(config)

factory.startCall(options)

Stable types

Animating while the agent is talking

Client tools

Migrating from @voxline/web

Troubleshooting

Embed widget

Status

`configureVoiceClient(config)`

`factory.startCall(options)`

Migrating from `@voxline/web`