@craftedxp/voice-rn

v0.7.0

Published

11 days ago

React Native SDK for embedding a voice agent call in any RN app (Expo or bare). iOS + Android.

0High
0Medium
0Low

seekarun

voice voice-ai agent react-native expo deepgram

@craftedxp/voice-rn

React Native SDK for embedding a voice agent call in any RN app. iOS + Android, Expo-compatible, bare RN also works.

Internal testing release. API surface may evolve before a stable release. 0.3.1 fixes two Android crash modes in useAgentPlayback (call-end std::bad_alloc race + long-call OOM via SignalsmithStretch accumulation) — see Changelog below. 0.2.0 was a breaking change from 0.1.0 — see Migrating from 0.1.0.

Install

npm install @craftedxp/voice-rn 'react-native-audio-api@^0.10.0'
# or
yarn add @craftedxp/voice-rn 'react-native-audio-api@^0.10.0'

# Optional — only required if any agent is configured for WebRTC transport
npm install react-native-webrtc

react-native-audio-api is a peer dependency. Pin to ^0.10.0 — the 0.12 release changed the AudioRecorder callback shape and crashes the SDK's PCM conversion path. The SDK ships native patches (iOS + Android) that auto-apply at install (postinstall); look for [voice-rn] patched [email protected] in your install log.

react-native-webrtc is optional. The SDK only loads it when fetchToken returns { transport: 'webrtc' }. If you only call WS-configured agents, you can skip the install. WebRTC consumers must additionally configure native build steps per the react-native-webrtc install guide (Podfile permissions on iOS, Gradle min-SDK + manifest permissions on Android). Expo: dev-build required — WebRTC does not work in Expo Go.

How the integration fits together

┌─────────────────┐        ┌──────────────────┐        ┌─────────────────┐
│  Your RN app    │        │ Your backend     │        │ Voissia server  │
│                 │        │                  │        │                 │
│  configure      │        │  POST /voice-    │        │  POST /v1/call- │
│  (orgId,        │        │  token           │        │  tokens         │
│   apiBase,      │        │  (your auth)     │        │  (sk_ auth)     │
│   fetchToken)   │        │       │          │        │       │         │
│        │        │        │       │          │        │       │         │
│  fetchToken ────┼───────►│   call Voissia ──┼───────►│   mint ct_      │
│        │        │   (1)  │   with sk_       │  (2)   │   with context  │
│        │        │        │       │          │        │       │         │
│        │◄───────┼────────┼───── ct_ ────────┼────────┼─── ct_          │
│        │        │   (4)  │       │          │  (3)   │                 │
│  <AgentCall …>──┼────────┼──── WSS /v1/agents/.../call?token=ct_ ─────►│
│                 │   (5)  │       │          │        │                 │
└─────────────────┘        └──────────────────┘        └─────────────────┘

The sk_ API key never lives in your app. Your own backend holds it, mints short-lived ct_ tokens, and returns them to the SDK. The SDK only ever sees ct_ tokens.

Quick start

Once at app startup

// e.g. mobile/app/_layout.tsx — runs once when the app boots
import { configureVoiceClient } from '@craftedxp/voice-rn'

configureVoiceClient({
  orgId: 'org_abc123',
  apiBase: 'https://api.your-server.com',
  // Called by the SDK whenever it needs a fresh ct_ token. Hit YOUR own
  // backend; your backend hits Voissia's POST /v1/call-tokens with the sk_
  // and returns the response to the app. Never embed sk_ in the app.
  //
  // Return the rich form so the SDK can dispatch WS vs WebRTC per the
  // agent's configuration. New agents default to WebRTC since 2026-05-16.
  // The bare-string form (just `return body.token`) is still supported —
  // it always uses WS. See the Transport section below.
  fetchToken: async ({ agentId, userId, context, metadata }) => {
    const res = await fetch('https://your-backend.com/voice-token', {
      method: 'POST',
      headers: { Authorization: `Bearer ${await yourSessionToken()}` },
      body: JSON.stringify({ agentId, userId, context, metadata }),
    })
    const body = await res.json()
    return {
      token: body.token,
      transport: body.transport, // 'ws' | 'webrtc'
      webrtcGatewayBase: body.webrtcGatewayBase, // present when transport=webrtc
    }
  },
})

Drop-in component

import { AgentCall } from '@craftedxp/voice-rn'

export default function CallScreen() {
  return (
    <AgentCall
      agentId="agent_support"
      userId={currentUser.id}
      context={{
        topic: 'billing',
        recentOrders: [
          /* ... */
        ],
      }}
      metadata={{ deviceId, tier: 'pro' }}
      bargeIn={true}
      onStateChange={(s) => console.log('state', s)}
      onError={(e) => console.warn('err', e.code, e.message)}
      onEnd={(end) => console.log('ended', end.reason, `${end.durationMs}ms`)}
    />
  )
}

Headless hook (build your own UI)

For phone-call-style UIs, custom theming, or anywhere you need fine-grained control:

import { useVoiceCall } from '@craftedxp/voice-rn'

function PhoneCallScreen() {
  const {
    state,                            // 'idle' | 'connecting' | 'listening' | 'user_speaking' | 'agent_speaking' | 'ended' | 'error'
    transcript,                       // [{ role, text, ... }]
    isMuted,
    inputVolume, outputVolume,        // 0..1, for VU meters
    start, end,
    mute, unmute,
  } = useVoiceCall({
    agentId: 'agent_support',
    userId: currentUser.id,
    context: { topic: 'billing' },
  })

  useEffect(() => { void start(); return () => end() }, [])

  return <YourCallUI ...{...{ state, isMuted, mute, unmute, inputVolume, outputVolume, end }} />
}

Configuration reference

Build time (`configureVoiceClient`)

| Field | Purpose | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | orgId | Pinned at build time — identifies the org all calls run against. | | apiBase | Voissia server URL. WS scheme is auto-derived (https→wss). | | fetchToken | Async callback the SDK calls to mint a ct_. Return either a bare ct_ string (always WS — back-compat) or the rich { token, transport, webrtcGatewayBase? } object so the SDK dispatches WS vs WebRTC per the agent's configuration. Called on start() and on token_expired mid-call. | | defaultContext | Optional. Merged underneath every per-call context. Use for app-level fields the agent should always see (locale, app version). | | defaultMetadata | Optional. Merged underneath every per-call metadata. Use for telemetry fields you always want on the call.ended webhook (build, device class). |

Per-call (`<AgentCall>` props / `useVoiceCall(opts)`)

| Field | Purpose | | ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | agentId | The agent to call. Swap at runtime to talk to a different agent. | | userId | Optional. Round-tripped to the server as contactId for cross-call memory. | | context | Optional arbitrary JSON. Server lowers it into the agent's system prompt at session open — use for per-call data the agent should know. | | metadata | Optional { string: string }. Round-tripped on the call.ended webhook. NOT shown to the agent. ≤1 KB. | | bargeIn | Default true. Set false to suppress user-interruption (alarm-style flows). | | clientTools | Optional ClientToolMap. Declare functions the agent's LLM can invoke mid-call; handlers run on the device. Works for both WS and WebRTC transports. See docs/sdks.md → Client tools for the wire-level protocol. | | onInterrupt | Optional. Fires when the server signals barge-in (after the SDK's built-in playback flush). Use for your own teardown / haptics. | | onAgentTurnStart | Optional. Precise anchor for the start of an agent turn — fires once per turn before any agent audio is queued. | | onVolume | Optional. { input: number, output: number } callback at ~10–25 Hz for VU meters. WS-path only; WebRTC consumers use inputVolume / outputVolume on the controller instead. | | token | Optional escape hatch. Pre-minted ct_; skips fetchToken for this call. Test-only — tokens expire and the SDK can't re-mint without the callback. |

Animating while the agent is talking

Four optional callbacks on useVoiceCall (and <AgentCall> as props) give you everything you need to drive an "agent is talking" animation — same shape and names as @craftedxp/voice-js:

| Callback | What | | ----------------------------- | ------------------------------------------------------------------------------------------------------------- | | onStateChange(state) | state === 'agent_speaking' is true while the agent is in its speaking turn. | | onAgentTurnStart() | Fires the moment the agent starts a turn — a discrete trigger if you want an "activation" cue. | | onInterrupt() | The user barged in. End the animation early. | | onVolume({ input, output }) | Real-time amplitude. output is the agent's playback level (≈ what the listener hears); input is your mic. |

onStateChange === 'agent_speaking' is protocol-driven: it flips on as soon as the server begins the turn, before the first audio sample reaches the speaker. For a visual that follows what the listener actually hears, drive it from onVolume.output.

The usual recipe is a stable gate × live amplitude — pair them with Animated.Value so the visual updates run on the native thread:

import { useRef, useState } from 'react'
import { Animated } from 'react-native'
import { useVoiceCall } from '@craftedxp/voice-rn'

function AgentBlob({ agentId }: { agentId: string }) {
  const amplitude = useRef(new Animated.Value(0)).current
  const [speaking, setSpeaking] = useState(false)

  useVoiceCall({
    agentId,
    onStateChange: (s) => setSpeaking(s === 'agent_speaking'),
    onInterrupt: () => setSpeaking(false),
    onVolume: ({ output }) => amplitude.setValue(output),
  })

  if (!speaking) return null
  return (
    <Animated.View
      style={{
        width: 80,
        height: 80,
        borderRadius: 40,
        backgroundColor: '#3d7dd0',
        transform: [
          { scale: amplitude.interpolate({ inputRange: [0, 1], outputRange: [1, 1.6] }) },
        ],
      }}
    />
  )
}

1:1 calls only — in multi-party rooms (useRoom) the in-room agent is a silent notetaker and has no speaking turn, so none of these signals fire for it.

Transport

The SDK supports both WebSocket and WebRTC transports. The agent's configured transport field (set server-side via the dashboard) decides which one is used for a given call — the consumer side doesn't need to think about it.

To enable WebRTC, your backend must forward the mint response's transport and webrtcGatewayBase fields to the app, and your fetchToken callback returns the rich form:

fetchToken: async (args) => {
  const r = await fetch('https://your-backend.com/voice-token', {
    /* ... */
  })
  const body = await r.json()
  // Rich form — let the SDK dispatch. The bare-string form is still
  // supported and always uses WS.
  return {
    token: body.token,
    transport: body.transport, // 'ws' | 'webrtc'
    webrtcGatewayBase: body.webrtcGatewayBase, // present when transport=webrtc
  }
}

WebRTC also requires installing the react-native-webrtc peer dependency (see Install). If a webrtc-configured agent is called without that peer installed, the SDK falls back to WebSocket with a console.warn (the server's WS path accepts both transports, so the call still succeeds).

iOS permissions

<key>NSMicrophoneUsageDescription</key>
<string>Used for voice calls with the AI agent.</string>

<!-- For calls that should continue when the screen locks -->
<key>UIBackgroundModes</key>
<array>
  <string>audio</string>
  <string>voip</string>
</array>

Expo:

{
  "expo": {
    "ios": {
      "infoPlist": {
        "NSMicrophoneUsageDescription": "...",
        "UIBackgroundModes": ["audio", "voip"],
      },
    },
  },
}

Errors

onError fires for terminal issues with a stable code from this set:

| code | when | | ---------------------- | ----------------------------------------------------------------------- | | missing_credentials | start() before configureVoiceClient, or fetchToken returned empty | | mic_denied | OS-level mic permission denied | | mic_start_failed | Native mic capture threw | | audio_session_failed | iOS audio session couldn't be configured | | token_expired | Server rejected ct_ as expired | | token_invalid | Server rejected ct_ as invalid (revoked, malformed) | | network_unreachable | WS open failed before any frame | | payment_required | Org credit balance insufficient | | forbidden | Token's agentId doesn't match request | | not_found | Agent deleted between mint and connect | | silence_timeout | Server detected long silence; ends call | | socket_error | Generic WS error | | server_error | Catch-all for 5xx / 1011 / unexpected close |

End-of-call event

onEnd(end) fires exactly once when the call ends, regardless of cause:

end: {
  reason: 'agent_ended' | 'user_hangup' | 'timeout' | 'error',
  errorCode?: CallErrorCode,    // present iff reason === 'error'
  durationMs: number,
}

Use this to drive CallKit teardown, billing hooks, or fall-back-to-non-voice flow triggers — rather than watching state === 'ended'.

Migrating from 0.1.0

| Was (0.1.0) | Now (0.2.0) | | ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------- | | <AgentCall apiBase={url} token={ct} agentId={id} /> | Call configureVoiceClient({ orgId, apiBase, fetchToken }) once at startup, then <AgentCall agentId={id} />. | | useVoiceCall({ apiBase, token, agentId, variables }) | useVoiceCall({ agentId, userId, context }). SDK calls your fetchToken to get the token. | | state === 'ended' to detect end | onEnd(({ reason, durationMs }) => ...) | | variables: { foo: 'bar' } (flat, primitives only) | context: { foo: { nested: 'objects' }, allowed: true } (arbitrary JSON) | | Mute via raw useMicCapture | mute() / unmute() on the hook return | | Volume via custom RMS | inputVolume / outputVolume on the hook return |

Android

Validated end-to-end on Pixel 9 Pro XL (Android 16) over speakerphone + Bluetooth headset. See CONSUMING.md for the full Android setup walkthrough — manifest permissions, runtime grant, emulator network gotchas (adb reverse), and known unknowns across OEMs.

The SDK's bundled patch enables hardware Acoustic Echo Cancellation on Android (VOICE_COMMUNICATION input preset → AEC + NS + AGC), so speakerphone calls work without acoustic feedback loops. Barge-in (interrupting the agent mid-sentence) works by default.

For bare RN projects, declare RECORD_AUDIO, MODIFY_AUDIO_SETTINGS, INTERNET, and FOREGROUND_SERVICE_MEDIA_PLAYBACK in AndroidManifest.xml. Expo managed apps: declare these in app.json's android.permissions.

Multi-party video rooms (since 0.6.0)

useRoom + <RoomVideoView> join a multi-party voissia room (provisioned server-side with client.rooms.create from @craftedxp/sdk-node) and render remote participants' video. A silent AI notetaker transcribes the call in the background.

import { useRoom, RoomVideoView, requestRoomPermissions } from '@craftedxp/voice-rn'
import { useEffect } from 'react'
import { View, Button } from 'react-native'

function Call({
  apiBase,
  roomId,
  joinCode,
  name,
}: {
  apiBase: string
  roomId: string
  joinCode: string
  name: string
}) {
  useEffect(() => {
    void requestRoomPermissions()
  }, [])
  const room = useRoom({ apiBase, roomId, joinCode, name })

  return (
    <View style={{ flex: 1 }}>
      <RoomVideoView track={room.localVideoTrack} mirror style={{ flex: 1 }} />
      {room.participants.map((p) => (
        <RoomVideoView key={p.participantId} track={p.videoTrack} style={{ height: 160 }} />
      ))}
      <Button title="Leave" onPress={() => void room.leave()} />
    </View>
  )
}

Setup (Expo prebuilt or bare RN)

npm install @craftedxp/voice-rn.

Add the plugin to app.json:

{ "expo": { "plugins": ["@craftedxp/voice-rn"] } }

Run a dev build: npx expo prebuild --clean && cd ios && pod install.
Not supported in Expo Go — rooms need a development build for native media.

Surface

| Symbol | What | | ------------------------------------------------------------ | ----------------------------- | | useRoom({ apiBase, roomId, joinCode, name }) | reactive state + controls | | <RoomVideoView track={…} /> | renders a participant's video | | requestRoomPermissions() | OS camera + mic prompt | | RoomParticipant / RoomVideoTrack / RoomConnectionState | exported types |

RoomVideoTrack is opaque — pass it to <RoomVideoView>; there are no methods to call on it.

Permissions

iOS picks up NSCameraUsageDescription / NSMicrophoneUsageDescription from the voissia config plugin (override with ["@craftedxp/voice-rn", { "cameraPermission": "…" }]). Android CAMERA + RECORD_AUDIO are added to the manifest by the same plugin; call requestRoomPermissions() once before joining to prompt the user.

Changelog

0.6.0

0.6.0 (current) — Multi-party video rooms. New useRoom hook + <RoomVideoView> component join a voissia room, publish mic + camera, render remote participants' video, expose active-speaker + mic/camera toggles. requestRoomPermissions() helper for the OS prompt. WebRTC native module is now bundled (apps stop installing react-native-webrtc); add the @craftedxp/voice-rn Expo config plugin. Expo Go is not supported (development build required).

0.5.0

Feature parity with @craftedxp/[email protected]. The RN SDK now supports:

Client tools — declare functions the agent's LLM can invoke mid-call via the new clientTools prop on <AgentCall> / useVoiceCall(opts). Handlers run on the device; results stream back to the LLM over the existing call transport (WS or WebRTC). See docs/sdks.md → Client tools for the wire protocol and gotchas.
onInterrupt — fires when the server signals barge-in. Built-in playback flush still runs first; the callback is for consumers that need their own teardown.
onAgentTurnStart — precise turn anchor without diffing onStateChange.
onVolume — VU-meter callback. Fires at ~10–25 Hz. WebSocket-path only; WebRTC consumers should keep using the controller's inputVolume / outputVolume (no native RMS hooks for RTP audio yet — captured as a phase-07 followup).
token escape hatch — pass a pre-minted ct_ and skip fetchToken. Test-only — tokens expire and the SDK can't re-mint without the callback.
defaultContext / defaultMetadata on configureVoiceClient — applied to every call; per-call values merge on top.

All additions are opt-in. Existing consumers see no change.

0.4.1

WebRTC reliability fix. The icecandidate and connectionstatechange listeners were registered after setRemoteDescription, so any ICE candidates fired during the offer/answer round-trip were silently dropped on iOS — calls would reach setRemoteDescription then never transition to connected. Now both listeners are registered before setLocalDescription (where gathering actually starts), and candidates emitted before callId is known are buffered and flushed once the answer arrives. WS-path callers see no change. Validated end-to-end on a physical iOS device.

0.4.0

WebRTC transport support. fetchToken may now return either a bare string (existing behaviour — always WS) or the rich { token, transport, webrtcGatewayBase? } object the server's mint response carries. When transport === 'webrtc', the SDK uses react-native-webrtc (optional peer dependency) for the call; everything else stays the same — same VoiceCallController surface, same <AgentCall> props, same event callbacks. If a webrtc-configured agent is reached without the peer installed, the SDK logs a console warning and falls back to WebSocket (server accepts both). Existing consumers see no change.

0.3.1

Fix std::bad_alloc SIGABRT on call end (Android). useAgentPlayback now sets a shutdown flag in its unmount cleanup BEFORE closing the AudioContext, and enqueuePcm bails immediately when that flag is set. Previously, late WebSocket audio chunks landing after ctx.close() would lazily resurrect a fresh AudioContext mid-teardown; the half-initialised context's setBuffer then threw std::bad_alloc across the JSI boundary, escaping JSI's C++ marshaling in New Arch release builds and aborting the process via SIGABRT. getCtx() no longer resurrects a closed context. Belt-and-braces try/catch now also wraps createBuffer/setBuffer/connect so any future native throw drops the chunk instead of killing the process.
Fix long-call OOM (Android). react-native-audio-api's setBuffer allocates a per-source SignalsmithStretch (FFT tables, windowing buffers) regardless of whether time-stretching is used. At ~10 chunks/sec from the server, ~600 sources accumulate per minute — Hermes finalisers don't keep up because the JS heap stays small while the native heap balloons. Two fixes ship: (a) onEnded now eagerly calls disconnect() to release the FFT tables immediately rather than waiting for the JS finalizer; (b) belt-and-braces stale-entry drain plus a 64-source hard cap in enqueuePcm (drop-oldest with explicit stop()/disconnect()) defends against onEnded not firing reliably on every Android source instance. With both, sustained long calls (>5 min) survive without hitting Scudo's internal map failure. The user hears an audible gap if the cap evicts an in-flight source — strictly better than the process aborting.
Reported by an external integrator; tombstones from Pixel 9 Pro XL release builds isolated both call sites. No API surface change — patch-level upgrade is safe for any 0.3.x consumer.

0.3.0

Android validated end-to-end. Token mint, mic capture, agent TTS playback, hardware AEC over speakerphone, Bluetooth routing, barge-in, half-duplex fallback. Tested on Pixel 9 Pro XL / Android 16. Drops the previous "iOS-first" claim.
Native patch (Android): react-native-audio-api's recorder now opens with oboe::InputPreset::VoiceCommunication, mapping to Android's VOICE_COMMUNICATION audio source. This enables hardware AEC + NS + AGC on the platform's HAL — required for speakerphone calls without echo loops. Auto-applies at install.
TTS playback fix: useAgentPlayback no longer pins the AudioContext sample rate (Android's react-native-audio-api 0.10.3 silently fails when context rate ≠ device hardware rate). Manual linear-interpolation upsample 16 kHz → device-rate before queueing the buffer (the library doesn't resample buffer-to-context on Android — chipmunk effect without this fix).
Half-duplex echo gate: new opt-in fallback for hardware that lacks AEC. Set bargeIn: false on <AgentCall> props to mute mic frames while the agent is speaking. Default keeps barge-in open.
iOS path verified — no regressions from the playback / echo-gate refactors.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@craftedxp/voice-rn

Install

How the integration fits together

Quick start

Once at app startup

Drop-in component

Headless hook (build your own UI)

Configuration reference

Build time (configureVoiceClient)

Per-call (<AgentCall> props / useVoiceCall(opts))

Animating while the agent is talking

Transport

iOS permissions

Errors

End-of-call event

Migrating from 0.1.0

Android

Multi-party video rooms (since 0.6.0)

Setup (Expo prebuilt or bare RN)

Surface

Permissions

Changelog

0.6.0

0.5.0

0.4.1

0.4.0

0.3.1

0.3.0

Build time (`configureVoiceClient`)

Per-call (`<AgentCall>` props / `useVoiceCall(opts)`)