@kognitivedev/voice

v0.2.29

Published

10 days ago

Browser voice agent layer for Kognitive with OpenAI Realtime and Gemini Live runtimes

Downloads

287

0High
0Medium
0Low

vserifsaglam

kognitive voice realtime openai gemini live webrtc websocket

@kognitivedev/voice

Browser voice agents for Kognitive with built-in direct runtimes plus a backend-executed pipeline mode.

Installation

bun add @kognitivedev/voice @kognitivedev/tools

What It Provides

createVoiceAgent()
createVoiceAgentNetwork()
issueRealtimeClientSecret()
createBrowserVoiceSession()
@kognitivedev/voice/telephony for server-side telephony adapters and session management

Runtime Adapters

@kognitivedev/voice ships with three runtime modes:

openai-realtime via WebRTC
gemini-live via WebSocket
kognitive-voice via the backend pipeline runtime, bootstrapped with /session

createBrowserVoiceSession() selects the adapter automatically from prepare.runtime.provider, so the browser code stays the same after prepare().

Runtime Differences

OpenAI Realtime supports short-lived client secrets via issueRealtimeClientSecret().
OpenAI Realtime supports live instruction updates and in-session hot handoff when the next specialist stays on the same provider/runtime.
Gemini Live uses credentialsEndpoint, credentials, or getCredentials() instead of OpenAI client secrets.
Gemini Live currently requires reconnects for instruction changes and cross-agent handoff.
Kognitive pipeline mode keeps the browser API the same, but the live STT/LLM/TTS execution runs in the separate Python apps/voice-runtime service.

Quick Start

OpenAI Realtime

import { createVoiceAgent, createBrowserVoiceSession } from "@kognitivedev/voice";
import { createTool } from "@kognitivedev/tools";
import { z } from "zod";

const weatherTool = createTool({
  id: "weather-lookup",
  description: "Look up weather",
  inputSchema: z.object({ city: z.string() }),
  execute: async ({ city }) => ({ city, temperature: 24 }),
});

const voiceAgent = createVoiceAgent({
  name: "voice-assistant",
  instructions: "Help the user in a concise spoken style.",
  runtime: {
    provider: "openai-realtime",
    model: "gpt-realtime",
    voice: "marin",
  },
  tools: [weatherTool],
});

const prepared = await voiceAgent.prepare({
  resourceId: { userId: "user_1" },
});

const session = createBrowserVoiceSession({
  prepare: prepared,
  clientSecretEndpoint: "/api/kognitive/voice/agents/voice-assistant/client-secret",
  toolEndpoint: "/api/kognitive/voice/agents/voice-assistant/tools/execute",
});

await session.connect();
session.sendText("What is the weather in Paris?");

VoiceSessionState.messages uses KognitiveUIMessage[], so existing @kognitivedev/ui tool registrations can render voice tool calls and results without a second registry.

Context Adapters

Voice agents support the same structurally compatible contextAdapters contract as @kognitivedev/agents. Adapters resolve during prepare() and can append session instructions or add session-scoped tools:

import { createCloudKnowledgeBaseContextAdapter } from "@kognitivedev/cloud-knowledge-base";

const voiceAgent = createVoiceAgent({
  name: "voice-assistant",
  instructions: "Help the user in a concise spoken style.",
  contextAdapters: [
    createCloudKnowledgeBaseContextAdapter({
      pipelineId: "support-docs",
      autoRegisterSearchTool: true,
    }),
  ],
});

Because voice prepare() usually happens before the live conversation starts, context adapters are best for session bootstrap context such as account state, tenant configuration, preloaded knowledge, and tool registration. For turn-by-turn retrieval based on the user's spoken request, expose retrieval as a tool so the realtime model can call it during the session.

Gemini Live

import { createVoiceAgent, createBrowserVoiceSession } from "@kognitivedev/voice";

const voiceAgent = createVoiceAgent({
  name: "voice-assistant",
  instructions: "Help the user in a concise spoken style.",
  runtime: {
    provider: "gemini-live",
    model: "gemini-3.1-flash-live-preview",
    voice: "Aoede",
  },
});

const prepared = await voiceAgent.prepare({
  resourceId: { userId: "user_1" },
});

const session = createBrowserVoiceSession({
  prepare: prepared,
  credentialsEndpoint: "/api/kognitive/voice/agents/voice-assistant/credentials",
  toolEndpoint: "/api/kognitive/voice/agents/voice-assistant/tools/execute",
});

await session.connect();
session.sendText("Summarize what I said in one sentence.");

Credentials

Use clientSecretEndpoint or issueRealtimeClientSecret() for OpenAI Realtime sessions.
Use credentialsEndpoint, credentials, or getCredentials() for Gemini Live sessions.
Use sessionEndpoint, session, or getSession() for backend pipeline sessions.
The mounted Kognitive runtime now exposes /session as the canonical bootstrap endpoint. /client-secret and /credentials remain available for direct runtimes.

Pipeline Mode

Use pipeline mode when you want Kognitive to keep a single public voice API while a backend runtime owns the live STT/LLM/TTS execution.

const voiceAgent = createVoiceAgent({
  name: "voice-pipeline",
  instructions: "Help the user in a concise spoken style.",
  runtime: {
    provider: "kognitive-voice",
    mode: "pipeline",
    sessionEndpoint: "/api/kognitive/voice/agents/voice-pipeline/session",
    pipeline: {
      transport: { type: "websocket" },
      stt: { provider: "deepgram", model: "nova-3", language: "en" },
      llm: { provider: "xai", model: "grok-4.1-fast" },
      tts: { provider: "cartesia", model: "sonic-3", voice: "blake" },
      backgroundAudio: { preset: "none" },
    },
  },
});

const prepared = await voiceAgent.prepare({
  resourceId: { userId: "user_1" },
});

const session = createBrowserVoiceSession({
  prepare: prepared,
  sessionEndpoint: "/api/kognitive/voice/agents/voice-pipeline/session",
});

await session.connect();

Telephony

Use @kognitivedev/voice/telephony when the caller is not a browser microphone. This entry point is Node-only and is intended for PSTN, SIP, or provider WebSocket integrations.

The current implementation ships with:

a provider-neutral telephony session registry
normalized call/session types
a Twilio Media Streams adapter
Twilio request-signature validation
TwiML generation for <Connect><Stream>
Twilio bidirectional media message parsing and serialization
mu-law 8k codec helpers for Twilio audio payloads

Install

@kognitivedev/voice/telephony is exported from the same package:

import {
  createVoiceTelephonyService,
  createTwilioInboundCallResponse,
  handleTwilioMediaStreamMessage,
} from "@kognitivedev/voice/telephony";

Session Model

The telephony layer does not replace createVoiceAgent(). It resolves an existing voice agent, calls prepare(), and stores the resulting VoicePrepareResult alongside provider call metadata.

import { createVoiceTelephonyService } from "@kognitivedev/voice/telephony";

const telephony = createVoiceTelephonyService({
  resolveAgent: async ({ agentName }) => {
    if (agentName === "billing") return billingVoiceAgent;
    return conciergeVoiceAgent;
  },
});

Twilio Inbound Call Example

import { createTwilioInboundCallResponse } from "@kognitivedev/voice/telephony";

const response = await createTwilioInboundCallResponse({
  service: telephony,
  authToken: process.env.TWILIO_AUTH_TOKEN!,
  resourceId: { userId: "user_123" },
  buildStreamUrl: (session) =>
    `wss://example.com/api/twilio/media/${session.sessionId}`,
  customParameters: {
    projectId: "project_1",
  },
}, {
  url: "https://example.com/api/twilio/inbound",
  headers: {
    "x-twilio-signature": req.headers["x-twilio-signature"] as string,
  },
  params: {
    CallSid: "CA123",
    AccountSid: "AC123",
    From: "+15550001111",
    To: "+15550002222",
    Direction: "inbound",
  },
});

return new Response(response.twiml, {
  headers: { "Content-Type": "text/xml" },
});

The generated TwiML uses <Connect><Stream> and injects sessionId, callId, and agentName as Twilio <Parameter> values so the subsequent WebSocket stream can be correlated back to your Kognitive session.

Twilio Media Stream Example

import {
  handleTwilioMediaStreamMessage,
  parseTwilioMediaStreamMessage,
} from "@kognitivedev/voice/telephony";

const parsed = parseTwilioMediaStreamMessage(rawMessage);
const event = handleTwilioMediaStreamMessage(telephony, sessionId, parsed);

if (event.type === "call.audio") {
  // event.audio.payload is base64 mu-law 8k
  // forward it into your STT / realtime / pipeline runtime here
}

Twilio Security

The helper validates the X-Twilio-Signature header against the exact webhook URL and form parameters. For Twilio WebSocket validation, the helper also supports the documented trailing-slash fallback used by Twilio's signature validation guidance.

What This Layer Does Not Do

The telephony entry point does not yet mount framework routes or run the live STT-LLM-TTS bridge for you. It gives you the provider-neutral session model and the first carrier adapter so your app can wire:

webhook ingress
WebSocket upgrades
audio bridging
call-control decisions
tracing/reporting

Multi-Agent Voice Networks

Use createVoiceAgentNetwork() when one live call should move between specialists instead of forcing a single prompt to do everything.

import {
  createVoiceAgent,
  createVoiceAgentNetwork,
  createBrowserVoiceSession,
} from "@kognitivedev/voice";

const supportAgent = createVoiceAgent({
  name: "support",
  instructions: "Handle general support questions.",
});

const billingAgent = createVoiceAgent({
  name: "billing",
  instructions: "Handle refunds, invoices, and payment failures.",
});

const network = createVoiceAgentNetwork({
  name: "support-network",
  agents: {
    support: supportAgent,
    billing: billingAgent,
  },
  defaultAgent: "support",
  maxHops: 5,
});

const prepared = await network.prepare({
  resourceId: { userId: "user_1" },
});

const session = createBrowserVoiceSession({
  prepare: prepared,
  credentialsEndpoint: "/api/kognitive/voice/agents/support-network/credentials",
  handoffDelayMs: 1500,
  handoff: async ({ handoff, currentPrepare }) => ({
    prepare: await network.handoff({
      targetAgentName: handoff.agent,
      currentAgentName: currentPrepare.network?.activeAgentName,
      reason: handoff.reason,
      resourceId: currentPrepare.resourceId,
      metadata: currentPrepare.metadata,
      hopCount: currentPrepare.network?.hopCount,
      sharedState: currentPrepare.network?.sharedState,
      transferState: handoff.transferState,
    }),
  }),
});

Network Behavior

A networked prepare result includes prepare.network with the active specialist, hop count, shared state, and handoff tool metadata.
The active specialist gets an internal handoff_to_agent tool by default.
OpenAI Realtime sessions can hot-swap specialists in-place when the next specialist uses the same provider, transport, and model.
Gemini Live currently emits a clear handoff failure because in-session specialist replacement is not yet supported by the runtime.

Exports

createVoiceAgent()
createVoiceAgentNetwork()
VoiceContextAdapter
VoiceContextAdapterResult
createBrowserVoiceSession()
issueRealtimeClientSecret()
@kognitivedev/voice/telephony
OPENAI_REALTIME_RUNTIME
GEMINI_LIVE_RUNTIME
OPENAI_REALTIME_CAPABILITIES
GEMINI_LIVE_CAPABILITIES
resolveVoiceRuntimeAdapter()
resolvePreparedVoiceRuntime()
sanitizePreparedVoiceSession()
toSdkRealtimeSessionConfig()
voice session state reducers and telemetry/reporting helpers
telephony service, Twilio Media Streams helpers, and telephony codec utilities

Browser Handoff Resolution

Browser sessions can resolve handoffs in two ways:

Pass handoff(request) directly and return the next VoicePrepareResult.
Pass handoffEndpoint, or rely on auto-derivation from credentialsEndpoint / clientSecretEndpoint, and return JSON like:
Optionally set handoffDelayMs to delay the in-session swap. This can be a fixed number or a function that computes the delay from the handoff request/result.

{
  "prepare": {
    "...": "VoicePrepareResult"
  }
}

The request body includes:

handoff
resourceId
metadata
network
callId
sessionId

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@kognitivedev/voice

Installation

What It Provides

Runtime Adapters

Runtime Differences

Quick Start

OpenAI Realtime

Context Adapters

Gemini Live

Credentials

Pipeline Mode

Telephony

Install

Session Model

Twilio Inbound Call Example

Twilio Media Stream Example

Twilio Security

What This Layer Does Not Do

Multi-Agent Voice Networks

Network Behavior

Exports

Browser Handoff Resolution