@kognitivedev/voice
v0.2.29
Published
Browser voice agent layer for Kognitive with OpenAI Realtime and Gemini Live runtimes
Downloads
287
Maintainers
Readme
@kognitivedev/voice
Browser voice agents for Kognitive with built-in direct runtimes plus a backend-executed pipeline mode.
Installation
bun add @kognitivedev/voice @kognitivedev/toolsWhat It Provides
createVoiceAgent()createVoiceAgentNetwork()issueRealtimeClientSecret()createBrowserVoiceSession()@kognitivedev/voice/telephonyfor server-side telephony adapters and session management
Runtime Adapters
@kognitivedev/voice ships with three runtime modes:
openai-realtimevia WebRTCgemini-livevia WebSocketkognitive-voicevia the backendpipelineruntime, bootstrapped with/session
createBrowserVoiceSession() selects the adapter automatically from prepare.runtime.provider, so the browser code stays the same after prepare().
Runtime Differences
- OpenAI Realtime supports short-lived client secrets via
issueRealtimeClientSecret(). - OpenAI Realtime supports live instruction updates and in-session hot handoff when the next specialist stays on the same provider/runtime.
- Gemini Live uses
credentialsEndpoint,credentials, orgetCredentials()instead of OpenAI client secrets. - Gemini Live currently requires reconnects for instruction changes and cross-agent handoff.
- Kognitive pipeline mode keeps the browser API the same, but the live STT/LLM/TTS execution runs in the separate Python
apps/voice-runtimeservice.
Quick Start
OpenAI Realtime
import { createVoiceAgent, createBrowserVoiceSession } from "@kognitivedev/voice";
import { createTool } from "@kognitivedev/tools";
import { z } from "zod";
const weatherTool = createTool({
id: "weather-lookup",
description: "Look up weather",
inputSchema: z.object({ city: z.string() }),
execute: async ({ city }) => ({ city, temperature: 24 }),
});
const voiceAgent = createVoiceAgent({
name: "voice-assistant",
instructions: "Help the user in a concise spoken style.",
runtime: {
provider: "openai-realtime",
model: "gpt-realtime",
voice: "marin",
},
tools: [weatherTool],
});
const prepared = await voiceAgent.prepare({
resourceId: { userId: "user_1" },
});
const session = createBrowserVoiceSession({
prepare: prepared,
clientSecretEndpoint: "/api/kognitive/voice/agents/voice-assistant/client-secret",
toolEndpoint: "/api/kognitive/voice/agents/voice-assistant/tools/execute",
});
await session.connect();
session.sendText("What is the weather in Paris?");VoiceSessionState.messages uses KognitiveUIMessage[], so existing @kognitivedev/ui tool registrations can render voice tool calls and results without a second registry.
Context Adapters
Voice agents support the same structurally compatible contextAdapters contract as @kognitivedev/agents. Adapters resolve during prepare() and can append session instructions or add session-scoped tools:
import { createCloudKnowledgeBaseContextAdapter } from "@kognitivedev/cloud-knowledge-base";
const voiceAgent = createVoiceAgent({
name: "voice-assistant",
instructions: "Help the user in a concise spoken style.",
contextAdapters: [
createCloudKnowledgeBaseContextAdapter({
pipelineId: "support-docs",
autoRegisterSearchTool: true,
}),
],
});Because voice prepare() usually happens before the live conversation starts, context adapters are best for session bootstrap context such as account state, tenant configuration, preloaded knowledge, and tool registration. For turn-by-turn retrieval based on the user's spoken request, expose retrieval as a tool so the realtime model can call it during the session.
Gemini Live
import { createVoiceAgent, createBrowserVoiceSession } from "@kognitivedev/voice";
const voiceAgent = createVoiceAgent({
name: "voice-assistant",
instructions: "Help the user in a concise spoken style.",
runtime: {
provider: "gemini-live",
model: "gemini-3.1-flash-live-preview",
voice: "Aoede",
},
});
const prepared = await voiceAgent.prepare({
resourceId: { userId: "user_1" },
});
const session = createBrowserVoiceSession({
prepare: prepared,
credentialsEndpoint: "/api/kognitive/voice/agents/voice-assistant/credentials",
toolEndpoint: "/api/kognitive/voice/agents/voice-assistant/tools/execute",
});
await session.connect();
session.sendText("Summarize what I said in one sentence.");Credentials
- Use
clientSecretEndpointorissueRealtimeClientSecret()for OpenAI Realtime sessions. - Use
credentialsEndpoint,credentials, orgetCredentials()for Gemini Live sessions. - Use
sessionEndpoint,session, orgetSession()for backend pipeline sessions. - The mounted Kognitive runtime now exposes
/sessionas the canonical bootstrap endpoint./client-secretand/credentialsremain available for direct runtimes.
Pipeline Mode
Use pipeline mode when you want Kognitive to keep a single public voice API while a backend runtime owns the live STT/LLM/TTS execution.
const voiceAgent = createVoiceAgent({
name: "voice-pipeline",
instructions: "Help the user in a concise spoken style.",
runtime: {
provider: "kognitive-voice",
mode: "pipeline",
sessionEndpoint: "/api/kognitive/voice/agents/voice-pipeline/session",
pipeline: {
transport: { type: "websocket" },
stt: { provider: "deepgram", model: "nova-3", language: "en" },
llm: { provider: "xai", model: "grok-4.1-fast" },
tts: { provider: "cartesia", model: "sonic-3", voice: "blake" },
backgroundAudio: { preset: "none" },
},
},
});
const prepared = await voiceAgent.prepare({
resourceId: { userId: "user_1" },
});
const session = createBrowserVoiceSession({
prepare: prepared,
sessionEndpoint: "/api/kognitive/voice/agents/voice-pipeline/session",
});
await session.connect();Telephony
Use @kognitivedev/voice/telephony when the caller is not a browser microphone. This entry point is Node-only and is intended for PSTN, SIP, or provider WebSocket integrations.
The current implementation ships with:
- a provider-neutral telephony session registry
- normalized call/session types
- a Twilio Media Streams adapter
- Twilio request-signature validation
- TwiML generation for
<Connect><Stream> - Twilio bidirectional media message parsing and serialization
- mu-law
8kcodec helpers for Twilio audio payloads
Install
@kognitivedev/voice/telephony is exported from the same package:
import {
createVoiceTelephonyService,
createTwilioInboundCallResponse,
handleTwilioMediaStreamMessage,
} from "@kognitivedev/voice/telephony";Session Model
The telephony layer does not replace createVoiceAgent(). It resolves an existing voice agent, calls prepare(), and stores the resulting VoicePrepareResult alongside provider call metadata.
import { createVoiceTelephonyService } from "@kognitivedev/voice/telephony";
const telephony = createVoiceTelephonyService({
resolveAgent: async ({ agentName }) => {
if (agentName === "billing") return billingVoiceAgent;
return conciergeVoiceAgent;
},
});Twilio Inbound Call Example
import { createTwilioInboundCallResponse } from "@kognitivedev/voice/telephony";
const response = await createTwilioInboundCallResponse({
service: telephony,
authToken: process.env.TWILIO_AUTH_TOKEN!,
resourceId: { userId: "user_123" },
buildStreamUrl: (session) =>
`wss://example.com/api/twilio/media/${session.sessionId}`,
customParameters: {
projectId: "project_1",
},
}, {
url: "https://example.com/api/twilio/inbound",
headers: {
"x-twilio-signature": req.headers["x-twilio-signature"] as string,
},
params: {
CallSid: "CA123",
AccountSid: "AC123",
From: "+15550001111",
To: "+15550002222",
Direction: "inbound",
},
});
return new Response(response.twiml, {
headers: { "Content-Type": "text/xml" },
});The generated TwiML uses <Connect><Stream> and injects sessionId, callId, and agentName as Twilio <Parameter> values so the subsequent WebSocket stream can be correlated back to your Kognitive session.
Twilio Media Stream Example
import {
handleTwilioMediaStreamMessage,
parseTwilioMediaStreamMessage,
} from "@kognitivedev/voice/telephony";
const parsed = parseTwilioMediaStreamMessage(rawMessage);
const event = handleTwilioMediaStreamMessage(telephony, sessionId, parsed);
if (event.type === "call.audio") {
// event.audio.payload is base64 mu-law 8k
// forward it into your STT / realtime / pipeline runtime here
}Twilio Security
The helper validates the X-Twilio-Signature header against the exact webhook URL and form parameters. For Twilio WebSocket validation, the helper also supports the documented trailing-slash fallback used by Twilio's signature validation guidance.
What This Layer Does Not Do
The telephony entry point does not yet mount framework routes or run the live STT-LLM-TTS bridge for you. It gives you the provider-neutral session model and the first carrier adapter so your app can wire:
- webhook ingress
- WebSocket upgrades
- audio bridging
- call-control decisions
- tracing/reporting
Multi-Agent Voice Networks
Use createVoiceAgentNetwork() when one live call should move between specialists instead of forcing a single prompt to do everything.
import {
createVoiceAgent,
createVoiceAgentNetwork,
createBrowserVoiceSession,
} from "@kognitivedev/voice";
const supportAgent = createVoiceAgent({
name: "support",
instructions: "Handle general support questions.",
});
const billingAgent = createVoiceAgent({
name: "billing",
instructions: "Handle refunds, invoices, and payment failures.",
});
const network = createVoiceAgentNetwork({
name: "support-network",
agents: {
support: supportAgent,
billing: billingAgent,
},
defaultAgent: "support",
maxHops: 5,
});
const prepared = await network.prepare({
resourceId: { userId: "user_1" },
});
const session = createBrowserVoiceSession({
prepare: prepared,
credentialsEndpoint: "/api/kognitive/voice/agents/support-network/credentials",
handoffDelayMs: 1500,
handoff: async ({ handoff, currentPrepare }) => ({
prepare: await network.handoff({
targetAgentName: handoff.agent,
currentAgentName: currentPrepare.network?.activeAgentName,
reason: handoff.reason,
resourceId: currentPrepare.resourceId,
metadata: currentPrepare.metadata,
hopCount: currentPrepare.network?.hopCount,
sharedState: currentPrepare.network?.sharedState,
transferState: handoff.transferState,
}),
}),
});Network Behavior
- A networked prepare result includes
prepare.networkwith the active specialist, hop count, shared state, and handoff tool metadata. - The active specialist gets an internal
handoff_to_agenttool by default. - OpenAI Realtime sessions can hot-swap specialists in-place when the next specialist uses the same provider, transport, and model.
- Gemini Live currently emits a clear handoff failure because in-session specialist replacement is not yet supported by the runtime.
Exports
createVoiceAgent()createVoiceAgentNetwork()VoiceContextAdapterVoiceContextAdapterResultcreateBrowserVoiceSession()issueRealtimeClientSecret()@kognitivedev/voice/telephonyOPENAI_REALTIME_RUNTIMEGEMINI_LIVE_RUNTIMEOPENAI_REALTIME_CAPABILITIESGEMINI_LIVE_CAPABILITIESresolveVoiceRuntimeAdapter()resolvePreparedVoiceRuntime()sanitizePreparedVoiceSession()toSdkRealtimeSessionConfig()- voice session state reducers and telemetry/reporting helpers
- telephony service, Twilio Media Streams helpers, and telephony codec utilities
Browser Handoff Resolution
Browser sessions can resolve handoffs in two ways:
- Pass
handoff(request)directly and return the nextVoicePrepareResult. - Pass
handoffEndpoint, or rely on auto-derivation fromcredentialsEndpoint/clientSecretEndpoint, and return JSON like: - Optionally set
handoffDelayMsto delay the in-session swap. This can be a fixed number or a function that computes the delay from the handoff request/result.
{
"prepare": {
"...": "VoicePrepareResult"
}
}The request body includes:
handoffresourceIdmetadatanetworkcallIdsessionId
