@cogitator-ai/voice

v0.1.12

Published

12 days ago

Voice/Realtime agent capabilities for Cogitator

0High
0Medium
0Low

el1fe

@cogitator-ai/voice

Voice and Realtime agent capabilities for Cogitator.

Two modes: Pipeline (STT -> Agent -> TTS) for any LLM, and Realtime (native speech-to-speech) for OpenAI/Gemini.

Installation

pnpm add @cogitator-ai/voice

# Required for OpenAI STT/TTS
pnpm add openai

# Optional dependencies
pnpm add onnxruntime-node  # Silero VAD (neural network-based)

Features

Pipeline Mode — STT -> Agent -> TTS, works with any Cogitator agent and LLM backend
Realtime Mode — Native speech-to-speech via OpenAI Realtime API or Gemini Live API
2 STT Providers — OpenAI (gpt-4o-mini-transcribe) and Deepgram (nova-3, real-time streaming)
2 TTS Providers — OpenAI (gpt-4o-mini-tts) and ElevenLabs (eleven_flash_v2_5, ~75ms latency)
2 VAD Providers — Energy-based (zero deps) and Silero (ONNX neural network)
WebSocket Transport — Built-in server for browser/mobile clients
Agent Tools — Drop-in transcribe_audio and speak_text tools for any Cogitator agent
Interruption Handling — Barge-in support in both pipeline and realtime modes
Audio Utilities — PCM/WAV conversion, resampling, RMS calculation

Quick Start

import { VoiceAgent, OpenAISTT, OpenAITTS } from '@cogitator-ai/voice';
import { Agent } from '@cogitator-ai/core';

const agent = new Agent({ instructions: 'You are a helpful assistant' });

const voiceAgent = new VoiceAgent({
  agent,
  mode: 'pipeline',
  stt: new OpenAISTT({ apiKey: process.env.OPENAI_API_KEY! }),
  tts: new OpenAITTS({ apiKey: process.env.OPENAI_API_KEY! }),
});

await voiceAgent.listen(8080);

Connect from any WebSocket client at ws://localhost:8080/voice — send binary audio frames, receive binary audio + JSON control messages.

STT Providers

| Provider | Default Model | Streaming | Word Timestamps | Notes | | ------------- | ------------------------ | ------------------- | --------------- | ---------------------------------------------- | | OpenAISTT | gpt-4o-mini-transcribe | Buffered | Yes | Also supports gpt-4o-transcribe, whisper-1 | | DeepgramSTT | nova-3 | Real-time WebSocket | Yes | Interim results, endpointing, auto-punctuation |

OpenAI STT

import { OpenAISTT } from '@cogitator-ai/voice';

const stt = new OpenAISTT({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o-mini-transcribe',
});

const result = await stt.transcribe(audioBuffer, { language: 'en' });
console.log(result.text);
console.log(result.words);
console.log(result.duration);

Deepgram STT

Real-time streaming with interim results:

import { DeepgramSTT } from '@cogitator-ai/voice';

const stt = new DeepgramSTT({
  apiKey: process.env.DEEPGRAM_API_KEY!,
  model: 'nova-3',
  language: 'en',
});

const stream = stt.createStream({ interimResults: true, endpointing: 500 });

stream.on('partial', (text) => {
  console.log('partial:', text);
});

stream.on('final', (result) => {
  console.log('final:', result.text);
});

stream.write(audioChunk1);
stream.write(audioChunk2);
await stream.close();

TTS Providers

| Provider | Default Model | Streaming | Voices | Notes | | --------------- | ------------------- | --------- | -------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- | | OpenAITTS | gpt-4o-mini-tts | Yes | alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse, marin, cedar | Also supports tts-1, tts-1-hd. Supports instructions for voice style control | | ElevenLabsTTS | eleven_flash_v2_5 | Yes | By voice ID | ~75ms latency. Also supports eleven_turbo_v2_5, eleven_multilingual_v2, eleven_v3 |

OpenAI TTS

import { OpenAITTS } from '@cogitator-ai/voice';

const tts = new OpenAITTS({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o-mini-tts',
  voice: 'coral',
});

const audio = await tts.synthesize('Hello, world!', {
  speed: 1.0,
  format: 'mp3',
  instructions: 'Speak in a warm, friendly tone',
});

for await (const chunk of tts.streamSynthesize('Streaming response...')) {
  process.stdout.write('.');
}

ElevenLabs TTS

import { ElevenLabsTTS } from '@cogitator-ai/voice';

const tts = new ElevenLabsTTS({
  apiKey: process.env.ELEVENLABS_API_KEY!,
  model: 'eleven_flash_v2_5',
  voiceId: '21m00Tcm4TlvDq8ikWAM',
});

const audio = await tts.synthesize('Hello!', { format: 'mp3' });

for await (const chunk of tts.streamSynthesize('Streaming...')) {
  process.stdout.write('.');
}

VAD Providers

| Provider | Accuracy | Dependencies | Speed | Notes | | ----------- | -------- | ------------------ | ---------- | ------------------------------------------------- | | EnergyVAD | Basic | None | Fast | RMS energy threshold, good for quiet environments | | SileroVAD | High | onnxruntime-node | ~3ms/frame | Neural network, works in noisy environments |

Energy VAD

Zero-dependency voice activity detection based on audio energy levels:

import { EnergyVAD } from '@cogitator-ai/voice';

const vad = new EnergyVAD({
  threshold: 0.01,
  silenceDuration: 500,
  sampleRate: 16000,
});

const event = vad.process(float32Samples);

switch (event.type) {
  case 'speech_start':
    console.log('User started speaking');
    break;
  case 'speech_end':
    console.log(`Speech ended after ${event.duration}ms`);
    break;
  case 'speech':
    console.log(`Speech probability: ${event.probability}`);
    break;
  case 'silence':
    break;
}

Silero VAD

Neural network-based VAD using the Silero ONNX model:

import { SileroVAD } from '@cogitator-ai/voice';

const vad = new SileroVAD({
  modelPath: './silero_vad.onnx',
  threshold: 0.5,
  silenceDuration: 500,
  sampleRate: 16000,
});

await vad.init();

const event = await vad.process(float32Samples);

Pipeline Mode

The pipeline processes audio through a three-stage loop: STT -> Agent -> TTS. Works with any Cogitator agent regardless of the underlying LLM.

One-shot processing

import { VoicePipeline, OpenAISTT, OpenAITTS } from '@cogitator-ai/voice';

const pipeline = new VoicePipeline({
  stt: new OpenAISTT({ apiKey: process.env.OPENAI_API_KEY! }),
  tts: new OpenAITTS({ apiKey: process.env.OPENAI_API_KEY! }),
  agent: myAgent,
});

const result = await pipeline.process(audioBuffer);
console.log(result.transcript);
console.log(result.response);
// result.audio — synthesized response audio

Streaming sessions

For continuous conversations with VAD-driven turn detection:

import { VoicePipeline, OpenAISTT, OpenAITTS, EnergyVAD } from '@cogitator-ai/voice';

const pipeline = new VoicePipeline({
  stt: new OpenAISTT({ apiKey: process.env.OPENAI_API_KEY! }),
  tts: new OpenAITTS({ apiKey: process.env.OPENAI_API_KEY! }),
  vad: new EnergyVAD({ threshold: 0.01, silenceDuration: 500 }),
  agent: myAgent,
});

const session = pipeline.createSession();

session.on('speech_start', () => {
  console.log('User started speaking');
});

session.on('transcript', (text, isFinal) => {
  console.log(isFinal ? `Final: ${text}` : `Interim: ${text}`);
});

session.on('agent_response', (text) => {
  console.log('Agent:', text);
});

session.on('audio', (chunk) => {
  playAudio(chunk);
});

session.pushAudio(pcm16Chunk);

session.interrupt();

await session.close();

Realtime Mode

Native speech-to-speech without the STT/TTS pipeline. The LLM directly processes and generates audio. Lower latency, more natural conversation flow.

OpenAI Realtime

import { RealtimeSession } from '@cogitator-ai/voice';

const session = new RealtimeSession({
  provider: 'openai',
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o-mini-realtime-preview',
  voice: 'coral',
  instructions: 'You are a helpful assistant.',
  tools: [
    {
      name: 'get_weather',
      description: 'Get current weather',
      parameters: { type: 'object', properties: { city: { type: 'string' } } },
      execute: async (args) => ({ temp: 72, unit: 'F' }),
    },
  ],
});

session.on('connected', () => console.log('Connected'));
session.on('audio', (chunk) => playAudio(chunk));
session.on('transcript', (text, role) => console.log(`${role}: ${text}`));
session.on('speech_start', () => console.log('VAD: speech detected'));
session.on('tool_call', (name, args) => console.log(`Tool: ${name}`, args));

await session.connect();

session.pushAudio(pcm16Chunk);
session.sendText('Hello!');

session.interrupt();
session.close();

Gemini Live

import { RealtimeSession } from '@cogitator-ai/voice';

const session = new RealtimeSession({
  provider: 'gemini',
  apiKey: process.env.GOOGLE_API_KEY!,
  model: 'gemini-live-2.5-flash-native-audio',
  voice: 'Puck',
  instructions: 'You are a helpful assistant.',
});

session.on('audio', (chunk) => playAudio(chunk));
session.on('transcript', (text, role) => console.log(`${role}: ${text}`));

await session.connect();
session.pushAudio(pcm16Chunk);

WebSocket Transport

Built-in WebSocket server for connecting browser/mobile clients. Binary frames carry audio, text frames carry JSON control messages.

import { WebSocketTransport, VoiceClient } from '@cogitator-ai/voice';

const transport = new WebSocketTransport({
  path: '/voice',
  maxConnections: 100,
});

transport.on('connection', (client: VoiceClient) => {
  console.log(`Client connected: ${client.id}`);

  client.on('audio', (chunk) => {
    // PCM16 audio from client
  });

  client.on('message', (msg) => {
    // JSON control messages
  });

  client.on('close', () => {
    console.log(`Client disconnected: ${client.id}`);
  });

  client.sendAudio(responseChunk);
  client.sendMessage({ type: 'transcript', text: 'Hello' });
});

await transport.listen(8080);

// or attach to an existing HTTP server
transport.attachToServer(httpServer);

await transport.close();

WebSocket Protocol

| Direction | Format | Content | | ---------------- | ----------- | ----------------------------------------------------------------------------------- | | Client -> Server | Binary | PCM16 audio frames (16kHz, mono, 16-bit LE) | | Client -> Server | Text (JSON) | Control messages | | Server -> Client | Binary | PCM16 or encoded audio response | | Server -> Client | Text (JSON) | { type: 'transcript' \| 'agent_response' \| 'speech_start' \| 'speech_end', ... } |

VoiceAgent

High-level class that wires together transport, pipeline/realtime, and session management:

import { VoiceAgent, OpenAISTT, OpenAITTS, EnergyVAD } from '@cogitator-ai/voice';

const voiceAgent = new VoiceAgent({
  agent: myAgent,
  mode: 'pipeline',
  stt: new OpenAISTT({ apiKey: process.env.OPENAI_API_KEY! }),
  tts: new OpenAITTS({ apiKey: process.env.OPENAI_API_KEY! }),
  vad: new EnergyVAD(),
  transport: { path: '/voice', maxConnections: 50 },
});

voiceAgent.on('session_start', (id) => console.log(`Session started: ${id}`));
voiceAgent.on('session_end', (id) => console.log(`Session ended: ${id}`));
voiceAgent.on('error', (err) => console.error(err));

console.log(`Active sessions: ${voiceAgent.activeSessions}`);

await voiceAgent.listen(8080);
await voiceAgent.close();

Realtime mode with VoiceAgent

const voiceAgent = new VoiceAgent({
  agent: myAgent,
  mode: 'realtime',
  realtimeProvider: 'openai',
  realtimeApiKey: process.env.OPENAI_API_KEY!,
  realtimeModel: 'gpt-4o-mini-realtime-preview',
  voice: 'coral',
});

await voiceAgent.listen(8080);

Agent Integration

Give any Cogitator agent the ability to transcribe audio or synthesize speech:

import { Agent, tool } from '@cogitator-ai/core';
import { voiceTools, OpenAISTT, OpenAITTS } from '@cogitator-ai/voice';
import { z } from 'zod';

const [transcribe, speak] = voiceTools({
  stt: new OpenAISTT({ apiKey: process.env.OPENAI_API_KEY! }),
  tts: new OpenAITTS({ apiKey: process.env.OPENAI_API_KEY! }),
});

const transcribeTool = tool({
  name: transcribe.name,
  description: transcribe.description,
  parameters: z.object({
    audioBase64: z.string().describe('Base64-encoded audio data'),
    language: z.string().optional().describe('Language code'),
  }),
  execute: async (params) => transcribe.execute(params),
});

const speakTool = tool({
  name: speak.name,
  description: speak.description,
  parameters: z.object({
    text: z.string().describe('Text to convert to speech'),
    voice: z.string().optional().describe('Voice to use'),
  }),
  execute: async (params) => speak.execute(params),
});

const agent = new Agent({
  name: 'voice-assistant',
  model: 'gpt-4o',
  instructions: 'You can transcribe audio and generate speech.',
  tools: [transcribeTool, speakTool],
});

Audio Utilities

Low-level audio conversion functions for working with PCM and WAV data:

import {
  float32ToPcm16,
  pcm16ToFloat32,
  pcmToWav,
  wavToPcm,
  resample,
  calculateRMS,
} from '@cogitator-ai/voice';

const pcm = float32ToPcm16(float32Samples);
const floats = pcm16ToFloat32(pcmBuffer);

const wav = pcmToWav(pcmBuffer, 16000);
const { samples, sampleRate } = wavToPcm(wavBuffer);

const resampled = resample(float32Samples, 44100, 16000);

const rms = calculateRMS(float32Samples);

Configuration Reference

`VoiceAgentConfig`

| Field | Type | Required | Description | | ------------------ | ---------------------------------------- | ------------- | ------------------------------------ | | agent | { run(input) => Promise<{ content }> } | Yes | Cogitator agent or compatible object | | mode | 'pipeline' \| 'realtime' | Yes | Processing mode | | stt | STTProvider | Pipeline only | Speech-to-text provider | | tts | TTSProvider | Pipeline only | Text-to-speech provider | | vad | VADProvider | No | Voice activity detection | | realtimeProvider | 'openai' \| 'gemini' | Realtime only | Realtime API provider | | realtimeApiKey | string | Realtime only | API key for realtime provider | | realtimeModel | string | No | Model override | | voice | string | No | Voice for TTS or realtime | | transport | WebSocketTransportConfig | No | Transport options |

`OpenAISTTConfig`

| Field | Type | Default | Description | | --------- | -------- | ------------------------ | ------------------- | | apiKey | string | — | OpenAI API key | | model | string | gpt-4o-mini-transcribe | Model ID | | baseURL | string | — | Custom API base URL |

`DeepgramSTTConfig`

| Field | Type | Default | Description | | ---------- | -------- | -------- | --------------------- | | apiKey | string | — | Deepgram API key | | model | string | nova-3 | Model ID | | language | string | — | Default language code |

`OpenAITTSConfig`

| Field | Type | Default | Description | | --------- | -------- | ----------------- | ------------------- | | apiKey | string | — | OpenAI API key | | model | string | gpt-4o-mini-tts | Model ID | | voice | string | alloy | Default voice | | baseURL | string | — | Custom API base URL |

`ElevenLabsTTSConfig`

| Field | Type | Default | Description | | --------- | -------- | ---------------------- | ------------------ | | apiKey | string | — | ElevenLabs API key | | voiceId | string | 21m00Tcm4TlvDq8ikWAM | Default voice ID | | model | string | eleven_flash_v2_5 | Model ID |

`EnergyVADConfig`

| Field | Type | Default | Description | | ----------------- | -------- | ------- | ----------------------------------------- | | threshold | number | 0.01 | RMS energy threshold for speech detection | | silenceDuration | number | 500 | Silence duration (ms) before speech_end | | sampleRate | number | 16000 | Audio sample rate in Hz |

`SileroVADConfig`

| Field | Type | Default | Description | | ----------------- | -------- | ------- | ----------------------------------------- | | modelPath | string | — | Path to silero_vad.onnx model file | | threshold | number | 0.5 | Speech probability threshold (0-1) | | silenceDuration | number | 500 | Silence duration (ms) before speech_end | | sampleRate | number | 16000 | Audio sample rate in Hz |

`WebSocketTransportConfig`

| Field | Type | Default | Description | | ---------------- | -------- | -------- | ------------------------------ | | path | string | /voice | WebSocket endpoint path | | maxConnections | number | 100 | Maximum concurrent connections |

Examples

See examples/voice/ for runnable examples:

01-pipeline-basic.ts — Basic pipeline with OpenAI STT + TTS
02-realtime-openai.ts — OpenAI Realtime API voice agent
03-realtime-gemini.ts — Gemini Live voice agent

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@cogitator-ai/voice

Installation

Features

Quick Start

STT Providers

OpenAI STT

Deepgram STT

TTS Providers

OpenAI TTS

ElevenLabs TTS

VAD Providers

Energy VAD

Silero VAD

Pipeline Mode

One-shot processing

Streaming sessions

Realtime Mode

OpenAI Realtime

Gemini Live

WebSocket Transport

WebSocket Protocol

VoiceAgent

Realtime mode with VoiceAgent

Agent Integration

Audio Utilities

Configuration Reference

VoiceAgentConfig

OpenAISTTConfig

DeepgramSTTConfig

OpenAITTSConfig

ElevenLabsTTSConfig

EnergyVADConfig

SileroVADConfig

WebSocketTransportConfig

Examples

License

`VoiceAgentConfig`

`OpenAISTTConfig`

`DeepgramSTTConfig`

`OpenAITTSConfig`

`ElevenLabsTTSConfig`

`EnergyVADConfig`

`SileroVADConfig`

`WebSocketTransportConfig`