@kond.studio/voicekit

v0.7.0

Published

16 days ago

Voice SDK for AI agents - STT, TTS, Turn Detection

Downloads

1,090

0High
0Medium
0Low

stranxik_dev

voice speech stt tts ai agent vad turn-detection

 ██╗  ██╗ ██████╗ ███╗   ██╗██████╗
 ██║ ██╔╝██╔═══██╗████╗  ██║██╔══██╗
 █████╔╝ ██║   ██║██╔██╗ ██║██║  ██║
 ██╔═██╗ ██║   ██║██║╚██╗██║██║  ██║
 ██║  ██╗╚██████╔╝██║ ╚████║██████╔╝
 ╚═╝  ╚═╝ ╚═════╝ ╚═╝  ╚═══╝╚═════╝

 ██╗   ██╗ ██████╗ ██╗ ██████╗███████╗██╗  ██╗██╗████████╗
 ██║   ██║██╔═══██╗██║██╔════╝██╔════╝██║ ██╔╝██║╚══██╔══╝
 ██║   ██║██║   ██║██║██║     █████╗  █████╔╝ ██║   ██║
 ╚██╗ ██╔╝██║   ██║██║██║     ██╔══╝  ██╔═██╗ ██║   ██║
  ╚████╔╝ ╚██████╔╝██║╚██████╗███████╗██║  ██╗██║   ██║
   ╚═══╝   ╚═════╝ ╚═╝ ╚═════╝╚══════╝╚═╝  ╚═╝╚═╝   ╚═╝

@kond.studio/voicekit

Voice I/O for AI agents — You bring your own LLM
VoiceKit handles STT, TTS, and turn detection. Your AI handles intelligence.
Zero LLM lock-in. Zero markup on your AI calls.

Installation

npm install @kond.studio/voicekit
# or
pnpm add @kond.studio/voicekit
# or
yarn add @kond.studio/voicekit

Why VoiceKit?

Most voice AI platforms (Vapi, Retell, Hume) want to own your entire stack — including your LLM. They proxy your AI calls and charge markup on every token.

VoiceKit is different:

┌─────────────────────────────────────────────────────────────────┐
│  OTHER PLATFORMS                                                 │
│                                                                  │
│  Your App ──► Platform ──► LLM (their proxy) ──► Platform ──► App│
│                    └── 15-30% markup on tokens ──┘               │
│                                                                  │
│  Lock-in: Limited LLM choices, migration required to switch     │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  VOICEKIT                                                        │
│                                                                  │
│  Your App ──► VoiceKit ──► Your App ──► YOUR LLM (direct)       │
│                  │                           │                   │
│               Voice I/O                   Your choice            │
│          (STT, TTS, Turn)           (Claude/GPT/Gemini/...)     │
│                                                                  │
│  Freedom: Any LLM, switch anytime, pay your provider directly   │
└─────────────────────────────────────────────────────────────────┘

Early Access

VoiceKit is currently in early access.

Get your free API key at kond.studio/developers/voicekit/keys.

Free tier includes 10 minutes/month to test.

Quick Start

With Claude (Anthropic)

import { VoiceKit } from '@kond.studio/voicekit';
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

const voice = new VoiceKit({
  apiKey: 'vk_xxxxxxxxxxxx',
  locale: 'en',
  onTranscript: async (text) => {
    const response = await anthropic.messages.create({
      model: 'claude-sonnet-4-20250514',
      max_tokens: 1024,
      messages: [{ role: 'user', content: text }],
    });
    voice.speak(response.content[0].text);
  },
});

await voice.start();

With GPT (OpenAI)

import { VoiceKit } from '@kond.studio/voicekit';
import OpenAI from 'openai';

const openai = new OpenAI();

const voice = new VoiceKit({
  apiKey: 'vk_xxxxxxxxxxxx',
  locale: 'en',
  onTranscript: async (text) => {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: text }],
    });
    voice.speak(response.choices[0].message.content);
  },
});

await voice.start();

With Gemini (Google)

import { VoiceKit } from '@kond.studio/voicekit';
import { GoogleGenerativeAI } from '@google/generative-ai';

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: 'gemini-pro' });

const voice = new VoiceKit({
  apiKey: 'vk_xxxxxxxxxxxx',
  locale: 'en',
  onTranscript: async (text) => {
    const result = await model.generateContent(text);
    voice.speak(result.response.text());
  },
});

await voice.start();

With Mistral

import { VoiceKit } from '@kond.studio/voicekit';
import { Mistral } from '@mistralai/mistralai';

const mistral = new Mistral();

const voice = new VoiceKit({
  apiKey: 'vk_xxxxxxxxxxxx',
  locale: 'en',
  onTranscript: async (text) => {
    const response = await mistral.chat.complete({
      model: 'mistral-large-latest',
      messages: [{ role: 'user', content: text }],
    });
    voice.speak(response.choices[0].message.content);
  },
});

await voice.start();

With Ollama (Local LLM)

import { VoiceKit } from '@kond.studio/voicekit';
import { Ollama } from 'ollama';

const ollama = new Ollama();

const voice = new VoiceKit({
  apiKey: 'vk_xxxxxxxxxxxx',
  locale: 'en',
  onTranscript: async (text) => {
    const response = await ollama.chat({
      model: 'llama3',
      messages: [{ role: 'user', content: text }],
    });
    voice.speak(response.message.content);
  },
});

await voice.start();

Two API Keys

VoiceKit uses a simple two-key model:

| Key | Purpose | Who provides it | |-----|---------|-----------------| | vk_xxx | Voice I/O (STT, TTS, turn detection) | VoiceKit | | ANTHROPIC_API_KEY / OPENAI_API_KEY / etc. | Your LLM | You (direct from provider) |

VoiceKit never sees your LLM calls. The onTranscript callback is invisible to me — I only handle voice.

Security Considerations

API Key Storage

DO:

Store VOICEKIT_API_KEY in environment variables (.env.local)
Store LLM API keys server-side when possible
Use short-lived tokens for client-side if needed

DON'T:

Never hardcode API keys in source code
Never store API keys in localStorage or cookies
Never log API keys or transcripts in production

HTTPS

VoiceKit enforces HTTPS in production. HTTP is only allowed for localhost development.

// This will throw in production:
new VoiceKit({ baseUrl: 'http://insecure.example.com' }); // Error!

// OK in development:
new VoiceKit({ baseUrl: 'http://localhost:3000' }); // Works

Data Privacy

Transcripts are processed in real-time and not stored by VoiceKit
Audio is streamed to STT provider (Deepgram) and not retained
LLM calls go directly to your provider — VoiceKit never sees them

React

import { useVoiceKit } from '@kond.studio/voicekit/react';

function VoiceChat() {
  const voice = useVoiceKit({
    apiKey: process.env.NEXT_PUBLIC_VOICEKIT_API_KEY,
    locale: 'en',
    onTranscript: async (text) => {
      const reply = await myLLM.chat(text);  // Your LLM, your choice
      voice.speak(reply);
    },
  });

  return (
    <button onClick={voice.isActive ? voice.stop : voice.start}>
      {voice.isActive ? 'Listening...' : 'Start Voice'}
    </button>
  );
}

Premium React Components

VoiceKit includes pre-built UI components for voice conversations with:

Glow effects for listening state
Ripple effects for speaking state
Barge-in indicators for interruptions
Real-time waveform visualization

VoiceStatusIndicator

import { useVoiceKit, VoiceStatusIndicator } from '@kond.studio/voicekit/react';

function VoiceUI() {
  const voice = useVoiceKit({ ... });

  return (
    <VoiceStatusIndicator
      state={voice.state}
      isActive={voice.isActive}
      userSpeaking={voice.userSpeaking}
      showGlow={true}           // Glow effect for listening state
      wasInterrupted={false}    // Show interrupted state
    />
  );
}

VoiceWaveform

Real-time audio level visualization:

import { useVoiceKit, VoiceWaveform } from '@kond.studio/voicekit/react';

function VoiceUI() {
  const [audioLevel, setAudioLevel] = useState(0);

  const voice = useVoiceKit({
    ...config,
    onAudioLevel: (level) => setAudioLevel(level),
  });

  return (
    <VoiceWaveform
      isActive={voice.isActive && voice.userSpeaking}
      audioLevel={audioLevel}
      color="#00ff88"
      size="md"            // "sm" | "md" | "lg" | number
      barCount={5}
      showGlow={true}
    />
  );
}

Barge-in (Interruption) Handling

import { useVoiceKit } from '@kond.studio/voicekit/react';

function VoiceUI() {
  const [interrupted, setInterrupted] = useState(false);

  const voice = useVoiceKit({
    ...config,
    onInterruption: (context) => {
      // Called when user speaks while AI is talking
      console.log('User interrupted:', context?.interruptedText);
      setInterrupted(true);
      setTimeout(() => setInterrupted(false), 3000);
    },
  });

  return (
    <>
      {interrupted && <div className="badge">Interrupted!</div>}
      {/* ... */}
    </>
  );
}

VAD Meter

import { useVoiceKit, AnimatedVADMeter } from '@kond.studio/voicekit/react';

function VoiceUI() {
  const [audioLevel, setAudioLevel] = useState(0);

  const voice = useVoiceKit({
    ...config,
    onAudioLevel: (level) => setAudioLevel(level),
  });

  return (
    <AnimatedVADMeter
      level={audioLevel}
      isActive={voice.isActive}
      barCount={10}
      activeColor="#00ff88"
    />
  );
}

What's Included

VoiceKit handles the hard parts of voice:

| Feature | Description | |---------|-------------| | Speech-to-Text | Optimized STT with streaming transcription | | Text-to-Speech | Natural voices with gapless audio queue | | Turn Detection | ML-powered end-of-utterance detection with local ONNX or cloud | | Local ML | ONNX inference in browser (~25-50ms) for desktop devices | | VAD | Local voice activity detection | | Barge-in | Natural user interruption support | | Backchannels | Filters "mh", "yeah", "ok" (no LLM call) | | 9-state FSM | Battle-tested conversation orchestration |

Languages

SUPPORTED LOCALES
─────────────────
├─ "en"     English
├─ "fr"     French
└─ "multi"  Multilingual (auto-detects EN/FR/ES/DE/IT/PT/JA/NL/RU/HI)

Pricing

┌──────────────────────────────────────────────────────────────────────────┐
│  PLAN              │ MINUTES/MONTH      │ PRICE        │ OVERAGE         │
├────────────────────┼────────────────────┼──────────────┼─────────────────┤
│  Free              │ 10                 │ $0/mo        │ Blocked         │
│  Starter           │ 500                │ $49/mo       │ $0.08/min       │
│  Pro               │ 3000               │ $99/mo       │ $0.05/min       │
└──────────────────────────────────────────────────────────────────────────┘

Note: This is just VoiceKit pricing. Your LLM costs are separate and go directly to your provider.

vs Competition

┌─────────────────────────────────────────────────────────────────────────┐
│                           LLM COMPARISON                                 │
├──────────────┬───────────────┬───────────────┬───────────────┬──────────┤
│              │ VoiceKit      │ Vapi.ai       │ Retell.ai     │ Hume.ai  │
├──────────────┼───────────────┼───────────────┼───────────────┼──────────┤
│ LLM choice   │ ANY           │ Limited       │ Limited       │ EVI only │
│              │               │ (their proxy) │ (their proxy) │          │
├──────────────┼───────────────┼───────────────┼───────────────┼──────────┤
│ LLM markup   │ 0%            │ ~15-30%       │ ~10-25%       │ Included │
│              │ (pay direct)  │ on tokens     │ on tokens     │          │
├──────────────┼───────────────┼───────────────┼───────────────┼──────────┤
│ Local LLM    │ Yes           │ No            │ No            │ No       │
│ (Ollama)     │               │               │               │          │
├──────────────┼───────────────┼───────────────┼───────────────┼──────────┤
│ Switch LLM   │ 1 line change │ Migration     │ Migration     │ N/A      │
├──────────────┼───────────────┼───────────────┼───────────────┼──────────┤
│ See prompts? │ Never         │ Yes (proxy)   │ Yes (proxy)   │ Yes      │
└──────────────┴───────────────┴───────────────┴───────────────┴──────────┘

Turn Detection Modes

VoiceKit uses intelligent turn detection to know when you've finished speaking:

| Mode | Best For | Latency | |------|----------|---------| | auto (default) | Most apps — auto-selects based on device | ~25-200ms | | local | Desktop apps wanting lowest latency | ~25-50ms | | cloud | Mobile apps, consistent behavior | ~100-200ms | | heuristic | Offline, testing | ~1-5ms |

Auto Mode (Default)

Device capable (desktop 4GB+ RAM, WebAssembly, IndexedDB)?
├── YES → Local ONNX (~25-50ms)
└── NO  → Cloud API (~100-200ms)

// Force local ONNX for lowest latency
const voice = new VoiceKit({
  apiKey: 'vk_xxx',
  turnDetection: { type: 'local' },
  onTranscript: ...
});

// Force heuristic for offline use
const voice = new VoiceKit({
  apiKey: 'vk_xxx',
  turnDetection: { type: 'heuristic' },
  onTranscript: ...
});

Use Cases

Voice chatbots — Customer support, virtual assistants
Educational apps — AI tutors, language learning
Accessibility — Voice interfaces for visually impaired
Gaming — NPCs that talk with players
Autonomous agents — Agents that report back vocally

Documentation

Full documentation: kond.studio/developers/voicekit/docs

License

MIT — see LICENSE

Contributing

VoiceKit is extracted from KOND, a personal AI companion.

Issues and PRs welcome on GitHub.

┌──────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│                           [ ^_^ ]                                        │
│                                                                          │
│             @kond.studio/voicekit — built with care                          │
│                                                                          │
│          Voice I/O for AI agents. You bring your own LLM.               │
│                                                                          │
│                            2025                                          │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@kond.studio/voicekit

Installation

Why VoiceKit?

Early Access

Quick Start

With Claude (Anthropic)

With GPT (OpenAI)

With Gemini (Google)

With Mistral

With Ollama (Local LLM)

Two API Keys

Security Considerations

API Key Storage

HTTPS

Data Privacy

React

Premium React Components

VoiceStatusIndicator

VoiceWaveform

Barge-in (Interruption) Handling

VAD Meter

What's Included

Languages

Pricing

vs Competition

Turn Detection Modes

Auto Mode (Default)

Use Cases

Documentation

License

Contributing