@metaventionsai/voice-nexus

v1.0.1

Published

2 months ago

Universal multi-provider voice architecture - seamless routing between STT, reasoning, and TTS providers

0High
0Medium
0Low

metaventionsai.com

voice speech-to-text text-to-speech stt tts ai llm multi-provider gemini claude elevenlabs whisper

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           VOICE NEXUS ORCHESTRATOR                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         INPUT LAYER (STT)                            │   │
│  │                                                                      │   │
│  │   🎤 Audio ──→ [STT Provider] ──→ Transcribed Text                  │   │
│  │                      │                                               │   │
│  │         ┌────────────┼────────────┐                                 │   │
│  │         │            │            │                                 │   │
│  │     Gemini      Whisper      Browser                                │   │
│  │     (Live)      (Batch)      (Fallback)                             │   │
│  │     ~200ms      ~500ms       ~300ms                                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                     │                                       │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                      COMPLEXITY ROUTER                               │   │
│  │                                                                      │   │
│  │   Text ──→ [Signal Extraction] ──→ [Tier Selection] ──→ Provider    │   │
│  │                     │                      │                         │   │
│  │         ┌───────────┴───────────┐   ┌─────┴─────┐                   │   │
│  │         │ Code:      +0.25      │   │   FAST    │ → Sonnet          │   │
│  │         │ Reasoning: +0.20      │   │   <0.2    │   (ELITE)         │   │
│  │         │ Creative:  +0.15      │   ├───────────┤                   │   │
│  │         │ Navigation: -0.30     │   │ BALANCED  │ → Opus            │   │
│  │         │ Question:   -0.10     │   │  0.2-0.5  │   (ELITE)         │   │
│  │         └───────────────────────┘   ├───────────┤                   │   │
│  │                                     │   DEEP    │ → Opus            │   │
│  │                                     │   >0.5    │   (ELITE)         │   │
│  │                                     └───────────┘                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                     │                                       │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    KNOWLEDGE LAYER (Optional)                        │   │
│  │                                                                      │   │
│  │   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │   │
│  │   │   Semantic   │  │   Research   │  │    Agent     │             │   │
│  │   │   Search     │  │   Findings   │  │   Expertise  │             │   │
│  │   │              │  │   Injection  │  │   Context    │             │   │
│  │   └──────────────┘  └──────────────┘  └──────────────┘             │   │
│  │                              │                                       │   │
│  │              [Enriched Context → Reasoning]                         │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                     │                                       │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    REASONING LAYER                                   │   │
│  │                                                                      │   │
│  │   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │   │
│  │   │   CLAUDE     │  │   GEMINI     │  │    GROK      │             │   │
│  │   │  Sonnet/Opus │  │  Flash/Pro   │  │  (Creative)  │             │   │
│  │   │              │  │              │  │              │             │   │
│  │   │ • Reasoning  │  │ • Realtime   │  │ • Chat       │             │   │
│  │   │ • Code       │  │ • Grounding  │  │ • Creative   │             │   │
│  │   │ • Analysis   │  │ • Speed      │  │ • Humor      │             │   │
│  │   └──────────────┘  └──────────────┘  └──────────────┘             │   │
│  │                              │                                       │   │
│  │              [Response Text]                                        │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                     │                                       │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                      OUTPUT LAYER (TTS)                              │   │
│  │                                                                      │   │
│  │   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │   │
│  │   │  ELEVENLABS  │  │   GEMINI     │  │   BROWSER    │             │   │
│  │   │  (Premium)   │  │  (Native)    │  │  (Fallback)  │             │   │
│  │   │              │  │              │  │              │             │   │
│  │   │ • Emotional  │  │ • Fast       │  │ • Free       │             │   │
│  │   │ • 9 Voices   │  │ • Integrated │  │ • Universal  │             │   │
│  │   │ • Streaming  │  │ • Low-latency│  │ • Basic      │             │   │
│  │   └──────────────┘  └──────────────┘  └──────────────┘             │   │
│  │                              │                                       │   │
│  │              [Audio Output → 🔊]                                    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Features

3 Voice Modes: Realtime, Turn-based, Hybrid
Provider Agnostic: Works with any STT, LLM, or TTS service
ELITE Complexity Routing: Auto-selects reasoning tier (Opus-first)
Knowledge Injection: Enrich queries with external knowledge
Real-time Events: Callbacks for all processing stages
ElevenLabs Integration: 9 premium voices with emotional range

Installation

npm install @metaventionsai/voice-nexus
# or
yarn add @metaventionsai/voice-nexus
# or
pnpm add @metaventionsai/voice-nexus

Quick Start

import { createVoiceNexus, type ReasoningProvider } from '@metaventionsai/voice-nexus';

// 1. Define your reasoning provider
const claudeProvider: ReasoningProvider = {
    name: 'claude',
    models: {
        fast: 'claude-sonnet-4-20250514',      // ELITE: Sonnet even for fast
        balanced: 'claude-opus-4-20250514',    // ELITE: Opus for balanced
        deep: 'claude-opus-4-20250514'         // ELITE: Opus for deep
    },
    isAvailable: () => !!process.env.ANTHROPIC_API_KEY,
    generate: async (prompt, config) => {
        const response = await anthropic.messages.create({
            model: config.model || 'claude-opus-4-20250514',
            messages: [{ role: 'user', content: prompt }],
            max_tokens: 4096
        });
        return {
            text: response.content[0].text,
            model: config.model || 'claude-opus-4-20250514'
        };
    }
};

// 2. Create Voice Nexus instance
const nexus = createVoiceNexus({
    config: {
        mode: 'turn-based',
        knowledgeInjection: false,
        providers: {
            reasoning: claudeProvider
        }
    },
    events: {
        onTranscriptUpdate: (t) => console.log(`[${t.role}] ${t.text}`),
        onComplexityAnalyzed: (c) => console.log(`Complexity: ${c.score.toFixed(2)} → ${c.tier}`)
    }
});

// 3. Process text input
const response = await nexus.processTextInput('How do I architect a distributed system?');
console.log(response?.text);

ELITE TIER Configuration

| Setting | ELITE Value | Standard Value | Description | |---------|-------------|----------------|-------------| | Fast Threshold | 0.2 | 0.3 | Lower = more orchestration | | Deep Threshold | 0.5 | 0.7 | Lower = more Opus usage | | Fast Model | Sonnet | Flash | Quality even for simple | | Balanced Model | Opus | Sonnet | Opus by default | | Deep Model | Opus | Opus | Maximum reasoning | | TTS Provider | ElevenLabs | Gemini | Premium audio always |

Use Standard Tier

import { createVoiceNexus, STANDARD_THRESHOLDS } from '@metaventionsai/voice-nexus';

const nexus = createVoiceNexus({
    config: {
        mode: 'turn-based',
        thresholds: STANDARD_THRESHOLDS  // 0.4/0.75 thresholds
    }
});

Voice Modes

| Mode | Flow | Latency | Use Case | |------|------|---------|----------| | Realtime | Streaming STT → Fast LLM → Streaming TTS | ~500ms | Live conversation | | Turn-based | Complete STT → LLM → TTS pipeline | ~2-5s | High quality responses | | Hybrid | Auto-switches based on complexity | Variable | Best of both |

Mode Selection

// Realtime: Fast responses, live feel
nexus.setMode('realtime');

// Turn-based: Higher quality, full processing
nexus.setMode('turn-based');

// Hybrid: System decides based on query
nexus.setMode('hybrid');

Complexity Router

Signal Extraction

| Signal | Pattern | Impact | |--------|---------|--------| | Token count | Length of query | +0.0 to +0.25 | | Code indicators | implement, debug, function | +0.25 | | Reasoning | analyze, compare, why | +0.20 | | Creative | brainstorm, imagine | +0.15 | | Navigation | go to, open, show | -0.30 | | Question | what is, how do | -0.10 | | Domain | architecture, distributed | +0.30 |

Tier Selection (ELITE)

| Tier | Complexity | Model | TTS | |------|------------|-------|-----| | Fast | <0.2 | Claude Sonnet | ElevenLabs | | Balanced | 0.2-0.5 | Claude Opus | ElevenLabs | | Deep | >0.5 | Claude Opus | ElevenLabs |

Usage

import { analyzeComplexity, ELITE_THRESHOLDS } from '@metaventionsai/voice-nexus';

const result = analyzeComplexity('Design a microservices architecture');
// {
//   score: 0.55,
//   tier: 'deep',
//   signals: { hasCodeIndicators: false, hasReasoningIndicators: true, ... },
//   recommendedProvider: { reasoning: 'claude-opus', tts: 'elevenlabs' }
// }

Provider Interfaces

STT Provider

interface STTProvider {
    readonly name: string;
    readonly supportsStreaming: boolean;
    transcribe(audio: Blob): Promise<string>;
    startStreaming?(onPartial: (text: string) => void): Promise<void>;
    stopStreaming?(): Promise<string>;
    isAvailable(): boolean;
}

Reasoning Provider

interface ReasoningProvider {
    readonly name: string;
    readonly models: { fast: string; balanced: string; deep: string };
    generate(prompt: string, config: ReasoningConfig): Promise<ReasoningResult>;
    isAvailable(): boolean;
}

TTS Provider

interface TTSProvider {
    readonly name: string;
    readonly supportsStreaming: boolean;
    readonly voices: VoiceConfig[];
    synthesize(text: string, voice: string, settings?: TTSSettings): Promise<ArrayBuffer>;
    synthesizeStream?(text: string, voice: string, onChunk: (chunk: ArrayBuffer) => void): Promise<void>;
    getVoiceForAgent(agentName: string): string;
    isAvailable(): boolean;
}

Event Callbacks

const nexus = createVoiceNexus({
    config: { mode: 'turn-based', knowledgeInjection: false },
    events: {
        // Transcript updates
        onTranscriptUpdate: (transcript) => {
            console.log(`[${transcript.role}] ${transcript.text}`);
        },

        // Live streaming (realtime mode)
        onPartialTranscript: (partial) => {
            updateUI(partial.text);
        },

        // Processing state
        onProcessingStart: () => setLoading(true),
        onProcessingEnd: () => setLoading(false),

        // Complexity analysis
        onComplexityAnalyzed: (result) => {
            console.log(`Complexity: ${result.score.toFixed(2)} → ${result.tier}`);
            console.log(`Provider: ${result.recommendedProvider.reasoning}`);
        },

        // Provider changes
        onProviderSwitch: (providers) => {
            console.log('Switched to:', providers);
        },

        // Knowledge injection
        onKnowledgeInjected: (context) => {
            console.log(`Injected ${context.searchResults.length} results`);
        },

        // Errors
        onError: (error) => {
            console.error('Voice Nexus error:', error);
        },

        // Full state changes
        onStateChange: (state) => {
            updateUIState(state);
        }
    }
});

Knowledge Injection

Enrich queries with external knowledge:

const knowledgeInjector: KnowledgeInjector = {
    isAvailable: () => true,
    injectContext: async (query, expertise) => {
        const results = await searchKnowledgeBase(query);
        return {
            searchResults: results,
            injectedPrompt: `Based on this context:\n${results.map(r => r.content).join('\n')}`
        };
    }
};

const nexus = createVoiceNexus({
    config: {
        mode: 'turn-based',
        knowledgeInjection: true,
        agent: {
            id: 'research-agent',
            name: 'Dr. Ira',
            expertise: ['AI', 'systems design', 'research']
        }
    },
    knowledgeInjector
});

Multi-Provider Setup

import { createVoiceNexus } from '@metaventionsai/voice-nexus';

const nexus = createVoiceNexus({
    config: {
        mode: 'hybrid',
        knowledgeInjection: true,
        providers: {
            stt: whisperSTT,           // OpenAI Whisper
            reasoning: claudeProvider,  // Anthropic Claude
            tts: elevenLabsTTS         // ElevenLabs
        },
        agent: {
            id: 'assistant',
            name: 'Nova',
            expertise: ['coding', 'research']
        }
    },
    events: {
        onTranscriptUpdate: updateChat,
        onComplexityAnalyzed: showComplexity
    },
    knowledgeInjector: myKnowledgeSource
});

// Start voice session
await nexus.start();

// Or process directly
const response = await nexus.processTextInput('Analyze this architecture...');

ElevenLabs Voice Mapping

ELITE tier always uses ElevenLabs with agent-specific voices:

| Agent | Voice | Character | |-------|-------|-----------| | Dr. Ira | Clyde | Deep, authoritative | | Mike | Liam | Narrative, American | | Caleb | Drew | News anchor | | Paramdeep | Charlie | Casual | | Bilal | Fin | Energetic | | Noah | Dorothy | Gentle | | Helen | Domi | Expressive | | Perri | Rachel | Clear |

State Management

// Get current state
const state = nexus.getState();
// {
//   mode: 'turn-based',
//   isActive: true,
//   isProcessing: false,
//   currentProvider: { stt: 'whisper', reasoning: 'claude', tts: 'elevenlabs' },
//   transcripts: [...],
//   lastComplexityScore: 0.65,
//   error: null
// }

// Get transcripts
const transcripts = nexus.getTranscripts();

// Clear history
nexus.clearTranscripts();

// Change mode
nexus.setMode('realtime');

// Stop session
await nexus.stop();

Related Packages

@metaventionsai/cpb-core - CPB precision orchestration
ResearchGravity - Research framework