@omote/core
v0.10.6
Published
Renderer-agnostic core SDK for Omote AI Characters
Downloads
554
Maintainers
Readme
@omote/core
Client-side AI inference for real-time lip sync, speech recognition, and avatar animation — runs entirely in browser via WebGPU and WASM.
Features
- Lip Sync (A2E) — Audio to 52 ARKit blendshapes via LAM, with automatic WebGPU/WASM platform detection
- PlaybackPipeline — TTS audio playback to lip sync with ExpressionProfile scaling, gapless scheduling
- Speech Recognition — SenseVoice ASR (ONNX), 15x faster than Whisper, progressive transcription
- Voice Activity Detection — Silero VAD with Worker and main-thread modes
- Text-to-Speech — Kokoro TTS (82M q8, offline) with TTSBackend interface for custom engines
- CharacterController — Renderer-agnostic avatar composition (compositor + gaze + life layer)
- TTSPlayback — Composes TTSBackend + PlaybackPipeline for text → lip sync
- TTSSpeaker — High-level speak(text) with abort, queueing, and LLM streaming
- SpeechListener — Mic → VAD → ASR orchestration with adaptive silence detection
- createTTSPlayer() — Factory composing Kokoro TTS + TTSSpeaker for zero-config playback
- VoiceOrchestrator — Full conversational agent loop with local TTS support (cloud or offline)
- configureModelUrls() — Self-host model files from your own CDN
- Animation Graph — State machine (idle/listening/thinking/speaking) with emotion blending
- Emotion Controller — Preset-based emotion system with smooth transitions
- Model Caching — IndexedDB with versioning, LRU eviction, and quota monitoring
- Microphone Capture — Browser noise suppression, echo cancellation, AGC
- Logging & Telemetry — Structured logging (6 levels) and OpenTelemetry-compatible tracing
- Offline Ready — No cloud dependencies, works entirely without internet
- WebGPU + WASM — WebGPU-first with automatic WASM fallback
Installation
npm install @omote/corePeer dependency: onnxruntime-web is included — no additional installs needed.
Quick Start
PlaybackPipeline (TTS Lip Sync)
The most common use case: feed TTS audio chunks and get back 52 ARKit blendshape frames at render rate.
import { PlaybackPipeline, createA2E } from '@omote/core';
// 1. Create A2E backend (auto-detects GPU vs CPU)
const lam = createA2E(); // auto-detects GPU vs CPU, fetches from HF CDN (192MB fp16)
await lam.load();
// 2. Create pipeline with expression profile
const pipeline = new PlaybackPipeline({
lam,
sampleRate: 16000,
profile: { mouth: 1.0, jaw: 1.0, brows: 0.6, eyes: 0.0, cheeks: 0.5, nose: 0.3, tongue: 0.5 },
});
// 3. Listen for blendshape frames
pipeline.on('frame', (frame) => {
applyToAvatar(frame.blendshapes); // ExpressionProfile-scaled, 52 ARKit weights
});
// 4. Feed TTS audio and play
pipeline.start();
pipeline.feedBuffer(ttsAudioChunk); // Uint8Array PCM16
pipeline.end(); // Flush remaining audioAPI Reference
A2E (Audio to Expression)
Factory API (Recommended)
Auto-detects platform: Chrome/Edge/Android use WebGPU, Safari/iOS use WASM CPU fallback.
import { createA2E } from '@omote/core';
const a2e = createA2E(); // auto-detects: WebGPU on Chrome/Edge, WASM on Safari/iOS/Firefox
await a2e.load();
const { blendshapes } = await a2e.infer(audioSamples); // Float32Array (16kHz)
// → 52 ARKit blendshape weightsCustom Configuration
import { createA2E, ARKIT_BLENDSHAPES } from '@omote/core';
const a2e = createA2E({ backend: 'wasm' }); // Force WASM for testing
await a2e.load();
const { blendshapes } = await a2e.infer(audioSamples);
const jawOpen = blendshapes[ARKIT_BLENDSHAPES.indexOf('jawOpen')];PlaybackPipeline
End-to-end TTS playback with lip sync inference, audio scheduling, and ExpressionProfile scaling.
import { PlaybackPipeline } from '@omote/core';
const pipeline = new PlaybackPipeline({
lam, // A2E backend from createA2E()
sampleRate: 16000,
profile: { mouth: 1.0, jaw: 1.0, brows: 0.6, eyes: 0.0, cheeks: 0.5, nose: 0.3, tongue: 0.5 },
});
pipeline.on('frame', (frame) => {
// frame.blendshapes — ExpressionProfile-scaled
// frame.rawBlendshapes — unscaled original values
applyToAvatar(frame.blendshapes);
});
pipeline.start();
pipeline.feedBuffer(chunk); // feed TTS audio (Uint8Array PCM16)
pipeline.end(); // flush final partial chunkA2EProcessor
Engine-agnostic audio-to-blendshapes processor for custom integrations. Supports pull mode (timestamped frames for TTS) and push mode (drip-feed for live mic).
import { A2EProcessor } from '@omote/core';
const processor = new A2EProcessor({ backend: lam, chunkSize: 16000 });
// Pull mode: timestamp audio for later retrieval
processor.pushAudio(samples, audioContext.currentTime + delay);
const frame = processor.getFrameForTime(audioContext.currentTime);Speech Recognition (SenseVoice)
SenseVoice ASR — 15x faster than Whisper, with progressive transcription and emotion detection.
import { createSenseVoice } from '@omote/core';
const asr = createSenseVoice(); // Auto-detects platform, fetches from HF CDN
await asr.load();
const { text, emotion, language } = await asr.transcribe(audioSamples);Platform-Aware ASR
import { shouldUseNativeASR, SafariSpeechRecognition, createSenseVoice } from '@omote/core';
const asr = shouldUseNativeASR()
? new SafariSpeechRecognition({ language: 'en-US' })
: createSenseVoice();Voice Activity Detection (Silero VAD)
import { createSileroVAD } from '@omote/core';
const vad = createSileroVAD({
threshold: 0.5,
// useWorker: true // Force off-main-thread
// useWorker: false // Force main thread
});
await vad.load();
const { isSpeech, probability } = await vad.process(audioSamples);Animation Graph
State machine for avatar animation states with emotion blending and audio energy.
import { AnimationGraph, AudioEnergyAnalyzer, EmphasisDetector } from '@omote/core';
const graph = new AnimationGraph();
graph.on('state.change', ({ from, to, trigger }) => {
console.log(`${from} → ${to}`);
});
graph.on('output.update', (output) => applyToAvatar(output));
// State transitions
graph.trigger('user_speech_start'); // idle → listening
graph.trigger('transcript_ready'); // listening → thinking
graph.trigger('ai_audio_start'); // thinking → speaking
graph.trigger('ai_audio_end'); // speaking → idle
// Blend emotion and audio energy into output
graph.setEmotion('happy', 0.8);
graph.setAudioEnergy(0.7);
graph.update(deltaTime); // call each frameStates: idle → listening → thinking → speaking → idle
Emotion Controller
import { EmotionController, EmotionPresets } from '@omote/core';
const controller = new EmotionController();
controller.setPreset('happy');
controller.transitionTo({ joy: 0.8 }, 500); // 500ms smooth transition
// In animation loop
controller.update();
const current = controller.emotion;Presets: neutral, happy, sad, angry, surprised, scared, disgusted, excited, tired, playful, pained, contemplative
Model Caching
IndexedDB-based caching with versioning, LRU eviction, and storage quota monitoring.
import { getModelCache, fetchWithCache, preloadModels } from '@omote/core';
// Fetch with automatic caching
const data = await fetchWithCache('/models/model.onnx');
// Versioned caching for model updates
const data = await fetchWithCache('/models/model.onnx', {
version: '1.0.0',
validateStale: true,
});
// Cache quota monitoring
import { configureCacheLimit } from '@omote/core';
configureCacheLimit({
maxSizeBytes: 500 * 1024 * 1024, // 500MB limit
onQuotaWarning: (info) => console.warn(`Storage ${info.percentUsed}% used`),
});
// Cache stats
const cache = getModelCache();
const stats = await cache.getStats(); // { totalSize, modelCount, models }Microphone Capture
import { MicrophoneCapture } from '@omote/core';
const mic = new MicrophoneCapture({
sampleRate: 16000,
bufferSize: 4096,
});
mic.on('audio', ({ samples }) => {
// Process 16kHz Float32Array samples
});
await mic.start();Logging
import { configureLogging, createLogger } from '@omote/core';
configureLogging({ level: 'debug', format: 'pretty' });
const logger = createLogger('MyModule');
logger.info('Model loaded', { backend: 'webgpu', loadTimeMs: 1234 });Telemetry
OpenTelemetry-compatible tracing and metrics.
import { configureTelemetry, getTelemetry } from '@omote/core';
configureTelemetry({
enabled: true,
serviceName: 'my-app',
exporter: 'console', // or 'otlp' for production
});
const telemetry = getTelemetry();
const span = telemetry.startSpan('custom-operation');
// ... do work
span.end();Text-to-Speech (Kokoro TTS)
import { createKokoroTTS } from '@omote/core';
const tts = createKokoroTTS({ defaultVoice: 'af_heart' });
await tts.load();
const audio = await tts.synthesize('Hello world!');
// audio: Float32Array @ 24kHzKokoro auto-detects the platform: mixed-fp16 WebGPU model (156MB) on Chrome/Edge, q8 WASM model (92MB) on Safari/iOS/Firefox.
Eager Load & Warmup
Use eagerLoad to preload models at construction time:
const tts = createKokoroTTS({ eagerLoad: true }); // Starts loading immediatelyUse warmup() to prime AudioContext for iOS/Safari autoplay policy. Call from a user gesture handler:
button.onclick = async () => {
await avatar.warmup(); // Primes AudioContext
await avatar.connectVoice({ ... });
};Observability
The SDK includes built-in OpenTelemetry-compatible tracing and metrics:
import { configureTelemetry, getTelemetry, MetricNames } from '@omote/core';
configureTelemetry({
enabled: true,
serviceName: 'my-app',
exporter: 'console', // or OTLPExporter for production
});All inference calls, model loads, cache operations, and voice turns are automatically instrumented.
Models
All models default to the HuggingFace CDN and are auto-downloaded on first use. Self-host with configureModelUrls():
import { configureModelUrls } from '@omote/core';
configureModelUrls({
lam: 'https://your-cdn.com/models/lam.onnx',
lamData: 'https://your-cdn.com/models/lam.onnx.data',
senseVoice: 'https://your-cdn.com/models/sensevoice.onnx',
sileroVad: 'https://your-cdn.com/models/silero_vad.onnx',
});| Model | HuggingFace Repo | Size |
|-------|-------------------|------|
| LAM A2E | omote-ai/lam-a2e | lam.onnx (230KB) + lam.onnx.data (192MB) |
| SenseVoice | omote-ai/sensevoice-asr | 228MB |
| Silero VAD | deepghs/silero-vad-onnx | ~2MB |
| Kokoro TTS (WASM) | onnx-community/Kokoro-82M-v1.0-ONNX | 92MB q8 |
| Kokoro TTS (WebGPU) | omote-ai/kokoro-tts | 156MB mixed-fp16 |
Browser Compatibility
WebGPU-first with automatic WASM fallback.
| Browser | WebGPU | WASM | Recommended | |---------|--------|------|-------------| | Chrome 113+ (Desktop) | Yes | Yes | WebGPU | | Chrome 113+ (Android) | Yes | Yes | WebGPU | | Edge 113+ | Yes | Yes | WebGPU | | Firefox 130+ | Flag only | Yes | WASM | | Safari 18+ (macOS) | Limited | Yes | WASM | | Safari (iOS) | No | Yes | WASM |
import { isWebGPUAvailable } from '@omote/core';
const webgpu = await isWebGPUAvailable();iOS Notes
All iOS browsers use WebKit under the hood. The SDK handles three platform constraints automatically:
- WASM binary selection — iOS crashes with the default JSEP/ASYNCIFY WASM binary. The SDK imports
onnxruntime-web/wasm(non-JSEP) on iOS/Safari. - A2E model routing —
createA2E()routes all platforms throughA2EInferenceviaUnifiedInferenceWorker. WebGPU on Chrome/Edge, WASM on Safari/iOS/Firefox. - Worker memory — Multiple Workers each load their own ORT WASM runtime, exceeding iOS tab memory (~1.5GB). The SDK defaults to main-thread inference on iOS.
Consumer requirement: COEP/COOP headers must be skipped for iOS to avoid triggering SharedArrayBuffer (which forces threaded WASM with 4GB shared memory — crashes iOS). Desktop should keep COEP/COOP for multi-threaded performance.
| Feature | iOS Status | Notes | |---------|------------|-------| | Silero VAD | Works | 0.9ms latency | | SenseVoice ASR | Works | WASM, ~200ms | | A2E Lip Sync | Works | A2EInference (WASM) via createA2E(), ~45ms |
License
MIT
