@auditionhub/voice-detector

v0.1.4

Published

7 months ago

Real-time line completion detection for browser apps (Web Speech default, Vosk-WASM fallback)

Downloads

0High
0Medium
0Low

craigdbaker

speech vad stt web-speech vosk browser react

@auditionhub/voice-detector

Real-time line completion detection for spoken lines in the browser. Detect when a user has finished a line using pause duration, fuzzy suffix matching, or AI-powered semantic verification. Optimized for low latency with support for AssemblyAI streaming transcription and Gemini semantic analysis.

Install

npm install @auditionhub/voice-detector

🔒 Security Best Practices

Important: API keys should never be exposed to the browser! For production applications:

AssemblyAI: Generate temporary tokens server-side (see example below)
Gemini: Use geminiProxyUrl to proxy API calls through your server
See Remix Example for a complete secure implementation

// ✅ GOOD: Server generates tokens, client uses proxy
const detector = new LineDetector({
	assemblyAiToken: await fetchTokenFromServer(),
	geminiProxyUrl: '/api/gemini-proxy',
})

// ❌ BAD: Direct API keys in browser
const detector = new LineDetector({
	assemblyAiApiKey: 'secret-key', // ⚠️ Exposed to browser!
	geminiApiKey: 'secret-key', // ⚠️ Exposed to browser!
})

Basic usage

Simple Web Speech (Default)

import { LineDetector } from '@auditionhub/voice-detector'

const detector = new LineDetector({
	pauseMs: 800,
	suffixWords: 3,
	useWebSpeech: true,
})

await detector.init()

detector.on('lineComplete', (ev) => {
	console.log('Line complete:', ev.reason, ev.elapsedMs)
})

await detector.startLine({ text: 'I think we should head back now.' })

With AssemblyAI + Gemini (Recommended)

import { LineDetector } from '@auditionhub/voice-detector'

const detector = new LineDetector({
	assemblyAiToken: 'your-temporary-token', // Generate server-side for browser
	geminiApiKey: 'your-gemini-key',
	semanticThreshold: 0.75,
	enableSemanticMatching: true,
	pauseMs: 800,
})

await detector.init()

detector.on('semanticMatch', (ev) => {
	console.log('Match probability:', ev.probability)
	console.log('Reason:', ev.reason)
})

detector.on('lineComplete', (ev) => {
	console.log('Line complete:', ev.reason, ev.elapsedMs)
	console.log('Semantic score:', ev.semanticProbability)
})

await detector.startLine({ text: 'To be or not to be, that is the question.' })

Configuration

Core Options

pauseMs: number (default 800) – silence duration to trigger completion
suffixWords: number (default 3) – trailing words to compare for suffix match
suffixMaxDist: number (default 2) – max edit distance for suffix match
baseWPS: number (default 2.7) – initial words/sec for timing estimate
maxMultiplier: number (default 2.0) – timeout multiplier cap
useWebSpeech: boolean (default true) – prefer Web Speech API locally

AssemblyAI Options

assemblyAiToken: string (optional) – Temporary token for AssemblyAI real-time streaming (required for browser)
assemblyAiApiKey: string (optional) – API key for AssemblyAI (Node.js only, not for browser)

Important: For browser usage, you MUST use assemblyAiToken instead of assemblyAiApiKey. Generate tokens server-side:

// Server-side (Node.js)
import { AssemblyAI } from 'assemblyai'
const client = new AssemblyAI({ apiKey: 'your-api-key' })
const token = await client.streaming.createTemporaryToken({
	expires_in_seconds: 480,
})
// Return token to client

Get your API key: https://www.assemblyai.com

Gemini Options

geminiApiKey: string (optional) – API key for Gemini 2.5 Flash Lite semantic matching
geminiProxyUrl: string (optional) – Recommended – Server-side proxy URL for Gemini API calls (keeps API key secure)
semanticThreshold: number (default 0.75) – probability threshold for semantic completion (0-1)
enableSemanticMatching: boolean (default true) – enable semantic verification

Security Best Practice: Use geminiProxyUrl instead of geminiApiKey in browser applications to keep your API key secure on the server. See Remix Example for a complete implementation.

Get your key: https://aistudio.google.com

Legacy Options

remoteURL: string | null – optional remote STT (future)

Events

lineStart: { text } – Fired when a new line starts
lineComplete: { reason: 'pause' | 'suffix' | 'semantic', elapsedMs, transcript?, avgWPS?, semanticProbability?, semanticReason? } – Fired when line is successfully detected
lineTimeout: { reason: 'timeout', elapsedMs, transcript?, avgWPS?, semanticProbability?, semanticReason? } – Fired if no completion detected in time
transcript: { text, isFinal } – Fired on each transcript update
semanticMatch: { probability, reason } – Fired when semantic match is calculated (Gemini only)

Speech-to-Text Adapters

The library supports multiple STT backends with automatic fallback:

AssemblyAI (recommended): Real-time streaming transcription with WebSocket
- Best accuracy and latency
- Requires temporary token (for browser) or API key (for Node.js)
- Generate tokens server-side for browser security
- Automatic fallback if unavailable
Web Speech API: Browser-native STT (default fallback)
- No API key required
- Good for simple use cases
- Browser support varies
Vosk-WASM (offline, WIP): Local on-device transcription
- No internet required
- Scaffolding included; full integration pending

Semantic Verification

When geminiApiKey is provided, the library uses Gemini 2.5 Flash Lite to verify semantic meaning:

Probability-based matching: Returns 0-1 score for how well transcript matches target
Smart threshold: Configurable matching strictness (default 75%)
Paraphrase-friendly: Understands synonyms and alternate phrasings
Low cost: $0.10/1M input tokens
Fast: ~180ms time-to-first-token

Examples

React: examples/react-basic – Complete demo with AssemblyAI + Gemini integration
Remix (client-only): examples/remix

How It Works

VAD Detection: Voice Activity Detection monitors audio for speech/silence boundaries
Transcription: AssemblyAI streams real-time transcription (or Web Speech as fallback)
Semantic Verification: Gemini compares meaning between spoken and target text
Completion: Line completes when:
- Pause detected (configurable duration), OR
- Suffix match found, OR
- Semantic probability crosses threshold (Gemini only)

Roadmap

[x] AssemblyAI real-time streaming
[x] Gemini semantic verification
[ ] Vosk-WASM offline model + caching
[ ] Web Worker/AudioWorklet offload
[ ] Multi-language support

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@auditionhub/voice-detector

Install

🔒 Security Best Practices

Basic usage

Simple Web Speech (Default)

With AssemblyAI + Gemini (Recommended)

Configuration

Core Options

AssemblyAI Options

Gemini Options

Legacy Options

Events

Speech-to-Text Adapters

Semantic Verification

Examples

How It Works

Roadmap