voice-tool-call

v0.2.5

Published

3 months ago

Voice-to-tool-call browser library: wake word detection, speech-to-text, LLM intent interpretation, tool execution, and text-to-speech

Downloads

174

0High
0Medium
0Low

mr_nano

voice speech tool-calling wake-word speech-recognition tts text-to-speech web-speech-api browser kokoro gemini-nano

voice-tool-call

Voice-to-tool-call library for the browser. Wake word detection, speech-to-text, LLM intent interpretation, tool execution, and text-to-speech — all running locally.

Microphone → Wake Word → STT → LLM Intent → Tool Execution → TTS Response

Install

npm install voice-tool-call

# Optional: high-quality local TTS (82M param model, runs in-browser)
npm install kokoro-js

Quick Start

import { VoiceToolSystem } from 'voice-tool-call';

const system = new VoiceToolSystem({
  wakeWords: ['hey assistant'],
  tts: 'browser',       // or 'kokoro' for neural TTS
  autoDetect: true,      // auto-detect Chrome LanguageModel API + WebGPU
  autoSpeak: true,       // speak tool results via TTS
});

system.registerTool('setVolume', {
  description: 'Set the audio volume level',
  parameters: { level: 'number' },
  keywords: ['volume', 'louder', 'quieter', 'mute'],
  examples: [
    { input: 'turn it up', arguments: { level: 80 } },
    { input: 'mute', arguments: { level: 0 } },
  ],
  handler: ({ level }) => {
    document.querySelector('video').volume = level / 100;
    return `Volume set to ${level}%.`;
  },
});

system.on('transcript', (t) => console.log('Heard:', t.text));
system.on('executed', (r) => console.log('Result:', r));

system.start();

Say "Hey Assistant, turn it up" and the tool executes.

How It Works

Wake Word — Continuous listening via Web Speech API. Detects your wake phrase, then captures the command.
Speech-to-Text — Web Speech API converts voice to text.
Intent Interpretation — Matches the transcript to a registered tool:
- local — Keyword matching (zero latency, works offline)
- language-model — Chrome's built-in Gemini Nano (on-device, no API key)
- api — Any OpenAI-compatible endpoint
Tool Execution — Calls the matched tool's handler with parsed arguments.
TTS Response — Speaks the result back via browser speechSynthesis or Kokoro neural TTS.

Configuration

const system = new VoiceToolSystem({
  // Wake word
  wakeWords: ['hey assistant', 'hey computer'],
  commandTimeout: 10000,
  lang: 'en-US',

  // Intent interpretation
  intent: 'local',         // 'local' | 'language-model' | 'api'
  apiUrl: '',              // for intent: 'api'
  apiKey: '',              // for intent: 'api'

  // Text-to-speech
  tts: 'browser',          // 'browser' | 'kokoro'
  kokoro: {
    voice: 'af_heart',
    dtype: 'q4',           // 'fp32' | 'fp16' | 'q8' | 'q4'
    device: 'wasm',        // 'wasm' | 'webgpu' (auto-detected)
  },

  // Behavior
  autoDetect: true,
  autoSpeak: true,
});

Registering Tools

system.registerTool('toolName', {
  description: 'What this tool does',
  parameters: { param1: 'string', param2: 'number' },

  // For local keyword matching
  keywords: ['keyword1', 'keyword2'],

  // For LLM few-shot matching
  examples: [
    { input: 'user might say this', arguments: { param1: 'value', param2: 42 } },
  ],

  // Return a string to auto-speak it
  handler: ({ param1, param2 }) => {
    return `Done with ${param1}.`;
  },
});

Multi-Tool Calls

The LLM interpreter can return multiple tool calls from a single command:

"Make the background blue and play a sound"
→ [setBackgroundColor({ color: "#2563eb" }), playSound({ frequency: 440 })]

Conversation Memory

The LLM interpreter remembers recent commands, so corrections work:

"Play a tone at 440 hertz for 500 milliseconds"  → playSound(440, 500)
"I said 500 seconds not milliseconds"            → playSound(440, 500000)

Chat Fallback

system.registerTool('chat', {
  description: 'Respond conversationally when no other tool matches',
  parameters: { message: 'string' },
  examples: [
    { input: 'hello', arguments: { message: 'Hello! How can I help?' } },
  ],
  handler: ({ message }) => message,
});

Events

system.on('wakeword', ({ state }) => {});        // 'idle' | 'listening' | 'activated'
system.on('transcript', ({ text }) => {});        // STT result
system.on('intent', (toolCall) => {});            // parsed tool call(s)
system.on('executed', (results) => {});           // execution results
system.on('response', ({ text }) => {});          // spoken response text
system.on('tts:status', ({ status }) => {});      // 'generating' | 'speaking' | 'done'
system.on('tts:mode', ({ mode }) => {});          // TTS mode changed
system.on('intent:mode', ({ mode }) => {});       // intent mode changed
system.on('loading', ({ module, status }) => {}); // model loading progress
system.on('scene', ({ scene }) => {});            // scene changed
system.on('error', ({ error, source }) => {});    // errors
system.on('state', ({ running }) => {});          // system start/stop
system.on('ready', ({ capabilities }) => {});     // detected capabilities

Methods

// Lifecycle
system.start();                       // Start wake word + auto-detect
system.stop();                        // Stop listening
system.destroy();                     // Stop + cleanup
system.isRunning();                   // Check if running

// Input
system.processText('do something');   // Process text directly (skip voice)
system.pushToTalk();                  // One-shot voice capture (no wake word)

// Tool management
system.registerTool('name', {...});   // Register a tool (see above)
system.registerTools({ a: {...} });   // Register multiple tools
system.registerGlobalTool('name', {...}); // Register tool that persists across scenes
system.unregisterTool('name');        // Remove a tool
system.clearTools();                  // Remove all tools
system.clearTools({ keepGlobal: true }); // Remove non-global tools only
system.getToolDefinitions();          // List registered tools

// TTS
system.speak('Hello');                // Speak via current TTS engine
system.stopSpeaking();
system.preloadKokoro();               // Pre-download Kokoro model

// Context (dynamic state passed to LLM)
system.setContext({ key: 'value' });
system.updateContext({ key: 'v2' });
system.getContext();

// Runtime config
system.setIntentMode('language-model');
system.getIntentMode();
system.setTTSMode('kokoro');
system.getTTSMode();
system.getCapabilities();

// Node-only async loaders (single import, dynamic loading)
const { warmUpWhisper, transcribeFile } = await loadWhisper();
const { createLlamaCppInterpreter } = await loadLlamaCpp();

Scenes

Scenes let you swap tool sets dynamically based on application state — like pages, modes, or steps in a workflow. Global tools persist across all scenes.

// Define scenes with their own tools and context
system.defineScene('dashboard', {
  context: { currentPage: 'dashboard' },
  onEnter: () => console.log('Entered dashboard'),
  onExit: () => console.log('Left dashboard'),
  tools: {
    viewChart: {
      description: 'View a dashboard chart',
      parameters: { chart: 'string' },
      keywords: ['chart', 'view', 'show'],
      examples: [{ input: 'show revenue chart', arguments: { chart: 'revenue' } }],
      handler: ({ chart }) => `Showing ${chart} chart.`,
    },
    exportReport: {
      description: 'Export a report',
      parameters: { format: 'string' },
      keywords: ['export', 'download', 'report'],
      examples: [{ input: 'export as pdf', arguments: { format: 'pdf' } }],
      handler: ({ format }) => `Exported as ${format}.`,
    },
  },
});

system.defineScene('player', {
  context: { currentPage: 'player', nowPlaying: null },
  tools: {
    play: {
      description: 'Play a song',
      parameters: { query: 'string' },
      keywords: ['play', 'listen'],
      examples: [{ input: 'play some jazz', arguments: { query: 'jazz' } }],
      handler: ({ query }) => `Playing "${query}".`,
    },
  },
});

// Global tools persist across all scenes
system.registerGlobalTool('navigate', {
  description: 'Navigate to a page',
  parameters: { page: 'string' },
  keywords: ['go to', 'navigate', 'open'],
  examples: [{ input: 'go to player', arguments: { page: 'player' } }],
  handler: ({ page }) => { system.setScene(page); return `Navigated to ${page}.`; },
});

// Switch scenes — tools swap, global tools stay
system.setScene('dashboard');  // viewChart + exportReport + navigate
system.setScene('player');     // play + navigate

// Query scene state
system.getScene();   // 'player'
system.getScenes();  // ['dashboard', 'player']

Context

Pass dynamic application state to the LLM for smarter intent resolution:

system.setContext({
  cameras: [
    { id: 'cam1', name: 'lobby' },
    { id: 'cam2', name: 'parking' },
  ],
  activeCamera: 'cam1',
});

"Switch to parking" resolves to cam2 because the LLM sees it in context. Context is also set per-scene via defineScene({ context: {...} }).

Capability Detection

import { detectDetailedCapabilities, requestMicrophoneAccess } from 'voice-tool-call';

const caps = await detectDetailedCapabilities();
// caps.speechRecognition.status  — 'available' | 'unsupported-browser'
// caps.languageModel.status      — 'available' | 'needs-flags' | 'downloadable'
// caps.languageModel.instructions — how to enable (if not available)
// caps.microphone.status         — 'granted' | 'denied' | 'prompt'

const granted = await requestMicrophoneAccess();

Enabling Chrome LanguageModel API (Gemini Nano)

For on-device AI intent matching with no API key:

Chrome 131+
Enable chrome://flags/#optimization-guide-on-device-model
Enable chrome://flags/#prompt-api-for-gemini-nano
Restart Chrome (model downloads ~1.7GB, one-time)

The library auto-detects and switches from keyword matching to AI.

Node.js / Bun

The library works server-side with a local LLM (Metal/CUDA accelerated) instead of Chrome's LanguageModel API.

npm install voice-tool-call node-llama-cpp

import { VoiceToolSystem } from 'voice-tool-call';

const system = new VoiceToolSystem({
  intent: 'llama-cpp',        // Local LLM with Metal/CUDA
  autoSpeak: false,            // No speaker in Node
  autoDetect: false,
  llamaCpp: {
    gpuLayers: -1,             // Offload all layers to GPU
    // model: 'path/to/custom.gguf',  // Optional custom model
  },
});

system.registerTool('deploy', {
  description: 'Deploy the application',
  parameters: { env: 'string' },
  keywords: ['deploy', 'ship'],
  examples: [{ input: 'deploy to staging', arguments: { env: 'staging' } }],
  handler: ({ env }) => `Deployed to ${env}`,
});

await system.start();  // Downloads Qwen2.5-0.5B (~400MB) on first run
const results = await system.processText('deploy to production');

Node-only imports

Node-specific modules (Whisper STT, llama-cpp) are in a separate entry point to keep the browser bundle clean:

// Browser-safe (main entry)
import { VoiceToolSystem } from 'voice-tool-call';

// Node-only (Whisper, llama-cpp, mic recording)
import { warmUpWhisper, transcribeFile, createLlamaCppInterpreter } from 'voice-tool-call/node';

Server with voice UI

The Node server example streams mic audio from a browser, processes everything server-side (Whisper STT + LLM + Kokoro TTS):

bun run demo:server
# Opens browser → mic streams to server → Whisper → LLM → tools → Kokoro → audio back

See examples/node/server.ts for the full implementation.

Advanced: Individual Modules

Import individual pieces for custom pipelines:

// Browser
import {
  WakeWordListener,
  listenForCommand,
  createLocalInterpreter,
  createLanguageModelInterpreter,
  ToolExecutor,
  BrowserTTS,
  KokoroTTSEngine,
  TTSManager,
} from 'voice-tool-call';

// Node/Bun
import {
  createLlamaCppInterpreter,
  warmUpWhisper,
  transcribeFile,
} from 'voice-tool-call/node';

Examples

| Example | Command | Description | |---|---|---| | React (browser) | bun dev | Full browser demo with wake word, Kokoro TTS, Chrome AI | | Node server | bun run demo:server | Server-side voice pipeline with browser mic UI | | Node CLI | bun run demo:cli | Interactive text REPL with OS tools |

Platform Support

Browser

| Feature | Chrome | Edge | Firefox | Safari | |---|---|---|---|---| | Speech Recognition | ✓ | ✓ | ✗ | ✗ | | LanguageModel API | 131+ | ✗ | ✗ | ✗ | | WebGPU (Kokoro accel) | 113+ | 113+ | Nightly | Preview | | Speech Synthesis | ✓ | ✓ | ✓ | ✓ | | Kokoro TTS (WASM) | ✓ | ✓ | ✓ | ✓ |

Node.js / Bun

| Feature | Support | |---|---| | LLM (node-llama-cpp) | Metal (macOS), CUDA (Linux/Windows), Vulkan, CPU | | Whisper STT | Via @huggingface/transformers | | Kokoro TTS | CPU (via onnxruntime-node) |

Best experience: Chrome 131+ with LanguageModel flags enabled.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

voice-tool-call

Install

Quick Start

How It Works

Configuration

Registering Tools

Multi-Tool Calls

Conversation Memory

Chat Fallback

Events

Methods

Scenes

Context

Capability Detection

Enabling Chrome LanguageModel API (Gemini Nano)

Node.js / Bun

Node-only imports

Server with voice UI

Advanced: Individual Modules

Examples

Platform Support

Browser

Node.js / Bun

License