npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

voice-tool-call

v0.2.5

Published

Voice-to-tool-call browser library: wake word detection, speech-to-text, LLM intent interpretation, tool execution, and text-to-speech

Downloads

755

Readme

voice-tool-call

Voice-to-tool-call library for the browser. Wake word detection, speech-to-text, LLM intent interpretation, tool execution, and text-to-speech — all running locally.

Microphone → Wake Word → STT → LLM Intent → Tool Execution → TTS Response

Install

npm install voice-tool-call

# Optional: high-quality local TTS (82M param model, runs in-browser)
npm install kokoro-js

Quick Start

import { VoiceToolSystem } from 'voice-tool-call';

const system = new VoiceToolSystem({
  wakeWords: ['hey assistant'],
  tts: 'browser',       // or 'kokoro' for neural TTS
  autoDetect: true,      // auto-detect Chrome LanguageModel API + WebGPU
  autoSpeak: true,       // speak tool results via TTS
});

system.registerTool('setVolume', {
  description: 'Set the audio volume level',
  parameters: { level: 'number' },
  keywords: ['volume', 'louder', 'quieter', 'mute'],
  examples: [
    { input: 'turn it up', arguments: { level: 80 } },
    { input: 'mute', arguments: { level: 0 } },
  ],
  handler: ({ level }) => {
    document.querySelector('video').volume = level / 100;
    return `Volume set to ${level}%.`;
  },
});

system.on('transcript', (t) => console.log('Heard:', t.text));
system.on('executed', (r) => console.log('Result:', r));

system.start();

Say "Hey Assistant, turn it up" and the tool executes.

How It Works

  1. Wake Word — Continuous listening via Web Speech API. Detects your wake phrase, then captures the command.
  2. Speech-to-Text — Web Speech API converts voice to text.
  3. Intent Interpretation — Matches the transcript to a registered tool:
    • local — Keyword matching (zero latency, works offline)
    • language-model — Chrome's built-in Gemini Nano (on-device, no API key)
    • api — Any OpenAI-compatible endpoint
  4. Tool Execution — Calls the matched tool's handler with parsed arguments.
  5. TTS Response — Speaks the result back via browser speechSynthesis or Kokoro neural TTS.

Configuration

const system = new VoiceToolSystem({
  // Wake word
  wakeWords: ['hey assistant', 'hey computer'],
  commandTimeout: 10000,
  lang: 'en-US',

  // Intent interpretation
  intent: 'local',         // 'local' | 'language-model' | 'api'
  apiUrl: '',              // for intent: 'api'
  apiKey: '',              // for intent: 'api'

  // Text-to-speech
  tts: 'browser',          // 'browser' | 'kokoro'
  kokoro: {
    voice: 'af_heart',
    dtype: 'q4',           // 'fp32' | 'fp16' | 'q8' | 'q4'
    device: 'wasm',        // 'wasm' | 'webgpu' (auto-detected)
  },

  // Behavior
  autoDetect: true,
  autoSpeak: true,
});

Registering Tools

system.registerTool('toolName', {
  description: 'What this tool does',
  parameters: { param1: 'string', param2: 'number' },

  // For local keyword matching
  keywords: ['keyword1', 'keyword2'],

  // For LLM few-shot matching
  examples: [
    { input: 'user might say this', arguments: { param1: 'value', param2: 42 } },
  ],

  // Return a string to auto-speak it
  handler: ({ param1, param2 }) => {
    return `Done with ${param1}.`;
  },
});

Multi-Tool Calls

The LLM interpreter can return multiple tool calls from a single command:

"Make the background blue and play a sound"
→ [setBackgroundColor({ color: "#2563eb" }), playSound({ frequency: 440 })]

Conversation Memory

The LLM interpreter remembers recent commands, so corrections work:

"Play a tone at 440 hertz for 500 milliseconds"  → playSound(440, 500)
"I said 500 seconds not milliseconds"            → playSound(440, 500000)

Chat Fallback

Register a chat tool to handle anything that doesn't match:

system.registerTool('chat', {
  description: 'Respond conversationally when no other tool matches',
  parameters: { message: 'string' },
  examples: [
    { input: 'hello', arguments: { message: 'Hello! How can I help?' } },
  ],
  handler: ({ message }) => message,
});

Events

system.on('wakeword', ({ state }) => {});        // 'idle' | 'listening' | 'activated'
system.on('transcript', ({ text }) => {});        // STT result
system.on('intent', (toolCall) => {});            // parsed tool call(s)
system.on('executed', (results) => {});           // execution results
system.on('response', ({ text }) => {});          // spoken response text
system.on('tts:status', ({ status }) => {});      // 'generating' | 'speaking' | 'done'
system.on('tts:mode', ({ mode }) => {});          // TTS mode changed
system.on('intent:mode', ({ mode }) => {});       // intent mode changed
system.on('loading', ({ module, status }) => {}); // model loading progress
system.on('scene', ({ scene }) => {});            // scene changed
system.on('error', ({ error, source }) => {});    // errors
system.on('state', ({ running }) => {});          // system start/stop
system.on('ready', ({ capabilities }) => {});     // detected capabilities

Methods

// Lifecycle
system.start();                       // Start wake word + auto-detect
system.stop();                        // Stop listening
system.destroy();                     // Stop + cleanup
system.isRunning();                   // Check if running

// Input
system.processText('do something');   // Process text directly (skip voice)
system.pushToTalk();                  // One-shot voice capture (no wake word)

// Tool management
system.registerTool('name', {...});   // Register a tool (see above)
system.registerTools({ a: {...} });   // Register multiple tools
system.registerGlobalTool('name', {...}); // Register tool that persists across scenes
system.unregisterTool('name');        // Remove a tool
system.clearTools();                  // Remove all tools
system.clearTools({ keepGlobal: true }); // Remove non-global tools only
system.getToolDefinitions();          // List registered tools

// TTS
system.speak('Hello');                // Speak via current TTS engine
system.stopSpeaking();
system.preloadKokoro();               // Pre-download Kokoro model

// Context (dynamic state passed to LLM)
system.setContext({ key: 'value' });
system.updateContext({ key: 'v2' });
system.getContext();

// Runtime config
system.setIntentMode('language-model');
system.getIntentMode();
system.setTTSMode('kokoro');
system.getTTSMode();
system.getCapabilities();

// Node-only async loaders (single import, dynamic loading)
const { warmUpWhisper, transcribeFile } = await loadWhisper();
const { createLlamaCppInterpreter } = await loadLlamaCpp();

Scenes

Scenes let you swap tool sets dynamically based on application state — like pages, modes, or steps in a workflow. Global tools persist across all scenes.

// Define scenes with their own tools and context
system.defineScene('dashboard', {
  context: { currentPage: 'dashboard' },
  onEnter: () => console.log('Entered dashboard'),
  onExit: () => console.log('Left dashboard'),
  tools: {
    viewChart: {
      description: 'View a dashboard chart',
      parameters: { chart: 'string' },
      keywords: ['chart', 'view', 'show'],
      examples: [{ input: 'show revenue chart', arguments: { chart: 'revenue' } }],
      handler: ({ chart }) => `Showing ${chart} chart.`,
    },
    exportReport: {
      description: 'Export a report',
      parameters: { format: 'string' },
      keywords: ['export', 'download', 'report'],
      examples: [{ input: 'export as pdf', arguments: { format: 'pdf' } }],
      handler: ({ format }) => `Exported as ${format}.`,
    },
  },
});

system.defineScene('player', {
  context: { currentPage: 'player', nowPlaying: null },
  tools: {
    play: {
      description: 'Play a song',
      parameters: { query: 'string' },
      keywords: ['play', 'listen'],
      examples: [{ input: 'play some jazz', arguments: { query: 'jazz' } }],
      handler: ({ query }) => `Playing "${query}".`,
    },
  },
});

// Global tools persist across all scenes
system.registerGlobalTool('navigate', {
  description: 'Navigate to a page',
  parameters: { page: 'string' },
  keywords: ['go to', 'navigate', 'open'],
  examples: [{ input: 'go to player', arguments: { page: 'player' } }],
  handler: ({ page }) => { system.setScene(page); return `Navigated to ${page}.`; },
});

// Switch scenes — tools swap, global tools stay
system.setScene('dashboard');  // viewChart + exportReport + navigate
system.setScene('player');     // play + navigate

// Query scene state
system.getScene();   // 'player'
system.getScenes();  // ['dashboard', 'player']

Context

Pass dynamic application state to the LLM for smarter intent resolution:

system.setContext({
  cameras: [
    { id: 'cam1', name: 'lobby' },
    { id: 'cam2', name: 'parking' },
  ],
  activeCamera: 'cam1',
});

"Switch to parking" resolves to cam2 because the LLM sees it in context. Context is also set per-scene via defineScene({ context: {...} }).

Capability Detection

import { detectDetailedCapabilities, requestMicrophoneAccess } from 'voice-tool-call';

const caps = await detectDetailedCapabilities();
// caps.speechRecognition.status  — 'available' | 'unsupported-browser'
// caps.languageModel.status      — 'available' | 'needs-flags' | 'downloadable'
// caps.languageModel.instructions — how to enable (if not available)
// caps.microphone.status         — 'granted' | 'denied' | 'prompt'

const granted = await requestMicrophoneAccess();

Enabling Chrome LanguageModel API (Gemini Nano)

For on-device AI intent matching with no API key:

  1. Chrome 131+
  2. Enable chrome://flags/#optimization-guide-on-device-model
  3. Enable chrome://flags/#prompt-api-for-gemini-nano
  4. Restart Chrome (model downloads ~1.7GB, one-time)

The library auto-detects and switches from keyword matching to AI.

Node.js / Bun

The library works server-side with a local LLM (Metal/CUDA accelerated) instead of Chrome's LanguageModel API.

npm install voice-tool-call node-llama-cpp
import { VoiceToolSystem } from 'voice-tool-call';

const system = new VoiceToolSystem({
  intent: 'llama-cpp',        // Local LLM with Metal/CUDA
  autoSpeak: false,            // No speaker in Node
  autoDetect: false,
  llamaCpp: {
    gpuLayers: -1,             // Offload all layers to GPU
    // model: 'path/to/custom.gguf',  // Optional custom model
  },
});

system.registerTool('deploy', {
  description: 'Deploy the application',
  parameters: { env: 'string' },
  keywords: ['deploy', 'ship'],
  examples: [{ input: 'deploy to staging', arguments: { env: 'staging' } }],
  handler: ({ env }) => `Deployed to ${env}`,
});

await system.start();  // Downloads Qwen2.5-0.5B (~400MB) on first run
const results = await system.processText('deploy to production');

Node-only imports

Node-specific modules (Whisper STT, llama-cpp) are in a separate entry point to keep the browser bundle clean:

// Browser-safe (main entry)
import { VoiceToolSystem } from 'voice-tool-call';

// Node-only (Whisper, llama-cpp, mic recording)
import { warmUpWhisper, transcribeFile, createLlamaCppInterpreter } from 'voice-tool-call/node';

Server with voice UI

The Node server example streams mic audio from a browser, processes everything server-side (Whisper STT + LLM + Kokoro TTS):

bun run demo:server
# Opens browser → mic streams to server → Whisper → LLM → tools → Kokoro → audio back

See examples/node/server.ts for the full implementation.

Advanced: Individual Modules

Import individual pieces for custom pipelines:

// Browser
import {
  WakeWordListener,
  listenForCommand,
  createLocalInterpreter,
  createLanguageModelInterpreter,
  ToolExecutor,
  BrowserTTS,
  KokoroTTSEngine,
  TTSManager,
} from 'voice-tool-call';

// Node/Bun
import {
  createLlamaCppInterpreter,
  warmUpWhisper,
  transcribeFile,
} from 'voice-tool-call/node';

Examples

| Example | Command | Description | |---|---|---| | React (browser) | bun dev | Full browser demo with wake word, Kokoro TTS, Chrome AI | | Node server | bun run demo:server | Server-side voice pipeline with browser mic UI | | Node CLI | bun run demo:cli | Interactive text REPL with OS tools |

Platform Support

Browser

| Feature | Chrome | Edge | Firefox | Safari | |---|---|---|---|---| | Speech Recognition | ✓ | ✓ | ✗ | ✗ | | LanguageModel API | 131+ | ✗ | ✗ | ✗ | | WebGPU (Kokoro accel) | 113+ | 113+ | Nightly | Preview | | Speech Synthesis | ✓ | ✓ | ✓ | ✓ | | Kokoro TTS (WASM) | ✓ | ✓ | ✓ | ✓ |

Node.js / Bun

| Feature | Support | |---|---| | LLM (node-llama-cpp) | Metal (macOS), CUDA (Linux/Windows), Vulkan, CPU | | Whisper STT | Via @huggingface/transformers | | Kokoro TTS | CPU (via onnxruntime-node) |

Best experience: Chrome 131+ with LanguageModel flags enabled.

License

MIT