npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@fakhre/voice-agent-sdk

v1.2.3

Published

Site-agnostic AI voice agent for React apps

Readme

@fakhre/voice-agent-sdk

A site-agnostic AI voice agent SDK that drops into any web app to add voice-driven UI control. The SDK listens to speech, scans the DOM for actionable elements, sends the transcript + snapshot to your backend, and safely executes the returned action — with zero UI rewrites required.


Table of Contents

  1. Features
  2. Install
  3. Quick Start
  4. How the SDK Works
  5. Writing HTML for Best SDK Support
  6. DOM Actions — div, image, SVG & More
  7. Configuration API
  8. Agent Control API
  9. State Reference
  10. Supported Actions
  11. Backend Contract
  12. React Integration
  13. Angular Integration
  14. Vanilla JS / Plain HTML Integration
  15. Mobile — Android APK (WebView)
  16. Mobile — iOS IPA (WKWebView)
  17. LiveKit STT (Server-Side Speech)
  18. Authentication
  19. Multi-Step Action Sequences
  20. Safety & Confirmation System
  21. Language Support
  22. Encryption
  23. Full Exports Reference
  24. Browser Support
  25. Get Started / Subscription

Features

  • Voice Recognition — Continuous speech-to-text via Web Speech API with interim results
  • Wake Word — Optional activation phrase (e.g., "hey assistant")
  • 15 Languages — English, Arabic, Hindi, French, Spanish, Chinese, Japanese, and more
  • DOM Scanning — Automatically detects buttons, links, inputs, images, SVGs, and divs with handlers
  • Smart Execution — Clicks, fills forms, scrolls, navigates, submits, and goes back
  • Multi-Step Actions — Sequential action chains returned by your backend
  • Safety Guards — Confidence thresholds, risk-word confirmation prompts, navigation restrictions
  • Sentence Queue — FCFS queue with automatic mic backpressure (no lost commands)
  • Encryption — Optional ECDH + AES-256-GCM for sensitive endpoints
  • Firebase Auth — Optional Google OAuth with JWT forwarding
  • LiveKit STT — Server-side speech recognition as an alternative to Web Speech API
  • React-FirstuseVoiceAgent hook and VoiceAgentOverlay component included
  • Framework Agnostic — Works with react,Angular, Vue, Svelte, or plain HTML

Install

npm install @fakhre/voice-agent-sdk

Peer dependencies (React projects only):

npm install react@>=17 react-dom@>=17

Quick Start

React

import { useVoiceAgent, VoiceAgentOverlay } from "@fakhre/voice-agent-sdk";

export default function App() {
  const { agent, state } = useVoiceAgent({
    intentEndpoint: "https://your-backend.com/intent",
  });

  return (
    <>
      <YourApp />
      {agent && <VoiceAgentOverlay agent={agent} state={state} />}
    </>
  );
}

Vanilla JS / Angular / Any Framework

import { createVoiceAgent } from "@fakhre/voice-agent-sdk";

const agent = createVoiceAgent({
  intentEndpoint: "https://your-backend.com/intent",
  onTranscript: (text, isFinal) => console.log("Heard:", text),
  onIntent: (intent) => console.log("Intent:", intent),
  onAction: (intent, ok) => console.log("Executed:", ok),
});

agent.start();

How the SDK Works

User speaks
    │
    ▼
Web Speech API / LiveKit STT
    │  transcript text
    ▼
SDK scans the DOM (buttons, links, inputs, images, SVGs, divs)
    │  DomElementSnapshot[]
    ▼
POST /intent  ──► Your AI backend (GPT / Claude / custom)
    │  IntentResult  { action, targetId, confidence, ... }
    ▼
Safety check  (confidence, risk words, file inputs, permissions)
    │  pass / requires confirmation
    ▼
executeIntent()  ──► click / fill / scroll / navigate / back / submit
    │
    ▼
DOM updated  →  user sees result

The SDK never modifies your HTML. It reads the DOM, asks your backend what to do, then drives clicks and fills programmatically — just like a real user.


Writing HTML for Best SDK Support

The SDK resolves element labels in this priority order:

  1. alt attribute (images)
  2. <title> child of SVG
  3. aria-label attribute
  4. title attribute
  5. aria-labelledby referenced element
  6. Associated <label for="…"> element
  7. placeholder attribute
  8. Visible text content
  9. Nested image alt or SVG <title>

Rules for Great Voice Support

Always label interactive elements:

<!-- Good — SDK can identify this -->
<button aria-label="Submit order">Place Order</button>

<!-- Also good — visible text is used -->
<button>Place Order</button>

<!-- Bad — no label, SDK cannot target it -->
<button><img src="cart.png" /></button>

Use aria-label on icon-only buttons:

<!-- Good -->
<button aria-label="Close dialog">✕</button>
<button aria-label="Add to cart"><svg>…</svg></button>

<!-- Bad -->
<button>✕</button>

Label all form inputs:

<!-- Good — linked label -->
<label for="email">Email address</label>
<input id="email" type="email" placeholder="[email protected]" />

<!-- Also good — placeholder as fallback -->
<input type="text" placeholder="Search products" />

<!-- Bad — no label, no placeholder -->
<input type="text" />

Name your action divs:

<!-- Good — SDK picks up aria-label on a clickable div -->
<div role="button" aria-label="Open menu" tabindex="0" onclick="openMenu()">
  ☰
</div>

<!-- Also good — text content is used -->
<div onclick="openSettings()">Settings</div>

Use data-testid for reliable targeting:

<!-- data-testid is preferred over nth-of-type selectors -->
<button data-testid="checkout-btn">Checkout</button>
<input data-testid="search-input" type="text" />

Link images to their action:

<!-- Image inside a button — SDK reads button label first, then img alt -->
<button aria-label="View profile">
  <img src="avatar.png" alt="User avatar" />
</button>

<!-- Standalone clickable image -->
<img src="product.jpg" alt="Buy Blue Sneakers" onclick="openProduct()" />

Use semantic HTML where possible:

<!-- Prefer semantic elements — SDK natively understands these -->
<button>          <!-- role: button -->
<a href="…">      <!-- role: link -->
<input>           <!-- role: input -->
<select>          <!-- role: input -->
<textarea>        <!-- role: input -->
<form>            <!-- submittable -->

DOM Actions — div, image, SVG & More

The SDK scans for interactive elements in three phases:

Phase 1 — Semantic / ARIA Elements

Buttons (<button>, [role="button"]), anchors (<a href>), inputs, textareas, selects, contenteditable elements, and anything with tabindex or an onclick attribute.

Phase 2 — Images and SVGs

<img> elements and <svg> elements that have a click handler attached.

<!-- Detected as role: button (has onclick) -->
<img src="close.png" alt="Close panel" onclick="closePanel()" />

<!-- SVG with title — SDK reads the <title> as the label -->
<svg onclick="handleClick()" aria-label="Upload file">
  <title>Upload file</title>
  <path d="…" />
</svg>

Tip: An image or SVG without an onclick (or attached event listener via JS) will not be included in the snapshot unless it has tabindex or role="button".

Phase 3 — Generic Containers

<div>, <span>, <li>, <td>, and similar elements that have visible interaction signals (click handler, tabindex, or role).

<!-- Div detected as clickable -->
<div data-testid="card-item" onclick="selectCard(1)" tabindex="0">
  Product Card
</div>

<!-- List item with role -->
<li role="menuitem" onclick="navigate('/home')">Home</li>

How Click Execution Works on Complex Elements

When the SDK executes a click action, it follows this sequence:

  1. Resolves element via stored CSS selector
  2. File inputs: Instead of clicking the hidden <input type="file">, it clicks the associated <label> (required for browser gesture support)
  3. React elements: Fires a pointer-down → pointer-up → click event chain to trigger React's synthetic event system
  4. SVG / icon wrappers: Falls back to calculating the element's center point and firing a click at those coordinates
  5. Standard elements: Calls .click() directly

Fill Execution on Inputs

<!-- Text / email / password -->
<input type="text" placeholder="Name" />
<!-- Voice: "fill name with John" → value set, input + change events fired -->

<!-- Textarea -->
<textarea placeholder="Write a message"></textarea>
<!-- Voice: "write hello world in the message box" -->

<!-- Select -->
<select id="country">
  <option value="us">United States</option>
  <option value="uk">United Kingdom</option>
</select>
<!-- Voice: "select United Kingdom" → matches by text or value -->

<!-- Contenteditable -->
<div contenteditable="true" aria-label="Rich text editor"></div>
<!-- Voice: "type hello in the editor" → uses Selection API -->

Scroll on Specific Elements

<!-- Give a scrollable div a label so voice can target it -->
<div
  aria-label="Product list"
  style="overflow-y: auto; height: 400px;"
  data-testid="product-list"
>
  …long list…
</div>
<!-- Voice: "scroll down in the product list" -->

Configuration API

createVoiceAgent(options) / useVoiceAgent(options)

{
  // ── Required ─────────────────────────────────────────────────
  intentEndpoint: string;
  // POST endpoint on your backend that resolves intents.

  // ── Speech ───────────────────────────────────────────────────
  wakeWord?: string;
  // Activation phrase, e.g. "hey assistant".
  // Empty string (default) means the agent is always listening.

  language?: string | string[];
  // Single code: "en-US"
  // Array: ["en-US", "ar-SA"] — first is default, rest shown in selector.

  // ── Behaviour ────────────────────────────────────────────────
  debug?: boolean;               // default: false — verbose console logs
  scan?: { max?: number };       // default: { max: 80 } — elements in snapshot

  queue?: { maxSize?: number };
  // default: { maxSize: 5 }
  // When full the mic pauses automatically until a slot opens.

  // ── Permissions ──────────────────────────────────────────────
  permissions?: {
    allowNavigation?: boolean;             // default: false
    allowCrossOriginNavigation?: boolean;  // default: false
    allowFormFill?: boolean;               // default: true
    requireConfirmFor?: string[];
    // default: ["delete","remove","logout","sign out","pay",
    //           "purchase","checkout","unsubscribe"]
    confidenceThreshold?: number;          // default: 0.55 (0–1)
  };

  // ── Auth ─────────────────────────────────────────────────────
  firebaseConfig?: object;
  // Pass your Firebase project config to enable Google OAuth.

  getToken?: () => Promise<string | null>;
  // Custom token provider — sent as "Authorization: Bearer <token>".

  apiKey?: string;
  // Sent as "x-api-key: <key>" on every request.

  // ── LiveKit STT ───────────────────────────────────────────────
  livekit?: {
    serverUrl: string;       // e.g. "wss://your-app.livekit.cloud"
    tokenEndpoint: string;   // Your backend endpoint returning { token: string }
  };

  // ── Callbacks ────────────────────────────────────────────────
  onTranscript?: (text: string, isFinal: boolean) => void;
  onIntent?: (intent: IntentResult) => void;
  onAction?: (intent: IntentResult, ok: boolean) => void;
  onError?: (err: unknown) => void;
}

Full Options Reference

| Option | Type | Default | Description | |---|---|---|---| | intentEndpoint | string | required | POST URL for intent resolution | | wakeWord | string | "" | Activation phrase; empty = always active | | language | string \| string[] | "en-US" | Recognition language(s) | | debug | boolean | false | Verbose console output | | scan.max | number | 80 | Max elements in DOM snapshot | | queue.maxSize | number | 5 | Sentence queue capacity | | permissions | Permissions | see below | Safety & navigation rules | | firebaseConfig | object | — | Firebase config for Google OAuth | | getToken | () => Promise<string\|null> | — | Custom JWT provider | | apiKey | string | — | API key sent as x-api-key header | | livekit | LiveKitConfig | — | Use LiveKit instead of Web Speech API | | onTranscript | function | — | Called on every speech result | | onIntent | function | — | Called when intent is resolved | | onAction | function | — | Called after action executes | | onError | function | — | Called on errors |

Permissions Reference

| Field | Type | Default | Description | |---|---|---|---| | allowNavigation | boolean | false | Allow navigate actions | | allowCrossOriginNavigation | boolean | false | Allow cross-origin URLs | | allowFormFill | boolean | true | Allow fill actions | | requireConfirmFor | string[] | see above | Risky keyword list | | confidenceThreshold | number | 0.55 | Min confidence before auto-execute |


Agent Control API

The agent object returned by createVoiceAgent() or useVoiceAgent():

| Method | Returns | Description | |---|---|---| | agent.start() | void | Start speech recognition | | agent.stop() | void | Stop speech recognition | | agent.toggle() | void | Toggle listening on/off | | agent.destroy() | void | Stop and release all resources | | agent.confirm() | void | Confirm a pending action | | agent.cancel() | void | Cancel a pending action | | agent.selectSuggestion(index) | void | Pick suggestion by 0-based index | | agent.setLanguage(lang) | void | Switch recognition language at runtime | | agent.getState() | VoiceAgentState | Snapshot of current state | | agent.subscribe(fn) | () => void | Subscribe to state; returns unsubscribe | | agent.login() | Promise<void> | Firebase Google sign-in popup | | agent.logout() | Promise<void> | Firebase sign-out |


State Reference

interface VoiceAgentState {
  status:              "idle" | "listening" | "processing" | "error";
  listening:           boolean;
  lastTranscript:      string;         // Most recent final speech text
  lastIntent:          IntentResult | null;
  awaitingConfirm:     boolean;        // true when confirmation dialog is open
  confirmPrompt:       string | null;  // Human-readable confirmation question
  lastError:           string | null;
  activeLanguage:      string;         // Current BCP-47 language code
  user:                User | null;    // Firebase user (if auth enabled)
  sentenceQueue:       string[];       // Pending commands not yet processed
  processingSentence:  string | null;  // Command currently being processed
  queueFull:           boolean;        // true when queue hits maxSize
  suggestions:         string[] | null; // Alternative actions from backend
  awaitingSuggestion:  boolean;        // true when suggestion list is shown
  currentStep:         number | null;  // Step index in multi-action sequence
  totalSteps:          number | null;  // Total steps in sequence
}

Supported Actions

| Action | Requires | Description | |---|---|---| | click | targetId | Click any interactive element by its snapshot ID | | fill | targetId, value | Set value on input / textarea / select / contenteditable | | scroll | delta (px), optional targetId | Scroll page or a specific container | | navigate | url, allowNavigation: true | Navigate to a URL | | back | — | Browser history.back() | | submit | targetId | Submit a form element |

Multi-Step Sequences

Your backend can return an array of steps:

{
  "action": null,
  "actions": [
    { "action": "fill", "targetId": "el-3", "value": "[email protected]" },
    { "action": "fill", "targetId": "el-4", "value": "password123" },
    { "action": "click", "targetId": "el-5" }
  ],
  "confidence": 0.91,
  "reason": "Log in with provided credentials"
}

Steps execute sequentially with a 300ms pause between each to allow React/Angular re-renders.


Backend Contract

Your backend must expose a POST endpoint at the URL you pass as intentEndpoint.

Request

POST /intent
Content-Type: application/json
x-site-id: yourdomain.com
x-api-key: <your-api-key>           (if apiKey option set)
Authorization: Bearer <token>        (if getToken / firebaseConfig set)
{
  "text": "click the login button",
  "domSnapshot": [
    {
      "id": "el-1",
      "role": "button",
      "label": "Login",
      "selector": "#login-btn"
    },
    {
      "id": "el-2",
      "role": "input",
      "label": "Email address",
      "selector": "input[type='email']",
      "inputType": "email"
    },
    {
      "id": "el-3",
      "role": "link",
      "label": "Forgot password",
      "selector": "a[href='/forgot']",
      "href": "/forgot"
    }
  ],
  "language": "en-US"
}

DomElementSnapshot Fields

| Field | Type | Description | |---|---|---| | id | string | SDK-generated ID ("el-1", "el-2", …) | | role | "button" \| "input" \| "link" | Element type | | label | string | Human-readable label (from ARIA, text, alt, etc.) | | selector | string | CSS selector to locate the element | | href | string \| null | Link destination (links only) | | inputType | string \| null | Input type (text, email, password, file, …) |

Response — Single Action

{
  "action": "click",
  "targetId": "el-1",
  "value": null,
  "confidence": 0.92,
  "delta": null,
  "url": null,
  "reason": "User wants to click the login button"
}

Response — Multi-Step Sequence

{
  "action": null,
  "actions": [
    { "action": "fill", "targetId": "el-2", "value": "[email protected]" },
    { "action": "click", "targetId": "el-1" }
  ],
  "confidence": 0.88,
  "reason": "Fill email then click login"
}

Response — Suggestions (no match)

Return suggestions when the intent is unclear. The SDK will read them out and wait for the user to pick one verbally ("one", "two", "three") or by tapping in the overlay.

{
  "action": null,
  "suggestions": [
    "Click the Login button",
    "Fill in the email field",
    "Navigate to the sign-up page"
  ],
  "confidence": 0.3,
  "reason": "Ambiguous command"
}

IntentResult Fields

| Field | Type | Description | |---|---|---| | action | VoiceAgentAction \| null | Action to execute | | targetId | string \| null | Target element from snapshot | | value | string \| null | Fill value | | delta | number \| null | Scroll amount in pixels | | url | string \| null | Navigate destination | | confidence | number \| null | 0–1 confidence score | | reason | string \| null | Human-readable explanation | | suggestions | string[] \| null | Up to 3 alternative commands | | actions | ActionStep[] \| null | Multi-step action array |


React Integration

useVoiceAgent(options)

import { useVoiceAgent, VoiceAgentOverlay } from "@fakhre/voice-agent-sdk";

function App() {
  const { agent, state } = useVoiceAgent({
    intentEndpoint: "https://api.example.com/intent",
    wakeWord: "hey app",
    language: ["en-US", "fr-FR"],
    permissions: {
      allowNavigation: true,
      requireConfirmFor: ["delete", "pay"],
    },
    onTranscript: (text, isFinal) => {
      if (isFinal) console.log("Final:", text);
    },
  });

  return (
    <div>
      <p>Status: {state.status}</p>
      <p>Heard: {state.lastTranscript}</p>
      {agent && (
        <VoiceAgentOverlay
          agent={agent}
          state={state}
          languages={["en-US", "fr-FR", "ar-SA"]}
          style={{ bottom: "24px", right: "24px" }}
          buttonStyle={{ backgroundColor: "#6200ee" }}
        />
      )}
    </div>
  );
}

VoiceAgentOverlay Props

| Prop | Type | Required | Description | |---|---|---|---| | agent | VoiceAgent | Yes | Agent instance | | state | VoiceAgentState | Yes | Reactive state from hook | | languages | string[] | No | Language codes to show in selector | | style | CSSProperties | No | Override container styles | | buttonStyle | CSSProperties | No | Override mic button styles | | panelAreaStyle | CSSProperties | No | Override panel area styles |

Controlling the Agent from React

function Controls({ agent }: { agent: VoiceAgent | null }) {
  return (
    <div>
      <button onClick={() => agent?.start()}>Start</button>
      <button onClick={() => agent?.stop()}>Stop</button>
      <button onClick={() => agent?.toggle()}>Toggle</button>
      <button onClick={() => agent?.setLanguage("ar-SA")}>Arabic</button>
    </div>
  );
}

Angular Integration

Angular does not use JSX, so use createVoiceAgent directly and hook into Angular's lifecycle methods.

Installation

npm install @fakhre/voice-agent-sdk

Service (Recommended Pattern)

// voice-agent.service.ts
import { Injectable, OnDestroy } from "@angular/core";
import { BehaviorSubject } from "rxjs";
import { createVoiceAgent, VoiceAgent, VoiceAgentState } from "@fakhre/voice-agent-sdk";

@Injectable({ providedIn: "root" })
export class VoiceAgentService implements OnDestroy {
  private agent: VoiceAgent | null = null;
  state$ = new BehaviorSubject<VoiceAgentState | null>(null);

  init(intentEndpoint: string): void {
    this.agent = createVoiceAgent({
      intentEndpoint,
      wakeWord: "hey app",
      language: ["en-US"],
      onTranscript: (text, isFinal) => console.log(text),
      onError: (err) => console.error(err),
    });

    this.agent.subscribe((state) => this.state$.next(state));
    this.agent.start();
  }

  start()  { this.agent?.start(); }
  stop()   { this.agent?.stop(); }
  toggle() { this.agent?.toggle(); }
  confirm(){ this.agent?.confirm(); }
  cancel() { this.agent?.cancel(); }
  setLanguage(lang: string) { this.agent?.setLanguage(lang); }

  ngOnDestroy(): void {
    this.agent?.destroy();
  }
}

Component Usage

// app.component.ts
import { Component, OnInit, OnDestroy } from "@angular/core";
import { VoiceAgentService } from "./voice-agent.service";
import { VoiceAgentState } from "@fakhre/voice-agent-sdk";

@Component({
  selector: "app-root",
  template: `
    <div>
      <p>Status: {{ (voiceService.state$ | async)?.status }}</p>
      <p>Heard: {{ (voiceService.state$ | async)?.lastTranscript }}</p>

      <button (click)="voiceService.toggle()">Toggle Mic</button>

      <div *ngIf="(voiceService.state$ | async)?.awaitingConfirm">
        <p>{{ (voiceService.state$ | async)?.confirmPrompt }}</p>
        <button (click)="voiceService.confirm()">Yes</button>
        <button (click)="voiceService.cancel()">No</button>
      </div>
    </div>
  `,
})
export class AppComponent implements OnInit, OnDestroy {
  constructor(public voiceService: VoiceAgentService) {}

  ngOnInit(): void {
    this.voiceService.init("https://api.example.com/intent");
  }

  ngOnDestroy(): void {
    this.voiceService.ngOnDestroy();
  }
}

Overlay in Angular (Custom Template)

Since the VoiceAgentOverlay is a React component, build your own Angular overlay using the state observable:

<!-- voice-overlay.component.html -->
<div class="voice-overlay" *ngIf="state">
  <button
    class="mic-btn"
    [class.listening]="state.listening"
    [class.processing]="state.status === 'processing'"
    (click)="voiceService.toggle()"
  >
    {{ state.listening ? '🎙️' : '🎤' }}
  </button>

  <div class="panel" *ngIf="state.lastTranscript">
    <p>Heard: {{ state.lastTranscript }}</p>
  </div>

  <div class="confirm-dialog" *ngIf="state.awaitingConfirm">
    <p>{{ state.confirmPrompt }}</p>
    <button (click)="voiceService.confirm()">Yes</button>
    <button (click)="voiceService.cancel()">No</button>
  </div>

  <div class="suggestions" *ngIf="state.awaitingSuggestion">
    <p *ngFor="let s of state.suggestions; let i = index">
      <button (click)="voiceService.agent?.selectSuggestion(i)">{{ i + 1 }}. {{ s }}</button>
    </p>
  </div>
</div>

Vanilla JS / Plain HTML Integration

<!DOCTYPE html>
<html>
  <head>
    <script type="module">
      import { createVoiceAgent } from "https://cdn.jsdelivr.net/npm/@fakhre/voice-agent-sdk/dist/index.js";

      const agent = createVoiceAgent({
        intentEndpoint: "https://api.example.com/intent",
        wakeWord: "hey assistant",
        language: "en-US",
        onTranscript: (text, isFinal) => {
          if (isFinal) document.getElementById("transcript").textContent = text;
        },
        onAction: (intent, ok) => {
          console.log(ok ? "Done!" : "Failed", intent);
        },
      });

      agent.subscribe((state) => {
        document.getElementById("status").textContent = state.status;
      });

      document.getElementById("start-btn").onclick = () => agent.start();
      document.getElementById("stop-btn").onclick = () => agent.stop();
    </script>
  </head>
  <body>
    <p>Status: <span id="status">idle</span></p>
    <p>Heard: <span id="transcript"></span></p>
    <button id="start-btn">Start</button>
    <button id="stop-btn">Stop</button>

    <!-- Your actual app elements — SDK scans and controls these -->
    <button aria-label="Open settings">Settings</button>
    <input type="text" placeholder="Search" />
  </body>
</html>

Mobile — Android APK (WebView)

The SDK runs inside an Android WebView since it targets the web platform. Follow these steps to integrate it in a native Android app.

1. Enable JavaScript and Web Speech API

// MainActivity.kt
import android.webkit.WebView
import android.webkit.WebSettings

class MainActivity : AppCompatActivity() {
    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)

        val webView = WebView(this)
        val settings: WebSettings = webView.settings

        settings.javaScriptEnabled = true
        settings.domStorageEnabled = true
        settings.mediaPlaybackRequiresUserGesture = false

        // Required for microphone permission in WebView
        webView.webChromeClient = object : WebChromeClient() {
            override fun onPermissionRequest(request: PermissionRequest) {
                runOnUiThread {
                    request.grant(request.resources)
                }
            }
        }

        webView.loadUrl("https://your-voice-app.com")
        setContentView(webView)
    }
}

2. Request Microphone Permission

<!-- AndroidManifest.xml -->
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.MODIFY_AUDIO_SETTINGS" />
// Request at runtime (Android 6+)
if (ContextCompat.checkSelfPermission(this, Manifest.permission.RECORD_AUDIO)
    != PackageManager.PERMISSION_GRANTED) {
    ActivityCompat.requestPermissions(this, arrayOf(Manifest.permission.RECORD_AUDIO), 100)
}

3. Web Speech API Note

Web Speech API (SpeechRecognition) is supported in Chrome on Android but not in all WebView versions. If targeting a wide Android audience, use the LiveKit STT option instead — it uses LiveKit's server-side speech recognition which works in any WebView:

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  livekit: {
    serverUrl: "wss://your-app.livekit.cloud",
    tokenEndpoint: "https://api.example.com/livekit-token",
  },
});

4. JavaScript Bridge (Optional)

If you need to control the agent from native Android code:

// Inject bridge into WebView
webView.addJavascriptInterface(object : Any() {
    @JavascriptInterface
    fun startAgent() {
        webView.post { webView.evaluateJavascript("window.__voiceAgent?.start()", null) }
    }

    @JavascriptInterface
    fun stopAgent() {
        webView.post { webView.evaluateJavascript("window.__voiceAgent?.stop()", null) }
    }
}, "AndroidBridge")
// In your web app — expose agent globally
const agent = createVoiceAgent({ … });
(window as any).__voiceAgent = agent;

Mobile — iOS IPA (WKWebView)

The SDK runs inside a WKWebView on iOS. Safari's WebKit engine supports the Web Speech API on iOS 14.5+.

1. Setup WKWebView with Microphone

// ViewController.swift
import WebKit

class ViewController: UIViewController, WKUIDelegate {
    var webView: WKWebView!

    override func viewDidLoad() {
        super.viewDidLoad()

        let config = WKWebViewConfiguration()
        config.allowsInlineMediaPlayback = true
        config.mediaTypesRequiringUserActionForPlayback = []

        webView = WKWebView(frame: view.bounds, configuration: config)
        webView.uiDelegate = self
        view.addSubview(webView)

        let url = URL(string: "https://your-voice-app.com")!
        webView.load(URLRequest(url: url))
    }

    // Grant microphone permission from WebView prompt
    func webView(
        _ webView: WKWebView,
        requestMediaCapturePermissionFor origin: WKSecurityOrigin,
        initiatedByFrame frame: WKFrameInfo,
        type: WKMediaCaptureType,
        decisionHandler: @escaping (WKPermissionDecision) -> Void
    ) {
        decisionHandler(.grant)
    }
}

2. Info.plist Permissions

<!-- Info.plist -->
<key>NSMicrophoneUsageDescription</key>
<string>Voice agent needs the microphone to hear your commands.</string>
<key>NSSpeechRecognitionUsageDescription</key>
<string>Voice agent uses speech recognition to understand your commands.</string>

3. JavaScript Bridge from Swift (Optional)

// Inject bridge
let script = WKUserScript(
    source: "window.__nativeBridge = true;",
    injectionTime: .atDocumentStart,
    forMainFrameOnly: true
)
webView.configuration.userContentController.addUserScript(script)

// Call from Swift to start agent
webView.evaluateJavaScript("window.__voiceAgent?.start()", completionHandler: nil)

4. WKWebView + LiveKit Recommendation

Web Speech API in WKWebView requires iOS 14.5+ and HTTPS. For maximum compatibility, especially on older iOS versions, use LiveKit STT:

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  livekit: {
    serverUrl: "wss://your-app.livekit.cloud",
    tokenEndpoint: "https://api.example.com/livekit-token",
  },
});

Deployment Requirements (iOS)

  • Hosted app must be served over HTTPS (required for microphone and Speech API)
  • Works in: Safari on iOS, WKWebView on iOS 14.5+
  • Does not work in UIWebView (deprecated)

LiveKit STT (Server-Side Speech)

Use LiveKit as the speech-to-text engine instead of the browser's Web Speech API. This is ideal for:

  • Environments where Web Speech API is unavailable (some WebViews, Firefox, older browsers)
  • Server-controlled language configuration
  • Enterprise / on-premise deployments
const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  livekit: {
    serverUrl: "wss://your-app.livekit.cloud",
    tokenEndpoint: "https://api.example.com/livekit-token",
  },
});

Your tokenEndpoint must return:

{ "token": "<livekit-room-token>" }

Note: Language switching via agent.setLanguage() is not available with LiveKit — language is configured server-side in your LiveKit agent.


Authentication

Firebase Google Auth

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  firebaseConfig: {
    apiKey: "…",
    authDomain: "your-app.firebaseapp.com",
    projectId: "your-app",
    // … rest of Firebase config
  },
});

// Trigger login
await agent.login();

// Access user
const state = agent.getState();
console.log(state.user?.email);

// Logout
await agent.logout();

When a Firebase user is signed in, every intent request automatically includes Authorization: Bearer <id-token>.

Custom Token Provider

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  getToken: async () => {
    const res = await fetch("/api/auth/token");
    const { token } = await res.json();
    return token;
  },
});

API Key

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  apiKey: "sk-live-abc123",
  // Sends "x-api-key: sk-live-abc123" on every request
});

Multi-Step Action Sequences

Your backend can return a sequence of actions to execute in order:

{
  "action": null,
  "actions": [
    { "action": "click", "targetId": "el-1" },
    { "action": "fill",  "targetId": "el-2", "value": "John Doe" },
    { "action": "fill",  "targetId": "el-3", "value": "[email protected]" },
    { "action": "click", "targetId": "el-4" }
  ],
  "confidence": 0.94,
  "reason": "Complete the registration form"
}

Behaviour:

  • Steps execute one at a time with a 300ms gap (allows DOM/React re-renders)
  • If any step contains a risky keyword, a single confirmation dialog is shown before execution starts
  • state.currentStep and state.totalSteps are updated in real-time — use them to show a progress bar
  • The overlay shows a progress bar with percentage automatically

Safety & Confirmation System

Automatic Confirmation Triggers

| Trigger | Default Condition | |---|---| | Risk keywords | Action label or value contains: delete, remove, logout, sign out, pay, purchase, checkout, unsubscribe | | Navigation blocked | allowNavigation: false (default) | | File input | <input type="file"> always requires confirmation | | Low confidence | confidence < confidenceThreshold (default 0.55) |

Customising Risk Keywords

permissions: {
  requireConfirmFor: ["delete", "cancel subscription", "wipe", "format"],
}

Voice Confirmation

The user can say:

  • Confirm: "yes", "yeah", "yep", "ok", "okay", "sure", "go ahead", "do it", "confirm", "proceed"
  • Cancel: "no", "nope", "cancel", "stop", "never mind", "don't", "abort"

Programmatic Confirmation

agent.confirm();  // Execute pending action
agent.cancel();   // Discard pending action

Language Support

15 languages are supported out of the box:

| Code | Language | Native Label | |---|---|---| | en-US | English (US) | English | | en-GB | English (UK) | English (UK) | | ar-SA | Arabic (Saudi Arabia) | العربية (السعودية) | | ar-EG | Arabic (Egypt) | العربية (مصر) | | ar-AE | Arabic (UAE) | العربية (الإمارات) | | hi-IN | Hindi | हिन्दी | | fr-FR | French | Français | | de-DE | German | Deutsch | | es-ES | Spanish | Español | | zh-CN | Chinese (Simplified) | 中文 (简体) | | ja-JP | Japanese | 日本語 | | pt-BR | Portuguese (Brazil) | Português | | ru-RU | Russian | Русский | | tr-TR | Turkish | Türkçe | | ur-PK | Urdu | اردو |

Runtime Language Switching

agent.setLanguage("ar-SA");    // Switch to Arabic
agent.setLanguage("zh-CN");    // Switch to Chinese

Multi-Language Initialisation

const { agent, state } = useVoiceAgent({
  intentEndpoint: "…",
  language: ["en-US", "ar-SA", "hi-IN"],
  // First in array = default language
  // Rest appear in the overlay's language selector
});

Importing Language Data

import { SUPPORTED_LANGUAGES, DEFAULT_LANGUAGES } from "@fakhre/voice-agent-sdk";
// SUPPORTED_LANGUAGES: LanguageOption[]  — all 15 entries
// DEFAULT_LANGUAGES: string[]            — ["en-US","ar-SA","ar-EG","ar-AE"]

Encryption

The SDK supports optional end-to-end encryption using ECDH (P-256) + AES-256-GCM for sensitive deployments.

When your backend exposes a GET /intent/pubkey endpoint that returns the server's public key, the SDK will:

  1. Generate a P-256 ECDH key pair
  2. Fetch the server's public key
  3. Derive a shared AES-256-GCM key
  4. Encrypt every request body before sending
  5. Decrypt encrypted responses automatically

This happens transparently — no configuration needed on the client side.


Full Exports Reference

// React
export { useVoiceAgent } from "./react/useVoiceAgent";
export { VoiceAgentOverlay } from "./react/VoiceAgentOverlay";

// Core
export { createVoiceAgent } from "./agent/createVoiceAgent";

// Utilities
export { scanDom } from "./dom/scanDom";
export { executeIntent } from "./exec/executeIntent";

// Language data
export { SUPPORTED_LANGUAGES } from "./react/languages";
export { DEFAULT_LANGUAGES } from "./react/languages";

// Types
export type {
  VoiceAgent,
  VoiceAgentOptions,
  VoiceAgentState,
  VoiceAgentAction,
  IntentResult,
  ActionStep,
  DomElementSnapshot,
  DomRole,
  Permissions,
  LiveKitConfig,
} from "./agent/types";

scanDom(options?) — Standalone Utility

Scan the DOM and return a snapshot without starting the full agent. Useful for debugging or building a custom backend request.

import { scanDom } from "@fakhre/voice-agent-sdk";

const snapshot = scanDom({ max: 100 });
console.log(snapshot);
// [{ id: "el-1", role: "button", label: "Login", selector: "#login-btn" }, …]

executeIntent(intent, snapshot, permissions) — Standalone Utility

Execute an intent against the current DOM without the full agent.

import { executeIntent } from "@fakhre/voice-agent-sdk";

const ok = executeIntent(
  { action: "click", targetId: "el-1", confidence: 0.9 },
  snapshot,
  { allowNavigation: false }
);

Browser Support

The SDK requires the Web Speech API. Supported environments:

| Browser | Support | Notes | |---|---|---| | Chrome (desktop) | ✅ Full | Recommended | | Edge (Chromium) | ✅ Full | | | Safari (macOS 14.1+) | ✅ Full | | | Safari (iOS 14.5+) | ✅ Full | HTTPS required | | Firefox | ❌ No | Use LiveKit STT instead | | Android WebView | ⚠️ Partial | Use LiveKit STT for reliability | | iOS WKWebView | ✅ iOS 14.5+ | HTTPS required |

For maximum cross-browser and cross-platform support, use the LiveKit STT option.


Get Started / Subscription

Free Tier

Try the SDK with your own backend — no account required. Install the package and point intentEndpoint at your own AI endpoint.

Managed Backend (Hosted Intent API)

Want a hosted AI backend that handles intent resolution, DOM understanding, and multi-language support out of the box? We offer a managed service with:

  • Hosted intent API endpoint
  • AI-powered DOM understanding (no prompt engineering needed)
  • Multi-language support with server-side models
  • Analytics dashboard
  • SLA & priority support

Contact: [email protected]

npm: npmjs.com/package/@fakhre/voice-agent-sdk


License

MIT