@fakhre/voice-agent-sdk

v1.2.3

Published

11 days ago

Site-agnostic AI voice agent for React apps

0High
0Medium
0Low

@fakhre/voice-agent-sdk

A site-agnostic AI voice agent SDK that drops into any web app to add voice-driven UI control. The SDK listens to speech, scans the DOM for actionable elements, sends the transcript + snapshot to your backend, and safely executes the returned action — with zero UI rewrites required.

Features

Voice Recognition — Continuous speech-to-text via Web Speech API with interim results
Wake Word — Optional activation phrase (e.g., "hey assistant")
15 Languages — English, Arabic, Hindi, French, Spanish, Chinese, Japanese, and more
DOM Scanning — Automatically detects buttons, links, inputs, images, SVGs, and divs with handlers
Smart Execution — Clicks, fills forms, scrolls, navigates, submits, and goes back
Multi-Step Actions — Sequential action chains returned by your backend
Safety Guards — Confidence thresholds, risk-word confirmation prompts, navigation restrictions
Sentence Queue — FCFS queue with automatic mic backpressure (no lost commands)
Encryption — Optional ECDH + AES-256-GCM for sensitive endpoints
Firebase Auth — Optional Google OAuth with JWT forwarding
LiveKit STT — Server-side speech recognition as an alternative to Web Speech API
React-First — useVoiceAgent hook and VoiceAgentOverlay component included
Framework Agnostic — Works with react,Angular, Vue, Svelte, or plain HTML

Install

npm install @fakhre/voice-agent-sdk

Peer dependencies (React projects only):

npm install react@>=17 react-dom@>=17

Quick Start

React

import { useVoiceAgent, VoiceAgentOverlay } from "@fakhre/voice-agent-sdk";

export default function App() {
  const { agent, state } = useVoiceAgent({
    intentEndpoint: "https://your-backend.com/intent",
  });

  return (
    <>
      <YourApp />
      {agent && <VoiceAgentOverlay agent={agent} state={state} />}
    </>
  );
}

Vanilla JS / Angular / Any Framework

import { createVoiceAgent } from "@fakhre/voice-agent-sdk";

const agent = createVoiceAgent({
  intentEndpoint: "https://your-backend.com/intent",
  onTranscript: (text, isFinal) => console.log("Heard:", text),
  onIntent: (intent) => console.log("Intent:", intent),
  onAction: (intent, ok) => console.log("Executed:", ok),
});

agent.start();

How the SDK Works

User speaks
    │
    ▼
Web Speech API / LiveKit STT
    │  transcript text
    ▼
SDK scans the DOM (buttons, links, inputs, images, SVGs, divs)
    │  DomElementSnapshot[]
    ▼
POST /intent  ──► Your AI backend (GPT / Claude / custom)
    │  IntentResult  { action, targetId, confidence, ... }
    ▼
Safety check  (confidence, risk words, file inputs, permissions)
    │  pass / requires confirmation
    ▼
executeIntent()  ──► click / fill / scroll / navigate / back / submit
    │
    ▼
DOM updated  →  user sees result

The SDK never modifies your HTML. It reads the DOM, asks your backend what to do, then drives clicks and fills programmatically — just like a real user.

Writing HTML for Best SDK Support

The SDK resolves element labels in this priority order:

alt attribute (images)
<title> child of SVG
aria-label attribute
title attribute
aria-labelledby referenced element
Associated <label for="…"> element
placeholder attribute
Visible text content
Nested image alt or SVG <title>

Rules for Great Voice Support

Always label interactive elements:

<!-- Good — SDK can identify this -->
<button aria-label="Submit order">Place Order</button>

<!-- Also good — visible text is used -->
<button>Place Order</button>

<!-- Bad — no label, SDK cannot target it -->
<button><img src="cart.png" /></button>

Use aria-label on icon-only buttons:

<!-- Good -->
<button aria-label="Close dialog">✕</button>
<button aria-label="Add to cart"><svg>…</svg></button>

<!-- Bad -->
<button>✕</button>

Label all form inputs:

<!-- Good — linked label -->
<label for="email">Email address</label>
<input id="email" type="email" placeholder="[email protected]" />

<!-- Also good — placeholder as fallback -->
<input type="text" placeholder="Search products" />

<!-- Bad — no label, no placeholder -->
<input type="text" />

Name your action divs:

<!-- Good — SDK picks up aria-label on a clickable div -->
<div role="button" aria-label="Open menu" tabindex="0" onclick="openMenu()">
  ☰
</div>

<!-- Also good — text content is used -->
<div onclick="openSettings()">Settings</div>

Use data-testid for reliable targeting:

<!-- data-testid is preferred over nth-of-type selectors -->
<button data-testid="checkout-btn">Checkout</button>
<input data-testid="search-input" type="text" />

Link images to their action:

<!-- Image inside a button — SDK reads button label first, then img alt -->
<button aria-label="View profile">
  <img src="avatar.png" alt="User avatar" />
</button>

<!-- Standalone clickable image -->
<img src="product.jpg" alt="Buy Blue Sneakers" onclick="openProduct()" />

Use semantic HTML where possible:

<!-- Prefer semantic elements — SDK natively understands these -->
<button>          <!-- role: button -->
<a href="…">      <!-- role: link -->
<input>           <!-- role: input -->
<select>          <!-- role: input -->
<textarea>        <!-- role: input -->
<form>            <!-- submittable -->

DOM Actions — div, image, SVG & More

The SDK scans for interactive elements in three phases:

Phase 1 — Semantic / ARIA Elements

Buttons (<button>, [role="button"]), anchors (<a href>), inputs, textareas, selects, contenteditable elements, and anything with tabindex or an onclick attribute.

Phase 2 — Images and SVGs

<img> elements and <svg> elements that have a click handler attached.

<!-- Detected as role: button (has onclick) -->
<img src="close.png" alt="Close panel" onclick="closePanel()" />

<!-- SVG with title — SDK reads the <title> as the label -->
<svg onclick="handleClick()" aria-label="Upload file">
  <title>Upload file</title>
  <path d="…" />
</svg>

Tip: An image or SVG without an onclick (or attached event listener via JS) will not be included in the snapshot unless it has tabindex or role="button".

Phase 3 — Generic Containers

<div>, <span>, <li>, <td>, and similar elements that have visible interaction signals (click handler, tabindex, or role).

<!-- Div detected as clickable -->
<div data-testid="card-item" onclick="selectCard(1)" tabindex="0">
  Product Card
</div>

<!-- List item with role -->
<li role="menuitem" onclick="navigate('/home')">Home</li>

How Click Execution Works on Complex Elements

When the SDK executes a click action, it follows this sequence:

Resolves element via stored CSS selector
File inputs: Instead of clicking the hidden <input type="file">, it clicks the associated <label> (required for browser gesture support)
React elements: Fires a pointer-down → pointer-up → click event chain to trigger React's synthetic event system
SVG / icon wrappers: Falls back to calculating the element's center point and firing a click at those coordinates
Standard elements: Calls .click() directly

Fill Execution on Inputs

<!-- Text / email / password -->
<input type="text" placeholder="Name" />
<!-- Voice: "fill name with John" → value set, input + change events fired -->

<!-- Textarea -->
<textarea placeholder="Write a message"></textarea>
<!-- Voice: "write hello world in the message box" -->

<!-- Select -->
<select id="country">
  <option value="us">United States</option>
  <option value="uk">United Kingdom</option>
</select>
<!-- Voice: "select United Kingdom" → matches by text or value -->

<!-- Contenteditable -->
<div contenteditable="true" aria-label="Rich text editor"></div>
<!-- Voice: "type hello in the editor" → uses Selection API -->

Scroll on Specific Elements

<!-- Give a scrollable div a label so voice can target it -->
<div
  aria-label="Product list"
  style="overflow-y: auto; height: 400px;"
  data-testid="product-list"
>
  …long list…
</div>
<!-- Voice: "scroll down in the product list" -->

Configuration API

`createVoiceAgent(options)` / `useVoiceAgent(options)`

{
  // ── Required ─────────────────────────────────────────────────
  intentEndpoint: string;
  // POST endpoint on your backend that resolves intents.

  // ── Speech ───────────────────────────────────────────────────
  wakeWord?: string;
  // Activation phrase, e.g. "hey assistant".
  // Empty string (default) means the agent is always listening.

  language?: string | string[];
  // Single code: "en-US"
  // Array: ["en-US", "ar-SA"] — first is default, rest shown in selector.

  // ── Behaviour ────────────────────────────────────────────────
  debug?: boolean;               // default: false — verbose console logs
  scan?: { max?: number };       // default: { max: 80 } — elements in snapshot

  queue?: { maxSize?: number };
  // default: { maxSize: 5 }
  // When full the mic pauses automatically until a slot opens.

  // ── Permissions ──────────────────────────────────────────────
  permissions?: {
    allowNavigation?: boolean;             // default: false
    allowCrossOriginNavigation?: boolean;  // default: false
    allowFormFill?: boolean;               // default: true
    requireConfirmFor?: string[];
    // default: ["delete","remove","logout","sign out","pay",
    //           "purchase","checkout","unsubscribe"]
    confidenceThreshold?: number;          // default: 0.55 (0–1)
  };

  // ── Auth ─────────────────────────────────────────────────────
  firebaseConfig?: object;
  // Pass your Firebase project config to enable Google OAuth.

  getToken?: () => Promise<string | null>;
  // Custom token provider — sent as "Authorization: Bearer <token>".

  apiKey?: string;
  // Sent as "x-api-key: <key>" on every request.

  // ── LiveKit STT ───────────────────────────────────────────────
  livekit?: {
    serverUrl: string;       // e.g. "wss://your-app.livekit.cloud"
    tokenEndpoint: string;   // Your backend endpoint returning { token: string }
  };

  // ── Callbacks ────────────────────────────────────────────────
  onTranscript?: (text: string, isFinal: boolean) => void;
  onIntent?: (intent: IntentResult) => void;
  onAction?: (intent: IntentResult, ok: boolean) => void;
  onError?: (err: unknown) => void;
}

Full Options Reference

| Option | Type | Default | Description | |---|---|---|---| | intentEndpoint | string | required | POST URL for intent resolution | | wakeWord | string | "" | Activation phrase; empty = always active | | language | string \| string[] | "en-US" | Recognition language(s) | | debug | boolean | false | Verbose console output | | scan.max | number | 80 | Max elements in DOM snapshot | | queue.maxSize | number | 5 | Sentence queue capacity | | permissions | Permissions | see below | Safety & navigation rules | | firebaseConfig | object | — | Firebase config for Google OAuth | | getToken | () => Promise<string\|null> | — | Custom JWT provider | | apiKey | string | — | API key sent as x-api-key header | | livekit | LiveKitConfig | — | Use LiveKit instead of Web Speech API | | onTranscript | function | — | Called on every speech result | | onIntent | function | — | Called when intent is resolved | | onAction | function | — | Called after action executes | | onError | function | — | Called on errors |

Permissions Reference

| Field | Type | Default | Description | |---|---|---|---| | allowNavigation | boolean | false | Allow navigate actions | | allowCrossOriginNavigation | boolean | false | Allow cross-origin URLs | | allowFormFill | boolean | true | Allow fill actions | | requireConfirmFor | string[] | see above | Risky keyword list | | confidenceThreshold | number | 0.55 | Min confidence before auto-execute |

Agent Control API

The agent object returned by createVoiceAgent() or useVoiceAgent():

| Method | Returns | Description | |---|---|---| | agent.start() | void | Start speech recognition | | agent.stop() | void | Stop speech recognition | | agent.toggle() | void | Toggle listening on/off | | agent.destroy() | void | Stop and release all resources | | agent.confirm() | void | Confirm a pending action | | agent.cancel() | void | Cancel a pending action | | agent.selectSuggestion(index) | void | Pick suggestion by 0-based index | | agent.setLanguage(lang) | void | Switch recognition language at runtime | | agent.getState() | VoiceAgentState | Snapshot of current state | | agent.subscribe(fn) | () => void | Subscribe to state; returns unsubscribe | | agent.login() | Promise<void> | Firebase Google sign-in popup | | agent.logout() | Promise<void> | Firebase sign-out |

State Reference

interface VoiceAgentState {
  status:              "idle" | "listening" | "processing" | "error";
  listening:           boolean;
  lastTranscript:      string;         // Most recent final speech text
  lastIntent:          IntentResult | null;
  awaitingConfirm:     boolean;        // true when confirmation dialog is open
  confirmPrompt:       string | null;  // Human-readable confirmation question
  lastError:           string | null;
  activeLanguage:      string;         // Current BCP-47 language code
  user:                User | null;    // Firebase user (if auth enabled)
  sentenceQueue:       string[];       // Pending commands not yet processed
  processingSentence:  string | null;  // Command currently being processed
  queueFull:           boolean;        // true when queue hits maxSize
  suggestions:         string[] | null; // Alternative actions from backend
  awaitingSuggestion:  boolean;        // true when suggestion list is shown
  currentStep:         number | null;  // Step index in multi-action sequence
  totalSteps:          number | null;  // Total steps in sequence
}

Supported Actions

| Action | Requires | Description | |---|---|---| | click | targetId | Click any interactive element by its snapshot ID | | fill | targetId, value | Set value on input / textarea / select / contenteditable | | scroll | delta (px), optional targetId | Scroll page or a specific container | | navigate | url, allowNavigation: true | Navigate to a URL | | back | — | Browser history.back() | | submit | targetId | Submit a form element |

Multi-Step Sequences

Your backend can return an array of steps:

{
  "action": null,
  "actions": [
    { "action": "fill", "targetId": "el-3", "value": "[email protected]" },
    { "action": "fill", "targetId": "el-4", "value": "password123" },
    { "action": "click", "targetId": "el-5" }
  ],
  "confidence": 0.91,
  "reason": "Log in with provided credentials"
}

Steps execute sequentially with a 300ms pause between each to allow React/Angular re-renders.

Backend Contract

Your backend must expose a POST endpoint at the URL you pass as intentEndpoint.

Request

POST /intent
Content-Type: application/json
x-site-id: yourdomain.com
x-api-key: <your-api-key>           (if apiKey option set)
Authorization: Bearer <token>        (if getToken / firebaseConfig set)

{
  "text": "click the login button",
  "domSnapshot": [
    {
      "id": "el-1",
      "role": "button",
      "label": "Login",
      "selector": "#login-btn"
    },
    {
      "id": "el-2",
      "role": "input",
      "label": "Email address",
      "selector": "input[type='email']",
      "inputType": "email"
    },
    {
      "id": "el-3",
      "role": "link",
      "label": "Forgot password",
      "selector": "a[href='/forgot']",
      "href": "/forgot"
    }
  ],
  "language": "en-US"
}

DomElementSnapshot Fields

| Field | Type | Description | |---|---|---| | id | string | SDK-generated ID ("el-1", "el-2", …) | | role | "button" \| "input" \| "link" | Element type | | label | string | Human-readable label (from ARIA, text, alt, etc.) | | selector | string | CSS selector to locate the element | | href | string \| null | Link destination (links only) | | inputType | string \| null | Input type (text, email, password, file, …) |

Response — Single Action

{
  "action": "click",
  "targetId": "el-1",
  "value": null,
  "confidence": 0.92,
  "delta": null,
  "url": null,
  "reason": "User wants to click the login button"
}

Response — Multi-Step Sequence

{
  "action": null,
  "actions": [
    { "action": "fill", "targetId": "el-2", "value": "[email protected]" },
    { "action": "click", "targetId": "el-1" }
  ],
  "confidence": 0.88,
  "reason": "Fill email then click login"
}

Response — Suggestions (no match)

Return suggestions when the intent is unclear. The SDK will read them out and wait for the user to pick one verbally ("one", "two", "three") or by tapping in the overlay.

{
  "action": null,
  "suggestions": [
    "Click the Login button",
    "Fill in the email field",
    "Navigate to the sign-up page"
  ],
  "confidence": 0.3,
  "reason": "Ambiguous command"
}

IntentResult Fields

| Field | Type | Description | |---|---|---| | action | VoiceAgentAction \| null | Action to execute | | targetId | string \| null | Target element from snapshot | | value | string \| null | Fill value | | delta | number \| null | Scroll amount in pixels | | url | string \| null | Navigate destination | | confidence | number \| null | 0–1 confidence score | | reason | string \| null | Human-readable explanation | | suggestions | string[] \| null | Up to 3 alternative commands | | actions | ActionStep[] \| null | Multi-step action array |

React Integration

`useVoiceAgent(options)`

import { useVoiceAgent, VoiceAgentOverlay } from "@fakhre/voice-agent-sdk";

function App() {
  const { agent, state } = useVoiceAgent({
    intentEndpoint: "https://api.example.com/intent",
    wakeWord: "hey app",
    language: ["en-US", "fr-FR"],
    permissions: {
      allowNavigation: true,
      requireConfirmFor: ["delete", "pay"],
    },
    onTranscript: (text, isFinal) => {
      if (isFinal) console.log("Final:", text);
    },
  });

  return (
    <div>
      <p>Status: {state.status}</p>
      <p>Heard: {state.lastTranscript}</p>
      {agent && (
        <VoiceAgentOverlay
          agent={agent}
          state={state}
          languages={["en-US", "fr-FR", "ar-SA"]}
          style={{ bottom: "24px", right: "24px" }}
          buttonStyle={{ backgroundColor: "#6200ee" }}
        />
      )}
    </div>
  );
}

`VoiceAgentOverlay` Props

| Prop | Type | Required | Description | |---|---|---|---| | agent | VoiceAgent | Yes | Agent instance | | state | VoiceAgentState | Yes | Reactive state from hook | | languages | string[] | No | Language codes to show in selector | | style | CSSProperties | No | Override container styles | | buttonStyle | CSSProperties | No | Override mic button styles | | panelAreaStyle | CSSProperties | No | Override panel area styles |

Controlling the Agent from React

function Controls({ agent }: { agent: VoiceAgent | null }) {
  return (
    <div>
      <button onClick={() => agent?.start()}>Start</button>
      <button onClick={() => agent?.stop()}>Stop</button>
      <button onClick={() => agent?.toggle()}>Toggle</button>
      <button onClick={() => agent?.setLanguage("ar-SA")}>Arabic</button>
    </div>
  );
}

Angular Integration

Angular does not use JSX, so use createVoiceAgent directly and hook into Angular's lifecycle methods.

Installation

npm install @fakhre/voice-agent-sdk

Service (Recommended Pattern)

// voice-agent.service.ts
import { Injectable, OnDestroy } from "@angular/core";
import { BehaviorSubject } from "rxjs";
import { createVoiceAgent, VoiceAgent, VoiceAgentState } from "@fakhre/voice-agent-sdk";

@Injectable({ providedIn: "root" })
export class VoiceAgentService implements OnDestroy {
  private agent: VoiceAgent | null = null;
  state$ = new BehaviorSubject<VoiceAgentState | null>(null);

  init(intentEndpoint: string): void {
    this.agent = createVoiceAgent({
      intentEndpoint,
      wakeWord: "hey app",
      language: ["en-US"],
      onTranscript: (text, isFinal) => console.log(text),
      onError: (err) => console.error(err),
    });

    this.agent.subscribe((state) => this.state$.next(state));
    this.agent.start();
  }

  start()  { this.agent?.start(); }
  stop()   { this.agent?.stop(); }
  toggle() { this.agent?.toggle(); }
  confirm(){ this.agent?.confirm(); }
  cancel() { this.agent?.cancel(); }
  setLanguage(lang: string) { this.agent?.setLanguage(lang); }

  ngOnDestroy(): void {
    this.agent?.destroy();
  }
}

Component Usage

// app.component.ts
import { Component, OnInit, OnDestroy } from "@angular/core";
import { VoiceAgentService } from "./voice-agent.service";
import { VoiceAgentState } from "@fakhre/voice-agent-sdk";

@Component({
  selector: "app-root",
  template: `
    <div>
      <p>Status: {{ (voiceService.state$ | async)?.status }}</p>
      <p>Heard: {{ (voiceService.state$ | async)?.lastTranscript }}</p>

      <button (click)="voiceService.toggle()">Toggle Mic</button>

      <div *ngIf="(voiceService.state$ | async)?.awaitingConfirm">
        <p>{{ (voiceService.state$ | async)?.confirmPrompt }}</p>
        <button (click)="voiceService.confirm()">Yes</button>
        <button (click)="voiceService.cancel()">No</button>
      </div>
    </div>
  `,
})
export class AppComponent implements OnInit, OnDestroy {
  constructor(public voiceService: VoiceAgentService) {}

  ngOnInit(): void {
    this.voiceService.init("https://api.example.com/intent");
  }

  ngOnDestroy(): void {
    this.voiceService.ngOnDestroy();
  }
}

Overlay in Angular (Custom Template)

Since the VoiceAgentOverlay is a React component, build your own Angular overlay using the state observable:

<!-- voice-overlay.component.html -->
<div class="voice-overlay" *ngIf="state">
  <button
    class="mic-btn"
    [class.listening]="state.listening"
    [class.processing]="state.status === 'processing'"
    (click)="voiceService.toggle()"
  >
    {{ state.listening ? '🎙️' : '🎤' }}
  </button>

  <div class="panel" *ngIf="state.lastTranscript">
    <p>Heard: {{ state.lastTranscript }}</p>
  </div>

  <div class="confirm-dialog" *ngIf="state.awaitingConfirm">
    <p>{{ state.confirmPrompt }}</p>
    <button (click)="voiceService.confirm()">Yes</button>
    <button (click)="voiceService.cancel()">No</button>
  </div>

  <div class="suggestions" *ngIf="state.awaitingSuggestion">
    <p *ngFor="let s of state.suggestions; let i = index">
      <button (click)="voiceService.agent?.selectSuggestion(i)">{{ i + 1 }}. {{ s }}</button>
    </p>
  </div>
</div>

Vanilla JS / Plain HTML Integration

<!DOCTYPE html>
<html>
  <head>
    <script type="module">
      import { createVoiceAgent } from "https://cdn.jsdelivr.net/npm/@fakhre/voice-agent-sdk/dist/index.js";

      const agent = createVoiceAgent({
        intentEndpoint: "https://api.example.com/intent",
        wakeWord: "hey assistant",
        language: "en-US",
        onTranscript: (text, isFinal) => {
          if (isFinal) document.getElementById("transcript").textContent = text;
        },
        onAction: (intent, ok) => {
          console.log(ok ? "Done!" : "Failed", intent);
        },
      });

      agent.subscribe((state) => {
        document.getElementById("status").textContent = state.status;
      });

      document.getElementById("start-btn").onclick = () => agent.start();
      document.getElementById("stop-btn").onclick = () => agent.stop();
    </script>
  </head>
  <body>
    <p>Status: <span id="status">idle</span></p>
    <p>Heard: <span id="transcript"></span></p>
    <button id="start-btn">Start</button>
    <button id="stop-btn">Stop</button>

    <!-- Your actual app elements — SDK scans and controls these -->
    <button aria-label="Open settings">Settings</button>
    <input type="text" placeholder="Search" />
  </body>
</html>

Mobile — Android APK (WebView)

The SDK runs inside an Android WebView since it targets the web platform. Follow these steps to integrate it in a native Android app.

1. Enable JavaScript and Web Speech API

// MainActivity.kt
import android.webkit.WebView
import android.webkit.WebSettings

class MainActivity : AppCompatActivity() {
    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)

        val webView = WebView(this)
        val settings: WebSettings = webView.settings

        settings.javaScriptEnabled = true
        settings.domStorageEnabled = true
        settings.mediaPlaybackRequiresUserGesture = false

        // Required for microphone permission in WebView
        webView.webChromeClient = object : WebChromeClient() {
            override fun onPermissionRequest(request: PermissionRequest) {
                runOnUiThread {
                    request.grant(request.resources)
                }
            }
        }

        webView.loadUrl("https://your-voice-app.com")
        setContentView(webView)
    }
}

2. Request Microphone Permission

<!-- AndroidManifest.xml -->
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.MODIFY_AUDIO_SETTINGS" />

// Request at runtime (Android 6+)
if (ContextCompat.checkSelfPermission(this, Manifest.permission.RECORD_AUDIO)
    != PackageManager.PERMISSION_GRANTED) {
    ActivityCompat.requestPermissions(this, arrayOf(Manifest.permission.RECORD_AUDIO), 100)
}

3. Web Speech API Note

Web Speech API (SpeechRecognition) is supported in Chrome on Android but not in all WebView versions. If targeting a wide Android audience, use the LiveKit STT option instead — it uses LiveKit's server-side speech recognition which works in any WebView:

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  livekit: {
    serverUrl: "wss://your-app.livekit.cloud",
    tokenEndpoint: "https://api.example.com/livekit-token",
  },
});

4. JavaScript Bridge (Optional)

If you need to control the agent from native Android code:

// Inject bridge into WebView
webView.addJavascriptInterface(object : Any() {
    @JavascriptInterface
    fun startAgent() {
        webView.post { webView.evaluateJavascript("window.__voiceAgent?.start()", null) }
    }

    @JavascriptInterface
    fun stopAgent() {
        webView.post { webView.evaluateJavascript("window.__voiceAgent?.stop()", null) }
    }
}, "AndroidBridge")

// In your web app — expose agent globally
const agent = createVoiceAgent({ … });
(window as any).__voiceAgent = agent;

Mobile — iOS IPA (WKWebView)

The SDK runs inside a WKWebView on iOS. Safari's WebKit engine supports the Web Speech API on iOS 14.5+.

1. Setup WKWebView with Microphone

// ViewController.swift
import WebKit

class ViewController: UIViewController, WKUIDelegate {
    var webView: WKWebView!

    override func viewDidLoad() {
        super.viewDidLoad()

        let config = WKWebViewConfiguration()
        config.allowsInlineMediaPlayback = true
        config.mediaTypesRequiringUserActionForPlayback = []

        webView = WKWebView(frame: view.bounds, configuration: config)
        webView.uiDelegate = self
        view.addSubview(webView)

        let url = URL(string: "https://your-voice-app.com")!
        webView.load(URLRequest(url: url))
    }

    // Grant microphone permission from WebView prompt
    func webView(
        _ webView: WKWebView,
        requestMediaCapturePermissionFor origin: WKSecurityOrigin,
        initiatedByFrame frame: WKFrameInfo,
        type: WKMediaCaptureType,
        decisionHandler: @escaping (WKPermissionDecision) -> Void
    ) {
        decisionHandler(.grant)
    }
}

2. Info.plist Permissions

<!-- Info.plist -->
<key>NSMicrophoneUsageDescription</key>
<string>Voice agent needs the microphone to hear your commands.</string>
<key>NSSpeechRecognitionUsageDescription</key>
<string>Voice agent uses speech recognition to understand your commands.</string>

3. JavaScript Bridge from Swift (Optional)

// Inject bridge
let script = WKUserScript(
    source: "window.__nativeBridge = true;",
    injectionTime: .atDocumentStart,
    forMainFrameOnly: true
)
webView.configuration.userContentController.addUserScript(script)

// Call from Swift to start agent
webView.evaluateJavaScript("window.__voiceAgent?.start()", completionHandler: nil)

4. WKWebView + LiveKit Recommendation

Web Speech API in WKWebView requires iOS 14.5+ and HTTPS. For maximum compatibility, especially on older iOS versions, use LiveKit STT:

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  livekit: {
    serverUrl: "wss://your-app.livekit.cloud",
    tokenEndpoint: "https://api.example.com/livekit-token",
  },
});

Deployment Requirements (iOS)

Hosted app must be served over HTTPS (required for microphone and Speech API)
Works in: Safari on iOS, WKWebView on iOS 14.5+
Does not work in UIWebView (deprecated)

LiveKit STT (Server-Side Speech)

Use LiveKit as the speech-to-text engine instead of the browser's Web Speech API. This is ideal for:

Environments where Web Speech API is unavailable (some WebViews, Firefox, older browsers)
Server-controlled language configuration
Enterprise / on-premise deployments

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  livekit: {
    serverUrl: "wss://your-app.livekit.cloud",
    tokenEndpoint: "https://api.example.com/livekit-token",
  },
});

Your tokenEndpoint must return:

{ "token": "<livekit-room-token>" }

Note: Language switching via agent.setLanguage() is not available with LiveKit — language is configured server-side in your LiveKit agent.

Authentication

Firebase Google Auth

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  firebaseConfig: {
    apiKey: "…",
    authDomain: "your-app.firebaseapp.com",
    projectId: "your-app",
    // … rest of Firebase config
  },
});

// Trigger login
await agent.login();

// Access user
const state = agent.getState();
console.log(state.user?.email);

// Logout
await agent.logout();

When a Firebase user is signed in, every intent request automatically includes Authorization: Bearer <id-token>.

Custom Token Provider

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  getToken: async () => {
    const res = await fetch("/api/auth/token");
    const { token } = await res.json();
    return token;
  },
});

API Key

const agent = createVoiceAgent({
  intentEndpoint: "https://api.example.com/intent",
  apiKey: "sk-live-abc123",
  // Sends "x-api-key: sk-live-abc123" on every request
});

Multi-Step Action Sequences

Your backend can return a sequence of actions to execute in order:

{
  "action": null,
  "actions": [
    { "action": "click", "targetId": "el-1" },
    { "action": "fill",  "targetId": "el-2", "value": "John Doe" },
    { "action": "fill",  "targetId": "el-3", "value": "[email protected]" },
    { "action": "click", "targetId": "el-4" }
  ],
  "confidence": 0.94,
  "reason": "Complete the registration form"
}

Behaviour:

Steps execute one at a time with a 300ms gap (allows DOM/React re-renders)
If any step contains a risky keyword, a single confirmation dialog is shown before execution starts
state.currentStep and state.totalSteps are updated in real-time — use them to show a progress bar
The overlay shows a progress bar with percentage automatically

Safety & Confirmation System

Automatic Confirmation Triggers

| Trigger | Default Condition | |---|---| | Risk keywords | Action label or value contains: delete, remove, logout, sign out, pay, purchase, checkout, unsubscribe | | Navigation blocked | allowNavigation: false (default) | | File input | <input type="file"> always requires confirmation | | Low confidence | confidence < confidenceThreshold (default 0.55) |

Customising Risk Keywords

permissions: {
  requireConfirmFor: ["delete", "cancel subscription", "wipe", "format"],
}

Voice Confirmation

The user can say:

Confirm: "yes", "yeah", "yep", "ok", "okay", "sure", "go ahead", "do it", "confirm", "proceed"
Cancel: "no", "nope", "cancel", "stop", "never mind", "don't", "abort"

Programmatic Confirmation

agent.confirm();  // Execute pending action
agent.cancel();   // Discard pending action

Language Support

15 languages are supported out of the box:

| Code | Language | Native Label | |---|---|---| | en-US | English (US) | English | | en-GB | English (UK) | English (UK) | | ar-SA | Arabic (Saudi Arabia) | العربية (السعودية) | | ar-EG | Arabic (Egypt) | العربية (مصر) | | ar-AE | Arabic (UAE) | العربية (الإمارات) | | hi-IN | Hindi | हिन्दी | | fr-FR | French | Français | | de-DE | German | Deutsch | | es-ES | Spanish | Español | | zh-CN | Chinese (Simplified) | 中文 (简体) | | ja-JP | Japanese | 日本語 | | pt-BR | Portuguese (Brazil) | Português | | ru-RU | Russian | Русский | | tr-TR | Turkish | Türkçe | | ur-PK | Urdu | اردو |

Runtime Language Switching

agent.setLanguage("ar-SA");    // Switch to Arabic
agent.setLanguage("zh-CN");    // Switch to Chinese

Multi-Language Initialisation

const { agent, state } = useVoiceAgent({
  intentEndpoint: "…",
  language: ["en-US", "ar-SA", "hi-IN"],
  // First in array = default language
  // Rest appear in the overlay's language selector
});

Importing Language Data

import { SUPPORTED_LANGUAGES, DEFAULT_LANGUAGES } from "@fakhre/voice-agent-sdk";
// SUPPORTED_LANGUAGES: LanguageOption[]  — all 15 entries
// DEFAULT_LANGUAGES: string[]            — ["en-US","ar-SA","ar-EG","ar-AE"]

Encryption

The SDK supports optional end-to-end encryption using ECDH (P-256) + AES-256-GCM for sensitive deployments.

When your backend exposes a GET /intent/pubkey endpoint that returns the server's public key, the SDK will:

Generate a P-256 ECDH key pair
Fetch the server's public key
Derive a shared AES-256-GCM key
Encrypt every request body before sending
Decrypt encrypted responses automatically

This happens transparently — no configuration needed on the client side.

Full Exports Reference

// React
export { useVoiceAgent } from "./react/useVoiceAgent";
export { VoiceAgentOverlay } from "./react/VoiceAgentOverlay";

// Core
export { createVoiceAgent } from "./agent/createVoiceAgent";

// Utilities
export { scanDom } from "./dom/scanDom";
export { executeIntent } from "./exec/executeIntent";

// Language data
export { SUPPORTED_LANGUAGES } from "./react/languages";
export { DEFAULT_LANGUAGES } from "./react/languages";

// Types
export type {
  VoiceAgent,
  VoiceAgentOptions,
  VoiceAgentState,
  VoiceAgentAction,
  IntentResult,
  ActionStep,
  DomElementSnapshot,
  DomRole,
  Permissions,
  LiveKitConfig,
} from "./agent/types";

`scanDom(options?)` — Standalone Utility

Scan the DOM and return a snapshot without starting the full agent. Useful for debugging or building a custom backend request.

import { scanDom } from "@fakhre/voice-agent-sdk";

const snapshot = scanDom({ max: 100 });
console.log(snapshot);
// [{ id: "el-1", role: "button", label: "Login", selector: "#login-btn" }, …]

`executeIntent(intent, snapshot, permissions)` — Standalone Utility

Execute an intent against the current DOM without the full agent.

import { executeIntent } from "@fakhre/voice-agent-sdk";

const ok = executeIntent(
  { action: "click", targetId: "el-1", confidence: 0.9 },
  snapshot,
  { allowNavigation: false }
);

Browser Support

The SDK requires the Web Speech API. Supported environments:

| Browser | Support | Notes | |---|---|---| | Chrome (desktop) | ✅ Full | Recommended | | Edge (Chromium) | ✅ Full | | | Safari (macOS 14.1+) | ✅ Full | | | Safari (iOS 14.5+) | ✅ Full | HTTPS required | | Firefox | ❌ No | Use LiveKit STT instead | | Android WebView | ⚠️ Partial | Use LiveKit STT for reliability | | iOS WKWebView | ✅ iOS 14.5+ | HTTPS required |

For maximum cross-browser and cross-platform support, use the LiveKit STT option.

Get Started / Subscription

Free Tier

Try the SDK with your own backend — no account required. Install the package and point intentEndpoint at your own AI endpoint.

Managed Backend (Hosted Intent API)

Want a hosted AI backend that handles intent resolution, DOM understanding, and multi-language support out of the box? We offer a managed service with:

Hosted intent API endpoint
AI-powered DOM understanding (no prompt engineering needed)
Multi-language support with server-side models
Analytics dashboard
SLA & priority support

Contact: [email protected]

npm: npmjs.com/package/@fakhre/voice-agent-sdk

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@fakhre/voice-agent-sdk

Table of Contents

Features

Install

Quick Start

React

Vanilla JS / Angular / Any Framework

How the SDK Works

Writing HTML for Best SDK Support

Rules for Great Voice Support

DOM Actions — div, image, SVG & More

Phase 1 — Semantic / ARIA Elements

Phase 2 — Images and SVGs

Phase 3 — Generic Containers

How Click Execution Works on Complex Elements

Fill Execution on Inputs

Scroll on Specific Elements

Configuration API

createVoiceAgent(options) / useVoiceAgent(options)

Full Options Reference

Permissions Reference

Agent Control API

State Reference

Supported Actions

Multi-Step Sequences

Backend Contract

Request

DomElementSnapshot Fields

Response — Single Action

Response — Multi-Step Sequence

Response — Suggestions (no match)

IntentResult Fields

React Integration

useVoiceAgent(options)

VoiceAgentOverlay Props

Controlling the Agent from React

Angular Integration

Installation

Service (Recommended Pattern)

Component Usage

Overlay in Angular (Custom Template)

Vanilla JS / Plain HTML Integration

Mobile — Android APK (WebView)

1. Enable JavaScript and Web Speech API

2. Request Microphone Permission

3. Web Speech API Note

4. JavaScript Bridge (Optional)

Mobile — iOS IPA (WKWebView)

1. Setup WKWebView with Microphone

2. Info.plist Permissions

3. JavaScript Bridge from Swift (Optional)

4. WKWebView + LiveKit Recommendation

Deployment Requirements (iOS)

LiveKit STT (Server-Side Speech)

Authentication

Firebase Google Auth

Custom Token Provider

API Key

Multi-Step Action Sequences

Safety & Confirmation System

Automatic Confirmation Triggers

Customising Risk Keywords

Voice Confirmation

Programmatic Confirmation

Language Support

Runtime Language Switching

Multi-Language Initialisation

Importing Language Data

Encryption

Full Exports Reference

scanDom(options?) — Standalone Utility

executeIntent(intent, snapshot, permissions) — Standalone Utility

Browser Support

Get Started / Subscription

`createVoiceAgent(options)` / `useVoiceAgent(options)`

`useVoiceAgent(options)`

`VoiceAgentOverlay` Props

`scanDom(options?)` — Standalone Utility

`executeIntent(intent, snapshot, permissions)` — Standalone Utility