@fakhre/voice-agent-sdk
v1.2.3
Published
Site-agnostic AI voice agent for React apps
Readme
@fakhre/voice-agent-sdk
A site-agnostic AI voice agent SDK that drops into any web app to add voice-driven UI control. The SDK listens to speech, scans the DOM for actionable elements, sends the transcript + snapshot to your backend, and safely executes the returned action — with zero UI rewrites required.
Table of Contents
- Features
- Install
- Quick Start
- How the SDK Works
- Writing HTML for Best SDK Support
- DOM Actions — div, image, SVG & More
- Configuration API
- Agent Control API
- State Reference
- Supported Actions
- Backend Contract
- React Integration
- Angular Integration
- Vanilla JS / Plain HTML Integration
- Mobile — Android APK (WebView)
- Mobile — iOS IPA (WKWebView)
- LiveKit STT (Server-Side Speech)
- Authentication
- Multi-Step Action Sequences
- Safety & Confirmation System
- Language Support
- Encryption
- Full Exports Reference
- Browser Support
- Get Started / Subscription
Features
- Voice Recognition — Continuous speech-to-text via Web Speech API with interim results
- Wake Word — Optional activation phrase (e.g.,
"hey assistant") - 15 Languages — English, Arabic, Hindi, French, Spanish, Chinese, Japanese, and more
- DOM Scanning — Automatically detects buttons, links, inputs, images, SVGs, and divs with handlers
- Smart Execution — Clicks, fills forms, scrolls, navigates, submits, and goes back
- Multi-Step Actions — Sequential action chains returned by your backend
- Safety Guards — Confidence thresholds, risk-word confirmation prompts, navigation restrictions
- Sentence Queue — FCFS queue with automatic mic backpressure (no lost commands)
- Encryption — Optional ECDH + AES-256-GCM for sensitive endpoints
- Firebase Auth — Optional Google OAuth with JWT forwarding
- LiveKit STT — Server-side speech recognition as an alternative to Web Speech API
- React-First —
useVoiceAgenthook andVoiceAgentOverlaycomponent included - Framework Agnostic — Works with react,Angular, Vue, Svelte, or plain HTML
Install
npm install @fakhre/voice-agent-sdkPeer dependencies (React projects only):
npm install react@>=17 react-dom@>=17Quick Start
React
import { useVoiceAgent, VoiceAgentOverlay } from "@fakhre/voice-agent-sdk";
export default function App() {
const { agent, state } = useVoiceAgent({
intentEndpoint: "https://your-backend.com/intent",
});
return (
<>
<YourApp />
{agent && <VoiceAgentOverlay agent={agent} state={state} />}
</>
);
}Vanilla JS / Angular / Any Framework
import { createVoiceAgent } from "@fakhre/voice-agent-sdk";
const agent = createVoiceAgent({
intentEndpoint: "https://your-backend.com/intent",
onTranscript: (text, isFinal) => console.log("Heard:", text),
onIntent: (intent) => console.log("Intent:", intent),
onAction: (intent, ok) => console.log("Executed:", ok),
});
agent.start();How the SDK Works
User speaks
│
▼
Web Speech API / LiveKit STT
│ transcript text
▼
SDK scans the DOM (buttons, links, inputs, images, SVGs, divs)
│ DomElementSnapshot[]
▼
POST /intent ──► Your AI backend (GPT / Claude / custom)
│ IntentResult { action, targetId, confidence, ... }
▼
Safety check (confidence, risk words, file inputs, permissions)
│ pass / requires confirmation
▼
executeIntent() ──► click / fill / scroll / navigate / back / submit
│
▼
DOM updated → user sees resultThe SDK never modifies your HTML. It reads the DOM, asks your backend what to do, then drives clicks and fills programmatically — just like a real user.
Writing HTML for Best SDK Support
The SDK resolves element labels in this priority order:
altattribute (images)<title>child of SVGaria-labelattributetitleattributearia-labelledbyreferenced element- Associated
<label for="…">element placeholderattribute- Visible text content
- Nested image
altor SVG<title>
Rules for Great Voice Support
Always label interactive elements:
<!-- Good — SDK can identify this -->
<button aria-label="Submit order">Place Order</button>
<!-- Also good — visible text is used -->
<button>Place Order</button>
<!-- Bad — no label, SDK cannot target it -->
<button><img src="cart.png" /></button>Use aria-label on icon-only buttons:
<!-- Good -->
<button aria-label="Close dialog">✕</button>
<button aria-label="Add to cart"><svg>…</svg></button>
<!-- Bad -->
<button>✕</button>Label all form inputs:
<!-- Good — linked label -->
<label for="email">Email address</label>
<input id="email" type="email" placeholder="[email protected]" />
<!-- Also good — placeholder as fallback -->
<input type="text" placeholder="Search products" />
<!-- Bad — no label, no placeholder -->
<input type="text" />Name your action divs:
<!-- Good — SDK picks up aria-label on a clickable div -->
<div role="button" aria-label="Open menu" tabindex="0" onclick="openMenu()">
☰
</div>
<!-- Also good — text content is used -->
<div onclick="openSettings()">Settings</div>Use data-testid for reliable targeting:
<!-- data-testid is preferred over nth-of-type selectors -->
<button data-testid="checkout-btn">Checkout</button>
<input data-testid="search-input" type="text" />Link images to their action:
<!-- Image inside a button — SDK reads button label first, then img alt -->
<button aria-label="View profile">
<img src="avatar.png" alt="User avatar" />
</button>
<!-- Standalone clickable image -->
<img src="product.jpg" alt="Buy Blue Sneakers" onclick="openProduct()" />Use semantic HTML where possible:
<!-- Prefer semantic elements — SDK natively understands these -->
<button> <!-- role: button -->
<a href="…"> <!-- role: link -->
<input> <!-- role: input -->
<select> <!-- role: input -->
<textarea> <!-- role: input -->
<form> <!-- submittable -->DOM Actions — div, image, SVG & More
The SDK scans for interactive elements in three phases:
Phase 1 — Semantic / ARIA Elements
Buttons (<button>, [role="button"]), anchors (<a href>), inputs, textareas, selects, contenteditable elements, and anything with tabindex or an onclick attribute.
Phase 2 — Images and SVGs
<img> elements and <svg> elements that have a click handler attached.
<!-- Detected as role: button (has onclick) -->
<img src="close.png" alt="Close panel" onclick="closePanel()" />
<!-- SVG with title — SDK reads the <title> as the label -->
<svg onclick="handleClick()" aria-label="Upload file">
<title>Upload file</title>
<path d="…" />
</svg>Tip: An image or SVG without an
onclick(or attached event listener via JS) will not be included in the snapshot unless it hastabindexorrole="button".
Phase 3 — Generic Containers
<div>, <span>, <li>, <td>, and similar elements that have visible interaction signals (click handler, tabindex, or role).
<!-- Div detected as clickable -->
<div data-testid="card-item" onclick="selectCard(1)" tabindex="0">
Product Card
</div>
<!-- List item with role -->
<li role="menuitem" onclick="navigate('/home')">Home</li>How Click Execution Works on Complex Elements
When the SDK executes a click action, it follows this sequence:
- Resolves element via stored CSS selector
- File inputs: Instead of clicking the hidden
<input type="file">, it clicks the associated<label>(required for browser gesture support) - React elements: Fires a pointer-down → pointer-up → click event chain to trigger React's synthetic event system
- SVG / icon wrappers: Falls back to calculating the element's center point and firing a click at those coordinates
- Standard elements: Calls
.click()directly
Fill Execution on Inputs
<!-- Text / email / password -->
<input type="text" placeholder="Name" />
<!-- Voice: "fill name with John" → value set, input + change events fired -->
<!-- Textarea -->
<textarea placeholder="Write a message"></textarea>
<!-- Voice: "write hello world in the message box" -->
<!-- Select -->
<select id="country">
<option value="us">United States</option>
<option value="uk">United Kingdom</option>
</select>
<!-- Voice: "select United Kingdom" → matches by text or value -->
<!-- Contenteditable -->
<div contenteditable="true" aria-label="Rich text editor"></div>
<!-- Voice: "type hello in the editor" → uses Selection API -->Scroll on Specific Elements
<!-- Give a scrollable div a label so voice can target it -->
<div
aria-label="Product list"
style="overflow-y: auto; height: 400px;"
data-testid="product-list"
>
…long list…
</div>
<!-- Voice: "scroll down in the product list" -->Configuration API
createVoiceAgent(options) / useVoiceAgent(options)
{
// ── Required ─────────────────────────────────────────────────
intentEndpoint: string;
// POST endpoint on your backend that resolves intents.
// ── Speech ───────────────────────────────────────────────────
wakeWord?: string;
// Activation phrase, e.g. "hey assistant".
// Empty string (default) means the agent is always listening.
language?: string | string[];
// Single code: "en-US"
// Array: ["en-US", "ar-SA"] — first is default, rest shown in selector.
// ── Behaviour ────────────────────────────────────────────────
debug?: boolean; // default: false — verbose console logs
scan?: { max?: number }; // default: { max: 80 } — elements in snapshot
queue?: { maxSize?: number };
// default: { maxSize: 5 }
// When full the mic pauses automatically until a slot opens.
// ── Permissions ──────────────────────────────────────────────
permissions?: {
allowNavigation?: boolean; // default: false
allowCrossOriginNavigation?: boolean; // default: false
allowFormFill?: boolean; // default: true
requireConfirmFor?: string[];
// default: ["delete","remove","logout","sign out","pay",
// "purchase","checkout","unsubscribe"]
confidenceThreshold?: number; // default: 0.55 (0–1)
};
// ── Auth ─────────────────────────────────────────────────────
firebaseConfig?: object;
// Pass your Firebase project config to enable Google OAuth.
getToken?: () => Promise<string | null>;
// Custom token provider — sent as "Authorization: Bearer <token>".
apiKey?: string;
// Sent as "x-api-key: <key>" on every request.
// ── LiveKit STT ───────────────────────────────────────────────
livekit?: {
serverUrl: string; // e.g. "wss://your-app.livekit.cloud"
tokenEndpoint: string; // Your backend endpoint returning { token: string }
};
// ── Callbacks ────────────────────────────────────────────────
onTranscript?: (text: string, isFinal: boolean) => void;
onIntent?: (intent: IntentResult) => void;
onAction?: (intent: IntentResult, ok: boolean) => void;
onError?: (err: unknown) => void;
}Full Options Reference
| Option | Type | Default | Description |
|---|---|---|---|
| intentEndpoint | string | required | POST URL for intent resolution |
| wakeWord | string | "" | Activation phrase; empty = always active |
| language | string \| string[] | "en-US" | Recognition language(s) |
| debug | boolean | false | Verbose console output |
| scan.max | number | 80 | Max elements in DOM snapshot |
| queue.maxSize | number | 5 | Sentence queue capacity |
| permissions | Permissions | see below | Safety & navigation rules |
| firebaseConfig | object | — | Firebase config for Google OAuth |
| getToken | () => Promise<string\|null> | — | Custom JWT provider |
| apiKey | string | — | API key sent as x-api-key header |
| livekit | LiveKitConfig | — | Use LiveKit instead of Web Speech API |
| onTranscript | function | — | Called on every speech result |
| onIntent | function | — | Called when intent is resolved |
| onAction | function | — | Called after action executes |
| onError | function | — | Called on errors |
Permissions Reference
| Field | Type | Default | Description |
|---|---|---|---|
| allowNavigation | boolean | false | Allow navigate actions |
| allowCrossOriginNavigation | boolean | false | Allow cross-origin URLs |
| allowFormFill | boolean | true | Allow fill actions |
| requireConfirmFor | string[] | see above | Risky keyword list |
| confidenceThreshold | number | 0.55 | Min confidence before auto-execute |
Agent Control API
The agent object returned by createVoiceAgent() or useVoiceAgent():
| Method | Returns | Description |
|---|---|---|
| agent.start() | void | Start speech recognition |
| agent.stop() | void | Stop speech recognition |
| agent.toggle() | void | Toggle listening on/off |
| agent.destroy() | void | Stop and release all resources |
| agent.confirm() | void | Confirm a pending action |
| agent.cancel() | void | Cancel a pending action |
| agent.selectSuggestion(index) | void | Pick suggestion by 0-based index |
| agent.setLanguage(lang) | void | Switch recognition language at runtime |
| agent.getState() | VoiceAgentState | Snapshot of current state |
| agent.subscribe(fn) | () => void | Subscribe to state; returns unsubscribe |
| agent.login() | Promise<void> | Firebase Google sign-in popup |
| agent.logout() | Promise<void> | Firebase sign-out |
State Reference
interface VoiceAgentState {
status: "idle" | "listening" | "processing" | "error";
listening: boolean;
lastTranscript: string; // Most recent final speech text
lastIntent: IntentResult | null;
awaitingConfirm: boolean; // true when confirmation dialog is open
confirmPrompt: string | null; // Human-readable confirmation question
lastError: string | null;
activeLanguage: string; // Current BCP-47 language code
user: User | null; // Firebase user (if auth enabled)
sentenceQueue: string[]; // Pending commands not yet processed
processingSentence: string | null; // Command currently being processed
queueFull: boolean; // true when queue hits maxSize
suggestions: string[] | null; // Alternative actions from backend
awaitingSuggestion: boolean; // true when suggestion list is shown
currentStep: number | null; // Step index in multi-action sequence
totalSteps: number | null; // Total steps in sequence
}Supported Actions
| Action | Requires | Description |
|---|---|---|
| click | targetId | Click any interactive element by its snapshot ID |
| fill | targetId, value | Set value on input / textarea / select / contenteditable |
| scroll | delta (px), optional targetId | Scroll page or a specific container |
| navigate | url, allowNavigation: true | Navigate to a URL |
| back | — | Browser history.back() |
| submit | targetId | Submit a form element |
Multi-Step Sequences
Your backend can return an array of steps:
{
"action": null,
"actions": [
{ "action": "fill", "targetId": "el-3", "value": "[email protected]" },
{ "action": "fill", "targetId": "el-4", "value": "password123" },
{ "action": "click", "targetId": "el-5" }
],
"confidence": 0.91,
"reason": "Log in with provided credentials"
}Steps execute sequentially with a 300ms pause between each to allow React/Angular re-renders.
Backend Contract
Your backend must expose a POST endpoint at the URL you pass as intentEndpoint.
Request
POST /intent
Content-Type: application/json
x-site-id: yourdomain.com
x-api-key: <your-api-key> (if apiKey option set)
Authorization: Bearer <token> (if getToken / firebaseConfig set){
"text": "click the login button",
"domSnapshot": [
{
"id": "el-1",
"role": "button",
"label": "Login",
"selector": "#login-btn"
},
{
"id": "el-2",
"role": "input",
"label": "Email address",
"selector": "input[type='email']",
"inputType": "email"
},
{
"id": "el-3",
"role": "link",
"label": "Forgot password",
"selector": "a[href='/forgot']",
"href": "/forgot"
}
],
"language": "en-US"
}DomElementSnapshot Fields
| Field | Type | Description |
|---|---|---|
| id | string | SDK-generated ID ("el-1", "el-2", …) |
| role | "button" \| "input" \| "link" | Element type |
| label | string | Human-readable label (from ARIA, text, alt, etc.) |
| selector | string | CSS selector to locate the element |
| href | string \| null | Link destination (links only) |
| inputType | string \| null | Input type (text, email, password, file, …) |
Response — Single Action
{
"action": "click",
"targetId": "el-1",
"value": null,
"confidence": 0.92,
"delta": null,
"url": null,
"reason": "User wants to click the login button"
}Response — Multi-Step Sequence
{
"action": null,
"actions": [
{ "action": "fill", "targetId": "el-2", "value": "[email protected]" },
{ "action": "click", "targetId": "el-1" }
],
"confidence": 0.88,
"reason": "Fill email then click login"
}Response — Suggestions (no match)
Return suggestions when the intent is unclear. The SDK will read them out and wait for the user to pick one verbally ("one", "two", "three") or by tapping in the overlay.
{
"action": null,
"suggestions": [
"Click the Login button",
"Fill in the email field",
"Navigate to the sign-up page"
],
"confidence": 0.3,
"reason": "Ambiguous command"
}IntentResult Fields
| Field | Type | Description |
|---|---|---|
| action | VoiceAgentAction \| null | Action to execute |
| targetId | string \| null | Target element from snapshot |
| value | string \| null | Fill value |
| delta | number \| null | Scroll amount in pixels |
| url | string \| null | Navigate destination |
| confidence | number \| null | 0–1 confidence score |
| reason | string \| null | Human-readable explanation |
| suggestions | string[] \| null | Up to 3 alternative commands |
| actions | ActionStep[] \| null | Multi-step action array |
React Integration
useVoiceAgent(options)
import { useVoiceAgent, VoiceAgentOverlay } from "@fakhre/voice-agent-sdk";
function App() {
const { agent, state } = useVoiceAgent({
intentEndpoint: "https://api.example.com/intent",
wakeWord: "hey app",
language: ["en-US", "fr-FR"],
permissions: {
allowNavigation: true,
requireConfirmFor: ["delete", "pay"],
},
onTranscript: (text, isFinal) => {
if (isFinal) console.log("Final:", text);
},
});
return (
<div>
<p>Status: {state.status}</p>
<p>Heard: {state.lastTranscript}</p>
{agent && (
<VoiceAgentOverlay
agent={agent}
state={state}
languages={["en-US", "fr-FR", "ar-SA"]}
style={{ bottom: "24px", right: "24px" }}
buttonStyle={{ backgroundColor: "#6200ee" }}
/>
)}
</div>
);
}VoiceAgentOverlay Props
| Prop | Type | Required | Description |
|---|---|---|---|
| agent | VoiceAgent | Yes | Agent instance |
| state | VoiceAgentState | Yes | Reactive state from hook |
| languages | string[] | No | Language codes to show in selector |
| style | CSSProperties | No | Override container styles |
| buttonStyle | CSSProperties | No | Override mic button styles |
| panelAreaStyle | CSSProperties | No | Override panel area styles |
Controlling the Agent from React
function Controls({ agent }: { agent: VoiceAgent | null }) {
return (
<div>
<button onClick={() => agent?.start()}>Start</button>
<button onClick={() => agent?.stop()}>Stop</button>
<button onClick={() => agent?.toggle()}>Toggle</button>
<button onClick={() => agent?.setLanguage("ar-SA")}>Arabic</button>
</div>
);
}Angular Integration
Angular does not use JSX, so use createVoiceAgent directly and hook into Angular's lifecycle methods.
Installation
npm install @fakhre/voice-agent-sdkService (Recommended Pattern)
// voice-agent.service.ts
import { Injectable, OnDestroy } from "@angular/core";
import { BehaviorSubject } from "rxjs";
import { createVoiceAgent, VoiceAgent, VoiceAgentState } from "@fakhre/voice-agent-sdk";
@Injectable({ providedIn: "root" })
export class VoiceAgentService implements OnDestroy {
private agent: VoiceAgent | null = null;
state$ = new BehaviorSubject<VoiceAgentState | null>(null);
init(intentEndpoint: string): void {
this.agent = createVoiceAgent({
intentEndpoint,
wakeWord: "hey app",
language: ["en-US"],
onTranscript: (text, isFinal) => console.log(text),
onError: (err) => console.error(err),
});
this.agent.subscribe((state) => this.state$.next(state));
this.agent.start();
}
start() { this.agent?.start(); }
stop() { this.agent?.stop(); }
toggle() { this.agent?.toggle(); }
confirm(){ this.agent?.confirm(); }
cancel() { this.agent?.cancel(); }
setLanguage(lang: string) { this.agent?.setLanguage(lang); }
ngOnDestroy(): void {
this.agent?.destroy();
}
}Component Usage
// app.component.ts
import { Component, OnInit, OnDestroy } from "@angular/core";
import { VoiceAgentService } from "./voice-agent.service";
import { VoiceAgentState } from "@fakhre/voice-agent-sdk";
@Component({
selector: "app-root",
template: `
<div>
<p>Status: {{ (voiceService.state$ | async)?.status }}</p>
<p>Heard: {{ (voiceService.state$ | async)?.lastTranscript }}</p>
<button (click)="voiceService.toggle()">Toggle Mic</button>
<div *ngIf="(voiceService.state$ | async)?.awaitingConfirm">
<p>{{ (voiceService.state$ | async)?.confirmPrompt }}</p>
<button (click)="voiceService.confirm()">Yes</button>
<button (click)="voiceService.cancel()">No</button>
</div>
</div>
`,
})
export class AppComponent implements OnInit, OnDestroy {
constructor(public voiceService: VoiceAgentService) {}
ngOnInit(): void {
this.voiceService.init("https://api.example.com/intent");
}
ngOnDestroy(): void {
this.voiceService.ngOnDestroy();
}
}Overlay in Angular (Custom Template)
Since the VoiceAgentOverlay is a React component, build your own Angular overlay using the state observable:
<!-- voice-overlay.component.html -->
<div class="voice-overlay" *ngIf="state">
<button
class="mic-btn"
[class.listening]="state.listening"
[class.processing]="state.status === 'processing'"
(click)="voiceService.toggle()"
>
{{ state.listening ? '🎙️' : '🎤' }}
</button>
<div class="panel" *ngIf="state.lastTranscript">
<p>Heard: {{ state.lastTranscript }}</p>
</div>
<div class="confirm-dialog" *ngIf="state.awaitingConfirm">
<p>{{ state.confirmPrompt }}</p>
<button (click)="voiceService.confirm()">Yes</button>
<button (click)="voiceService.cancel()">No</button>
</div>
<div class="suggestions" *ngIf="state.awaitingSuggestion">
<p *ngFor="let s of state.suggestions; let i = index">
<button (click)="voiceService.agent?.selectSuggestion(i)">{{ i + 1 }}. {{ s }}</button>
</p>
</div>
</div>Vanilla JS / Plain HTML Integration
<!DOCTYPE html>
<html>
<head>
<script type="module">
import { createVoiceAgent } from "https://cdn.jsdelivr.net/npm/@fakhre/voice-agent-sdk/dist/index.js";
const agent = createVoiceAgent({
intentEndpoint: "https://api.example.com/intent",
wakeWord: "hey assistant",
language: "en-US",
onTranscript: (text, isFinal) => {
if (isFinal) document.getElementById("transcript").textContent = text;
},
onAction: (intent, ok) => {
console.log(ok ? "Done!" : "Failed", intent);
},
});
agent.subscribe((state) => {
document.getElementById("status").textContent = state.status;
});
document.getElementById("start-btn").onclick = () => agent.start();
document.getElementById("stop-btn").onclick = () => agent.stop();
</script>
</head>
<body>
<p>Status: <span id="status">idle</span></p>
<p>Heard: <span id="transcript"></span></p>
<button id="start-btn">Start</button>
<button id="stop-btn">Stop</button>
<!-- Your actual app elements — SDK scans and controls these -->
<button aria-label="Open settings">Settings</button>
<input type="text" placeholder="Search" />
</body>
</html>Mobile — Android APK (WebView)
The SDK runs inside an Android WebView since it targets the web platform. Follow these steps to integrate it in a native Android app.
1. Enable JavaScript and Web Speech API
// MainActivity.kt
import android.webkit.WebView
import android.webkit.WebSettings
class MainActivity : AppCompatActivity() {
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
val webView = WebView(this)
val settings: WebSettings = webView.settings
settings.javaScriptEnabled = true
settings.domStorageEnabled = true
settings.mediaPlaybackRequiresUserGesture = false
// Required for microphone permission in WebView
webView.webChromeClient = object : WebChromeClient() {
override fun onPermissionRequest(request: PermissionRequest) {
runOnUiThread {
request.grant(request.resources)
}
}
}
webView.loadUrl("https://your-voice-app.com")
setContentView(webView)
}
}2. Request Microphone Permission
<!-- AndroidManifest.xml -->
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.MODIFY_AUDIO_SETTINGS" />// Request at runtime (Android 6+)
if (ContextCompat.checkSelfPermission(this, Manifest.permission.RECORD_AUDIO)
!= PackageManager.PERMISSION_GRANTED) {
ActivityCompat.requestPermissions(this, arrayOf(Manifest.permission.RECORD_AUDIO), 100)
}3. Web Speech API Note
Web Speech API (SpeechRecognition) is supported in Chrome on Android but not in all WebView versions. If targeting a wide Android audience, use the LiveKit STT option instead — it uses LiveKit's server-side speech recognition which works in any WebView:
const agent = createVoiceAgent({
intentEndpoint: "https://api.example.com/intent",
livekit: {
serverUrl: "wss://your-app.livekit.cloud",
tokenEndpoint: "https://api.example.com/livekit-token",
},
});4. JavaScript Bridge (Optional)
If you need to control the agent from native Android code:
// Inject bridge into WebView
webView.addJavascriptInterface(object : Any() {
@JavascriptInterface
fun startAgent() {
webView.post { webView.evaluateJavascript("window.__voiceAgent?.start()", null) }
}
@JavascriptInterface
fun stopAgent() {
webView.post { webView.evaluateJavascript("window.__voiceAgent?.stop()", null) }
}
}, "AndroidBridge")// In your web app — expose agent globally
const agent = createVoiceAgent({ … });
(window as any).__voiceAgent = agent;Mobile — iOS IPA (WKWebView)
The SDK runs inside a WKWebView on iOS. Safari's WebKit engine supports the Web Speech API on iOS 14.5+.
1. Setup WKWebView with Microphone
// ViewController.swift
import WebKit
class ViewController: UIViewController, WKUIDelegate {
var webView: WKWebView!
override func viewDidLoad() {
super.viewDidLoad()
let config = WKWebViewConfiguration()
config.allowsInlineMediaPlayback = true
config.mediaTypesRequiringUserActionForPlayback = []
webView = WKWebView(frame: view.bounds, configuration: config)
webView.uiDelegate = self
view.addSubview(webView)
let url = URL(string: "https://your-voice-app.com")!
webView.load(URLRequest(url: url))
}
// Grant microphone permission from WebView prompt
func webView(
_ webView: WKWebView,
requestMediaCapturePermissionFor origin: WKSecurityOrigin,
initiatedByFrame frame: WKFrameInfo,
type: WKMediaCaptureType,
decisionHandler: @escaping (WKPermissionDecision) -> Void
) {
decisionHandler(.grant)
}
}2. Info.plist Permissions
<!-- Info.plist -->
<key>NSMicrophoneUsageDescription</key>
<string>Voice agent needs the microphone to hear your commands.</string>
<key>NSSpeechRecognitionUsageDescription</key>
<string>Voice agent uses speech recognition to understand your commands.</string>3. JavaScript Bridge from Swift (Optional)
// Inject bridge
let script = WKUserScript(
source: "window.__nativeBridge = true;",
injectionTime: .atDocumentStart,
forMainFrameOnly: true
)
webView.configuration.userContentController.addUserScript(script)
// Call from Swift to start agent
webView.evaluateJavaScript("window.__voiceAgent?.start()", completionHandler: nil)4. WKWebView + LiveKit Recommendation
Web Speech API in WKWebView requires iOS 14.5+ and HTTPS. For maximum compatibility, especially on older iOS versions, use LiveKit STT:
const agent = createVoiceAgent({
intentEndpoint: "https://api.example.com/intent",
livekit: {
serverUrl: "wss://your-app.livekit.cloud",
tokenEndpoint: "https://api.example.com/livekit-token",
},
});Deployment Requirements (iOS)
- Hosted app must be served over HTTPS (required for microphone and Speech API)
- Works in: Safari on iOS, WKWebView on iOS 14.5+
- Does not work in UIWebView (deprecated)
LiveKit STT (Server-Side Speech)
Use LiveKit as the speech-to-text engine instead of the browser's Web Speech API. This is ideal for:
- Environments where Web Speech API is unavailable (some WebViews, Firefox, older browsers)
- Server-controlled language configuration
- Enterprise / on-premise deployments
const agent = createVoiceAgent({
intentEndpoint: "https://api.example.com/intent",
livekit: {
serverUrl: "wss://your-app.livekit.cloud",
tokenEndpoint: "https://api.example.com/livekit-token",
},
});Your tokenEndpoint must return:
{ "token": "<livekit-room-token>" }Note: Language switching via
agent.setLanguage()is not available with LiveKit — language is configured server-side in your LiveKit agent.
Authentication
Firebase Google Auth
const agent = createVoiceAgent({
intentEndpoint: "https://api.example.com/intent",
firebaseConfig: {
apiKey: "…",
authDomain: "your-app.firebaseapp.com",
projectId: "your-app",
// … rest of Firebase config
},
});
// Trigger login
await agent.login();
// Access user
const state = agent.getState();
console.log(state.user?.email);
// Logout
await agent.logout();When a Firebase user is signed in, every intent request automatically includes Authorization: Bearer <id-token>.
Custom Token Provider
const agent = createVoiceAgent({
intentEndpoint: "https://api.example.com/intent",
getToken: async () => {
const res = await fetch("/api/auth/token");
const { token } = await res.json();
return token;
},
});API Key
const agent = createVoiceAgent({
intentEndpoint: "https://api.example.com/intent",
apiKey: "sk-live-abc123",
// Sends "x-api-key: sk-live-abc123" on every request
});Multi-Step Action Sequences
Your backend can return a sequence of actions to execute in order:
{
"action": null,
"actions": [
{ "action": "click", "targetId": "el-1" },
{ "action": "fill", "targetId": "el-2", "value": "John Doe" },
{ "action": "fill", "targetId": "el-3", "value": "[email protected]" },
{ "action": "click", "targetId": "el-4" }
],
"confidence": 0.94,
"reason": "Complete the registration form"
}Behaviour:
- Steps execute one at a time with a 300ms gap (allows DOM/React re-renders)
- If any step contains a risky keyword, a single confirmation dialog is shown before execution starts
state.currentStepandstate.totalStepsare updated in real-time — use them to show a progress bar- The overlay shows a progress bar with percentage automatically
Safety & Confirmation System
Automatic Confirmation Triggers
| Trigger | Default Condition |
|---|---|
| Risk keywords | Action label or value contains: delete, remove, logout, sign out, pay, purchase, checkout, unsubscribe |
| Navigation blocked | allowNavigation: false (default) |
| File input | <input type="file"> always requires confirmation |
| Low confidence | confidence < confidenceThreshold (default 0.55) |
Customising Risk Keywords
permissions: {
requireConfirmFor: ["delete", "cancel subscription", "wipe", "format"],
}Voice Confirmation
The user can say:
- Confirm:
"yes","yeah","yep","ok","okay","sure","go ahead","do it","confirm","proceed" - Cancel:
"no","nope","cancel","stop","never mind","don't","abort"
Programmatic Confirmation
agent.confirm(); // Execute pending action
agent.cancel(); // Discard pending actionLanguage Support
15 languages are supported out of the box:
| Code | Language | Native Label |
|---|---|---|
| en-US | English (US) | English |
| en-GB | English (UK) | English (UK) |
| ar-SA | Arabic (Saudi Arabia) | العربية (السعودية) |
| ar-EG | Arabic (Egypt) | العربية (مصر) |
| ar-AE | Arabic (UAE) | العربية (الإمارات) |
| hi-IN | Hindi | हिन्दी |
| fr-FR | French | Français |
| de-DE | German | Deutsch |
| es-ES | Spanish | Español |
| zh-CN | Chinese (Simplified) | 中文 (简体) |
| ja-JP | Japanese | 日本語 |
| pt-BR | Portuguese (Brazil) | Português |
| ru-RU | Russian | Русский |
| tr-TR | Turkish | Türkçe |
| ur-PK | Urdu | اردو |
Runtime Language Switching
agent.setLanguage("ar-SA"); // Switch to Arabic
agent.setLanguage("zh-CN"); // Switch to ChineseMulti-Language Initialisation
const { agent, state } = useVoiceAgent({
intentEndpoint: "…",
language: ["en-US", "ar-SA", "hi-IN"],
// First in array = default language
// Rest appear in the overlay's language selector
});Importing Language Data
import { SUPPORTED_LANGUAGES, DEFAULT_LANGUAGES } from "@fakhre/voice-agent-sdk";
// SUPPORTED_LANGUAGES: LanguageOption[] — all 15 entries
// DEFAULT_LANGUAGES: string[] — ["en-US","ar-SA","ar-EG","ar-AE"]Encryption
The SDK supports optional end-to-end encryption using ECDH (P-256) + AES-256-GCM for sensitive deployments.
When your backend exposes a GET /intent/pubkey endpoint that returns the server's public key, the SDK will:
- Generate a P-256 ECDH key pair
- Fetch the server's public key
- Derive a shared AES-256-GCM key
- Encrypt every request body before sending
- Decrypt encrypted responses automatically
This happens transparently — no configuration needed on the client side.
Full Exports Reference
// React
export { useVoiceAgent } from "./react/useVoiceAgent";
export { VoiceAgentOverlay } from "./react/VoiceAgentOverlay";
// Core
export { createVoiceAgent } from "./agent/createVoiceAgent";
// Utilities
export { scanDom } from "./dom/scanDom";
export { executeIntent } from "./exec/executeIntent";
// Language data
export { SUPPORTED_LANGUAGES } from "./react/languages";
export { DEFAULT_LANGUAGES } from "./react/languages";
// Types
export type {
VoiceAgent,
VoiceAgentOptions,
VoiceAgentState,
VoiceAgentAction,
IntentResult,
ActionStep,
DomElementSnapshot,
DomRole,
Permissions,
LiveKitConfig,
} from "./agent/types";scanDom(options?) — Standalone Utility
Scan the DOM and return a snapshot without starting the full agent. Useful for debugging or building a custom backend request.
import { scanDom } from "@fakhre/voice-agent-sdk";
const snapshot = scanDom({ max: 100 });
console.log(snapshot);
// [{ id: "el-1", role: "button", label: "Login", selector: "#login-btn" }, …]executeIntent(intent, snapshot, permissions) — Standalone Utility
Execute an intent against the current DOM without the full agent.
import { executeIntent } from "@fakhre/voice-agent-sdk";
const ok = executeIntent(
{ action: "click", targetId: "el-1", confidence: 0.9 },
snapshot,
{ allowNavigation: false }
);Browser Support
The SDK requires the Web Speech API. Supported environments:
| Browser | Support | Notes | |---|---|---| | Chrome (desktop) | ✅ Full | Recommended | | Edge (Chromium) | ✅ Full | | | Safari (macOS 14.1+) | ✅ Full | | | Safari (iOS 14.5+) | ✅ Full | HTTPS required | | Firefox | ❌ No | Use LiveKit STT instead | | Android WebView | ⚠️ Partial | Use LiveKit STT for reliability | | iOS WKWebView | ✅ iOS 14.5+ | HTTPS required |
For maximum cross-browser and cross-platform support, use the LiveKit STT option.
Get Started / Subscription
Free Tier
Try the SDK with your own backend — no account required. Install the package and point intentEndpoint at your own AI endpoint.
Managed Backend (Hosted Intent API)
Want a hosted AI backend that handles intent resolution, DOM understanding, and multi-language support out of the box? We offer a managed service with:
- Hosted intent API endpoint
- AI-powered DOM understanding (no prompt engineering needed)
- Multi-language support with server-side models
- Analytics dashboard
- SLA & priority support
Contact: [email protected]
npm: npmjs.com/package/@fakhre/voice-agent-sdk
License
MIT
