voice-ai-agent
v1.0.0
Published
Push-to-talk Voice AI SDK with real-time WebSocket audio streaming and Gemini AI
Maintainers
Readme
voice-ai-agent
Push-to-talk Voice AI SDK — real-time audio streaming with WebSocket and Gemini AI. Works in React, Next.js, and Vanilla JavaScript.
Features
- 🎤 Push-to-talk microphone recording via
MediaRecorder - 📡 Real-time audio streaming to your backend over WebSocket
- 🔤 Live transcription using the browser's Web Speech API (instant feedback)
- 🤖 AI responses — via backend server or direct Gemini API (your choice)
- 🔁 Auto-reconnect WebSocket with configurable back-off
- 🏗️ Framework-agnostic — use in React, Next.js, or plain JavaScript
- 📦 Dual CJS + ESM build with full TypeScript types
Installation
npm install voice-ai-agentyarn add voice-ai-agentQuick Start
Option A — With Backend Server (Recommended)
Pair with voice-stream-server or any compatible WebSocket backend.
import { VoiceAgent } from "voice-ai-agent";
const agent = new VoiceAgent({
websocketUrl: "ws://localhost:3000/audio-stream",
});
agent.on("transcript", (text) => {
console.log("User said:", text);
});
agent.on("response", (text) => {
console.log("AI replied:", text);
});
agent.on("error", (err) => {
console.error("Error:", err.message);
});
// Push-to-talk
document
.querySelector("#mic-btn")
.addEventListener("mousedown", () => agent.start());
document
.querySelector("#mic-btn")
.addEventListener("mouseup", () => agent.stop());
// Cleanup on page unload
window.addEventListener("beforeunload", () => agent.destroy());Option B — Client-side Gemini (No backend needed)
import { VoiceAgent } from "voice-ai-agent";
const agent = new VoiceAgent({
apiKey: "YOUR_GEMINI_API_KEY",
});
agent.on("transcript", (text) => console.log("You:", text));
agent.on("response", (text) => console.log("AI:", text));React Integration
Using the useVoiceAgent hook (from examples)
Copy examples/react-example/useVoiceAgent.ts into your project, or inline the logic:
import { useEffect, useRef, useState } from "react";
import { VoiceAgent } from "voice-ai-agent";
function VoiceButton() {
const agentRef = useRef<VoiceAgent | null>(null);
const [isRecording, setIsRecording] = useState(false);
const [transcript, setTranscript] = useState("");
const [aiResponse, setAiResponse] = useState("");
useEffect(() => {
const agent = new VoiceAgent({
websocketUrl: "ws://localhost:3000/audio-stream",
});
agent.on("transcript", setTranscript);
agent.on("response", setAiResponse);
agent.on("start", () => setIsRecording(true));
agent.on("stop", () => setIsRecording(false));
agentRef.current = agent;
return () => agent.destroy(); // cleanup on unmount
}, []);
return (
<div>
<button
onMouseDown={() => agentRef.current?.start()}
onMouseUp={() => agentRef.current?.stop()}
>
{isRecording ? "🔴 Recording…" : "🎤 Hold to Speak"}
</button>
{transcript && <p>You said: {transcript}</p>}
{aiResponse && <p>AI: {aiResponse}</p>}
</div>
);
}Next.js Integration
// app/components/VoiceWidget.tsx
"use client"; // Required — browser APIs
import { useEffect, useRef } from "react";
import { VoiceAgent } from "voice-ai-agent";
export default function VoiceWidget() {
const agentRef = useRef<VoiceAgent | null>(null);
useEffect(() => {
const agent = new VoiceAgent({
websocketUrl:
process.env.NEXT_PUBLIC_WS_URL ?? "ws://localhost:3000/audio-stream",
});
agent.on("response", (text) => console.log("AI:", text));
agentRef.current = agent;
return () => agent.destroy();
}, []);
return (
<button
onMouseDown={() => agentRef.current?.start()}
onMouseUp={() => agentRef.current?.stop()}
>
Push to Talk
</button>
);
}Vanilla JavaScript
<!-- index.html -->
<script type="module">
import { VoiceAgent } from "https://cdn.jsdelivr.net/npm/voice-ai-agent/dist/index.mjs";
const agent = new VoiceAgent({
websocketUrl: "ws://localhost:3000/audio-stream",
});
agent.on(
"transcript",
(text) => (document.getElementById("transcript").textContent = text),
);
agent.on(
"response",
(text) => (document.getElementById("response").textContent = text),
);
document
.getElementById("btn")
.addEventListener("mousedown", () => agent.start());
document
.getElementById("btn")
.addEventListener("mouseup", () => agent.stop());
</script>
<button id="btn">Hold to Speak</button>
<p id="transcript"></p>
<p id="response"></p>API Reference
new VoiceAgent(config)
| Option | Type | Default | Description |
| ------------------ | -------- | --------- | -------------------------------------------------------------- |
| websocketUrl | string | — | WebSocket server URL (e.g. ws://localhost:3000/audio-stream) |
| apiKey | string | — | Google Gemini API key for client-side AI mode |
| lang | string | "en-US" | Web Speech API language (BCP-47) |
| chunkIntervalMs | number | 250 | Audio chunk interval in ms |
| reconnectDelayMs | number | 3000 | WebSocket reconnect delay in ms |
Note: At least one of
websocketUrlorapiKeymust be provided.
Methods
| Method | Description |
| ----------------- | ---------------------------------------------------- |
| agent.start() | Starts recording (async — requests mic permission) |
| agent.stop() | Stops recording and signals server to process |
| agent.destroy() | Full cleanup: stops recording, disconnects WebSocket |
Events
| Event | Callback | Description |
| ---------------- | ---------------- | ------------------------------------------------------------- |
| "transcript" | (text: string) | Live transcription (interim from browser + final from server) |
| "response" | (text: string) | AI-generated response text |
| "error" | (err: Error) | Any error in recording, WebSocket, or AI processing |
| "start" | () | Recording started |
| "stop" | () | Recording stopped |
| "connecting" | () | WebSocket connection attempt started |
| "connected" | () | WebSocket successfully opened |
| "disconnected" | () | WebSocket connection closed |
Individual Modules
You can also import and use individual low-level modules:
import { AudioRecorder, WebSocketClient, GeminiProvider } from "voice-ai-agent";
// Use AudioRecorder standalone
const recorder = new AudioRecorder("en-US", 250);
recorder.on("chunk", (base64) => console.log("chunk:", base64.length));
recorder.on("speechResult", (text) => console.log("heard:", text));
await recorder.start();
// Use WebSocketClient standalone
const ws = new WebSocketClient("ws://localhost:3000/audio-stream");
ws.on("message", (data) => console.log(data));
ws.connect();
// Use GeminiProvider standalone
const gemini = new GeminiProvider({ apiKey: "YOUR_KEY" });
const response = await gemini.generateResponse("What is the weather like?");Building from Source
cd voice-ai-agent
# Install dependencies
npm install
# Type-check only (no output)
npm run typecheck
# Build dist/ (CJS + ESM + .d.ts)
npm run build
# Watch mode during development
npm run devAfter build, dist/ will contain:
dist/
├── index.js ← CommonJS (for Node.js, webpack)
├── index.mjs ← ESM (for Vite, Rollup, browser modules)
└── index.d.ts ← TypeScript declarationsPublishing to npm
Step 1 — Set your package name
Edit package.json → "name". Use a scoped name to avoid conflicts:
"name": "@yourname/voice-ai-agent"Step 2 — Bump the version
npm version patch # 1.0.0 → 1.0.1
npm version minor # 1.0.0 → 1.1.0
npm version major # 1.0.0 → 2.0.0Step 3 — Log in to npm
npm loginStep 4 — Build and publish
# Build is run automatically via "prepublishOnly" script
npm publish
# For scoped packages, make it public:
npm publish --access publicStep 5 — Verify
npm info voice-ai-agentLocal Testing with npm link
Test the SDK in your existing voice-stream-client app without publishing:
# In voice-ai-agent/
npm run build
npm link
# In voice-stream-client/ (or any other project)
npm link voice-ai-agent
# Then import normally:
import { VoiceAgent } from "voice-ai-agent";Project Structure
voice-ai-agent/
├── src/
│ ├── core/
│ │ └── VoiceAgent.ts ← Main SDK class
│ ├── audio/
│ │ └── audioRecorder.ts ← MediaRecorder + Web Speech API
│ ├── streaming/
│ │ └── websocketClient.ts ← WebSocket with auto-reconnect
│ ├── providers/
│ │ └── geminiProvider.ts ← Client-side Gemini REST wrapper
│ ├── types.ts ← Shared TypeScript interfaces
│ └── index.ts ← Public barrel exports
├── examples/
│ └── react-example/
│ ├── VoiceButton.tsx ← Complete React component
│ └── useVoiceAgent.ts ← React hook
├── dist/ ← Generated build output
├── package.json
├── tsconfig.json
└── tsup.config.tsLicense
MIT © 2026
