@northpointe/voice-connect
v1.2.2
Published
Voice Connect — standalone OpenClaw voice channel plugin (SPA + STT + TTS)
Downloads
274
Maintainers
Readme
Voice Connect
A hands-free, voice-first conversational AI interface for OpenClaw. One tap. One orb. Speak directly with your main OpenClaw agent.
🎉 Milestone (2026-05-22): Streaming dispatcher fix — end of the 3-week pendulum. LIVE-VALIDATED. Voice Connect was caught in a three-week back-and-forth between two failure modes: stacked replies (the AI dragging every prior topic into every new turn) and new-session-per-turn amnesia (the AI forgetting the previous turn entirely). Both symptoms had the same root cause and the same architectural fix.
The root cause: since 2026-05-07, the VC streaming path bypassed OpenClaw's native dispatch and talked directly to the Gateway's OpenAI-compat
/v1/chat/completionsendpoint so per-sentence TTS could start mid-reply. That made VC streaming effectively a stateless OpenAI client — session memory had to be hand-rolled in the plugin via abuildVoiceCompatMessageshelper and a "continuity guard" system prompt. Every patch since then swung between re-injecting history aggressively (→ stacking) and not at all (→ amnesia).The smoking gun: OpenClaw's plugin SDK exposes two reply dispatchers — a buffered one that fires
deliver()once at end-of-run, and a streaming one (dispatchReplyWithDispatcher) that firesdeliver()per agent block during the run. The SDK'sdispatchInboundDirectDmWithRuntimewrapper hard-wires the buffered variant. The plugin was using the wrong dispatcher all along.The fix: composed
dispatchInboundDirectDmWithStreamingDispatcherlocally (~30 lines), substituting the streaming dispatcher for the buffered one. Deleted the/v1/chat/completionsbypass,buildVoiceCompatMessages, the continuity-guard system prompt, and ~250 lines of compensating helpers. Net code reduction. Per-sentence TTS preserved (streaming dispatcher →deliver()→ SSE slicer → SPA sentence chunker, untouched). Native session memory returned to OpenClaw.Preserved (orthogonal defensive layers): XML isolation envelope (
<voice_connect_turn seq="N" priority="newest_active">), monotonic sequence numbers, "only answer THIS turn" instruction, consume-latest gate, staggered turn dedup,queueMode="followup"enforcement, lifecycle markers, turn-end finalization.Live validation: 7+ consecutive VC rounds with sharp topic switches (cocoa, holidays, music, theology, meta-discussion of memory itself). No stacking. No new-session symptoms. Per-sentence TTS latency preserved. Tested across higher and lower-tier reasoning behavior.
Durable lesson: when the plugin SDK "only does X," check whether it actually exposes a related variant that does Y. The streaming dispatcher was on the shelf for three weeks while we wrote 250 lines of compensating code. A 4-line import swap and a local 30-line wrapper would have solved the problem on day one.
🎉 Milestone (2026-05-13): Plugin-owned architecture — LIVE-VALIDATED. Voice Connect is operational as a self-contained OpenClaw plugin package. The legacy external
voice-chat-bridgeapp is gone. The plugin owns the SPA, API, TTS, STT, settings, health, tool markers, and packaging. A tiny packaged helper proxy on port18790may still be used for host/Caddy ingress where OpenClaw's public gateway port or body middleware is not suitable.Live validation confirmed 3+ clean voice conversation rounds with:
- Deepgram Premium STT — real-time Heard text keeping pace with speech
- Vosk lgraph Built-in STT — connects and works reliably
- Fast Mode (WebRTC) — works independently
- Full orb cadence — green listening → yellow thinking → orange working (with heartbeat music on tool calls) → purple speaking → green
- Tool calls — confirmed orange working state during live operations
- Full-turn submit (sentence chunking OFF) — complete user thought submitted after VAD
Milestone (2026-05-14): Final cadence polish — LIVE-VALIDATED. The last voice-loop polish pass fixed the assistant return path and receipt cue semantics:
- Assistant TTS early-start — VC now pipes Gateway streaming deltas back to the SPA so sentence-level TTS can begin before the full agent response is complete.
- Receipt/stall cue semantics — the short audio cue now fires when the AI run actually begins processing (
lifecycle:start), not merely when the browser submits audio.- Barge-in hardening — interrupt thresholds were made more conservative so VC does not mistake its own TTS bleed for a user interruption.
- Tool/orb truthfulness preserved — yellow thinking and orange working may appear during/around purple speaking, but green listening is still reserved for true user-ready state.
Live phone/car testing confirmed that the cue feels natural: a small delay that communicates the AI agent has received the response.
Milestone (2026-05-15): Active VC turn marker — PREPARED. Voice Connect now wraps submitted transcripts in a structured active-turn block (
[VC_TURN_START] ... [VC_TURN_END]) so the model can identify the newest Voice Connect turn when OpenClaw queues or stacks context. This is a prompt-level safety rail, not the final architectural queue fix.
- Marker includes turn id, UTC/local timestamps,
priority: newest_active_turn, and an unconditional handling rule- Goal: answer only the newest active VC turn; do not revisit older queued turns unless explicitly requested
- Side note: revisit this if repeated markers clutter the dashboard transcript during heavier VC usage. A future pass may hide/system-route the metadata while preserving the safety signal.
- Plugin build passed and live extension was synced; manual Gateway restart required to load the plugin route change.
Milestone (2026-05-15): VAD profile simplification — LIVE-VALIDATED. Retired the legacy Auto mode (built for sentence-chunking era) and simplified VAD into three clean presets plus a Custom slider:
- Fast (800ms), Medium (1500ms), Conservative (2700ms) — pure silence-duration timers, no adaptive transcript logic
- Custom — single slider from 400ms to 4000ms with tick marks at preset positions; auto-splits into hangover/confirmation
- Settings persist server-side via voice settings JSON, not session-scoped
- Live validation found Custom at ~1500ms to be an effective default tuning point.
- Auto mode preserved (commented out) for future Hermes reuse
Milestone (2026-05-16): Mobile safety cues + sacred-green guard — LIVE-VALIDATED. Tightened the orb release logic so green listening stays sacred/user-ready only, even between assistant speech/tool chunks. Added optional subtle nonverbal audio state cues:
- Sent cue — plays when finalized STT leaves the browser toward OpenClaw
- Listening-ready cue — plays only when VC truly returns to green listening
- Both cues are client-side, nonverbal, and separate from the spoken receipt/stall cue
- Live testing confirmed the cue sounds and no green flash after hard refresh
- Default stall phrase pool updated from validation feedback
Milestone (2026-05-07): Per-message independence — bundling fix LIVE-VALIDATED. Each finalized VC utterance now lands as its own discrete dashboard bubble + its own agent turn, even when arriving during an active reply. Achieved entirely inside the plugin via
sessionEntry.queueMode = "queue"through the officialopenclaw/plugin-sdk/session-store-runtimeSDK. Zero latency impact.
Current Status — Production-Grade
- Plugin-owned / self-contained package — no external legacy bridge repo, no workspace clone dependency, no manual app server. A minimal packaged helper proxy can be managed by the plugin when host ingress needs it.
- Multi-round orb cadence — green → yellow → orange (tool calls) → purple → green, verified across multiple exchanges.
- Simplified VAD profiles — Fast (800ms), Medium (1500ms), Conservative (2700ms) presets plus Custom slider (400–4000ms). Pure silence-duration timers, no legacy transcript-adaptive logic. Settings saved server-side.
- Low-latency assistant TTS — user STT submits full turns after VAD, while assistant replies stream back as deltas so TTS can speak complete sentences before the full response is finished.
- Receipt/stall cue — short spoken acknowledgment plays when the AI run begins processing the user turn, functioning as an audio typing indicator.
- Subtle browser-side audio cues — tiny nonverbal sounds indicate finalized STT sent from the browser and true green listening-ready state, helping mobile/car use without watching the screen.
- Premium STT — Deepgram, OpenAI, and Google selectable. Real-time Heard text confirmed with Deepgram.
- Built-in STT — Vosk lgraph supported by the package. For packaging, the model is installed separately instead of being shipped inside the npm tarball.
- Premium TTS — ElevenLabs, OpenAI, Google, Cartesia selectable with per-device persistence.
- Plugin API routes (
/voice-connect/api/v1/*) serve all endpoints directly through gateway. - Caddy config — points to the supported Voice Connect ingress for the install: either direct gateway when safe, or the plugin-managed helper proxy on
18790when the public gateway path is blocked by auth/body middleware. - SPA served by plugin —
spa-dist/bundled inside the plugin extension directory.
Architecture
Browser → Caddy (HTTPS) → Voice Connect ingress → OpenClaw Gateway → Voice Connect Plugin
│
└─ Direct gateway when safe, OR plugin-managed tiny helper proxy on :18790
Voice Connect Plugin owns:
├── SPA static files (spa-dist/)
├── Inbound transcript route (SSE streaming back to SPA)
├── TTS synthesis routes
├── Cloud STT routes (GET control + POST audio)
├── Vosk managed runtime (lgraph model)
├── Voice settings persistence
├── Tool/lifecycle marker hooks
└── Optional helper proxy script bundled inside the packageAll product logic flows through the OpenClaw gateway and plugin routes. The plugin registers HTTP routes via the plugin SDK's registerHttpRoute. The helper proxy, when used, is not the old bridge application: it is a tiny package-owned ingress shim that forwards bytes to the gateway and injects server-side auth. It exists because some host/Caddy paths cannot safely use the dashboard/public gateway port for unauthenticated VC traffic or POST/file bodies.
Important transport note: Control endpoints (start, finalize, close, settings) use GET with query params rather than POST with JSON body. This is because the OpenClaw gateway body-parsing middleware can consume POST bodies before they reach plugin routes when accessed through Caddy reverse-proxy. Binary audio chunks (/stt/cloud/chunk) remain POST with application/octet-stream.
Voice cadence note: User-side STT sentence chunking is intentionally off by default; VAD submits the user's full thought as one turn because OpenClaw does not consume partial user-sentence chunks into agent:main:main. Assistant-side TTS is different: the plugin streams assistant deltas back to the SPA, and the browser sentence chunker enqueues complete sentences immediately for low-latency speech.
Installation (target)
npm install -g @northpointe/voice-connect
bash scripts/install-vosk-model.shOne line for Caddy:
handle /voice-connect/* {
header Permissions-Policy "microphone=(self)"
reverse_proxy <voice-connect-ingress> {
flush_interval -1
}
}
redir /voice-connect /voice-connect/ 308
# In the current production container, <voice-connect-ingress> is the
# plugin-managed helper proxy reachable from the host at 172.19.0.3:18790.
# On future installs with a safe plugin-public gateway listener, this can be
# the direct gateway listener instead.No workspace clones and no separate legacy bridge application. Any helper proxy needed for ingress is bundled, versioned, and managed as part of the plugin package.
Remaining Packaging Items
- Vosk auto-bootstrap inside plugin install flow — the repo now supports script-based model install, but a first-run or post-install bootstrap would be even cleaner than a manual script step.
- License infrastructure — Pay-first model with Stripe + a dedicated license server. 2-3 days after packaging ships.
- Upstream OpenClaw feature request — Plugin-owned public listeners would let VC remove the helper proxy entirely on more installs.
