pi-speak-pk
v0.2.5
Published
Voice, wake-word, Telegram, and mobile web remote extensions for Pi.
Maintainers
Readme
pi-speak-pk
Voice, wake-word, and remote-control extensions for Pi / pi-mono.
This package turns Pi into a usable voice workstation, not just a text assistant with TTS bolted on. It gives you:
- spoken assistant replies with multiple TTS backends
- the always-listening
PKwake phrase flow - Telegram text and voice turns from your phone
- a local HTTP control API
- a built-in mobile web app at
/app/ - a Unified Remote control surface
What To Use
If you just want the shortest path:
- Local desktop voice: use
/speak on - Hands-free on the same machine: use
/mono on - Remote from your phone with the least friction: use
/phone on - Remote from your phone with QR setup: use
/pk-remote, then scan the QR from the Android phone - Remote button grid on Android: use the bundled Unified Remote remote
Install
Install the extension:
pi npm i pi-speak-pkReload Pi after install.
Quick Start
1. Make Pi Speak Locally
/speak on
/speak test
/speak statusIf you do nothing else, auto provider selection will try available backends in this order:
legacyviaspeak11elevenlabsopenaiedge
If an earlier auto-selected backend fails at synthesis time, Pi now falls through to the next available provider instead of stopping on the first failure.
2. Enable The Always-Listening Wake Phrase
/mono onSay:
PKPi will open a short voice-input window, play a short listening cue, and update the mono status so you can tell it is actively listening. Say PK again within the timeout to keep it alive. Default keep-alive is 15 seconds.
Wake matching now has a sensitivity preset. Use PI_SPEAK_WAKE_SENSITIVITY=low|medium|high to make activation stricter or more forgiving. medium is the default.
3. Remote In From Your Phone With Telegram
/phone setup
/phone token <bot-token>
/phone on
/phone codeThen in Telegram:
- Open your bot
- Send
/link <code> - Send text or voice notes to Pi
This is the easiest remote path. It works well when you want reliability more than low latency.
/phone setup prints the running-session setup steps. If the token is not already in the environment, paste it with /phone token <bot-token>; the extension saves it, starts the bridge, and prints the /link code. PI_SPEAK_TELEGRAM_BOT_TOKEN can still point to an existing bot you already control.
4. Remote In From Your Phone With The Built-In Web App
/pk-remote
/remote setup
/remote setup bluetooth/pk-remote is the shortest path. It starts the remote API if needed, chooses a setup URL in this order, and prints a QR code for the phone setup page:
PI_SPEAK_PUBLIC_BASE_URL- detected Tailscale IPv4 address
- detected local LAN IPv4 address
- configured fallback
Scan the QR from the Android phone to open the setup page. From there you can download the bundled APK, open the native pi-speak://setup link, and save the machine URL, token, profile name, connection mode, and Codex route metadata. If you want the browser app instead, open one of the printed browser URLs:
http://localhost:8767/app/
https://<tailnet-host>/app/
https://<tunnel-domain>/app/The web app:
- records your microphone in the browser
- sends audio to
/v1/turn/voice - shows the transcript
- plays the returned reply audio
- stores the remote token in the current browser session by default
- can explicitly remember the token on that device if you enable it in Settings
/remote setup prints the same QR and links as /pk-remote. Use /remote setup bluetooth or /pk-remote bluetooth when the phone is paired over Bluetooth networking/PAN.
For real phone use, prefer an HTTPS URL through Tailscale Serve or a tunnel. If the phone is paired over Bluetooth networking/PAN instead, use /remote setup bluetooth; the Android app treats that as a Bluetooth local-link profile and does not require Tailscale.
Optional Windows tray:
/remote tray onRight-click the tray icon and choose Show setup QR code. Scanning the QR opens the Android app with this computer's Tailscale endpoint, token, and saved machine profile metadata. Set PI_SPEAK_TRAY=1 to start the tray automatically with /remote on.
NPM-installed tray/service path:
npx -p pi-speak-pk pi-speak-trayOr, after global install:
pi-speak-tray --install-startupThe tray keeps the headless gateway running in the background, restarts it if it exits, and exposes setup, APK download, status, settings, restart, and web remote actions from the tray menu.
Gemini Live Smoke Test
Use this before wiring a real-time session into the phone UI:
set PI_SPEAK_GEMINI_BACKEND=vertex
set PI_SPEAK_VERTEX_API_KEY=<optional-vertex-api-key>
set GOOGLE_CLOUD_PROJECT=<your-gcloud-project>
set GOOGLE_CLOUD_LOCATION=us-central1
gcloud auth application-default login
pi-speak-gemini-live-smoke --model gemini-2.5-flash-native-audio-preview-12-2025 --modality audioTo run the tray/headless gateway through ElevenLabs voice, backed by Vertex AI Gemini text reasoning:
set ELEVENLABS_API_KEY=<your-elevenlabs-key>
set PI_SPEAK_GEMINI_BACKEND=vertex
set PI_SPEAK_VERTEX_API_KEY=<optional-vertex-api-key>
set GOOGLE_CLOUD_PROJECT=<your-gcloud-project>
set GOOGLE_CLOUD_LOCATION=us-central1
gcloud auth application-default login
set AGENT_PROVIDER=elevenlabs
pi-speak-gatewayThis is the recommended high-quality voice stack. It uses ElevenLabs for reply audio and Vertex AI for Gemini reasoning so Google Cloud billing/credits apply through your Cloud project. Set PI_SPEAK_ELEVENLABS_MODEL_ID=eleven_multilingual_v2 when quality matters more than credit use.
To run the tray/headless gateway through Gemini Live instead:
set AGENT_PROVIDER=gemini-live
set PI_SPEAK_GEMINI_BACKEND=vertex
set GOOGLE_CLOUD_PROJECT=<your-gcloud-project>
set GOOGLE_CLOUD_LOCATION=us-central1
set PI_SPEAK_GEMINI_LIVE_MODEL=gemini-2.5-flash-native-audio-preview-12-2025
pi-speak-gatewayOptional environment:
PI_SPEAK_GEMINI_BACKEND=vertex|developer-apiselects Vertex AI or direct Gemini Developer APIPI_SPEAK_VERTEX_API_KEYuses a Vertex AI API key instead of Application Default CredentialsGOOGLE_CLOUD_PROJECTandGOOGLE_CLOUD_LOCATIONconfigure Vertex AIPI_SPEAK_GEMINI_LIVE_MODELselects the Live modelPI_SPEAK_GEMINI_LIVE_MODALITY=audio|textselects response modePI_SPEAK_GEMINI_API_VERSION=v1beta|v1alphaselects the Gemini API versionPI_SPEAK_ELEVENLABS_MODEL_IDselects the ElevenLabs speech modelPI_SPEAK_ELEVENLABS_VOICE_IDselects the ElevenLabs voice
Keep Gemini, Vertex, and ElevenLabs credentials server-side. Do not put them in the Android app or browser app.
Main Commands
/speak
Turns spoken replies on, off, or changes the backend.
Common examples:
/speak on
/speak off
/speak stop
/speak status
/speak test
/speak providers
/speak provider edge
/speak provider openai
/speak provider elevenlabs
/speak rewrite on
/speak rewrite offBehavior:
- Pi still keeps the full on-screen response
- the spoken version can optionally be rewritten for audio clarity
/speak stopinterrupts playback without disabling speech mode
/mono
Controls the wake-word listener.
/mono on
/mono off
/mono statusBehavior:
- waits for the wake phrase
PKby default - activates voice input for a short window
- keeps the existing
/monoflow intact with a faster-whisper wake detector - supports
PK <session-name>to route into a named session when the target name is spoken clearly - keeps short numeric routes deterministic:
PK one,PK 1, andPK1belong to the same1familyPK two,PK 2, andPK2belong to the same2family1stays distinct from2- multi-word names like
PK to Googlestay literal and are not coerced into2
/phone
Controls the Telegram bridge.
/phone on
/phone off
/phone status
/phone setup
/phone token <bot-token>
/phone code
/phone unpairBehavior:
- text messages become Pi turns
- voice notes are transcribed, then sent to Pi
- replies can be delivered as text plus generated audio
/remote
Controls the HTTP API and mobile web app.
/remote on
/remote off
/remote status
/remote token
/remote setup
/remote setup bluetooth
/remote tray on
/remote tray off
/remote tray statusBehavior:
- starts the HTTP server
- serves the mobile app from
/app/ - serves the phone setup page from
/setup - serves the bundled Android APK from
/download/pi-speak.apk - exposes remote-control endpoints
- generates a token if one is not already configured
- prints one-step setup URLs for the browser app and native Android app
/sess
Named sessions, wake aliases, and routing summaries for voice control.
/sess
/sess new bugfix
/sess switch bugfix
/sess name active-work
/sess rename bugfix voice-bugfix
/sess wake one
/sess wake clear one
/sess alias add bugfix one
/sess alias remove one
/sess edit bugfix
/sess remove bugfix
/sess confirm remove bugfix
/sess slots
/sess export
/sess ui
/sess ui openThis matters because PK bugfix can route voice input to that named session, while compact routes like PK one / PK1 and PK two / PK2 can stay stable and distinct.
/sess with no args shows the current session, ready sessions, aliases, store path, a compact 1 vs 2 lane summary, and inline state for known sessions.
Use /sess slots when you want the explicit compact-route view for PK one / PK1 and PK two / PK2.
Use /sess ui for inline guidance without opening another terminal. Use /sess ui open only when you explicitly want the older terminal pane; repeat launches reuse the existing pane instead of creating more terminal windows. The pane mirrors the /sess dashboard, refreshes within one second of external mutations, supports focus movement with arrow keys, tab, or j / k, shows the compact PK1/PK2 route lanes plus a focused-session footer, and adds keybindings [r] rename, [a] alias, [x] remove, and [q] quit.
For operator details, see:
docs/VOICE_SESSION_BRIDGE.mddocs/SESSION_OPERATIONS.mddocs/CODEBASE_MAP.md
Architecture
There are six main subsystems:
index.tsThe extension entrypoint. Registers commands, persists state, owns wake-word routing, and coordinates TTS, STT, Telegram, and HTTP control.tts.tsMulti-provider speech synthesis. Supportslegacy,edge,openai,elevenlabs, andauto.stt.tsandlistener/stt_worker.pyRemote voice transcription for uploaded audio.autoprefers OpenAI when an API key is present, otherwise a warm localfaster-whisperworker process.listener/listener.pyThe always-on two-tier listener:- Tier 1:
faster-whispertiny for wake-phrase detection - Tier 2:
faster-whisperfor actual speech transcription
- Tier 1:
phone-bridge.tsTelegram transport for remote text and voice notes.control-server.tsLocal HTTP API, audio artifact serving, and the built-in mobile app host.
Remote Paths
Best Overall: Built-In Mobile Web App
Use this when you want:
- browser mic capture
- browser audio playback
- one-tap remote use from Android
- compatibility with Tailscale or an HTTPS tunnel
Start it:
/remote onOpen:
https://<your-url>/app/Best Zero-Friction Fallback: Telegram
Use this when you want:
- the least setup
- reliable remote turns
- simple text plus voice note interaction
Start it:
/phone onBest Button Grid: Unified Remote
Use this when you want:
- fast buttons for
mono,speak, provider changes, and phone pairing - a control surface on the phone
Do not use this as your main audio path. It is a controller, not a real voice transport.
Mobile Web App
The mobile app is built into the extension and served from:
/app/Capabilities:
- record a voice turn with the browser microphone
- send typed fallback text
- request spoken replies on each turn
- autoplay returned audio when the browser allows it
- keep the token in session storage by default
- optionally remember the token on that device
- install as a PWA on Android
Token onboarding options:
- Paste the token in the Settings panel
- Open the app once with:
/app/?token=YOUR_TOKENThe app will save the token into the current browser session and clean the URL immediately.
Secure-origin rules:
localhostworks- HTTPS works
- random plain HTTP hostnames usually will not allow browser microphone access
That is why Tailscale Serve or an HTTPS tunnel is the right remote path.
Native Android can also use Bluetooth networking/PAN. Pair the phone with the desktop, start /remote on, run /remote setup bluetooth, then open the native setup link or select the built-in Bluetooth / local link profile and adjust the base URL to the desktop Bluetooth adapter IP if needed. Set PI_SPEAK_BLUETOOTH_BASE_URL before launching Pi Speak if you want /remote setup bluetooth to print a known adapter URL instead of the default http://192.168.44.1:8767/.
HTTP API
Start it with:
/remote onDefault bind:
host: 0.0.0.0
port: 8767Public Routes
These are available before auth because they serve the built-in app:
GET /
GET /app/
GET /app/index.html
GET /app/app.webmanifest
GET /app/sw.js
GET /app/icon.svgControl Routes
GET /v1/health
GET /v1/status
GET /v1/diagnostics
GET /v1/route
POST /v1/route
GET /v1/mono/status
POST /v1/mono/on
POST /v1/mono/off
GET /v1/speak/status
GET /v1/speak/providers
POST /v1/speak/on
POST /v1/speak/off
POST /v1/speak/stop
POST /v1/speak/test
POST /v1/speak/provider/:provider
POST /v1/speak/rewrite/:onOrOff
GET /v1/phone/status
POST /v1/phone/on
POST /v1/phone/off
POST /v1/phone/code
POST /v1/phone/unpair
GET /v1/turn/text?text=hello&audio=1
POST /v1/turn/text
POST /v1/turn/voice
GET /v1/audio/:idAuth
Local bypass applies only to true localhost requests:
localhost127.0.0.1::1
Remote clients must send one of:
Authorization: Bearer <token>X-Pi-Speak-Token: <token>
Query-string token auth is reserved for:
/app/?token=...bootstrap onboarding/v1/audio/:id?token=...reply-audio playback in the browser
Remote control and turn endpoints should use headers, not query-string auth.
Hardening Defaults
The production-oriented defaults are:
- same-origin CORS unless
PI_SPEAK_HTTP_ALLOWED_ORIGINSis set - request body limit for text turns:
64 KB - request body limit for voice turns:
25 MB - lightweight in-memory rate limits for non-local traffic
- background cleanup of expired reply-audio artifacts
- authenticated diagnostics at
/v1/diagnostics, including a compact summary block for queue state, phone linkage, mono state, current session/target, and active error sources - queue/backpressure for remote turns so Pi returns a deterministic busy response instead of piling up unlimited work
- synchronous remote turns fail fast when the current Pi session is already mid-turn, instead of hanging the HTTP request against the same active session
- mutating control routes require
POST, leavingGETread-only for fetch-safe status endpoints - outbound provider calls share a default
30stimeout viaPI_SPEAK_OUTBOUND_TIMEOUT_MS
Inspect the active token with:
/remote tokenExample Requests
Text turn:
curl -X POST http://127.0.0.1:8767/v1/turn/text ^
-H "Content-Type: application/json" ^
-d "{\"text\":\"Summarize the repo\",\"audio\":true}"Voice turn:
curl -X POST "https://<your-host>/v1/turn/voice?audio=1" ^
-H "Authorization: Bearer <token>" ^
-H "Content-Type: audio/webm" ^
--data-binary "@voice.webm"Unified Remote
Bundled remote source:
unified-remote/Pi SpeakInstall path:
C:\ProgramData\Unified Remote\Remotes\Custom\Pi SpeakWhat it is good at:
- toggling
mono - toggling
speak - switching providers
- requesting the Telegram pair code
- sending short text turns
What it is not good at:
- full remote voice capture
- browser-style audio playback
- low-latency conversational audio
Environment Variables
Core
AGENT_PROVIDER=pi|codex|elevenlabs|gemini|gemini-live
CODEX_BIN=codex
PI_BIN=pi
AGENT_MODEL=
PI_SPEAK_EXECUTION_ROUTER_MODE=auto|pi|codex
AGENT_CWD=
AGENT_WORKSPACE=
PI_SPEAK_TTS_PROVIDER=auto|legacy|edge|openai|elevenlabs
PI_SPEAK_REWRITE_ENABLED=true|false
PI_SPEAK_WAKE_PHRASE=PK
PI_SPEAK_MONO_ACTIVITY_TIMEOUT=15
PI_SPEAK_WAKE_SENSITIVITY=low|medium|high
PI_SPEAK_WAKE_FUZZY_ENABLED=true|false # optional override
PI_SPEAK_WAKE_FUZZY_MAX_DISTANCE=0|1|2 # optional override
PI_SPEAK_WAKE_COMPACT_PREFIX_ENABLED=true|false # optional overrideIf PI_SPEAK_EXECUTION_ROUTER_MODE is unset, explicit AGENT_PROVIDER=pi or AGENT_PROVIDER=codex controls which backend remote turns dispatch to. Set the router mode to auto when you want the conversation router to choose Pi or Codex from the reduced task.
Rewrite
OPENROUTER_API_KEY=...
PI_SPEAK_REWRITE_MODEL=openai/gpt-oss-20b:nitro
PI_SPEAK_OPENROUTER_URL=https://openrouter.ai/api/v1/chat/completionsOpenAI
# Dedicated key for audio TTS (avoids consuming the general LLM key)
PI_SPEAK_OPENAI_KEY=...
# Legacy fallback
VOICE_TOOLS_OPENAI_KEY=...
PI_SPEAK_OPENAI_TTS_MODEL=gpt-4o-mini-tts
PI_SPEAK_OPENAI_VOICE=alloy
PI_SPEAK_REMOTE_OPENAI_STT_MODEL=whisper-1
PI_SPEAK_OPENAI_BASE_URL=https://api.openai.com/v1ElevenLabs
ELEVENLABS_API_KEY=...
PI_SPEAK_ELEVENLABS_VOICE_ID=pNInz6obpgDQGcFmaJgB
PI_SPEAK_ELEVENLABS_MODEL_ID=eleven_multilingual_v2Vertex AI Gemini
PI_SPEAK_GEMINI_BACKEND=vertex
PI_SPEAK_VERTEX_API_KEY=<optional-vertex-api-key>
GOOGLE_CLOUD_PROJECT=<your-gcloud-project>
GOOGLE_CLOUD_LOCATION=us-central1
PI_SPEAK_GEMINI_TEXT_MODEL=gemini-2.5-flash
PI_SPEAK_GEMINI_LIVE_MODEL=gemini-2.5-flash-native-audio-preview-12-2025Run gcloud auth application-default login on the machine hosting the tray/gateway, or set PI_SPEAK_VERTEX_API_KEY to a Vertex AI API key. Enable the Vertex AI API on the Cloud project.
Edge TTS
PI_SPEAK_EDGE_VOICE=en-US-AriaNeural
PI_SPEAK_EDGE_LANG=en-US
PI_SPEAK_EDGE_RATE=1
PI_SPEAK_EDGE_TIMEOUT_MS=15000Legacy / Local Python
PI_SPEAK_SPEAK11_PATH=...
PI_SPEAK_PYTHON=...
WHISPER_MODEL=tiny
WHISPER_DEVICE=cpu
WHISPER_COMPUTE=int8
PI_SPEAK_REMOTE_WHISPER_MODEL=base
PI_SPEAK_REMOTE_STT_PROVIDER=auto|local|openaiPI_SPEAK_PYTHON and PI_SPEAK_SPEAK11_PATH are now the first-class override path for local Python audio setups. When they are unset, Pi scans the normal Windows user-site Python*/Scripts locations before falling back to PATH resolution.
Telegram
PI_SPEAK_TELEGRAM_BOT_TOKEN=...
TELEGRAM_BOT_TOKEN=...
PI_SPEAK_PHONE_WAIT_TIMEOUT_MS=180000HTTP Remote
PI_SPEAK_HTTP_HOST=0.0.0.0
PI_SPEAK_HTTP_PORT=8767
PI_SPEAK_HTTP_TOKEN=...
PI_SPEAK_HTTP_AUDIO_TTL_MS=600000
PI_SPEAK_HTTP_AUDIO_CLEANUP_MS=30000
PI_SPEAK_HTTP_ALLOWED_ORIGINS=https://your-tailnet-host,https://your-tunnel-host
PI_SPEAK_HTTP_TIMEOUT_MS=180000
PI_SPEAK_HTTP_TEXT_BODY_LIMIT_BYTES=65536
PI_SPEAK_HTTP_VOICE_BODY_LIMIT_BYTES=26214400
PI_SPEAK_HTTP_RATE_LIMIT_WINDOW_MS=60000
PI_SPEAK_HTTP_RATE_LIMIT_CONTROL=20
PI_SPEAK_HTTP_RATE_LIMIT_VOICE=6
PI_SPEAK_OUTBOUND_TIMEOUT_MS=30000Troubleshooting
The phone web app opens, but the mic does not work
You are probably not on a secure origin.
Use one of:
http://localhost:8767/app/- Tailscale Serve over HTTPS
- Cloudflare Tunnel over HTTPS
/mono on starts, but voice transcription fails
You likely do not have the Python audio stack installed. The local listener depends on:
numpysounddevicefaster_whisper
Remote voice turns fail
Check these in order:
/remote status/v1/diagnostics/remote tokenPI_SPEAK_REMOTE_STT_PROVIDER- OpenAI key or local whisper setup
Speech is using the wrong provider
Check:
/speak status
/speak providers
/speak provider edgeTelegram pairing is stuck
Use:
/phone code
/phone unpairThen link again with the fresh code.
Testing
Run the automated production-readiness checks with:
npm testCurrent automated coverage includes:
- non-local auth enforcement
- localhost auth bypass
- body-size rejection
- voice content-type rejection
- rate limiting
- audio artifact expiry
- Telegram link + text-turn handling
- PWA token persistence rules
- remote queue backpressure behavior
- runtime path resolution for local Python / speak11
- explicit listener shutdown signaling with force-kill fallback
Manual Smoke Checklist
Before treating a machine as production-ready, verify:
/mono on- local wake phrase: say
PK /phone onthen/phone code, then complete a Telegram text turn and voice-note turn/remote on, open/app/, complete a text turn and voice turn, and confirm reply audio playback- over Tailscale or your HTTPS tunnel, confirm non-local requests fail without the token and succeed with it
For a full phone-focused run sheet with pass/fail capture fields, use docs/REMOTE_VALIDATION_CHECKLIST.md.
For a compact operator worksheet, use docs/REMOTE_VALIDATION_RUN_SHEET.md.
Files You Will Care About
- index.ts
- tts.ts
- stt.ts
- phone-bridge.ts
- control-server.ts
- listener/listener.py
- web/remote/index.html
- docs/CODEBASE_MAP.md
Release Notes
See CHANGELOG.md.
