@p8n.ai/pi-listens
v0.3.1
Published
Pi package for speech-first interaction with pluggable STT/TTS providers, defaulting to Sarvam AI.
Maintainers
Readme
@p8n.ai/pi-listens
Speech-first Pi package with a provider abstraction for speech-to-text (STT) and text-to-speech (TTS). The default bundled provider is Sarvam AI, which gives Pi tools and commands for:
- streaming speech-to-text (STT) with Sarvam Saaras (
saaras:v3) over WebSockets - text-to-speech (TTS) with Sarvam Bulbul (
bulbul:v3) - voice-first clarification loops where the agent speaks a question, listens, transcribes, and continues
- interactive TUI and headless/RPC usage through Pi extension tools and UI fallback
Quick start
pi install npm:@p8n.ai/pi-listens
piInside Pi, run /voice-init to create a global settings file with sensible defaults:
/voice-initThen open ~/.pi/pi-listens.json and replace the provider credential placeholder with your Sarvam AI API key.
Alternatively, set the provider and key via environment variables:
export PI_LISTENS_PROVIDER="sarvam"
export PI_LISTENS_SARVAM_API_KEY="your-sarvam-api-key"
# legacy aliases still work: SARVAM_API_KEY or SARVAM_API_SUBSCRIPTION_KEYFor local development from this checkout:
npm install
npm run typecheck
pi -e /Users/ravindrabarthwal/Projects/pi-listensSystem requirements
Voice provider credentials
The bundled provider is Sarvam AI. Set one of:
export PI_LISTENS_SARVAM_API_KEY="..."
# or legacy aliases
export SARVAM_API_KEY="..."
export SARVAM_API_SUBSCRIPTION_KEY="..."Sarvam's SDK uses the api-subscription-key auth model internally; this package uses the official sarvamai npm package inside the Sarvam provider module.
Local microphone recorder and audio player
pi-listens records from the local microphone and plays audio locally.
Auto-detected recorders:
recfrom SoX (recommended)ffmpeg(avfoundationon macOS,alsaon Linux)
Auto-detected players:
afplayon macOSplayfrom SoXffplayaplay
You can override capture/playback with command templates:
export PI_LISTENS_RECORD_COMMAND='rec -q -r {sampleRate} -c 1 -b 16 {path} trim 0 {seconds}'
export PI_LISTENS_STREAM_COMMAND='rec -q -r {sampleRate} -c 1 -b 16 -e signed-integer -t raw -'
export PI_LISTENS_PLAY_COMMAND='afplay {path}'Template variables are shell-quoted automatically. Recording templates support {path}, {seconds}, {sampleRate}. Streaming templates write 16-bit mono PCM to stdout and support {sampleRate}.
Agent tools
The package registers these tools for Pi's agent:
| Tool | Purpose |
| --- | --- |
| voice_output | Speak short user-facing text via the configured TTS provider and local playback. |
| voice_input | Stream microphone audio to the configured STT provider. |
| voice_ask | Speak a concise question, then listen and transcribe the user's answer. |
| voice_transcribe_file | Transcribe an existing audio file. |
| voice_setup_check | Check provider credentials, recorder, player, and model configuration. |
The extension also injects voice guidance into the system prompt:
- use
voice_askwhenever user input is needed in voice-first sessions - use
voice_outputonly for short spoken status or response snippets - keep spoken replies to 1-2 short sentences with no headings, hashtags, bullet lists, boilerplate recaps, or full task summaries
- do not speak code blocks, logs, diffs, stack traces, or long explanations
Commands
| Command | Purpose |
| --- | --- |
| /voice-init | Create a global settings file at ~/.pi/pi-listens.json with sensible defaults. Use --overwrite to replace an existing file. |
| /speak <text> | Speak text with the configured TTS provider. |
| /voice-on [--manual] [--no-listen] [seconds] | Start the hands-free voice loop. Auto-listens for the next instruction after each agent turn. --manual disables auto-listen (press Space to listen). |
| /voice-check | Show setup diagnostics and voice-mode status. |
| /voice-chatty | Toggle conversational mode. When on, the agent speaks its responses and thinks out loud. |
Voice panel controls in interactive mode:
- Space: listen now; press again while listening to stop; if Pi is speaking, stops playback first
- A: toggle auto-listen (listen again after each assistant reply)
- Q: close the panel and stop any active listening or speaking
- Click the character: visual sparkle feedback (terminals with mouse reporting)
The character animates to reflect the current state:
| State | Character Color | Pose | Status Bar |
| --- | --- | --- | --- |
| Idle | Teal | Calm standing pose with a subtle blink | voice on |
| Listening | Blue | Alert eyes, ear pose, and incoming wave lines | listening… |
| Speaking | Pink/Magenta | Open-mouth talking frames with music/sound waves | speaking… |
| Agent working | Purple | Focused face with a small terminal/laptop panel | agent working |
| Error | Red | Concerned face with alert marks | Shows error message |
The current implementation uses ANSI/Unicode sprite frames so it works in ordinary terminals. Pi's TUI also has an Image component for Kitty, iTerm2, Ghostty, and WezTerm, so future character packs can experiment with PNG sprites where the terminal supports inline images.
Headless/RPC behavior
Pi extension tools work in interactive TUI and headless/RPC modes.
- The audio capture/playback still happens on the machine running Pi.
- When speech is not recognized,
voice_inputandvoice_askuse Pi's extension UI text fallback if UI is available. - In RPC mode that fallback becomes an
extension_ui_request(input) event, so a client can provide textual input. - In print/JSON modes, UI fallback is unavailable; the tool returns the empty transcription so the agent can recover.
Configuration
Configuration is resolved in this order, with later entries overriding earlier ones:
- defaults
~/.pi/agent/pi-listens.json(legacy global path, still supported)~/.pi/pi-listens.json(global user config)<project>/.pi/pi-listens.json(project config)- environment variables
Project config overrides global config, and environment variables override both.
Example config file:
{
"provider": "sarvam",
"sarvamApiKey": "paste-your-sarvam-api-key-here",
"sttModel": "saaras:v3",
"sttMode": "transcribe",
"sttLanguageCode": "unknown",
"translateInputToEnglish": true,
"ttsModel": "bulbul:v3",
"ttsLanguageCode": "en-IN",
"ttsSpeaker": "shubh",
"recordSeconds": 300,
"recordSampleRate": 16000,
"streamChunkMs": 250,
"streamMaxSeconds": 300,
"silenceStartSeconds": 0.2,
"silenceStopSeconds": 3.5,
"silenceThreshold": "1%",
"ttsSampleRate": 24000,
"ttsOutputCodec": "wav",
"textFallback": true,
"conversational": false
}Supported environment variables:
PI_LISTENS_PROVIDER(sarvamby default)PI_LISTENS_SARVAM_API_KEY(preferred for the bundled Sarvam provider)SARVAM_API_KEY/SARVAM_API_SUBSCRIPTION_KEY(legacy aliases for the Sarvam provider key)PI_LISTENS_STT_MODELPI_LISTENS_STT_MODE(transcribe,translate,verbatim,translit,codemix)PI_LISTENS_STT_LANGUAGE(defaultunknown)PI_LISTENS_TRANSLATE_INPUT_TO_ENGLISH(defaulttrue; speak any supported language, send English to the agent)PI_LISTENS_TTS_MODELPI_LISTENS_TTS_LANGUAGE(defaulten-IN)PI_LISTENS_TTS_SPEAKER(defaultshubh)PI_LISTENS_TTS_PACEPI_LISTENS_TTS_TEMPERATUREPI_LISTENS_TTS_SAMPLE_RATEPI_LISTENS_TTS_OUTPUT_CODEC(wav,mp3,linear16,mulaw,alaw,opus,flac,aac)PI_LISTENS_RECORD_SECONDS(default300; maximum listen duration for one streaming utterance)PI_LISTENS_RECORD_SAMPLE_RATE(default16000; Sarvam streaming works best with 16kHz mono PCM)PI_LISTENS_STREAM_CHUNK_MS(default250; outgoing WebSocket audio chunk size)PI_LISTENS_STREAM_MAX_SECONDS(default300; default maximum for streaming microphone capture)PI_LISTENS_SILENCE_START_SECONDSPI_LISTENS_SILENCE_STOP_SECONDSPI_LISTENS_SILENCE_THRESHOLD
recordSeconds is the maximum time Pi will keep streaming one utterance. silenceStopSeconds is the quiet pause after which it considers the utterance complete, flushes the WebSocket, and submits the transcript. For example, recordSeconds: 300 and silenceStopSeconds: 3.5 means “let me speak for up to 5 minutes, but submit after 3.5 seconds of silence.”
PI_LISTENS_RECORD_COMMANDPI_LISTENS_PLAY_COMMANDPI_LISTENS_AUDIO_DIRPI_LISTENS_DELETE_AUDIOPI_LISTENS_TEXT_FALLBACKPI_LISTENS_CONVERSATIONAL(defaultfalse; whentrue, the agent speaks its responses conversationally)
Provider architecture
The extension code talks to a generic VoiceProvider interface for microphone transcription, file transcription, file synthesis, and streamed synthesis. Sarvam-specific WebSocket and SDK logic lives in src/providers/sarvam.ts; provider selection and routing live in src/providers/index.ts. To add another STT/TTS provider later, add a provider module that implements the interface and register it in the provider router.
Notes
- The bundled Sarvam provider uses the WebSocket streaming API for microphone input, not the 30-second synchronous REST endpoint.
- Streaming input is sent as 16kHz, 16-bit, mono PCM (
pcm_s16le) withsaaras:v3by default. - macOS may ask for microphone permissions the first time
recorffmpegrecords audio. - Spoken output is intentionally optimized for concise interaction, not for reading code or full agent responses.
- When
conversationalmode is enabled, the agent speaks most of its responses, thinks out loud, and usesvoice_askfor all clarification. Toggle at runtime with/voice-chatty.
