@darkhorseprojects/pi-vo
v0.2.2
Published
Local low-resource voice extension for Pi featuring sub-6GB VRAM TTS/ASR with 4-bit quantization.
Maintainers
Readme
pi-vo
Local voice extension for Pi featuring resident ASR/TTS workers for low-latency speech-to-text and text-to-speech.
Features
- Live Speech-to-Text: Cohere-ASR transcribes microphone input in real-time
- Text-to-Speech: OmniVoice generates expressive speech with voice design
- Zero-latency: Models run as persistent workers, no per-request loading
- Low VRAM: 4-bit quantization with CPU offload reduces VRAM to ~6GB
- Configurable: Voice personality via voice design parameters
Usage
/v Warm models and toggle live microphone
ctrl+space Same as /v (toggle recording)
/v 0-100 Set volume (0-100%)
/v stop Stop all recording, playback, and workers
/v unload Stop models and hide the orb
/v stt [model] Show or switch ASR model
/v tts Show TTS backend settings
/v i [n] List/select input devices
/v o [n] List/select output devices
esc Stop current speech/playback
enter Clear transcript previewInstallation
npm install
npm run typecheck
scripts/install.shThe installer creates .venv-voice, installs PyTorch, OmniVoice, Cohere-ASR, bitsandbytes, and writes ~/.pi/pi-vo.json.
Voice Cloning
For voice cloning, provide a reference audio file and its transcript. The audio should be 15-30 seconds of expressive speech that matches the style you want:
{
"ttsReferenceAudio": "/path/to/voice-sample.wav",
"ttsVoiceDesign": "female, young adult, moderate pitch"
}The reference audio must be 16kHz PCM WAV.
Voice Style Guidance
Use ttsVoiceDesign to guide the delivery style (emotional tone, pace, character). Valid attributes are: gender, age, pitch, style, and accent. Combine with comma + space:
{
"ttsVoiceDesign": "female, young adult, high pitch, american accent"
}This parameter works independently or alongside voice cloning to shape how text is delivered.
Configuration
Default config at ~/.pi/pi-vo.json:
{
"voicePython": "/path/to/pi-vo/.venv-voice/bin/python",
"asrModel": "cstr/cohere-transcribe-onnx-int4",
"asrDeviceMap": "cuda:0",
"asrDtype": "float32",
"ttsModel": "k2-fsa/OmniVoice",
"ttsDeviceMap": "cuda:0",
"ttsDtype": "bfloat16",
"ttsLoadIn4bit": true,
"ttsQuantType": "nf4",
"ttsComputeDtype": "bfloat16",
"ttsCpuOffload": true,
"ttsOffloadFolder": "~/.pi/pi-vo-offload",
"ttsReferenceAudio": "",
"ttsVoiceDesign": "",
"ttsNumSteps": 32,
"voiceSpeed": 1.15,
"voiceVolume": 0.85,
"recordSampleRate": 16000,
"audioSampleRate": 24000
}Environment Variables
PI_VO_VOICE_PYTHON=/path/to/.venv-voice/bin/python
PI_VO_ASR_MODEL=cstr/cohere-transcribe-onnx-int4
PI_VO_ASR_DEVICE_MAP=cuda:0
PI_VO_ASR_DTYPE=float32
PI_VO_TTS_MODEL=k2-fsa/OmniVoice
PI_VO_TTS_DEVICE_MAP=cuda:0
PI_VO_TTS_DTYPE=bfloat16
PI_VO_TTS_LOAD_IN_4BIT=1
PI_VO_TTS_QUANT_TYPE=nf4
PI_VO_TTS_COMPUTE_DTYPE=bfloat16
PI_VO_TTS_CPU_OFFLOAD=1Requirements
- Node.js 20+
- Python 3.12+
- CUDA-compatible GPU
- PipeWire or PulseAudio
- ~6GB VRAM (with 4-bit quantization and CPU offload)
Inspired by p8n-ai/pi-listens
License
Apache 2.0 - see LICENSE file for details.
