@darkhorseprojects/pi-vo

v0.2.2

Published

2 months ago

Local low-resource voice extension for Pi featuring sub-6GB VRAM TTS/ASR with 4-bit quantization.

0High
0Medium
0Low

5arrio

pi voice omnivoice qwen3-asr tts asr

pi-vo

Local voice extension for Pi featuring resident ASR/TTS workers for low-latency speech-to-text and text-to-speech.

Features

Live Speech-to-Text: Cohere-ASR transcribes microphone input in real-time
Text-to-Speech: OmniVoice generates expressive speech with voice design
Zero-latency: Models run as persistent workers, no per-request loading
Low VRAM: 4-bit quantization with CPU offload reduces VRAM to ~6GB
Configurable: Voice personality via voice design parameters

Usage

/v              Warm models and toggle live microphone
ctrl+space      Same as /v (toggle recording)
/v 0-100        Set volume (0-100%)
/v stop         Stop all recording, playback, and workers
/v unload       Stop models and hide the orb
/v stt [model]  Show or switch ASR model
/v tts          Show TTS backend settings
/v i [n]        List/select input devices
/v o [n]        List/select output devices
esc             Stop current speech/playback
enter           Clear transcript preview

Installation

npm install
npm run typecheck
scripts/install.sh

The installer creates .venv-voice, installs PyTorch, OmniVoice, Cohere-ASR, bitsandbytes, and writes ~/.pi/pi-vo.json.

Voice Cloning

For voice cloning, provide a reference audio file and its transcript. The audio should be 15-30 seconds of expressive speech that matches the style you want:

{
  "ttsReferenceAudio": "/path/to/voice-sample.wav",
  "ttsVoiceDesign": "female, young adult, moderate pitch"
}

The reference audio must be 16kHz PCM WAV.

Voice Style Guidance

Use ttsVoiceDesign to guide the delivery style (emotional tone, pace, character). Valid attributes are: gender, age, pitch, style, and accent. Combine with comma + space:

{
  "ttsVoiceDesign": "female, young adult, high pitch, american accent"
}

This parameter works independently or alongside voice cloning to shape how text is delivered.

Configuration

Default config at ~/.pi/pi-vo.json:

{
  "voicePython": "/path/to/pi-vo/.venv-voice/bin/python",
  "asrModel": "cstr/cohere-transcribe-onnx-int4",
  "asrDeviceMap": "cuda:0",
  "asrDtype": "float32",
  "ttsModel": "k2-fsa/OmniVoice",
  "ttsDeviceMap": "cuda:0",
  "ttsDtype": "bfloat16",
  "ttsLoadIn4bit": true,
  "ttsQuantType": "nf4",
  "ttsComputeDtype": "bfloat16",
  "ttsCpuOffload": true,
  "ttsOffloadFolder": "~/.pi/pi-vo-offload",
  "ttsReferenceAudio": "",
  "ttsVoiceDesign": "",
  "ttsNumSteps": 32,
  "voiceSpeed": 1.15,
  "voiceVolume": 0.85,
  "recordSampleRate": 16000,
  "audioSampleRate": 24000
}

Environment Variables

PI_VO_VOICE_PYTHON=/path/to/.venv-voice/bin/python
PI_VO_ASR_MODEL=cstr/cohere-transcribe-onnx-int4
PI_VO_ASR_DEVICE_MAP=cuda:0
PI_VO_ASR_DTYPE=float32
PI_VO_TTS_MODEL=k2-fsa/OmniVoice
PI_VO_TTS_DEVICE_MAP=cuda:0
PI_VO_TTS_DTYPE=bfloat16
PI_VO_TTS_LOAD_IN_4BIT=1
PI_VO_TTS_QUANT_TYPE=nf4
PI_VO_TTS_COMPUTE_DTYPE=bfloat16
PI_VO_TTS_CPU_OFFLOAD=1

Requirements

Node.js 20+
Python 3.12+
CUDA-compatible GPU
PipeWire or PulseAudio
~6GB VRAM (with 4-bit quantization and CPU offload)

Inspired by p8n-ai/pi-listens

License

Apache 2.0 - see LICENSE file for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme