@drakulavich/kesha-voice-kit
v1.23.0
Published
Give your tools a voice — speech to text and back, 25 languages, up to ~19× faster than Whisper. On your machine.
Maintainers
Readme
- Transcribe locally — 25 languages, up to ~19x faster than Whisper on Apple Silicon, ~2.5x on CPU
- Speak back — text-to-speech in 9 languages
- Plug into agents — ship voice workflows as CLI commands, an MCP server, an OpenClaw skill, or a Hermes agent
- Small Rust engine — single ~20MB binary, no ffmpeg, no Python, no native Node addons
Quick Start
Runtime: Bun >= 1.3.0 · Platforms: macOS arm64, Linux x64, Windows x64.
# 1. Install Bun (skip if you have it) — Linux & macOS:
curl -fsSL https://bun.sh/install | bash # or: brew install oven-sh/bun/bun
# Windows: powershell -c "irm bun.sh/install.ps1 | iex"
# 2. Install Kesha:
bun add -g @drakulavich/kesha-voice-kit
kesha install # downloads engine + models (explicit — never automatic)
# 3. Transcribe:
kesha audio.ogg # transcript to stdoutPrefer Homebrew, .deb/.rpm, Docker, or Nix? See Other install methods.
Air-gapped or behind a corporate mirror? See docs/model-mirror.md.
Speech-to-text
kesha audio.ogg # transcribe (plain text)
kesha --format transcript audio.ogg # text + language/confidence
kesha --format json audio.ogg # full JSON with lang fields
kesha --json --timestamps audio.ogg # JSON with timestamped segments
kesha --toon audio.ogg # compact LLM-friendly TOON
kesha status # show installed backend infoMultiple files get head-style headers; stdout is the transcript, stderr is errors — pipe-friendly:
$ kesha freedom.ogg tahiti.ogg
=== freedom.ogg ===
Свободу попугаям! Свободу!
=== tahiti.ogg ===
Таити, Таити! Не были мы ни в какой Таити! Нас и тут неплохо кормят.- Long / silence-heavy audio: install VAD (
kesha install --vad); Kesha auto-uses it past 120 s. Without VAD, long audio falls back to fixed ASR chunks. See docs/vad.md. - Speaker diarization (darwin-arm64):
kesha install --diarize, thenkesha --json --vad --speakers meeting.m4astamps each segment with aspeakerid. Linux/Windows return a clear "darwin-arm64 only" error (#199).
Text-to-speech
Kesha speaks back in 9 languages, auto-picking the voice from the text's language. Override with --lang <code> or --voice <id>.
kesha install --tts # opt-in models (~990MB)
kesha say "Hello, world" > hello.wav
kesha say "Привет, мир" > privet.wav # auto-routes by language
kesha say --voice ru-vosk-m02 "Голос в текст." > ru.wavOutput formats (--format, or inferred from the --out extension):
kesha say "Hello" --out hi.wav # WAV (default, uncompressed)
kesha say "Hello" --format ogg-opus --out hi.ogg # OGG/Opus — messenger voice notes
kesha say "Hello" --format flac --out hi.flac # FLAC — lossless, plays in every browser incl. Safari/iOSkesha say --list-voices lists what's installed. Voices, the full catalogue, macOS system voices, SSML, speaking rate (--rate, <prosody>), Russian word stress, and Russian/English abbreviation handling are all in docs/tts.md.
Languages
Speech-to-text spans 25 languages and text-to-speech covers English, Russian, and select multilingual voices — full tables with codes and flags in docs/languages.md. Audio language detection identifies 107 languages.
Performance
Up to ~19x faster than Whisper on Apple Silicon (M2), ~2.5x faster on CPU
Compared against Whisper large-v3-turbo, all engines auto-detecting language:
Full per-file breakdown (Russian + English): BENCHMARK.md.
Other install methods
All of these install the Bun CLI wrapper; engine + models still download explicitly via kesha install.
- Homebrew —
brew install drakulavich/tap/kesha-voice-kit· docs/homebrew.md - Linux packages (
.deb/.rpm, x64) — docs/linux-packages.md - Docker (GHCR image) — docs/docker.md
- Nix (
aarch64-darwin/x86_64-linux) —nix run github:drakulavich/kesha-voice-kit -- install· docs/nix-install.md - Shell completions + manpage —
kesha completions bash|zsh|fishandkesha manpageprint the packaged files to install wherever your shell expects them.
Integrations
- MCP server —
kesha mcpexposes transcribe/synthesize/list tools to any MCP client (Claude, Cursor, Codex, Gemini). Setup: docs/mcp.md. - OpenClaw — give your LLM agent ears. Install & config: docs/openclaw.md.
- Hermes Agent — local STT/TTS through Hermes command providers. Setup: docs/hermes.md.
- Raycast (macOS) — transcribe selected audio & speak the clipboard from the launcher. Source + install:
raycast/. - Programmatic API —
@drakulavich/kesha-voice-kit/corefor use inside a Bun program. See docs/api.md.
More
- Architecture — runtime data flow, the models that ship, the CLI ↔ Rust engine boundary, model pinning, and where tests live.
- Use cases — copy-paste recipes (transcribe a meeting, speak from OpenClaw, run offline, move the cache).
- Product positioning — supported workflows, non-goals, maturity labels, platform matrix.
- Diagnostics:
kesha doctor,kesha support-bundle(redacted.tar.gzfor issues), andkesha logsproduce local, content-free diagnostics — see docs/diagnostic-logs.md. Every failure prints a stableerror [CODE]: …line (docs/errors.md). - Privacy / Local Stats: Stats are off by default and fully local. Opt in with
kesha stats enableto record content-free operational metrics in a local SQLite database — never networked, never storing audio, transcripts, text, or paths. Full commands & lifecycle: docs/local-stats.md.
Contributing
See CONTRIBUTING.md, the Roadmap (Now / Next / Later), and the Decision log (why platform/model choices were made — and reversed). Dev setup: make dev-setup (Bun, Rust, nextest, platform libs).
License
Made with 💛🩵 and 🥤 energy under MIT License
