@drakulavich/kesha-voice-kit

v1.23.0

Published

19 hours ago

Give your tools a voice — speech to text and back, 25 languages, up to ~19× faster than Whisper. On your machine.

0High
0Medium
0Low

drakulavich

asr speech-to-text transcription voice kesha multilingual apple-silicon coreml bun cli

Transcribe locally — 25 languages, up to ~19x faster than Whisper on Apple Silicon, ~2.5x on CPU
Speak back — text-to-speech in 9 languages
Plug into agents — ship voice workflows as CLI commands, an MCP server, an OpenClaw skill, or a Hermes agent
Small Rust engine — single ~20MB binary, no ffmpeg, no Python, no native Node addons

Quick Start

Runtime: Bun >= 1.3.0 · Platforms: macOS arm64, Linux x64, Windows x64.

# 1. Install Bun (skip if you have it) — Linux & macOS:
curl -fsSL https://bun.sh/install | bash        # or: brew install oven-sh/bun/bun
# Windows: powershell -c "irm bun.sh/install.ps1 | iex"

# 2. Install Kesha:
bun add -g @drakulavich/kesha-voice-kit
kesha install        # downloads engine + models (explicit — never automatic)

# 3. Transcribe:
kesha audio.ogg      # transcript to stdout

Prefer Homebrew, .deb/.rpm, Docker, or Nix? See Other install methods. Air-gapped or behind a corporate mirror? See docs/model-mirror.md.

Speech-to-text

kesha audio.ogg                            # transcribe (plain text)
kesha --format transcript audio.ogg        # text + language/confidence
kesha --format json audio.ogg              # full JSON with lang fields
kesha --json --timestamps audio.ogg        # JSON with timestamped segments
kesha --toon audio.ogg                     # compact LLM-friendly TOON
kesha status                               # show installed backend info

Multiple files get head-style headers; stdout is the transcript, stderr is errors — pipe-friendly:

$ kesha freedom.ogg tahiti.ogg
=== freedom.ogg ===
Свободу попугаям! Свободу!

=== tahiti.ogg ===
Таити, Таити! Не были мы ни в какой Таити! Нас и тут неплохо кормят.

Long / silence-heavy audio: install VAD (kesha install --vad); Kesha auto-uses it past 120 s. Without VAD, long audio falls back to fixed ASR chunks. See docs/vad.md.
Speaker diarization (darwin-arm64): kesha install --diarize, then kesha --json --vad --speakers meeting.m4a stamps each segment with a speaker id. Linux/Windows return a clear "darwin-arm64 only" error (#199).

Text-to-speech

Kesha speaks back in 9 languages, auto-picking the voice from the text's language. Override with --lang <code> or --voice <id>.

kesha install --tts                              # opt-in models (~990MB)
kesha say "Hello, world" > hello.wav
kesha say "Привет, мир" > privet.wav             # auto-routes by language
kesha say --voice ru-vosk-m02 "Голос в текст." > ru.wav

Output formats (--format, or inferred from the --out extension):

kesha say "Hello" --out hi.wav                    # WAV (default, uncompressed)
kesha say "Hello" --format ogg-opus --out hi.ogg  # OGG/Opus — messenger voice notes
kesha say "Hello" --format flac --out hi.flac     # FLAC — lossless, plays in every browser incl. Safari/iOS

kesha say --list-voices lists what's installed. Voices, the full catalogue, macOS system voices, SSML, speaking rate (--rate, <prosody>), Russian word stress, and Russian/English abbreviation handling are all in docs/tts.md.

Languages

Speech-to-text spans 25 languages and text-to-speech covers English, Russian, and select multilingual voices — full tables with codes and flags in docs/languages.md. Audio language detection identifies 107 languages.

Performance

Up to ~19x faster than Whisper on Apple Silicon (M2), ~2.5x faster on CPU

Compared against Whisper large-v3-turbo, all engines auto-detecting language:

Benchmark: openai-whisper vs faster-whisper vs Kesha Voice Kit

Full per-file breakdown (Russian + English): BENCHMARK.md.

Other install methods

All of these install the Bun CLI wrapper; engine + models still download explicitly via kesha install.

Homebrew — brew install drakulavich/tap/kesha-voice-kit · docs/homebrew.md
Linux packages (.deb/.rpm, x64) — docs/linux-packages.md
Docker (GHCR image) — docs/docker.md
Nix (aarch64-darwin / x86_64-linux) — nix run github:drakulavich/kesha-voice-kit -- install · docs/nix-install.md
Shell completions + manpage — kesha completions bash|zsh|fish and kesha manpage print the packaged files to install wherever your shell expects them.

Integrations

MCP server — kesha mcp exposes transcribe/synthesize/list tools to any MCP client (Claude, Cursor, Codex, Gemini). Setup: docs/mcp.md.
OpenClaw — give your LLM agent ears. Install & config: docs/openclaw.md.
Hermes Agent — local STT/TTS through Hermes command providers. Setup: docs/hermes.md.
Raycast (macOS) — transcribe selected audio & speak the clipboard from the launcher. Source + install: raycast/.
Programmatic API — @drakulavich/kesha-voice-kit/core for use inside a Bun program. See docs/api.md.

Architecture — runtime data flow, the models that ship, the CLI ↔ Rust engine boundary, model pinning, and where tests live.
Use cases — copy-paste recipes (transcribe a meeting, speak from OpenClaw, run offline, move the cache).
Product positioning — supported workflows, non-goals, maturity labels, platform matrix.
Diagnostics: kesha doctor, kesha support-bundle (redacted .tar.gz for issues), and kesha logs produce local, content-free diagnostics — see docs/diagnostic-logs.md. Every failure prints a stable error [CODE]: … line (docs/errors.md).
Privacy / Local Stats: Stats are off by default and fully local. Opt in with kesha stats enable to record content-free operational metrics in a local SQLite database — never networked, never storing audio, transcripts, text, or paths. Full commands & lifecycle: docs/local-stats.md.

Contributing

See CONTRIBUTING.md, the Roadmap (Now / Next / Later), and the Decision log (why platform/model choices were made — and reversed). Dev setup: make dev-setup (Bun, Rust, nextest, platform libs).

License

Made with 💛🩵 and 🥤 energy under MIT License

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme