mrmd-voice

v0.1.0

Published

3 months ago

Voice-to-text for MRMD — recording, transcription, and text routing

0High
0Medium
0Low

maximerivest

mrmd-voice

Voice-to-text for MRMD. Works everywhere: Electron desktop, markco.dev browser, phone (TWA).

Voice is an app-level service, not an editor feature. It captures audio regardless of what's focused (editor, terminal, file picker, rename input) and routes transcribed text to the active element.

Architecture

Core Principle: Never Lose Audio

Every recording is saved as a project asset immediately. Transcription is secondary — if it fails, the audio is safe and can be re-transcribed later. Audio assets are auto-cleaned after 1 hour (configurable). The goal is crash safety, not archiving.

During recording, checkpoint blobs are saved every 30 seconds. On stop, the final complete blob overwrites the checkpoints. Format: WebM/Opus (~12KB/sec).

Package Structure

mrmd-voice/
├── src/
│   ├── recorder.js          # MediaRecorder wrapper
│   │                        #   - start/stop recording
│   │                        #   - periodic checkpoint saves (crash safety)
│   │                        #   - final blob on stop
│   │
│   ├── transcriber.js        # Transcription backend router
│   │                        #   - tries backends in priority order
│   │                        #   - common interface for all backends
│   │
│   ├── parakeet-client.js    # WebSocket client for Parakeet servers
│   │                        #   - Same protocol as parakeet-stream server
│   │                        #   - Sends int16 PCM audio, receives segments
│   │                        #   - Works direct (ws://) or via proxy (/proxy/)
│   │
│   ├── api-client.js         # REST client for Groq/OpenAI Whisper API
│   │                        #   - POST audio blob, get text back
│   │
│   ├── text-router.js        # Routes transcribed text to focused element
│   │                        #   - CodeMirror → editor.dispatch() insert
│   │                        #   - xterm terminal → write to PTY stdin
│   │                        #   - <input>/<textarea> → set value + input event
│   │                        #   - File picker → set search query
│   │                        #   - Nothing focused → buffer / last known target
│   │
│   └── index.js              # Public API
├── dist/
│   └── mrmd-voice.iife.js    # Browser bundle (loaded by index.html)
├── rollup.config.js
└── package.json

Where Each Piece Lives (Across the Ecosystem)

| Component | Package | Why | |---|---|---| | Recording, transcription clients, text routing | mrmd-voice/ | Pure browser lib, shared by all deployment modes | | Shimmer overlay, mic button, Alt+W, rail panel | mrmd-electron/index.html | App shell (served to browser by mrmd-server too) | | Voice service (Parakeet process lifecycle) | mrmd-electron/src/services/voice-service.js | Node service, same pattern as session services | | Voice IPC handlers | mrmd-electron/main.js | voice:status, voice:startLocal, voice:stopLocal | | Voice IPC bridge | mrmd-electron/preload.cjs | electronAPI.voice.* namespace | | Voice HTTP routes | mrmd-server/src/api/voice.js | HTTP mirror of voice IPC | | Voice http-shim entries | mrmd-server/static/http-shim.js | Browser compat layer | | FixTranscriptionPredict | mrmd-ai/ | Already exists | | Voice settings | Settings service | Already exists, just add voice section |

Transcription Backends

Current providers wired in app settings:

Parakeet (WebSocket, GPU-friendly)
- Config: voice.provider = "parakeet"
- URL: voice.parakeetUrl = "ws://192.168.2.24:8765"
OpenAI Whisper API
- Config: voice.provider = "openai"
- Uses apiKeys.openai
- Endpoint: https://api.openai.com/v1/audio/transcriptions
Groq Whisper API
- Config: voice.provider = "groq"
- Uses apiKeys.groq
- Endpoint: https://api.groq.com/openai/v1/audio/transcriptions

If no backend is configured, audio is still saved and user gets a non-fatal toast.

How Phone/Browser Uses Desktop GPU (Tunnel)

The existing Runtime Tunnel already proxies arbitrary HTTP and WebSocket to the desktop Electron app. Voice piggybacks on this:

Phone (markco.dev)
  → mrmd-voice/parakeet-client.js connects via WebSocket
  → ws://server/sync/8765/  (proxy path)
  → RuntimeTunnelClient in mrmd-server
  → WebSocket relay (markco.dev/sync-relay)  
  → RuntimeTunnel provider in mrmd-electron on desktop
  → ws://localhost:8765  (Parakeet server on GPU)

No new tunnel infrastructure needed. Just register the Parakeet port as a tunnel port.

Text Routing

When transcription completes, text is routed to wherever focus is:

| Focus target | How text is inserted | |---|---| | CodeMirror editor | view.dispatch({ changes: { from: cursor, insert: text } }) | | xterm.js terminal | Write to PTY stdin (simulates typing — user sees text, presses Enter) | | <input> element (file picker, rename, search) | el.value += text; el.dispatchEvent(new Event('input')) | | Nothing focused | Insert at last known editor cursor position |

UI Elements

Full-screen shimmer: Animated border glow on the .app container when recording. Unmistakable visual signal.
Mic button: In the mobile toolbar (alongside search, AI, terminal buttons). On desktop, in the status bar area.
Alt+W: Toggle recording on/off (same shortcut as existing parakeet-hotkey service).
Duration timer: Shows recording time while active.
Rail menu panel: Voice status, recent recordings (last hour), re-transcribe button, backend settings, mic test.

Audio Safety Details

During recording: checkpoint blob saved every 30 seconds to _assets/voice/
On stop: final complete blob saved, checkpoints removed
After transcription: .json sidecar with transcript + metadata
Cleanup: background job deletes voice assets older than 1 hour
Format: WebM/Opus (native to MediaRecorder, good compression)

_assets/voice/
├── 2026-02-23T06-31-56.webm       # Audio recording
├── 2026-02-23T06-31-56.json       # { audio, text, backend, duration, timestamp }
└── 2026-02-23T06-31-56.checkpoint.webm  # Only exists during active recording

Settings

In SettingsService (~/.mrmd/settings.json), voice section:

{
  "voice": {
    "shortcut": { "altKey": true, "ctrlKey": false, "metaKey": false, "shiftKey": false, "key": "w" },
    "parakeetUrl": "ws://192.168.2.24:8765"
  }
}

You can preseed parakeetUrl from backend/environment (no manual typing) with:

MRMD_PARAKEET_URL=ws://192.168.2.24:8765 mrmd-electron
# or
MRMD_PARAKEET_URL=ws://192.168.2.24:8765 mrmd-server

The Settings panel still lets users override it at runtime.

Implementation Phases

Phase 1 — Core Recording + Parakeet (Current)

mrmd-voice/ package with recorder, parakeet-client, text-router
Full-screen shimmer in index.html
Alt+W shortcut, mic button in mobile toolbar
Audio safety (save blobs as assets, 1-hour cleanup)
Direct Parakeet connection (configurable URL in settings)
Text insertion to editor, inputs, terminal

Phase 2 — Electron Parakeet Management + Tunnel

voice-service.js in Electron (GPU detection, auto-start Parakeet)
Register Parakeet port in the tunnel
Phone/browser → tunnel → desktop Parakeet (zero config)
electronAPI.voice.* IPC + server HTTP mirror + http-shim entries

Phase 3 — Cloud API + Rail Panel + Polish

Groq/OpenAI Whisper API client (api-client.js)
Rail menu voice panel (status, re-transcribe, settings)
FixTranscriptionPredict auto-cleanup option
Streaming partial transcription during recording

Development

cd mrmd-voice
npm install
npm run build    # Produces dist/mrmd-voice.iife.js
npm run dev      # Watch mode

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

mrmd-voice

Architecture

Core Principle: Never Lose Audio

Package Structure

Where Each Piece Lives (Across the Ecosystem)

Transcription Backends

How Phone/Browser Uses Desktop GPU (Tunnel)

Text Routing

UI Elements

Audio Safety Details

Settings

Implementation Phases

Phase 1 — Core Recording + Parakeet (Current)

Phase 2 — Electron Parakeet Management + Tunnel

Phase 3 — Cloud API + Rail Panel + Polish

Development