read-docs

v0.3.3

Published

4 days ago

macOS local GPU speech workspace for document reading, dictation, and library analysis.

read-docs

Maintained by SproutSeeds. Research stewardship: Fractal Research Group (frg.earth).

A macOS-first, local-GPU speech workspace for reading documents, capturing dictation, and keeping the resulting material organized in one local Library.

It streams .pdf, .docx, .txt, and .md files through local neural TTS, reads highlighted text with Right Command or Command-L, captures Option-key dictation through local Whisper, and keeps playback continuous by preparing later chunks in the background.

Why this exists

Uses free local GPU processing as the primary speech path: strict 4090 Kokoro for text-to-speech and 4090 Whisper for speech-to-text.
Keeps OpenAI speech as an explicit optional backend that loads only from your environment, macOS Keychain, or ORP secrets when you select it.
Supports private neural speech through a Tailscale-connected 4090 or a Mac-local Kokoro sidecar, with local say fallback on macOS.
Turns reading, dictation, and external app handoffs into durable Library cards.
Tracks a local Signal Map for STT/TTS word counts, reading completion, and batch analysis.
Reads with understanding in smart mode instead of spelling every character.
Prefetches chunks so playback remains continuous.
Supports pause/resume, active-field dictation insertion, and selectable speech backends.

Platform support

The app experience is macOS-first. The menu-bar app, login agent, global selection hotkey, and right-click Services integration are macOS features.

The document reader engine is still a Python CLI and may work on Linux or Windows with compatible speech dependencies, but the packaged app workflow is supported on macOS.

Quick start: macOS app

Install the npm bootstrapper, then install the managed macOS app agent:

npm install -g read-docs
read-docs install
read-docs status

read-docs install copies the runtime into ~/.doc-reader-managed, prepares its Python environment, registers a LaunchAgent, starts the menu-bar app, and installs the Read with Doc Reader Services item for highlighted text.

Current native-wrapper builds require Apple's Command Line Tools because the app bundle is compiled locally during install:

xcode-select --install

Useful app commands:

read-docs start
read-docs open
read-docs dock
read-docs stop
read-docs restart
read-docs status
read-docs uninstall

The installer also creates ~/Applications/Doc Reader.app with the native app icon and registers it with Launch Services. You can launch it from Applications, Spotlight, or the Dock. The Dock/menu-bar app opens the web page, which is the canonical DocReader interface.

Canonical web app

Doc Reader runs as a local web app and can be exposed to your tailnet:

read-docs tailscale
read-docs web-status

By default, the local service listens on http://127.0.0.1:8766. Tailscale Serve can expose the same page at https://<this-machine>:8766 inside the tailnet. The web app supports document upload, text reading, Library cards, play, pause, resume, stop, and voice settings.

Local-only controls:

read-docs web-start
read-docs web-stop
read-docs web

Local neural text-to-speech

Doc Reader runs private neural speech sidecars as the normal app speech path. The strict 4090 backend is the default for app playback and uses the Umbra Tailnet TTS service:

Strict 4090 (Kokoro)

The optional local fallback backend stays on your Mac and tailnet:

4090 Kokoro -> Mac Kokoro -> 4090 Chatterbox -> macOS say

Doc Reader cleans Markdown/code-heavy text and splits long passages before they reach the neural TTS sidecars. Chatterbox is still available as a selectable voice, but strict 4090 mode favors Kokoro for steadier document playback.

Set up the 4090 service on the Windows machine reachable as Umbra:

read-docs tts-umbra-install
read-docs tts-umbra-status

Set up the Mac-local Kokoro service:

read-docs tts-mac-start
read-docs tts-mac-status

Run a benchmark and generate sample files:

read-docs tts-bench

Benchmark reports and sample audio are saved under ~/.doc-reader-managed/tts-benchmarks/.

Local 4090 speech-to-text

Doc Reader can also use the Umbra 4090 service for local dictation through Whisper. This path is the default STT path, runs through Tailscale, and does not call an API.

Setup is part of the Umbra service install:

read-docs tts-umbra-install
read-docs tts-umbra-start
read-docs restart

Open the canonical web app and confirm Hold Option for 4090 dictation is on. Choose the microphone from the Dictation settings if the system default is not the input you want. Put the cursor in a text field, then hold the Option/Alt key to record from the selected Mac microphone. Doc Reader shows a small recording HUD while the key is held, sends the audio to Umbra when the key is released, inserts the transcription at the cursor, and adds the transcription as a Dictation card in the web app.

The web app keeps read-aloud cards, dictation cards, and external app readings in one Library. Filter buttons narrow the Library to Readings, Dictations, or Clawdad-origin items. Dictation cards have a copy icon button; clicking it copies the full text and briefly switches the button to a green checkmark. Library retention is unbounded: read-aloud cards, prepared TTS audio metadata, STT dictation cards, and saved dictation recording files are retained until you remove them from the managed app data.

The Library also tracks a local Signal Map. It counts words separately for STT dictations and TTS/read-aloud material, records completion state for readings, and writes batch semantic analysis to ~/.doc-reader-managed/library-analysis.json plus per-batch JSON files under ~/.doc-reader-managed/library-analysis-batches/. The analyzer runs periodically while the web app is open and can be triggered from the Signal Map button. By default it tries a free local 4090 model endpoint through Ollama at the Umbra host on port 11434, then falls back to deterministic local structure if no model is reachable. Configure it with:

export DOC_READER_ANALYSIS_BACKEND=ollama
export DOC_READER_ANALYSIS_URL=http://100.72.151.28:11434
export DOC_READER_ANALYSIS_MODEL=llama3.1:8b

Umbra preloads large-v3 after service start so the first real dictation does not pay the model-load cost. Short warm dictations should return quickly. macOS may ask for microphone permission the first time the native app records audio. The web app shows the selected input device, microphone authorization, Accessibility, and Input Monitoring state. It also shows whether the native app helper is online; use Start Helper if the web page is up but the hold-Option listener is not running. If the HUD does not appear while Doc Reader is in the background, allow Doc Reader.app in macOS Privacy & Security for Input Monitoring. Accessibility is also required for automatic insertion into the active text field; without it, Doc Reader copies the transcription to the clipboard.

Quick start: source checkout

Create and activate a virtual environment.
Install dependencies.
Run the reader.

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m doc_reader /path/to/file.pdf --mode smart --style balanced --verbose

npm package

This repository is configured for the public npm package read-docs. The unscoped doc-reader package name is already owned by another maintainer, so the official SproutSeeds package uses the available global npm name read-docs.

The npm package is a bootstrapper and control surface for the managed app. Running read-docs with no arguments shows the available app and CLI commands; it does not directly run the Python tray from the package install location.

Install globally:

npm install -g read-docs

Pass a document path to use the command-line reader instead of the menu-bar app:

read-docs /path/to/file.pdf --mode smart --style balanced --verbose

Development launch

From the repo root:

./run-doc-reader

This command will:

Create .venv if needed
Install/update dependencies when requirements.txt changes
Launch the menu-bar app directly from the source checkout

Optional fallback TTS engine:

./.venv/bin/python -m pip install pyttsx3

Optional remote text-to-speech

The default app path uses local GPU speech. OpenAI remains available as an explicit remote speech backend for users who choose it. When --speech-backend openai is selected, the app checks OPENAI_API_KEY, Doc Reader's macOS Keychain item, and ORP's local openai-primary Keychain secret.

CLI with explicit backend selection:

./run-doc-reader \
  --speech-backend openai \
  --openai-model gpt-4o-mini-tts \
  --openai-voice marin \
  --openai-response-format wav

App usage:

Open the panel from the menu bar icon.
Choose a document, read clipboard text, or paste text into the reader window.
Use the Library cards to pause, resume, copy dictation text, and switch between saved readings.
Choose Strict 4090, 4090 Chatterbox, 4090 Kokoro, Mac Kokoro, OpenAI API, or system speech.
Store an OpenAI key in the macOS Keychain or ORP secrets only when using the optional remote backend.
Click Stop Reading from the menu bar item to stop active playback.

macOS menu-bar app

The supported app path is:

read-docs install

For local development, you can also run the Python menu-bar module directly:

python -m doc_reader.tray

What it gives you:

Native menu-bar app shell (Doc Reader.app) with a formal app icon that opens the canonical web page.
Applications/Dock launcher at ~/Applications/Doc Reader.app.
Tailnet web app through read-docs tailscale.
Open DocReader Page opens http://127.0.0.1:8766.
Read Clipboard in DocReader posts clipboard text into the web Library.
Pause/resume and stop controls call the web app.
Persistent Library cards for documents, pasted text, clipboard text, highlighted text, dictation, and external app handoffs.
Option/Alt hold-to-record dictation with microphone selection, 4090 Whisper, active-field insertion, and copyable Dictation cards.
Highlighted-text readback through Right Command, Command-L, or the right-click Services item.
Signal Map metrics and batch analysis for local reading and dictation material.
Web settings for strict 4090, Mac-local, optional OpenAI API, and system speech.
OpenAI API keys are loaded only when OpenAI API is explicitly selected.
Right-click Services integration sends highlighted text into the web app.

The older PySide tray module remains in the source tree as a development fallback, but the npm app path uses the native macOS wrapper.

Right-click menu (macOS Services)

The native macOS helper can read highlighted text from the keyboard. Highlight text in any app and tap the right Command key; Doc Reader copies the selection, restores your clipboard, creates a reading card in the web app, and starts playback through the selected TTS backend. Control+Command+R is also accepted as a fallback.

Install a native Services entry as a fallback so highlighted text can also be read from right-click menus:

read-docs install

If the app agent is already installed and you only need to refresh the Services entry, run read-docs install-service.

Then in any app:

Highlight text.
Right click.
Choose Services -> Read with Doc Reader.

The Services flow uses macOS text input directly. If an app invokes the Service but passes empty or partial text, the helper makes a clipboard-preserving copy attempt and logs the selected-text handoff count to ~/Library/Logs/doc-reader-service.log.

Remove it later with:

read-docs uninstall-service

Auto-start at login (macOS)

The app agent is registered by:

read-docs install

This installs a managed app copy at ~/.doc-reader-managed and registers a LaunchAgent. Run read-docs restart after package updates to refresh that managed copy and restart the app.

Disable later:

read-docs disable-startup

Modes

--mode smart: Speaks key ideas from each chunk (default).
--mode full: Speaks cleaned source text.

Detail styles

--style concise: very short key points
--style balanced: moderate detail (default)
--style detailed: more context per chunk

Continuous playback strategy

Pipeline architecture:

Extract text progressively from the input file.
Chunk text into early-small then steady-sized segments.
Prepare speech-ready narration for each chunk.
Queue prepared chunks while the current chunk is being spoken.

The first chunk target is smaller (--first-chunk-words) so audio starts quickly; later chunks use --chunk-words for steadier flow.

Useful CLI options

python -m doc_reader file.docx \
  --mode smart \
  --style detailed \
  --speech-backend auto \
  --rate 190 \
  --voice Samantha \
  --first-chunk-words 95 \
  --chunk-words 240 \
  --queue-size 10

Dry-run without speaking:

python -m doc_reader notes.md --dry-run --verbose

Notes

.doc (legacy Word) is not supported yet; convert to .docx first.
PDF quality depends on extractable text in the file (scanned PDFs need OCR first).

Contributing

Contributions are welcome. Please see CONTRIBUTING.md for setup, PR guidelines, and security reporting.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

read-docs

Why this exists

Platform support

Quick start: macOS app

Canonical web app

Local neural text-to-speech

Local 4090 speech-to-text

Quick start: source checkout

npm package

Development launch

Optional remote text-to-speech

macOS menu-bar app

Right-click menu (macOS Services)

Auto-start at login (macOS)

Modes

Detail styles

Continuous playback strategy

Useful CLI options

Notes

Contributing