openclaw-azure-speech

v1.0.0

Published

2 months ago

Azure AI Speech Service integration for OpenClaw — TTS and STT with 400+ neural voices, SSML support, and enterprise SLA.

0High
0Medium
0Low

sawyer0x110

openclaw openclaw-plugin azure speech tts stt text-to-speech speech-to-text neural-voice cognitive-services

openclaw-azure-speech

Azure AI Speech Service integration for OpenClaw — TTS and STT with 400+ neural voices, full SSML support, and enterprise SLA.

Features

TTS (Text-to-Speech): 400+ neural voices across 140+ languages/locales
STT (Speech-to-Text): Short audio transcription via REST API + realtime streaming via WebSocket
Zero dependencies: Pure fetch for REST calls, native WebSocket for streaming
SSML support: Full SSML override for fine-grained voice control
Voice listing: Browse available voices filtered by locale
Channel-aware output: Automatic opus/ogg for Telegram/WhatsApp voice messages, MP3 for others
CJK auto-switch: Automatically uses Chinese voice when CJK text is detected
Model directive support: Override voice with [[tts:voice=zh-CN-YunxiNeural]]

Prerequisites

An Azure account with a Speech Service resource
Get your subscription key and region from the Azure Portal → Speech resource → Keys and Endpoint

Installation

openclaw plugins install openclaw-azure-speech

Or link locally for development:

openclaw plugins install -l /path/to/openclaw-azure-speech

Configuration

Minimal setup (TTS only)

Add to your openclaw.json:

{
  plugins: {
    entries: {
      "azure-speech": {
        config: {
          subscriptionKey: "your-azure-speech-key",
          region: "eastasia",
        }
      }
    }
  },
  messages: {
    tts: {
      provider: "azure",  // use Azure as TTS provider
    }
  }
}

Full setup (TTS + STT)

STT requires an additional models.providers.azure entry for OpenClaw's media understanding auth pipeline:

{
  // Plugin config
  plugins: {
    entries: {
      "azure-speech": {
        config: {
          subscriptionKey: "your-azure-speech-key",
          region: "eastasia",
          voice: "zh-CN-XiaoxiaoNeural",  // optional, default auto-detects CJK
          sttLanguage: "zh-CN",            // optional, default zh-CN
        }
      }
    }
  },

  // TTS provider selection
  messages: {
    tts: {
      provider: "azure",
    }
  },

  // STT auth (required for audio transcription)
  models: {
    providers: {
      azure: {
        apiKey: "your-azure-speech-key",
        baseUrl: "https://eastasia.stt.speech.microsoft.com",
        models: []
      }
    }
  },

  // Audio transcription model entry
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          { provider: "azure", model: "default", language: "zh-CN" }
        ]
      }
    }
  }
}

Environment variables (alternative)

You can use environment variables instead of or alongside openclaw.json:

AZURE_SPEECH_KEY=your-key        # subscription key
AZURE_SPEECH_REGION=eastasia     # Azure region
AZURE_SPEECH_VOICE=zh-CN-XiaoxiaoNeural  # optional, default TTS voice
AZURE_SPEECH_STT_LANGUAGE=zh-CN          # optional, default STT language

Config resolution priority

The plugin resolves configuration from multiple sources (highest priority first):

messages.tts.providers.azure (standard OpenClaw TTS provider config)
plugins.entries.azure-speech.config (plugin config)
Environment variables
Built-in defaults

TTS directives

When messages.tts.modelOverrides.enabled is true (default), the model can override TTS settings per-reply:

[[tts:voice=zh-CN-YunxiNeural]]
[[tts:voiceId=en-US-GuyNeural]]
[[tts:outputFormat=ogg-48khz-16bit-mono-opus]]
[[tts:lang=ja-JP]]

Supported output formats

| Format | Use case | |--------|----------| | audio-24khz-48kbitrate-mono-mp3 | Default, good for most channels | | audio-24khz-96kbitrate-mono-mp3 | Higher quality MP3 | | ogg-48khz-16bit-mono-opus | Voice messages (Telegram, WhatsApp, etc.) | | riff-24khz-16bit-mono-pcm | WAV/PCM | | audio-48khz-192kbitrate-mono-mp3 | High-fidelity MP3 |

See Microsoft docs for the full list.

Popular voices

| Voice | Language | Gender | Styles | |-------|----------|--------|--------| | zh-CN-XiaoxiaoNeural | Chinese (Mandarin) | Female | cheerful, sad, angry, ... | | zh-CN-YunxiNeural | Chinese (Mandarin) | Male | narration, cheerful, ... | | en-US-JennyNeural | English (US) | Female | — | | en-US-GuyNeural | English (US) | Male | — | | ja-JP-NanamiNeural | Japanese | Female | — |

Use the listVoices API to browse all 400+ voices.

CJK auto-switch

When using the default English voice (en-US-JennyNeural), the plugin automatically switches to zh-CN-XiaoxiaoNeural if CJK characters are dominant in the text. This means you don't need to configure a Chinese voice explicitly for Chinese-dominant usage.

Architecture

The plugin registers three OpenClaw capabilities:

| Capability | Registration | Purpose | |-----------|-------------|---------| | SpeechProvider | api.registerSpeechProvider() | TTS synthesis | | RealtimeTranscriptionProvider | api.registerRealtimeTranscriptionProvider() | Streaming STT via WebSocket | | MediaUnderstandingProvider | api.registerMediaUnderstandingProvider() | Audio file transcription (short audio ≤60s) |

All API calls use pure fetch (zero runtime dependencies). The WebSocket STT uses Node.js native WebSocket (requires Node ≥ 22).

Development

git clone https://github.com/sawyer0x110/openclaw-azure-speech
cd openclaw-azure-speech
npm install
npm run build
npm test           # 63 unit tests
npm run typecheck  # TypeScript strict mode

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

openclaw-azure-speech

Features

Prerequisites

Installation

Configuration

Minimal setup (TTS only)

Full setup (TTS + STT)

Environment variables (alternative)

Config resolution priority

TTS directives

Supported output formats

Popular voices

CJK auto-switch

Architecture

Development

License