openclaw-azure-speech
v1.0.0
Published
Azure AI Speech Service integration for OpenClaw — TTS and STT with 400+ neural voices, SSML support, and enterprise SLA.
Maintainers
Readme
openclaw-azure-speech
Azure AI Speech Service integration for OpenClaw — TTS and STT with 400+ neural voices, full SSML support, and enterprise SLA.
Features
- TTS (Text-to-Speech): 400+ neural voices across 140+ languages/locales
- STT (Speech-to-Text): Short audio transcription via REST API + realtime streaming via WebSocket
- Zero dependencies: Pure
fetchfor REST calls, nativeWebSocketfor streaming - SSML support: Full SSML override for fine-grained voice control
- Voice listing: Browse available voices filtered by locale
- Channel-aware output: Automatic opus/ogg for Telegram/WhatsApp voice messages, MP3 for others
- CJK auto-switch: Automatically uses Chinese voice when CJK text is detected
- Model directive support: Override voice with
[[tts:voice=zh-CN-YunxiNeural]]
Prerequisites
- An Azure account with a Speech Service resource
- Get your subscription key and region from the Azure Portal → Speech resource → Keys and Endpoint
Installation
openclaw plugins install openclaw-azure-speechOr link locally for development:
openclaw plugins install -l /path/to/openclaw-azure-speechConfiguration
Minimal setup (TTS only)
Add to your openclaw.json:
{
plugins: {
entries: {
"azure-speech": {
config: {
subscriptionKey: "your-azure-speech-key",
region: "eastasia",
}
}
}
},
messages: {
tts: {
provider: "azure", // use Azure as TTS provider
}
}
}Full setup (TTS + STT)
STT requires an additional models.providers.azure entry for OpenClaw's media understanding auth pipeline:
{
// Plugin config
plugins: {
entries: {
"azure-speech": {
config: {
subscriptionKey: "your-azure-speech-key",
region: "eastasia",
voice: "zh-CN-XiaoxiaoNeural", // optional, default auto-detects CJK
sttLanguage: "zh-CN", // optional, default zh-CN
}
}
}
},
// TTS provider selection
messages: {
tts: {
provider: "azure",
}
},
// STT auth (required for audio transcription)
models: {
providers: {
azure: {
apiKey: "your-azure-speech-key",
baseUrl: "https://eastasia.stt.speech.microsoft.com",
models: []
}
}
},
// Audio transcription model entry
tools: {
media: {
audio: {
enabled: true,
models: [
{ provider: "azure", model: "default", language: "zh-CN" }
]
}
}
}
}Environment variables (alternative)
You can use environment variables instead of or alongside openclaw.json:
AZURE_SPEECH_KEY=your-key # subscription key
AZURE_SPEECH_REGION=eastasia # Azure region
AZURE_SPEECH_VOICE=zh-CN-XiaoxiaoNeural # optional, default TTS voice
AZURE_SPEECH_STT_LANGUAGE=zh-CN # optional, default STT languageConfig resolution priority
The plugin resolves configuration from multiple sources (highest priority first):
messages.tts.providers.azure(standard OpenClaw TTS provider config)plugins.entries.azure-speech.config(plugin config)- Environment variables
- Built-in defaults
TTS directives
When messages.tts.modelOverrides.enabled is true (default), the model can override TTS settings per-reply:
[[tts:voice=zh-CN-YunxiNeural]]
[[tts:voiceId=en-US-GuyNeural]]
[[tts:outputFormat=ogg-48khz-16bit-mono-opus]]
[[tts:lang=ja-JP]]Supported output formats
| Format | Use case |
|--------|----------|
| audio-24khz-48kbitrate-mono-mp3 | Default, good for most channels |
| audio-24khz-96kbitrate-mono-mp3 | Higher quality MP3 |
| ogg-48khz-16bit-mono-opus | Voice messages (Telegram, WhatsApp, etc.) |
| riff-24khz-16bit-mono-pcm | WAV/PCM |
| audio-48khz-192kbitrate-mono-mp3 | High-fidelity MP3 |
See Microsoft docs for the full list.
Popular voices
| Voice | Language | Gender | Styles |
|-------|----------|--------|--------|
| zh-CN-XiaoxiaoNeural | Chinese (Mandarin) | Female | cheerful, sad, angry, ... |
| zh-CN-YunxiNeural | Chinese (Mandarin) | Male | narration, cheerful, ... |
| en-US-JennyNeural | English (US) | Female | — |
| en-US-GuyNeural | English (US) | Male | — |
| ja-JP-NanamiNeural | Japanese | Female | — |
Use the listVoices API to browse all 400+ voices.
CJK auto-switch
When using the default English voice (en-US-JennyNeural), the plugin automatically switches to zh-CN-XiaoxiaoNeural if CJK characters are dominant in the text. This means you don't need to configure a Chinese voice explicitly for Chinese-dominant usage.
Architecture
The plugin registers three OpenClaw capabilities:
| Capability | Registration | Purpose |
|-----------|-------------|---------|
| SpeechProvider | api.registerSpeechProvider() | TTS synthesis |
| RealtimeTranscriptionProvider | api.registerRealtimeTranscriptionProvider() | Streaming STT via WebSocket |
| MediaUnderstandingProvider | api.registerMediaUnderstandingProvider() | Audio file transcription (short audio ≤60s) |
All API calls use pure fetch (zero runtime dependencies). The WebSocket STT uses Node.js native WebSocket (requires Node ≥ 22).
Development
git clone https://github.com/sawyer0x110/openclaw-azure-speech
cd openclaw-azure-speech
npm install
npm run build
npm test # 63 unit tests
npm run typecheck # TypeScript strict modeLicense
MIT
