minimax-speech-ts
v0.4.1
Published
Type-safe MiniMax TTS client for Node.js — sync & streaming synthesis, voice cloning, voice design. ESM + CJS.
Downloads
491
Maintainers
Readme
MiniMax TTS SDK for JavaScript / TypeScript
Type-safe MiniMax TTS client for Node.js. Full API coverage — sync and streaming synthesis, voice cloning, voice design, and voice management — with a single runtime dependency. Ships ESM + CJS with complete TypeScript declarations. (Unofficial)
API Reference | npm | GitHub
Features
- Full API coverage — sync, streaming (SSE), async, voice cloning, voice design, voice management
- Zero config —
npm install, pass your API key, get audio back ReadableStream<Buffer>streaming — pipe directly to a file, HTTP response, or WebSocket- Typed error hierarchy —
instanceofchecks for auth, rate-limit, and validation errors - Client-side validation — catches bad params before the network round-trip
- camelCase in, snake_case on the wire — no manual conversion needed
- Dual output — ESM and CommonJS with
.d.tsdeclarations
Quick Start
- Get an API key from platform.minimax.io
npm install minimax-speech-ts- Run:
import { MiniMaxSpeech } from 'minimax-speech-ts'
import fs from 'node:fs'
const client = new MiniMaxSpeech({
apiKey: process.env.MINIMAX_API_KEY!,
groupId: process.env.MINIMAX_GROUP_ID, // optional
})
const result = await client.synthesize({
text: 'Hello, world!',
model: 'speech-02-hd',
voiceSetting: { voiceId: 'English_expressive_narrator' },
})
await fs.promises.writeFile('output.mp3', result.audio) // → output.mp3Highlights
Stream audio to a file
const { audio } = await client.synthesizeStream({
text: 'Stream me!',
voiceSetting: { voiceId: 'English_expressive_narrator' },
audioSetting: { format: 'mp3' },
})
const writer = fs.createWriteStream('output.mp3')
for await (const chunk of audio) writer.write(chunk)
writer.end()Synthesize with emotion
const result = await client.synthesize({
text: 'I am so happy to meet you!',
voiceSetting: { voiceId: 'English_expressive_narrator', emotion: 'happy' },
})Clone a voice
const file = new Blob([await fs.promises.readFile('sample.mp3')], { type: 'audio/mp3' })
const upload = await client.uploadFile(file, 'voice_clone')
await client.cloneVoice({ fileId: upload.file.fileId, voiceId: 'my-voice' })Design a voice from a description
const voice = await client.designVoice({
prompt: 'A warm female voice with a slight British accent',
previewText: 'Hello, this is a preview.',
voiceId: 'my-designed-voice',
})Why this SDK?
Compared to calling the MiniMax API with raw fetch:
- Automatic camelCase ↔ snake_case — write idiomatic JS, the SDK converts for the wire
- Request validation — catches invalid params, emotion/model mismatches, and format conflicts before the network call
- Typed errors —
MiniMaxAuthError,MiniMaxRateLimitError,MiniMaxValidationErrorwithstatusCodeandtraceId - Streaming handled internally — SSE parsing and hex-to-Buffer decoding are built in
- One dependency — only
eventsource-parserfor SSE; everything else is native Node.js
API
Constructor
new MiniMaxSpeech({
apiKey: string // Required. MiniMax API key.
groupId?: string // Optional. MiniMax group ID, appended as ?GroupId= query param.
apiHost?: string // Optional. Defaults to 'https://api.minimax.io'.
// For reduced TTFA, try 'https://api-uw.minimax.io'.
})synthesize(request): Promise<SynthesizeResult>
Synchronous text-to-speech. Returns decoded audio as a Buffer.
const result = await client.synthesize({
text: 'Hello!',
model: 'speech-02-hd', // optional, defaults to 'speech-02-hd'
voiceSetting: {
voiceId: 'English_expressive_narrator',
speed: 1.0,
vol: 1.0,
pitch: 0,
emotion: 'happy', // speech-02-*/speech-2.6-*/speech-2.8-* only
},
audioSetting: {
format: 'mp3', // 'mp3' | 'pcm' | 'flac' | 'wav' | 'pcmu_raw' | 'pcmu_wav' | 'opus'
sampleRate: 32000,
bitrate: 128000,
channel: 1,
},
languageBoost: 'English',
voiceModify: {
pitch: 0, // -100 to 100
intensity: 0, // -100 to 100
timbre: 0, // -100 to 100
soundEffects: 'robotic', // optional
},
timbreWeights: [ // mix multiple voices
{ voiceId: 'voice-1', weight: 0.5 },
{ voiceId: 'voice-2', weight: 0.5 },
],
subtitleEnable: false,
subtitleType: 'sentence', // 'sentence' | 'word' ('word_streaming' is streaming-only — use synthesizeStream)
pronunciationDict: { tone: ['处理/(chǔ lǐ)'] },
})
result.audio // Buffer
result.extraInfo // { audioLength, audioSampleRate, audioSize, bitrate, wordCount, usageCharacters, ... }
result.traceId // string
result.subtitleFile // string | undefinedPass outputFormat: 'url' to receive a URL string instead of a decoded buffer:
const result = await client.synthesize({
text: 'Hello!',
outputFormat: 'url',
})
result.audio // string (URL)synthesizeStream(request): Promise<SynthesizeStreamResult>
Streaming text-to-speech via SSE. Returns { audio, subtitle, extraInfo, traceId } — a ReadableStream<Buffer> of audio chunks plus three promises resolved from the final aggregated chunk: the subtitle file URL (string | undefined), the parsed extraInfo (ExtraInfo | undefined — audio length, size, billable characters, …), and the traceId (string | undefined) for MiniMax support.
Drain
audiofirst.subtitle,extraInfo, andtraceIdonly settle onceaudiois being consumed (reading audio is what pumps the underlying SSE source). Awaiting them before reading or cancellingaudiowill hang. UsePromise.all([drainAudio, extraInfo])if you need both concurrently. None of them ever reject — they resolve toundefinedon early end, API error, transport error, or cancellation.
streamOptions.excludeAggregatedAudiofollows the MiniMax API default (false— the final chunk re-includes the full re-concatenated clip). That aggregated audio is never enqueued either way, soextraInfo/traceIdare unaffected by this flag. Pass{ excludeAggregatedAudio: true }to skip the redundant re-transmit and save bandwidth.
WAV format is not supported in streaming mode.
const { audio, subtitle, extraInfo, traceId } = await client.synthesizeStream({
text: 'Hello, streaming world!',
voiceSetting: { voiceId: 'English_expressive_narrator' },
audioSetting: { format: 'mp3' },
streamOptions: { excludeAggregatedAudio: true }, // optional — saves bandwidth
subtitleEnable: true, // optional
subtitleType: 'word_streaming', // 'word_streaming' is streaming-only
})
const writer = fs.createWriteStream('output.mp3')
for await (const chunk of audio) {
writer.write(chunk)
}
writer.end()
const subtitleUrl = await subtitle // undefined unless subtitleEnable was set
const info = await extraInfo // { audioLength, usageCharacters, … } or undefined
const trace = await traceId // undefined if no final chunk arrivedsynthesizeAsync(request): Promise<AsyncSynthesizeResult>
Async text-to-speech for long-form content. Submit a task then poll for completion.
Provide either text or textFileId (mutually exclusive). WAV format is not supported.
const task = await client.synthesizeAsync({
text: 'A very long article...',
voiceSetting: { voiceId: 'English_expressive_narrator' },
})
task.taskId // number
task.fileId // number
task.taskToken // string
task.usageCharacters // numberquerySynthesizeAsync(taskId): Promise<AsyncSynthesizeQueryResult>
Poll the status of an async synthesis task. On success you get a fileId — use the MiniMax File API to retrieve the audio. The synthesized file is only available for 9 hours after success; retrieve and store it before then.
const status = await client.querySynthesizeAsync(task.taskId)
status.status // 'processing' | 'success' | 'failed' | 'expired'
status.fileId // number (download file ID when status is 'success')uploadFile(file, purpose, options?): Promise<FileUploadResult>
Upload a file. purpose is one of voice_clone, prompt_audio (audio samples for voice cloning), or t2a_async_input (a text file feeding synthesizeAsync). Accepts a Blob or a ReadableStream<Uint8Array>.
// Blob upload (buffered)
const audioBlob = new Blob([await fs.promises.readFile('voice.mp3')], { type: 'audio/mp3' })
const upload = await client.uploadFile(audioBlob, 'voice_clone')
upload.file.fileId // number
upload.file.bytes // number
upload.file.filename // stringFor large files, pass a ReadableStream<Uint8Array> to upload without buffering the full payload in memory. The multipart body is assembled with per-chunk backpressure and cancellation propagation, so aborting the request cleanly releases the upstream source.
import { Readable } from 'node:stream'
import { createReadStream } from 'node:fs'
const stream = Readable.toWeb(createReadStream('big-voice.wav')) as ReadableStream<Uint8Array>
const upload = await client.uploadFile(stream, 'voice_clone', {
filename: 'big-voice.wav',
contentType: 'audio/wav', // optional, defaults to 'application/octet-stream'
})listFiles(request): Promise<ListFilesResult>
List files filtered by purpose (voice_clone, prompt_audio, or t2a_async_input).
const { files } = await client.listFiles({ purpose: 'voice_clone' })
files[0].fileId // number
files[0].filename // string
files[0].bytes // numberretrieveFile(fileId): Promise<RetrieveFileResult>
Retrieve metadata for a single file.
const { file } = await client.retrieveFile(12345)
file.bytes // number
file.purpose // string
file.createdAt // number — unix secondsretrieveFileContent(fileId): Promise<Buffer>
Download the file bytes. Useful for fetching async-synthesis output once querySynthesizeAsync returns status: 'success'.
const audio = await client.retrieveFileContent(task.fileId)
await fs.promises.writeFile('output.mp3', audio)deleteFile(request): Promise<DeleteFileResult>
Delete a file. purpose accepts the upload purposes plus t2a_async (async synthesis output) and video_generation.
await client.deleteFile({ fileId: 12345, purpose: 't2a_async' })cloneVoice(request): Promise<VoiceCloneResult>
Clone a voice from an uploaded audio file.
const result = await client.cloneVoice({
fileId: upload.file.fileId,
voiceId: 'my-custom-voice', // 8-256 chars, must start with a letter
text: 'Preview text', // optional preview
model: 'speech-02-hd', // required if text is provided
needNoiseReduction: true,
needVolumeNormalization: true,
clonePrompt: { // optional prompt-based cloning
promptAudio: promptFileId,
promptText: 'Transcript of the prompt audio',
},
})
result.demoAudio // URL to preview audio (empty if no text provided)
result.inputSensitive // { type: number } — 0 = normal; 1–7 categorize the safety trigger
result.extraInfo // billing info (audioLength, usageCharacters, …) when text+model preview randesignVoice(request): Promise<VoiceDesignResult>
Design a new voice from a text description.
const result = await client.designVoice({
prompt: 'A warm female voice with a slight British accent',
previewText: 'Hello, this is a preview of the designed voice.',
voiceId: 'my-designed-voice', // optional, auto-generated if omitted
})
result.voiceId // string
result.trialAudio // hex-encoded preview audiogetVoices(request): Promise<GetVoiceResult>
List available voices.
const voices = await client.getVoices({
voiceType: 'all', // 'system' | 'voice_cloning' | 'voice_generation' | 'all'
})
voices.systemVoice // SystemVoiceInfo[] — built-in voices
voices.voiceCloning // VoiceCloningInfo[] — your cloned voices
voices.voiceGeneration // VoiceGenerationInfo[] — your designed voicesdeleteVoice(request): Promise<DeleteVoiceResult>
Delete a cloned or designed voice.
const result = await client.deleteVoice({
voiceType: 'voice_cloning', // 'voice_cloning' | 'voice_generation'
voiceId: 'my-custom-voice',
})Error Handling
The library provides a typed error hierarchy:
import {
MiniMaxClientError, // Client-side validation (bad params, before request is sent)
MiniMaxError, // Base class for all API errors
MiniMaxAuthError, // Authentication failures (codes 1004, 2042, 2049)
MiniMaxRateLimitError, // Rate limiting (codes 1002, 1039, 1041, 2045, 2056)
MiniMaxValidationError, // Server-side validation (codes 1008, 1026, 1027, 1042, 1043, 1044, 2013, 2037, 2039, 2048, 20132)
} from 'minimax-speech-ts'try {
await client.synthesize({ text: 'Hello' })
} catch (e) {
if (e instanceof MiniMaxClientError) {
// Bad parameters — fix your request
console.error(e.message)
} else if (e instanceof MiniMaxAuthError) {
// Invalid API key
} else if (e instanceof MiniMaxRateLimitError) {
// Back off and retry
} else if (e instanceof MiniMaxValidationError) {
// Server rejected the request parameters
console.error(e.statusCode, e.statusMsg, e.traceId)
} else if (e instanceof MiniMaxError) {
// Other API error
console.error(e.statusCode, e.statusMsg)
}
}Client-side validation catches common mistakes before making a request:
- Missing required fields (
text,voiceId, etc.) - Emotions with unsupported models (
speech-01-*doesn't support emotions) fluent/whisperemotions with non-speech-2.6-*models- WAV format in streaming or async mode
textandtextFileIdboth provided (mutually exclusive)textprovided withoutmodelin voice cloning
Models
| Model | Emotions | Notes |
|-------|----------|-------|
| speech-2.8-hd | All except fluent, whisper | Latest HD |
| speech-2.8-turbo | All except fluent, whisper | Latest Turbo |
| speech-2.6-hd | All including fluent, whisper | |
| speech-2.6-turbo | All including fluent, whisper | |
| speech-02-hd | All except fluent, whisper | Default |
| speech-02-turbo | All except fluent, whisper | |
| speech-01-hd | None | |
| speech-01-turbo | None | |
Text Features
The text field supports inline markup beyond plain content:
- Pause control — insert
<#x#>between text segments to pause forxseconds (range0.01–99.99). Example:Hello<#0.5#>world. - Inline pronunciation — override the pronunciation of a word with Mandarin pinyin (tones 1–5), IPA, or Cantonese jyutping (tones 1–6), wrapped in half-width parentheses immediately after the word:
The word live is pronounced (lɪv) as a verb and (laɪv) as an adjective.This is (he2)平, not (huo4)面.去街市買啲(sung3)。
- Interjection tags (
speech-2.8-hd/speech-2.8-turboonly) — embed natural speech sounds:(laughs),(chuckle),(coughs),(clear-throat),(groans),(breath),(pant),(inhale),(exhale),(gasps),(sniffs),(sighs),(snorts),(burps),(lip-smacking),(humming),(hissing),(emm),(sneezes).
Rate Limits
The API enforces these limits per account; the SDK surfaces 429-equivalent responses as MiniMaxRateLimitError. Build your own retry/backoff on top.
| Endpoint | Limit |
|----------|-------|
| synthesize / synthesizeStream / voice cloning | 60 RPM |
| designVoice | 20 RPM |
| querySynthesizeAsync | 10 QPS |
Use Cases
- Voice-over generation — generate narration audio from scripts for videos and podcasts
- Accessibility — add text-to-speech to web and Node.js applications
- Voice cloning — clone a voice from a short audio sample and synthesize new speech
- Voice design — create custom AI voices from text descriptions
- Real-time TTS streaming — stream audio chunks via SSE for chatbots, virtual assistants, and live applications
- Batch audio production — use async synthesis for long-form content like audiobooks and articles
Compatibility
- Node.js >= 18 (uses native
fetchandReadableStream) - TypeScript >= 5.0
- Works with any MiniMax API key from platform.minimax.io
Contributing
See CONTRIBUTING.md for development setup and guidelines.
License
MIT
