npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

minimax-speech-ts

v0.4.1

Published

Type-safe MiniMax TTS client for Node.js — sync & streaming synthesis, voice cloning, voice design. ESM + CJS.

Downloads

491

Readme

MiniMax TTS SDK for JavaScript / TypeScript

npm version CI npm downloads license

Type-safe MiniMax TTS client for Node.js. Full API coverage — sync and streaming synthesis, voice cloning, voice design, and voice management — with a single runtime dependency. Ships ESM + CJS with complete TypeScript declarations. (Unofficial)

API Reference | npm | GitHub

Features

  • Full API coverage — sync, streaming (SSE), async, voice cloning, voice design, voice management
  • Zero confignpm install, pass your API key, get audio back
  • ReadableStream<Buffer> streaming — pipe directly to a file, HTTP response, or WebSocket
  • Typed error hierarchyinstanceof checks for auth, rate-limit, and validation errors
  • Client-side validation — catches bad params before the network round-trip
  • camelCase in, snake_case on the wire — no manual conversion needed
  • Dual output — ESM and CommonJS with .d.ts declarations

Quick Start

  1. Get an API key from platform.minimax.io
  2. npm install minimax-speech-ts
  3. Run:
import { MiniMaxSpeech } from 'minimax-speech-ts'
import fs from 'node:fs'

const client = new MiniMaxSpeech({
  apiKey: process.env.MINIMAX_API_KEY!,
  groupId: process.env.MINIMAX_GROUP_ID, // optional
})

const result = await client.synthesize({
  text: 'Hello, world!',
  model: 'speech-02-hd',
  voiceSetting: { voiceId: 'English_expressive_narrator' },
})

await fs.promises.writeFile('output.mp3', result.audio) // → output.mp3

Highlights

Stream audio to a file

const { audio } = await client.synthesizeStream({
  text: 'Stream me!',
  voiceSetting: { voiceId: 'English_expressive_narrator' },
  audioSetting: { format: 'mp3' },
})

const writer = fs.createWriteStream('output.mp3')
for await (const chunk of audio) writer.write(chunk)
writer.end()

Synthesize with emotion

const result = await client.synthesize({
  text: 'I am so happy to meet you!',
  voiceSetting: { voiceId: 'English_expressive_narrator', emotion: 'happy' },
})

Clone a voice

const file = new Blob([await fs.promises.readFile('sample.mp3')], { type: 'audio/mp3' })
const upload = await client.uploadFile(file, 'voice_clone')
await client.cloneVoice({ fileId: upload.file.fileId, voiceId: 'my-voice' })

Design a voice from a description

const voice = await client.designVoice({
  prompt: 'A warm female voice with a slight British accent',
  previewText: 'Hello, this is a preview.',
  voiceId: 'my-designed-voice',
})

Why this SDK?

Compared to calling the MiniMax API with raw fetch:

  • Automatic camelCase ↔ snake_case — write idiomatic JS, the SDK converts for the wire
  • Request validation — catches invalid params, emotion/model mismatches, and format conflicts before the network call
  • Typed errorsMiniMaxAuthError, MiniMaxRateLimitError, MiniMaxValidationError with statusCode and traceId
  • Streaming handled internally — SSE parsing and hex-to-Buffer decoding are built in
  • One dependency — only eventsource-parser for SSE; everything else is native Node.js

API

Constructor

new MiniMaxSpeech({
  apiKey: string        // Required. MiniMax API key.
  groupId?: string      // Optional. MiniMax group ID, appended as ?GroupId= query param.
  apiHost?: string      // Optional. Defaults to 'https://api.minimax.io'.
                        //           For reduced TTFA, try 'https://api-uw.minimax.io'.
})

synthesize(request): Promise<SynthesizeResult>

Synchronous text-to-speech. Returns decoded audio as a Buffer.

const result = await client.synthesize({
  text: 'Hello!',
  model: 'speech-02-hd',           // optional, defaults to 'speech-02-hd'
  voiceSetting: {
    voiceId: 'English_expressive_narrator',
    speed: 1.0,
    vol: 1.0,
    pitch: 0,
    emotion: 'happy',              // speech-02-*/speech-2.6-*/speech-2.8-* only
  },
  audioSetting: {
    format: 'mp3',                 // 'mp3' | 'pcm' | 'flac' | 'wav' | 'pcmu_raw' | 'pcmu_wav' | 'opus'
    sampleRate: 32000,
    bitrate: 128000,
    channel: 1,
  },
  languageBoost: 'English',
  voiceModify: {
    pitch: 0,                      // -100 to 100
    intensity: 0,                  // -100 to 100
    timbre: 0,                     // -100 to 100
    soundEffects: 'robotic',       // optional
  },
  timbreWeights: [                 // mix multiple voices
    { voiceId: 'voice-1', weight: 0.5 },
    { voiceId: 'voice-2', weight: 0.5 },
  ],
  subtitleEnable: false,
  subtitleType: 'sentence',        // 'sentence' | 'word' ('word_streaming' is streaming-only — use synthesizeStream)
  pronunciationDict: { tone: ['处理/(chǔ lǐ)'] },
})

result.audio        // Buffer
result.extraInfo    // { audioLength, audioSampleRate, audioSize, bitrate, wordCount, usageCharacters, ... }
result.traceId      // string
result.subtitleFile // string | undefined

Pass outputFormat: 'url' to receive a URL string instead of a decoded buffer:

const result = await client.synthesize({
  text: 'Hello!',
  outputFormat: 'url',
})

result.audio // string (URL)

synthesizeStream(request): Promise<SynthesizeStreamResult>

Streaming text-to-speech via SSE. Returns { audio, subtitle, extraInfo, traceId } — a ReadableStream<Buffer> of audio chunks plus three promises resolved from the final aggregated chunk: the subtitle file URL (string | undefined), the parsed extraInfo (ExtraInfo | undefined — audio length, size, billable characters, …), and the traceId (string | undefined) for MiniMax support.

Drain audio first. subtitle, extraInfo, and traceId only settle once audio is being consumed (reading audio is what pumps the underlying SSE source). Awaiting them before reading or cancelling audio will hang. Use Promise.all([drainAudio, extraInfo]) if you need both concurrently. None of them ever reject — they resolve to undefined on early end, API error, transport error, or cancellation.

streamOptions.excludeAggregatedAudio follows the MiniMax API default (false — the final chunk re-includes the full re-concatenated clip). That aggregated audio is never enqueued either way, so extraInfo/traceId are unaffected by this flag. Pass { excludeAggregatedAudio: true } to skip the redundant re-transmit and save bandwidth.

WAV format is not supported in streaming mode.

const { audio, subtitle, extraInfo, traceId } = await client.synthesizeStream({
  text: 'Hello, streaming world!',
  voiceSetting: { voiceId: 'English_expressive_narrator' },
  audioSetting: { format: 'mp3' },
  streamOptions: { excludeAggregatedAudio: true }, // optional — saves bandwidth
  subtitleEnable: true,                   // optional
  subtitleType: 'word_streaming',         // 'word_streaming' is streaming-only
})

const writer = fs.createWriteStream('output.mp3')
for await (const chunk of audio) {
  writer.write(chunk)
}
writer.end()

const subtitleUrl = await subtitle  // undefined unless subtitleEnable was set
const info = await extraInfo        // { audioLength, usageCharacters, … } or undefined
const trace = await traceId         // undefined if no final chunk arrived

synthesizeAsync(request): Promise<AsyncSynthesizeResult>

Async text-to-speech for long-form content. Submit a task then poll for completion.

Provide either text or textFileId (mutually exclusive). WAV format is not supported.

const task = await client.synthesizeAsync({
  text: 'A very long article...',
  voiceSetting: { voiceId: 'English_expressive_narrator' },
})

task.taskId           // number
task.fileId           // number
task.taskToken        // string
task.usageCharacters  // number

querySynthesizeAsync(taskId): Promise<AsyncSynthesizeQueryResult>

Poll the status of an async synthesis task. On success you get a fileId — use the MiniMax File API to retrieve the audio. The synthesized file is only available for 9 hours after success; retrieve and store it before then.

const status = await client.querySynthesizeAsync(task.taskId)

status.status  // 'processing' | 'success' | 'failed' | 'expired'
status.fileId  // number (download file ID when status is 'success')

uploadFile(file, purpose, options?): Promise<FileUploadResult>

Upload a file. purpose is one of voice_clone, prompt_audio (audio samples for voice cloning), or t2a_async_input (a text file feeding synthesizeAsync). Accepts a Blob or a ReadableStream<Uint8Array>.

// Blob upload (buffered)
const audioBlob = new Blob([await fs.promises.readFile('voice.mp3')], { type: 'audio/mp3' })
const upload = await client.uploadFile(audioBlob, 'voice_clone')

upload.file.fileId    // number
upload.file.bytes     // number
upload.file.filename  // string

For large files, pass a ReadableStream<Uint8Array> to upload without buffering the full payload in memory. The multipart body is assembled with per-chunk backpressure and cancellation propagation, so aborting the request cleanly releases the upstream source.

import { Readable } from 'node:stream'
import { createReadStream } from 'node:fs'

const stream = Readable.toWeb(createReadStream('big-voice.wav')) as ReadableStream<Uint8Array>
const upload = await client.uploadFile(stream, 'voice_clone', {
  filename: 'big-voice.wav',
  contentType: 'audio/wav',  // optional, defaults to 'application/octet-stream'
})

listFiles(request): Promise<ListFilesResult>

List files filtered by purpose (voice_clone, prompt_audio, or t2a_async_input).

const { files } = await client.listFiles({ purpose: 'voice_clone' })
files[0].fileId    // number
files[0].filename  // string
files[0].bytes     // number

retrieveFile(fileId): Promise<RetrieveFileResult>

Retrieve metadata for a single file.

const { file } = await client.retrieveFile(12345)
file.bytes     // number
file.purpose   // string
file.createdAt // number — unix seconds

retrieveFileContent(fileId): Promise<Buffer>

Download the file bytes. Useful for fetching async-synthesis output once querySynthesizeAsync returns status: 'success'.

const audio = await client.retrieveFileContent(task.fileId)
await fs.promises.writeFile('output.mp3', audio)

deleteFile(request): Promise<DeleteFileResult>

Delete a file. purpose accepts the upload purposes plus t2a_async (async synthesis output) and video_generation.

await client.deleteFile({ fileId: 12345, purpose: 't2a_async' })

cloneVoice(request): Promise<VoiceCloneResult>

Clone a voice from an uploaded audio file.

const result = await client.cloneVoice({
  fileId: upload.file.fileId,
  voiceId: 'my-custom-voice',        // 8-256 chars, must start with a letter
  text: 'Preview text',              // optional preview
  model: 'speech-02-hd',             // required if text is provided
  needNoiseReduction: true,
  needVolumeNormalization: true,
  clonePrompt: {                     // optional prompt-based cloning
    promptAudio: promptFileId,
    promptText: 'Transcript of the prompt audio',
  },
})

result.demoAudio       // URL to preview audio (empty if no text provided)
result.inputSensitive  // { type: number } — 0 = normal; 1–7 categorize the safety trigger
result.extraInfo       // billing info (audioLength, usageCharacters, …) when text+model preview ran

designVoice(request): Promise<VoiceDesignResult>

Design a new voice from a text description.

const result = await client.designVoice({
  prompt: 'A warm female voice with a slight British accent',
  previewText: 'Hello, this is a preview of the designed voice.',
  voiceId: 'my-designed-voice',  // optional, auto-generated if omitted
})

result.voiceId     // string
result.trialAudio  // hex-encoded preview audio

getVoices(request): Promise<GetVoiceResult>

List available voices.

const voices = await client.getVoices({
  voiceType: 'all',  // 'system' | 'voice_cloning' | 'voice_generation' | 'all'
})

voices.systemVoice      // SystemVoiceInfo[] — built-in voices
voices.voiceCloning     // VoiceCloningInfo[] — your cloned voices
voices.voiceGeneration  // VoiceGenerationInfo[] — your designed voices

deleteVoice(request): Promise<DeleteVoiceResult>

Delete a cloned or designed voice.

const result = await client.deleteVoice({
  voiceType: 'voice_cloning',  // 'voice_cloning' | 'voice_generation'
  voiceId: 'my-custom-voice',
})

Error Handling

The library provides a typed error hierarchy:

import {
  MiniMaxClientError,      // Client-side validation (bad params, before request is sent)
  MiniMaxError,            // Base class for all API errors
  MiniMaxAuthError,        // Authentication failures (codes 1004, 2042, 2049)
  MiniMaxRateLimitError,   // Rate limiting (codes 1002, 1039, 1041, 2045, 2056)
  MiniMaxValidationError,  // Server-side validation (codes 1008, 1026, 1027, 1042, 1043, 1044, 2013, 2037, 2039, 2048, 20132)
} from 'minimax-speech-ts'
try {
  await client.synthesize({ text: 'Hello' })
} catch (e) {
  if (e instanceof MiniMaxClientError) {
    // Bad parameters — fix your request
    console.error(e.message)
  } else if (e instanceof MiniMaxAuthError) {
    // Invalid API key
  } else if (e instanceof MiniMaxRateLimitError) {
    // Back off and retry
  } else if (e instanceof MiniMaxValidationError) {
    // Server rejected the request parameters
    console.error(e.statusCode, e.statusMsg, e.traceId)
  } else if (e instanceof MiniMaxError) {
    // Other API error
    console.error(e.statusCode, e.statusMsg)
  }
}

Client-side validation catches common mistakes before making a request:

  • Missing required fields (text, voiceId, etc.)
  • Emotions with unsupported models (speech-01-* doesn't support emotions)
  • fluent/whisper emotions with non-speech-2.6-* models
  • WAV format in streaming or async mode
  • text and textFileId both provided (mutually exclusive)
  • text provided without model in voice cloning

Models

| Model | Emotions | Notes | |-------|----------|-------| | speech-2.8-hd | All except fluent, whisper | Latest HD | | speech-2.8-turbo | All except fluent, whisper | Latest Turbo | | speech-2.6-hd | All including fluent, whisper | | | speech-2.6-turbo | All including fluent, whisper | | | speech-02-hd | All except fluent, whisper | Default | | speech-02-turbo | All except fluent, whisper | | | speech-01-hd | None | | | speech-01-turbo | None | |

Text Features

The text field supports inline markup beyond plain content:

  • Pause control — insert <#x#> between text segments to pause for x seconds (range 0.01–99.99). Example: Hello<#0.5#>world.
  • Inline pronunciation — override the pronunciation of a word with Mandarin pinyin (tones 1–5), IPA, or Cantonese jyutping (tones 1–6), wrapped in half-width parentheses immediately after the word:
    • The word live is pronounced (lɪv) as a verb and (laɪv) as an adjective.
    • This is (he2)平, not (huo4)面.
    • 去街市買啲(sung3)。
  • Interjection tags (speech-2.8-hd / speech-2.8-turbo only) — embed natural speech sounds: (laughs), (chuckle), (coughs), (clear-throat), (groans), (breath), (pant), (inhale), (exhale), (gasps), (sniffs), (sighs), (snorts), (burps), (lip-smacking), (humming), (hissing), (emm), (sneezes).

Rate Limits

The API enforces these limits per account; the SDK surfaces 429-equivalent responses as MiniMaxRateLimitError. Build your own retry/backoff on top.

| Endpoint | Limit | |----------|-------| | synthesize / synthesizeStream / voice cloning | 60 RPM | | designVoice | 20 RPM | | querySynthesizeAsync | 10 QPS |

Use Cases

  • Voice-over generation — generate narration audio from scripts for videos and podcasts
  • Accessibility — add text-to-speech to web and Node.js applications
  • Voice cloning — clone a voice from a short audio sample and synthesize new speech
  • Voice design — create custom AI voices from text descriptions
  • Real-time TTS streaming — stream audio chunks via SSE for chatbots, virtual assistants, and live applications
  • Batch audio production — use async synthesis for long-form content like audiobooks and articles

Compatibility

  • Node.js >= 18 (uses native fetch and ReadableStream)
  • TypeScript >= 5.0
  • Works with any MiniMax API key from platform.minimax.io

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

MIT