@aituber-onair/voice

v0.8.0

Published

10 days ago

Voice synthesis library for AITuber OnAir

0High
0Medium
0Low

shinshin86

aituber vtuber ai voice tts speech-synthesis

AITuber OnAir Voice

AITuber OnAir Voice - logo

@aituber-onair/voice is an independent voice synthesis library that supports multiple TTS (Text-to-Speech) engines. While originally developed for the AITuber OnAir project, it can be used standalone for any voice synthesis needs.

日本語版はこちら

This project is published as open-source software and is available as an npm package under the MIT License.

Overview

@aituber-onair/voice is a comprehensive voice synthesis library that provides a unified interface for multiple TTS engines. It specializes in emotion-aware speech synthesis, making it ideal for creating expressive virtual characters, AI assistants, and interactive applications.

Key design principles:

Engine Independence: Switch between TTS engines without changing your code
Emotion Support: Built-in emotion detection and synthesis
Browser Ready: Full support for web audio playback
TypeScript First: Complete type safety and excellent IDE support
Zero Dependencies: Minimal external dependencies for maximum compatibility

Installation

Install using npm:

npm install @aituber-onair/voice

Or using yarn:

yarn add @aituber-onair/voice

Or using pnpm:

pnpm install @aituber-onair/voice

Main Features

Multiple TTS Engine Support
Compatible with VOICEVOX, VoicePeak, OpenAI TTS, MiniMax, AivisSpeech, Aivis Cloud, and more
Unified Interface
Single API for all supported TTS engines
Emotion-Aware Synthesis
Automatically detects and applies emotions from text tags like [happy], [sad], etc.
Screenplay Conversion
Transforms text with emotion tags into structured screenplay format
Browser Audio Support
Direct playback in web browsers using HTMLAudioElement
Custom Endpoints
Support for self-hosted TTS servers
Language Detection
Automatic language recognition for multi-language engines
Flexible Configuration
Runtime engine switching and parameter updates

Basic Usage

Simple Text-to-Speech

import { VoiceService, VoiceServiceOptions } from '@aituber-onair/voice';

// Configure the voice service
const options: VoiceServiceOptions = {
  engineType: 'voicevox',
  speaker: '1',
  // Optional: specify custom endpoint
  voicevoxApiUrl: 'http://localhost:50021'
};

// Create voice service instance
const voiceService = new VoiceService(options);

// Speak text
await voiceService.speak({ text: 'Hello, world!' });

Using VoiceEngineAdapter (Recommended)

import { VoiceEngineAdapter, VoiceServiceOptions } from '@aituber-onair/voice';

const options: VoiceServiceOptions = {
  engineType: 'openai',
  speaker: 'alloy',
  apiKey: 'your-openai-api-key',
  onPlay: async (audioBuffer) => {
    // Custom audio playback handler
    console.log('Playing audio...');
  }
};

const voiceAdapter = new VoiceEngineAdapter(options);

// Speak with emotion
await voiceAdapter.speak({ 
  text: '[happy] I am so excited to talk with you!' 
});

Supported TTS Engines

VOICEVOX

High-quality Japanese speech synthesis engine with multiple character voices.

const voiceService = new VoiceService({
  engineType: 'voicevox',
  speaker: '1', // Character ID
  voicevoxApiUrl: 'http://localhost:50021' // Optional custom endpoint
});

VoicePeak

Professional speech synthesis with rich emotional expression.

const voiceService = new VoiceService({
  engineType: 'voicepeak',
  speaker: 'f1',
  voicepeakApiUrl: 'http://localhost:20202',
  voicepeakEmotion: 'happy',
  voicepeakSpeed: 140,
  voicepeakPitch: 20
});

OpenAI TTS

OpenAI's text-to-speech API with multiple voice options.

const voiceService = new VoiceService({
  engineType: 'openai',
  speaker: 'alloy',
  apiKey: 'your-openai-api-key'
});

MiniMax

Multi-language TTS supporting 24 languages with HD quality.

const voiceService = new VoiceService({
  engineType: 'minimax',
  speaker: 'male-qn-qingse',
  apiKey: 'your-minimax-api-key',
  groupId: 'your-group-id', // Required for MiniMax
  endpoint: 'global' // or 'china'
});

Note: MiniMax requires both API key and GroupId for authentication. The GroupId is used for user group management, usage tracking, and billing.

AivisSpeech

AI-powered speech synthesis with natural voice quality.

const voiceService = new VoiceService({
  engineType: 'aivisSpeech',
  speaker: '888753760',
  aivisSpeechApiUrl: 'http://localhost:10101'
});

Aivis Cloud

High-quality cloud-based TTS service with advanced SSML support and streaming capabilities.

const voiceService = new VoiceService({
  engineType: 'aivisCloud',
  speaker: 'unused', // Not used when model UUID is specified
  apiKey: 'your-aivis-cloud-api-key',
  aivisCloudModelUuid: 'a59cb814-0083-4369-8542-f51a29e72af7', // Required
  
  // Optional advanced settings
  aivisCloudSpeakerUuid: 'speaker-uuid', // For multi-speaker models
  aivisCloudStyleId: 0, // Or use aivisCloudStyleName: 'ノーマル'
  aivisCloudUseSSML: true, // Enable SSML tags
  aivisCloudSpeakingRate: 1.0, // 0.5-2.0
  aivisCloudEmotionalIntensity: 1.0, // 0.0-2.0
  aivisCloudOutputFormat: 'mp3', // wav, flac, mp3, aac, opus
  aivisCloudOutputSamplingRate: 44100, // Hz
});

Key Features:

SSML Support: Rich markup for prosody, breaks, aliases, and emotions
Streaming Audio: Real-time audio generation and delivery
Multiple Formats: WAV, FLAC, MP3, AAC, Opus output
Emotion Control: Fine-grained emotional intensity settings
High Quality: Professional-grade voice synthesis

None (Silent Mode)

No audio output - useful for testing or text-only scenarios.

const voiceService = new VoiceService({
  engineType: 'none'
});

Emotion-Aware Speech

The library supports emotion tags in text for more expressive speech:

// Emotion tags are automatically detected and processed
await voiceService.speak({ 
  text: '[happy] Great to see you today!' 
});

await voiceService.speak({ 
  text: '[sad] I will miss you...' 
});

await voiceService.speak({ 
  text: '[angry] This is unacceptable!' 
});

// Supported emotions vary by engine
// Common emotions: happy, sad, angry, surprised, neutral

The emotion system works by:

Extracting emotion tags from the text
Converting text to screenplay format with emotion metadata
Passing emotion information to engines that support it
Falling back gracefully for engines without emotion support

Browser Compatibility

The library includes built-in browser audio playback support:

// Option 1: Default browser playback
const voiceService = new VoiceService({
  engineType: 'openai',
  speaker: 'alloy',
  apiKey: 'your-api-key'
  // Audio will play automatically in the browser
});

// Option 2: Custom audio handling
const voiceService = new VoiceService({
  engineType: 'voicevox',
  speaker: '1',
  onPlay: async (audioBuffer: ArrayBuffer) => {
    // Custom audio playback logic
    const audioContext = new AudioContext();
    const audioBufferSource = audioContext.createBufferSource();
    // ... handle audio playback
  }
});

// Option 3: Specify HTML audio element
const voiceService = new VoiceService({
  engineType: 'voicevox',
  speaker: '1',
  voicevoxApiUrl: 'http://localhost:50021',
  audioElementId: 'my-audio-player' // ID of <audio> element
});

Advanced Configuration

Dynamic Engine Switching

const voiceAdapter = new VoiceEngineAdapter({
  engineType: 'voicevox',
  speaker: '1'
});

// Switch to a different engine at runtime
await voiceAdapter.updateOptions({
  engineType: 'openai',
  speaker: 'nova',
  apiKey: 'your-openai-api-key'
});

Custom Endpoints

// For self-hosted or custom TTS servers
const voiceService = new VoiceService({
  engineType: 'voicevox',
  speaker: '1',
  voicevoxApiUrl: 'https://my-custom-voicevox-server.com'
});

Engine Parameter Overrides

VoiceServiceOptions (see API Reference) now covers a consistent set of overrides for each engine. Below is a field-by-field summary to help you discover the right property without scanning the entire interface.

const voiceService = new VoiceService({
  engineType: 'voicevox',
  speaker: '1',
  openAiSpeed: 1.15,
  voicevoxSpeedScale: 1.1,
  voicevoxPitchScale: 0.05,
  voicevoxIntonationScale: 1.2,
  voicevoxQueryParameters: { pauseLength: 0.3, outputSamplingRate: 44100 },
  minimaxVoiceSettings: { speed: 1.05, vol: 1.1, pitch: 2 },
  minimaxAudioSettings: { sampleRate: 44100, format: 'mp3' },
  aivisSpeechSpeedScale: 1.05,
  aivisCloudSpeakingRate: 1.1,
  aivisCloudVolume: 1.05,
});

Tip: the React example in packages/voice/examples/react-basic exposes the same controls with collapsible cards + sliders, making it easy to try values before applying them in code.

Engine parameter reference

OpenAI TTS
- openAiModel
- openAiSpeed
VOICEVOX
- Endpoint: voicevoxApiUrl
- Scalars: voicevoxSpeedScale, voicevoxPitchScale, voicevoxIntonationScale, voicevoxVolumeScale
- Timing: voicevoxPrePhonemeLength, voicevoxPostPhonemeLength, voicevoxPauseLength, voicevoxPauseLengthScale
- Output: voicevoxOutputSamplingRate, voicevoxOutputStereo
- Flags: voicevoxEnableKatakanaEnglish, voicevoxEnableInterrogativeUpspeak
- Version: voicevoxCoreVersion
- Low-level overrides: voicevoxQueryParameters
AivisSpeech
- Endpoint: aivisSpeechApiUrl
- Scalars: aivisSpeechSpeedScale, aivisSpeechPitchScale, aivisSpeechIntonationScale, aivisSpeechTempoDynamicsScale, aivisSpeechVolumeScale
- Timing: aivisSpeechPrePhonemeLength, aivisSpeechPostPhonemeLength, aivisSpeechPauseLength, aivisSpeechPauseLengthScale
- Output: aivisSpeechOutputSamplingRate, aivisSpeechOutputStereo
- Low-level overrides: aivisSpeechQueryParameters
Aivis Cloud
- Identity: aivisCloudModelUuid, aivisCloudSpeakerUuid, aivisCloudStyleId, aivisCloudStyleName, aivisCloudUserDictionaryUuid
- Behaviour: aivisCloudUseSSML, aivisCloudLanguage, aivisCloudSpeakingRate, aivisCloudEmotionalIntensity, aivisCloudTempoDynamics, aivisCloudPitch, aivisCloudVolume
- Silence: aivisCloudLeadingSilence, aivisCloudTrailingSilence, aivisCloudLineBreakSilence
- Output: aivisCloudOutputFormat, aivisCloudOutputBitrate, aivisCloudOutputSamplingRate, aivisCloudOutputChannels
- Logging: aivisCloudEnableBillingLogs
VoicePeak
- Endpoint: voicepeakApiUrl
- Emotion: voicepeakEmotion
- Scalars: voicepeakSpeed, voicepeakPitch
MiniMax
- Identity: groupId, endpoint, minimaxModel, minimaxLanguageBoost
- Voice overrides: minimaxVoiceSettings or individual minimaxSpeed, minimaxVolume, minimaxPitch
- Audio overrides: minimaxAudioSettings or individual minimaxSampleRate, minimaxBitrate, minimaxAudioFormat, minimaxAudioChannel

Error Handling

try {
  await voiceService.speak({ text: 'Hello!' });
} catch (error) {
  if (error.message.includes('API key')) {
    console.error('Invalid API key');
  } else if (error.message.includes('network')) {
    console.error('Network error - check your connection');
  } else {
    console.error('TTS error:', error);
  }
}

Engine-Specific Features

VOICEVOX Features

Multiple character voices with unique personalities
Adjustable speech parameters (speed, pitch, intonation)
Local server support for privacy

OpenAI TTS Features

High-quality multilingual support
Multiple voice personalities
Optimized for conversational AI

MiniMax Features

24 language support with automatic detection
HD quality audio output
Dual-region endpoints (global/china)
Advanced emotion synthesis

Integration with AITuber OnAir Core

While this package can be used independently, it integrates seamlessly with @aituber-onair/core:

import { AITuberOnAirCore } from '@aituber-onair/core';

const core = new AITuberOnAirCore({
  apiKey: 'your-openai-key',
  voiceOptions: {
    engineType: 'voicevox',
    speaker: '1',
    voicevoxApiUrl: 'http://localhost:50021'
  }
});

// Voice synthesis is handled automatically
await core.processChat('Hello!');

API Reference

VoiceServiceOptions

interface VoiceServiceOptions {
  engineType: VoiceEngineType;
  speaker: string;
  apiKey?: string;
  groupId?: string; // For MiniMax
  endpoint?: 'global' | 'china'; // For MiniMax
  voicevoxApiUrl?: string;
  voicepeakApiUrl?: string;
  voicepeakEmotion?:
    | 'happy'
    | 'fun'
    | 'angry'
    | 'sad'
    | 'neutral'
    | 'surprised';
  voicepeakSpeed?: number; // 50-200 (integer)
  voicepeakPitch?: number; // -300 to 300 (integer)
  aivisSpeechApiUrl?: string;
  onPlay?: (audioBuffer: ArrayBuffer) => Promise<void>;
  onComplete?: () => void;
  audioElementId?: string;
}

VoiceEngine Methods

interface VoiceEngine {
  speak(params: SpeakParams): Promise<ArrayBuffer | null>;
  isAvailable(): Promise<boolean>;
  getSpeakers?(): Promise<SpeakerInfo[]>;
  getEngineInfo(): VoiceEngineInfo;
}

Screenplay Format

interface Screenplay {
  emotion?: string;
  text: string;
  speechText?: string;
}

Examples

React Integration

See the React example for a complete implementation:

import { useState } from 'react';
import { VoiceService } from '@aituber-onair/voice';

function VoiceDemo() {
  const [voiceService] = useState(
    () => new VoiceService({
      engineType: 'openai',
      speaker: 'alloy',
      apiKey: 'your-api-key'
    })
  );

  const handleSpeak = async (text: string) => {
    await voiceService.speak({ text });
  };

  return (
    <button onClick={() => handleSpeak('[happy] Hello!')}>
      Speak with emotion
    </button>
  );
}

Node.js Usage

The voice package now fully supports Node.js environments with automatic environment detection:

import { VoiceEngineAdapter } from '@aituber-onair/voice';

const voiceService = new VoiceEngineAdapter({
  engineType: 'openai',
  speaker: 'nova',
  apiKey: process.env.OPENAI_API_KEY
});

// Audio will be played using available Node.js audio libraries
await voiceService.speak({ text: 'Hello from Node.js!' });

Audio Playback in Node.js

For audio playback in Node.js, install one of these optional dependencies:

# Option 1: speaker (native bindings, better quality)
npm install speaker

# Option 2: play-sound (uses system audio player, easier to install)
npm install play-sound

If neither is installed, the package will still work but won't play audio. You can still use the onPlay callback to handle audio data:

const voiceService = new VoiceEngineAdapter({
  engineType: 'voicevox',
  speaker: '1',
  voicevoxApiUrl: 'http://localhost:50021',
  onPlay: async (audioBuffer) => {
    // Save to file or process audio data
    writeFileSync('output.wav', Buffer.from(audioBuffer));
  }
});

The package automatically detects the environment and uses the appropriate audio player:

Browser: Uses HTMLAudioElement
Node.js: Uses speaker or play-sound if available, otherwise silent

Testing

Run the test suite:

# Run all tests
npm test

# Run tests in watch mode
npm run test:watch

# Generate coverage report
npm run test:coverage

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.