animalese-tts
v1.1.3
Published
Animalese TTS is an Animal Crossing style Voice Synthesis (TTS) engine.
Maintainers
Readme
Animalese TTS
Animalese TTS is an Animal Crossing style Voice Synthesis (TTS) engine. It analyzes text to synthesize audio samples corresponding to each phoneme, and applies pitch adjustments and melody variations to generate a cute and unique voice.
Key Features
- Multi-language Support: Supports Korean (separated into onset/nucleus/coda), English, and Japanese (hiragana, katakana) analyzers.
- Cross-environment Support: Works perfectly in both browser environments using the Web Audio API and Node.js environments using the file system.
- Real-time Synthesis: Enables real-time synthesis character-by-character using
AsyncGenerator. - Rich Audio Effects:
- Base Pitch & Randomness: Adjust the pitch of the voice and the variation range that changes every time.
- Melody Variation: Create a singing-like effect through sine wave-based pitch changes.
- Playback Speed Control: Adjust the playback speed independently of the pitch.
- Punctuation & Space Delay: Set natural pauses for periods, commas, spaces, etc.
Installation and Build
npm install animalese-tts<script type="module">
import {
AnimaleseEngine,
KoreanAnalyzer,
WebSampler,
PitchManager,
WebPlayer
} from 'https://cdn.jsdelivr.net/npm/animalese-tts/+esm'
</script>Usage (Example)
Browser (Web) Environment
In a browser environment, you use WebSampler to fetch samples via HTTP and WebPlayer to play audio using the Web Audio API.
import {
AnimaleseEngine,
EnglishAnalyzer,
WebSampler,
PitchManager,
WebPlayer
} from 'https://cdn.jsdelivr.net/npm/animalese-tts/+esm'
const player = new WebPlayer()
const engine = new AnimaleseEngine({
analyzer: new EnglishAnalyzer(),
sampler: new WebSampler(
'https://your-server.com/sounds/sprite.wav',
// Auto-slicing based on silence using string[]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
),
effect: new PitchManager({
pitch: 1.5,
speed: 4.0,
})
})
// Wait for the engine to load (loads the sprite file, splits audio based on silence, and decodes it)
await engine.load()
async function speak(text: string) {
const speaker = engine.synthesize(text)
for await (const output of speaker.speak()) {
await player.play(output.buffer)
}
}
speak("Hello! This is a voice test in the browser.")Node.js Environment
In a Node.js environment, you use FileSystemSampler to read samples from the local disk and FilePlayer to play or save the audio.
import {
AnimaleseEngine,
KoreanAnalyzer,
FileSystemSampler,
PitchManager,
FilePlayer
} from 'animalese-tts'
const player = new FilePlayer()
const engine = new AnimaleseEngine({
analyzer: new KoreanAnalyzer(),
sampler: new FileSystemSampler(
'./sounds/sprite.wav',
// You can use an explicit SpriteMap object.
{
'ㄱ': { startMs: 0, durationMs: 100 },
'ㄲ': { startMs: 100, durationMs: 80 },
'ㄴ': { startMs: 180, durationMs: 95 },
// ...
}
),
effect: new PitchManager({
pitch: 0.8,
speed: 3.5,
})
})
await engine.load()
async function speak(text: string) {
const speaker = engine.synthesize(text)
for await (const output of speaker.speak()) {
// FilePlayer exports the buffer to a temporary or specified .wav file/stream
await player.play(output.buffer, './output_folder')
}
}
speak("안녕하세요! Node.js에서의 목소리 테스트입니다.")Detailed Configuration Options (AnimalVoiceConfig)
| Option Name | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| analyzer | TextAnalyzer | (Required) | Separates text into phonemes. You can select language-specific analyzers. |
| sampler | Sampler | (Required) | Supplies the original audio sample data corresponding to the phonemes. |
| effect | AudioEffect | (Required) | Handles pitch modulation and speed control effects. (Usually PitchManager is used) |
| spaceDelay | number | 0.03 | Silence delay inserted upon recognizing a space character ( ) (seconds). |
| punctuationDelay | number | 0.3 | Silence delay inserted upon recognizing punctuation (seconds). |
| punctuations | string[] | (Default set) | Array of characters to be considered as punctuation marks. |
Sampler Configuration Audio File and Sprite Options
[!IMPORTANT] For the default voice audio samples and SpriteMap sequence data, please refer here (docs/sounds).
The Sampler's role is to load a single large audio sprite (sprite.wav) and slice it into individual phoneme units. The common parameters injected when initializing WebSampler, FileSystemSampler, and MemorySampler are as follows:
Audio Original Source (1st Arg):
WebSampler: URL of the audio sprite file.FileSystemSampler: Path to the local audio sprite.wavfile.MemorySampler: Memory buffer (ArrayBuffer/Uint8Array) containing the audio data.
sprites (2nd Arg): Specifies how to slice the audio sprite and map it to each phoneme.
Using a
SpriteMapObject: Explicitly specifies the start time (startMs) and section length (durationMs) for each phoneme.const sprites = { 'a': { startMs: 0, durationMs: 154 }, 'b': { startMs: 154, durationMs: 130 }, 'c': { startMs: 284, durationMs: 160 } // ... };Using a
string[]Array: Automatically splits the audio based on silence sections (auto-slicing), and automatically assigns them to phonemes in the order listed in the array.const sprites = ['a', 'b', 'c', 'd', ...];- How does it work?: It reads the entire audio sprite, then finds all silence sections that meet the
silenceThreshold(silence detection threshold) andminSilenceDurationMs(minimum duration recognized as silence) options. Then, using these found silence sections as boundaries, it splits the entire audio into multiple smaller audio clips. - Array Mapping: These split audio clips are mapped 1:1 sequentially with the items specified in the array (e.g.,
['a', 'b', 'c']). - Trim Processing: Each phoneme audio clip that has been sliced based on silence internally undergoes an additional
trimprocess to completely remove any slight leading/trailing silence. This prevents unwanted empty noises during synthesis.
- How does it work?: It reads the entire audio sprite, then finds all silence sections that meet the
options (3rd Arg - Optional):
maxRetries: Maximum number of retries when loading the audio file. (Default: 3)silenceThreshold: When using thestring[]method, if the amplitude of the audio is below this value, it's considered silence (0.0~1.0). It also affects the trim process for the sliced audio clips front and back. (Default:0.01)minSilenceDurationMs: When using thestring[]method, this specifies how long (in milliseconds) the amplitude must stay equal to or below thesilenceThresholdto be identified as an intact, fully-qualified silence section (splicing boundary point). Written in milliseconds. The default value is20. If the audio splits unexpectedly small or strange, it may be because a very short internal silence within a phoneme sound was incorrectly recognized as a slice boundary. In such cases, test by increasing this value (e.g.,50). If it is a complex audio clip where adjusting values does not resolve the issue, we recommend using theSpriteMapapproach to explicitly designate times.
PitchManager Parameter Options (PitchManagerOptions)
pitch: The base tone of the voice. Higher is thinner, lower is deeper. (Default: 1.5)speed: Speaking speed. Greater than 1.0 is faster, less than 1.0 is slower. (Default: 4.0)randomness: Amount of random pitch change applied to each character. (Default: 0.1)melodyRate: The rate at which the melody wave changes. Higher values make the pitch rise and fall faster. (Default: 0.05)melodyAmplitude: The amplitude of the melody wave. Higher values make the pitch difference more distinct. (Default: 0.1)
Main Components Structure
Analyzers
Analyzers separate text into the smallest phoneme structures tailored to the specific characteristics of the language.
KoreanAnalyzer: Precisely separates Korean text into onset, nucleus, and coda. Handles double consonants and diphthongs smoothly.EnglishAnalyzer: Separates English text into alphabets ignoring case, and filters unsupported special characters.JapaneseAnalyzer: Analyzes hiragana and katakana and separates them into individual phonemes.
Creating Custom Analyzers
You can easily create your own custom language analyzer by extending the exported base classes: DecomposingAnalyzer or DictionaryAnalyzer.
Using DecomposingAnalyzer (Character-by-character)
Ideal for languages where characters decompose mathematically into phonemes (like Korean).
import { DecomposingAnalyzer, PhonemeToken } from 'animalese-tts';
export class MyCustomAnalyzer extends DecomposingAnalyzer {
protected decompose(char: string): PhonemeToken[] {
// Custom logic to convert a single character to phonemes
return [{ phoneme: char.toLowerCase(), mergeWithNext: false }];
}
}Using DictionaryAnalyzer (Mapping-based)
Ideal for languages where specific character structures map 1:1 to specific phoneme arrays (like Japanese).
Each string element of the array becomes a single continuous phoneme block.
For example, ['tai'] is played as a single continuous piece (combining t, a, i at once like one letter), while splitting it to ['ta', 'i'] plays as two independent pieces consecutively.
import { DictionaryAnalyzer, PhonemeToken } from 'animalese-tts';
export class MyMappedAnalyzer extends DictionaryAnalyzer {
protected dictionary: Record<string, string[]> = {
// Single character pronunciation
'あ': ['a'],
// Combined pronunciation as one block
'きゃ': ['kya'],
// Two distinct pronunciations merged together consecutively
'大': ['ta', 'i'],
};
// You can optionally override the `analyze` method to add complex language-specific rules (like sokuon, etc).
}Samplers
A Sampler loads a single audio sprite (.wav) file, decodes it, and slices it into individual phoneme buffer clips.
FileSystemSampler: Node.js only. Reads a sprite.wavfile from the local file system.WebSampler: Browser only. Fetches a sprite.wavfile via HTTP.MemorySampler: Directly accepts an integratedArrayBufferorUint8Arraycontaining audio data to decode and slice. Ideal for environments where fetch/fs isn't available, or for using bespoke caching techniques.
Playback Strategies (PlaybackStrategies)
Responsible for securely conveying the streamed audio data (Float32Array) generated by the AnimaleseEngine's AsyncGenerator to the final user's environment.
FilePlayer: Synthesizes audio in a Node.js environment and exports it as a.wavfile to the local disk.WebPlayer: Immediately plays buffers in the browser using the Web Audio API (AudioContext).
Project Structure
src/core: Core logic including audio decoders, converters, and sample providers.src/analyzers: Language-specific text analysis algorithms (Korean/English/Japanese)src/effects: Audio effects processing such as pitch and speed adjustmentsrc/playback: Platform-specific playback strategies (Browser/Node.js)src/interfaces.ts: Main interfaces and type definitions
License
Please see the LICENSE file for details regarding the license of this project.
