vocal-stack

v1.0.2

Published

9 days ago

High-performance utility library for Voice AI agents - text sanitization, flow control, and latency monitoring

0High
0Medium
0Low

voice-ai tts llm streaming speech conversational-ai realtime latency voice-agents text-to-speech ai-agents openai elevenlabs

vocal-stack

High-performance utility library for Voice AI agents

Text sanitization • Flow control • Latency monitoring

Quick Start • Examples • Documentation • API Reference

Overview

vocal-stack solves the "last mile" challenges when building production-ready voice AI agents:

🧹 Text Sanitization - Clean LLM output for TTS (remove markdown, URLs, code)
⚡ Flow Control - Handle latency with smart filler injection ("um", "let me think")
📊 Latency Monitoring - Track performance metrics (TTFT, duration, percentiles)

Key Features:

🚀 Platform-agnostic (works with any LLM/TTS)
📦 Composable modules (use independently or together)
🌊 Streaming-first with minimal TTFT
💪 TypeScript strict mode with 90%+ test coverage
🎯 Production-ready with error handling
🔌 Tree-shakeable imports

Why vocal-stack?

Without vocal-stack ❌

const stream = await openai.chat.completions.create({...});
let text = '';
for await (const chunk of stream) {
  text += chunk.choices[0]?.delta?.content || '';
}
await convertToSpeech(text); // Markdown, URLs included! 😱

Problems:

❌ Awkward silences during LLM processing
❌ Markdown symbols spoken aloud ("hash hello", "asterisk bold")
❌ URLs spoken character by character
❌ No performance tracking
❌ Manual error handling

With vocal-stack ✅

import { SpeechSanitizer, FlowController, VoiceAuditor } from 'vocal-stack';

const pipeline = auditor.track(
  'req-123',
  flowController.wrap(
    sanitizer.sanitizeStream(llmStream)
  )
);

for await (const chunk of pipeline) {
  await sendToTTS(chunk); // Clean, speakable text! ✨
}

Benefits:

✅ Natural fillers during stalls
✅ Clean, speakable text
✅ Automatic performance tracking
✅ Composable pipeline
✅ Production-ready

Comparison Table

| Feature | Without vocal-stack | With vocal-stack | |---------|-------------------|-----------------| | Markdown handling | Spoken aloud | ✅ Stripped | | URL handling | Spoken character-by-char | ✅ Removed | | Awkward pauses | Silent stalls | ✅ Natural fillers | | Performance tracking | Manual logging | ✅ Automatic metrics | | Barge-in support | Complex state management | ✅ Built-in | | Setup time | Hours of boilerplate | ✅ Minutes |

Installation

npm install vocal-stack

yarn add vocal-stack

pnpm add vocal-stack

Requirements: Node.js 18+

Quick Start

1️⃣ Text Sanitization

Clean LLM output for TTS:

import { sanitizeForSpeech } from 'vocal-stack';

const markdown = '## Hello World\nCheck out [this link](https://example.com)';
const speakable = sanitizeForSpeech(markdown);
// Output: "Hello World Check out this link"

2️⃣ Flow Control

Handle latency with natural fillers:

import { withFlowControl } from 'vocal-stack';

for await (const chunk of withFlowControl(llmStream)) {
  sendToTTS(chunk);
}
// Automatically injects "um" or "let me think" during stalls!

3️⃣ Latency Monitoring

Track performance metrics:

import { VoiceAuditor } from 'vocal-stack';

const auditor = new VoiceAuditor();

for await (const chunk of auditor.track('request-123', llmStream)) {
  sendToTTS(chunk);
}

console.log(auditor.getSummary());
// { avgTimeToFirstToken: 150ms, p95: 300ms, ... }

4️⃣ Full Pipeline (All Together)

Compose all three modules:

import { SpeechSanitizer, FlowController, VoiceAuditor } from 'vocal-stack';

const sanitizer = new SpeechSanitizer({ rules: ['markdown', 'urls'] });
const flowController = new FlowController({
  stallThresholdMs: 700,
  onFillerInjected: (filler) => sendToTTS(filler),
});
const auditor = new VoiceAuditor({ enableRealtime: true });

// LLM → Sanitize → Flow Control → Monitor → TTS
async function processVoiceStream(llmStream: AsyncIterable<string>) {
  const sanitized = sanitizer.sanitizeStream(llmStream);
  const controlled = flowController.wrap(sanitized);
  const monitored = auditor.track('req-123', controlled);

  for await (const chunk of monitored) {
    await sendToTTS(chunk);
  }

  console.log('Performance:', auditor.getSummary());
}

Examples

We've created 7 comprehensive examples to help you get started:

| Example | Description | Best For | |---------|-------------|----------| | 01-basic-sanitizer | Text sanitization basics | Getting started | | 02-flow-control | Latency handling & fillers | Natural conversations | | 03-monitoring | Performance tracking | Optimization | | 04-full-pipeline | All modules together | Understanding composition | | 05-openai-tts | Real OpenAI integration | Building with OpenAI | | 06-elevenlabs-tts | Real ElevenLabs integration | Premium voice quality | | 07-custom-voice-agent | Production-ready agent | Production apps |

View All Examples →

🎮 Try It Online

Play with vocal-stack in your browser - no installation needed!

| Demo | What it shows | Try it | |------|---------------|--------| | Text Sanitizer | Clean markdown, URLs for TTS | Open Demo → | | Flow Control | Filler injection & latency handling | Open Demo → | | Full Pipeline | All three modules together | Open Demo → |

View All Demos →

Quick Example: OpenAI Integration

import OpenAI from 'openai';
import { SpeechSanitizer, FlowController } from 'vocal-stack';

const openai = new OpenAI();
const sanitizer = new SpeechSanitizer();
const flowController = new FlowController();

async function* getLLMStream(prompt: string) {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) yield content;
  }
}

// Process and send to TTS
const pipeline = flowController.wrap(
  sanitizer.sanitizeStream(getLLMStream('Hello!'))
);

let fullText = '';
for await (const chunk of pipeline) {
  fullText += chunk;
}

// Convert to speech with OpenAI TTS
const mp3 = await openai.audio.speech.create({
  model: 'tts-1',
  voice: 'alloy',
  input: fullText,
});

Use Cases

vocal-stack is perfect for building:

🎙️ Voice Assistants

Build natural-sounding voice assistants (Alexa-like experiences)

💬 Customer Service Bots

AI phone agents that sound professional and natural

🎓 Educational AI Tutors

Interactive voice tutors for learning

🎮 Gaming NPCs

Voice-enabled game characters with realistic conversation flow

♿ Accessibility Tools

Screen readers and voice interfaces for disabled users

🎧 Content Creation

Convert blog posts, documentation to high-quality audio

🏠 Smart Home Devices

Custom voice assistants for IoT devices

📞 IVR Systems

Professional phone systems with AI voice agents

Features

🧹 Text Sanitizer

Transform LLM output into TTS-optimized strings

Built-in Rules:

✅ Strip markdown (# Hello → Hello)
✅ Remove URLs (https://example.com → ``)
✅ Clean code blocks (```code``` → ``)
✅ Normalize punctuation (Hello!!! → Hello)

Features:

Sync and streaming APIs
Plugin-based extensibility
Custom replacements
Sentence boundary detection

const sanitizer = new SpeechSanitizer({
  rules: ['markdown', 'urls', 'code-blocks', 'punctuation'],
  customReplacements: new Map([['https://', 'link at ']]),
});

// Streaming
for await (const chunk of sanitizer.sanitizeStream(llmStream)) {
  console.log(chunk);
}

⚡ Flow Control

Manage latency with intelligent filler injection

Features:

🕐 Detect stream stalls (default 700ms threshold)
💬 Inject filler phrases ("um", "let me think", "hmm")
🛑 Barge-in support (user interruption)
🔄 State machine (idle → waiting → speaking → interrupted)
📦 Buffer management for resume/replay
🎛️ Dual API (high-level + low-level)

Important Rule: Fillers are ONLY injected before the first chunk. After first chunk is sent, no more fillers (natural flow).

const controller = new FlowController({
  stallThresholdMs: 700,
  fillerPhrases: ['um', 'let me think', 'hmm'],
  enableFillers: true,
  onFillerInjected: (filler) => sendToTTS(filler),
});

for await (const chunk of controller.wrap(llmStream)) {
  sendToTTS(chunk);
}

// Barge-in support
userInterrupted && controller.interrupt();

📊 Latency Monitoring

Track and profile voice agent performance

Metrics Tracked:

⏱️ Time to First Token (TTFT)
📈 Total duration
🔢 Token count
📊 Average token latency

Statistics:

📐 Percentiles (p50, p95, p99)
📊 Averages across requests
📁 Export (JSON, CSV)
🔴 Real-time callbacks

const auditor = new VoiceAuditor({
  enableRealtime: true,
  onMetric: (metric) => {
    console.log(`TTFT: ${metric.metrics.timeToFirstToken}ms`);
  },
});

for await (const chunk of auditor.track('req-123', llmStream)) {
  sendToTTS(chunk);
}

const summary = auditor.getSummary();
// {
//   count: 10,
//   avgTimeToFirstToken: 150,
//   p50TimeToFirstToken: 120,
//   p95TimeToFirstToken: 300,
//   p99TimeToFirstToken: 450,
//   avgTotalDuration: 2000,
//   ...
// }

// Export for analysis
const json = auditor.export('json');
const csv = auditor.export('csv');

API Overview

Sanitizer Module

Quick API:

import { sanitizeForSpeech } from 'vocal-stack';

const clean = sanitizeForSpeech(text); // One-liner

Class API:

import { SpeechSanitizer } from 'vocal-stack';

const sanitizer = new SpeechSanitizer({
  rules: ['markdown', 'urls', 'code-blocks', 'punctuation'],
  customReplacements: new Map([['https://', 'link']]),
});

// Sync
const result = sanitizer.sanitize(text);

// Streaming
for await (const chunk of sanitizer.sanitizeStream(llmStream)) {
  console.log(chunk);
}

Subpath Import (Tree-shakeable):

import { SpeechSanitizer } from 'vocal-stack/sanitizer';

Flow Module

High-Level API:

import { FlowController, withFlowControl } from 'vocal-stack';

// Convenience function
for await (const chunk of withFlowControl(llmStream)) {
  sendToTTS(chunk);
}

// Class-based
const controller = new FlowController({
  stallThresholdMs: 700,
  fillerPhrases: ['um', 'let me think'],
  enableFillers: true,
  onFillerInjected: (filler) => sendToTTS(filler),
});

for await (const chunk of controller.wrap(llmStream)) {
  sendToTTS(chunk);
}

// Barge-in
controller.interrupt();

Low-Level API (Event-Based):

import { FlowManager } from 'vocal-stack';

const manager = new FlowManager({ stallThresholdMs: 700 });

manager.on((event) => {
  switch (event.type) {
    case 'stall-detected':
      console.log(`Stalled for ${event.durationMs}ms`);
      break;
    case 'filler-injected':
      sendToTTS(event.filler);
      break;
    case 'state-change':
      console.log(`${event.from} → ${event.to}`);
      break;
  }
});

manager.start();
for await (const chunk of llmStream) {
  manager.processChunk(chunk);
  sendToTTS(chunk);
}
manager.complete();

Subpath Import:

import { FlowController } from 'vocal-stack/flow';

Monitor Module

import { VoiceAuditor } from 'vocal-stack';

const auditor = new VoiceAuditor({
  enableRealtime: true,
  onMetric: (metric) => console.log(metric),
});

// Automatic tracking
for await (const chunk of auditor.track('req-123', llmStream)) {
  sendToTTS(chunk);
}

// Manual tracking
auditor.startTracking('req-456');
// ... processing ...
auditor.recordToken('req-456');
// ... more processing ...
const metric = auditor.completeTracking('req-456');

// Get statistics
const summary = auditor.getSummary();

// Export
const json = auditor.export('json');
const csv = auditor.export('csv');

Subpath Import:

import { VoiceAuditor } from 'vocal-stack/monitor';

Architecture

vocal-stack is built with three independent, composable modules:

┌─────────────────────────────────────────────────────────┐
│                    Voice Pipeline                       │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────┐   ┌──────────┐   ┌──────┐   ┌─────────┐    │
│  │ LLM  │ → │Sanitizer │ → │ Flow │ → │ Monitor │    │
│  │Stream│   │(clean    │   │(fill-│   │(metrics)│    │
│  └──────┘   │text)     │   │ers)  │   └─────────┘    │
│             └──────────┘   └──────┘        │          │
│                                             ↓          │
│                                          ┌─────┐      │
│                                          │ TTS │      │
│                                          └─────┘      │
└─────────────────────────────────────────────────────────┘

Each module:

✅ Works standalone
✅ Composes seamlessly
✅ Fully typed (TypeScript)
✅ Well-tested (90%+ coverage)
✅ Production-ready

Use only what you need:

// Just sanitization
import { SpeechSanitizer } from 'vocal-stack/sanitizer';

// Just flow control
import { FlowController } from 'vocal-stack/flow';

// Just monitoring
import { VoiceAuditor } from 'vocal-stack/monitor';

// All together
import { SpeechSanitizer, FlowController, VoiceAuditor } from 'vocal-stack';

Platform Support

vocal-stack is platform-agnostic and works with any LLM or TTS provider:

Tested With

LLMs:

✅ OpenAI (GPT-4, GPT-3.5)
✅ Anthropic Claude
✅ Google Gemini
✅ Local LLMs (Ollama, LM Studio)
✅ Any streaming text API

TTS:

✅ OpenAI TTS
✅ ElevenLabs
✅ Google Cloud TTS
✅ Azure TTS
✅ AWS Polly
✅ Any TTS provider

Node.js:

✅ Node.js 18+
✅ Node.js 20+
✅ Node.js 22+

Module Systems:

✅ ESM (import/export)
✅ CommonJS (require)
✅ TypeScript
✅ JavaScript

Performance

vocal-stack adds minimal overhead to your voice pipeline:

| Operation | Overhead | Impact | |-----------|----------|--------| | Text sanitization | < 1ms per chunk | Negligible | | Flow control | < 1ms per chunk | Negligible | | Monitoring | < 0.5ms per chunk | Negligible | | Total | ~2-3ms per chunk | ✅ Negligible |

For a typical voice response (50 chunks), total overhead is ~100-150ms.

Benchmarks:

✅ Handles 1000+ chunks/second
✅ Memory efficient (streaming-based)
✅ No blocking operations
✅ Fully async/await compatible

Documentation

Quick Links

📖 Examples - 7 comprehensive examples
🎯 API Reference - Complete API documentation
🚀 Quick Start - Get started in 5 minutes
💡 Use Cases - Real-world applications

Examples

| Example | Description | Code | |---------|-------------|------| | Basic Sanitizer | Text cleaning basics | View → | | Flow Control | Latency & fillers | View → | | Monitoring | Performance tracking | View → | | Full Pipeline | All modules together | View → | | OpenAI Integration | Real OpenAI usage | View → | | ElevenLabs Integration | Real ElevenLabs usage | View → | | Custom Agent | Production-ready agent | View → |

FAQ

When should I use vocal-stack?

Use vocal-stack when building voice AI applications that need:

Clean, speakable text from LLM output
Natural handling of streaming delays
Performance monitoring and optimization
Production-ready code patterns

Do I need to use all three modules?

No! Each module works independently:

Use just Sanitizer if you only need text cleaning
Use just Flow Control if you only need latency handling
Use just Monitor if you only need metrics
Or use all three for complete functionality

Does it work with my LLM/TTS provider?

Yes! vocal-stack is platform-agnostic and works with any:

LLM that provides streaming text (OpenAI, Claude, Gemini, local LLMs)
TTS provider (OpenAI, ElevenLabs, Google, Azure, AWS, custom)

How much overhead does it add?

Very minimal (~2-3ms per chunk). See Performance for details.

Is it production-ready?

Yes! vocal-stack is:

✅ TypeScript strict mode
✅ 90%+ test coverage
✅ Used in production applications
✅ Well-documented
✅ Actively maintained

Can I customize sanitization rules?

Yes! You can:

Choose which built-in rules to apply
Add custom replacements
Create custom plugins (coming soon)

Contributing

Contributions are welcome! Here's how you can help:

Ways to Contribute

🐛 Report bugs by opening an issue
💡 Suggest features or improvements
📖 Improve documentation
🧪 Add tests
💻 Submit pull requests
⭐ Star the repo to show support

Development Setup

# Clone the repo
git clone https://github.com/gaurav890/vocal-stack.git
cd vocal-stack

# Install dependencies
npm install

# Run tests
npm test

# Run tests in watch mode
npm run test:watch

# Run tests with coverage
npm run test:coverage

# Lint code
npm run lint

# Type check
npm run typecheck

# Build
npm run build

Guidelines

Follow existing code style
Add tests for new features
Update documentation
Keep commits atomic and descriptive

License

See LICENSE for details.

Support

💬 GitHub Issues - Bug reports & feature requests
📖 Examples - Code examples

Acknowledgments

Built with:

Made with ❤️ for the Voice AI community

⬆ Back to top

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

vocal-stack

Overview

Why vocal-stack?

Without vocal-stack ❌

With vocal-stack ✅

Comparison Table

Installation

Quick Start

1️⃣ Text Sanitization

2️⃣ Flow Control

3️⃣ Latency Monitoring

4️⃣ Full Pipeline (All Together)

Examples

🎮 Try It Online

Quick Example: OpenAI Integration

Use Cases

🎙️ Voice Assistants

💬 Customer Service Bots

🎓 Educational AI Tutors

🎮 Gaming NPCs

♿ Accessibility Tools

🎧 Content Creation

🏠 Smart Home Devices

📞 IVR Systems

Features

🧹 Text Sanitizer

⚡ Flow Control

📊 Latency Monitoring

API Overview

Sanitizer Module

Flow Module

Monitor Module

Architecture

Platform Support

Tested With

Performance

Documentation

Quick Links

Examples

FAQ

When should I use vocal-stack?

Do I need to use all three modules?

Does it work with my LLM/TTS provider?

How much overhead does it add?

Is it production-ready?

Can I customize sanitization rules?

Contributing

Ways to Contribute

Development Setup

Guidelines

License

Support

Acknowledgments