@sathsarabandaraj/audio-handler

v0.1.2

Published

4 months ago

A comprehensive audio handling library for STT (Speech-to-Text) and TTS (Text-to-Speech) functionality using Deepgram

Downloads

0High
0Medium
0Low

sathsarabandaraj

audio stt tts speech-to-text text-to-speech deepgram voice recognition synthesis

@sathsarabandaraj/audio-handler

A unified TypeScript library for browser-based audio management, featuring Speech-to-Text (STT) and Text-to-Speech (TTS) powered by Deepgram AI.

🌟 Features

🎤 Speech-to-Text: Real-time speech transcription using Deepgram's AI
🔊 Text-to-Speech: High-quality voice synthesis with multiple voice options
🔌 Provider System: Pluggable architecture for different AI providers
⚡️ Event-Driven: React to audio events in real-time
🌐 Browser Ready: Works seamlessly in web applications
🛡️ TypeScript: Full type safety and IntelliSense support
🔄 Smart Fallbacks: Automatic fallback to Web Speech API when needed
🎯 Easy Integration: Simple API, minimal setup

📦 Installation

npm install @sathsarabandaraj/audio-handler

Or with pnpm:

pnpm add @sathsarabandaraj/audio-handler

🚀 Quick Start

1. Initialize AudioManager

import AudioManager from '@sathsarabandaraj/audio-handler';

// Create instance
const audioManager = new AudioManager({
  apiKey: 'your-deepgram-api-key',
  language: 'en-US'  // Optional
});

// Initialize
await audioManager.initialize();

2. Speech-to-Text

// Start listening
await audioManager.startListening((result) => {
  console.log('Transcript:', result.text);
  console.log('Is Final:', result.isFinal);
  console.log('Confidence:', result.confidence);
});

// Stop listening
await audioManager.stopListening();

3. Text-to-Speech

// Generate speech
const result = await audioManager.speak('Hello, world!', {
  model: 'aura-asteria-en'
});

// Play audio
const blob = new Blob([result.audioBuffer], { type: 'audio/mpeg' });
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
audio.play();

📚 Complete Examples

We provide three complete, working examples that demonstrate different implementation approaches:

1. HTML Example (Vanilla JavaScript)

Single-file HTML application with no build step required.

cd examples/html-example
python3 -m http.server 8080
# Open http://localhost:8080

Features:

✅ Single HTML file
✅ No framework needed
✅ ES6 modules
✅ Smart TTS fallback
✅ Deferred transcription

📖 HTML Example Documentation

2. React Example (Component-Based)

Modern React application with Vite build system.

cd examples/react-example
pnpm install
pnpm run dev
# Open http://localhost:5173

Features:

✅ React 18 + Hooks
✅ Component architecture
✅ Hot Module Replacement
✅ TypeScript-ready
✅ Production-ready structure

📖 React Example Documentation

3. Backend Proxy (Node.js/Express)

Secure proxy server to solve CORS issues and protect API keys.

cd examples/backend
npm install
cp .env.example .env
# Edit .env and add your API key
npm start
# Server runs on http://localhost:3001

Features:

✅ Express.js REST API
✅ CORS-enabled
✅ API key security
✅ Audio streaming
✅ Health checks

📖 Backend Documentation

🎯 Complete Documentation

For a comprehensive understanding of how everything works together, read our detailed guide:

Examples Documentation - Complete architecture overview, data flow, and system design

📖 API Reference

AudioManager

Constructor

new AudioManager(config: AudioManagerConfig)

Config Options:

interface AudioManagerConfig {
  apiKey: string;         // Required: Deepgram API key
  language?: string;      // Optional: Language code (default: 'en-US')
}

Methods

`initialize()`

Initialize the audio manager. Must be called before using STT/TTS.

await audioManager.initialize();

`startListening(callback)`

Start speech-to-text transcription.

await audioManager.startListening((result) => {
  console.log(result.text);        // Transcribed text
  console.log(result.isFinal);     // Is this a final result?
  console.log(result.confidence);  // Confidence score (0-1)
});

Result Interface:

interface TranscriptionResult {
  text: string;          // Transcribed text
  isFinal: boolean;      // Final result or interim?
  confidence: number;    // Confidence score (0-1)
}

`stopListening()`

Stop speech-to-text transcription.

await audioManager.stopListening();

`speak(text, options)`

Generate speech from text.

const result = await audioManager.speak('Hello, world!', {
  model: 'aura-asteria-en'  // Optional: Voice model
});

console.log(result.audioBuffer);  // ArrayBuffer
console.log(result.duration);     // Duration in seconds

Options:

interface TTSOptions {
  model?: string;  // Voice model (default: 'aura-asteria-en')
}

Result Interface:

interface TTSResult {
  audioBuffer: ArrayBuffer;  // Audio data
  duration: number;          // Duration in seconds
}

Available Voice Models

| Model | Gender | Language | Description | |-------|--------|----------|-------------| | aura-asteria-en | Female | English | Warm and friendly | | aura-luna-en | Female | English | Clear and professional | | aura-stella-en | Female | English | Expressive | | aura-athena-en | Female | English | Authoritative | | aura-hera-en | Female | English | Confident | | aura-orion-en | Male | English | Deep and commanding | | aura-arcas-en | Male | English | Neutral and clear | | aura-perseus-en | Male | English | Energetic | | aura-angus-en | Male | English | Smooth | | aura-orpheus-en | Male | English | Rich and resonant | | aura-helios-en | Male | English | Bright | | aura-zeus-en | Male | English | Powerful |

🏗️ Architecture

AudioManager (Core)
│
├── DeepgramSTTProvider
│   ├── Microphone Permission
│   ├── AudioContext
│   ├── WebSocket (Live Transcription)
│   └── Event Emitter
│
└── DeepgramTTSProvider
    ├── HTTP Client (REST API)
    ├── Audio Buffer Handler
    └── Duration Calculator

🔒 Security Considerations

API Key Protection

❌ Don't expose API keys in frontend:

// Bad - API key visible in browser
const audioManager = new AudioManager({
  apiKey: 'abc123...'  // Visible in source code!
});

✅ Use backend proxy:

Set up the backend proxy server
Frontend calls your proxy
Proxy adds API key server-side
Proxy calls Deepgram API

CORS Issues

STT (WebSocket): Works directly from browser
TTS (REST): Requires backend proxy due to CORS

See our backend example for a complete proxy implementation.

🛠️ Development

Build from Source

# Clone repository
git clone https://github.com/sathsarabandaraj/audio-handler.git
cd audio-handler

# Install dependencies
pnpm install

# Build package
pnpm run build

# Output: dist/index.js

Project Structure

audio-handler/
├── src/
│   ├── index.ts              # Main exports
│   ├── core/
│   │   ├── audio-manager.ts  # Core AudioManager class
│   │   ├── types.ts          # TypeScript interfaces
│   │   └── mic-permission.ts # Microphone permission handler
│   └── providers/
│       └── deepgram/
│           ├── index.ts      # Provider exports
│           ├── stt.ts        # STT implementation
│           ├── tts.ts        # TTS implementation
│           └── types.ts      # Provider types
├── examples/
│   ├── html-example/         # Vanilla JS example
│   ├── react-example/        # React example
│   └── backend/              # Express proxy server
├── dist/                     # Build output
├── package.json
├── tsconfig.json
└── rollup.config.js

🧪 Testing

Manual Testing

Use the provided examples to test functionality:

# Test HTML example
cd examples/html-example
python3 -m http.server 8080

# Test React example
cd examples/react-example
pnpm run dev

# Test backend proxy
cd examples/backend
npm start

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

📝 License

MIT License

🙏 Acknowledgments

Deepgram for their excellent AI-powered speech APIs
Rollup for module bundling
TypeScript for type safety

🔗 Links

💬 Support

📧 Email: [email protected]
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

📊 Browser Compatibility

| Feature | Chrome | Firefox | Safari | Edge | |---------|--------|---------|--------|------| | STT (getUserMedia) | ✅ 53+ | ✅ 36+ | ✅ 11+ | ✅ 79+ | | AudioContext | ✅ 35+ | ✅ 25+ | ✅ 14.1+ | ✅ 79+ | | WebSocket | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | | Web Speech API | ✅ 33+ | ✅ 49+ | ✅ 14.1+ | ✅ 79+ |

Recommendation: Use the latest versions of Chrome, Firefox, Edge, or Safari for the best experience.

📈 Roadmap

[ ] Additional provider support (Azure, AWS, Google)
[ ] Language detection
[ ] Custom vocabulary support
[ ] Audio recording and export
[ ] Real-time waveform visualization
[ ] Voice activity detection
[ ] Noise cancellation
[ ] Multiple speaker detection

🎉 Getting Started

Ready to add voice capabilities to your app? Start with our examples:

Quick Test: Try the HTML example - no build required
Production Ready: Use the React example as a template
Security: Set up the backend proxy for production

Have questions? Check out our comprehensive documentation!

Built with ❤️ by Sathsara Bandaraj

Star ⭐ this repo if you find it useful!

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@sathsarabandaraj/audio-handler

🌟 Features

📦 Installation

🚀 Quick Start

1. Initialize AudioManager

2. Speech-to-Text

3. Text-to-Speech

📚 Complete Examples

1. HTML Example (Vanilla JavaScript)

2. React Example (Component-Based)

3. Backend Proxy (Node.js/Express)

🎯 Complete Documentation

📖 API Reference

AudioManager

Constructor

Methods

initialize()

startListening(callback)

stopListening()

speak(text, options)

Available Voice Models

🏗️ Architecture

🔒 Security Considerations

API Key Protection

CORS Issues

🛠️ Development

Build from Source

Project Structure

🧪 Testing

Manual Testing

🤝 Contributing

📝 License

🙏 Acknowledgments

🔗 Links

💬 Support

📊 Browser Compatibility

📈 Roadmap

🎉 Getting Started

`initialize()`

`startListening(callback)`

`stopListening()`

`speak(text, options)`