voice-to-text-converter

v1.0.0

Published

6 months ago

A modern, lightweight Node.js package for speech-to-text conversion with support for multiple engines

0High
0Medium
0Low

speech-to-text voice-recognition speech-recognition audio-to-text vosk google-cloud-speech web-speech-api typescript nodejs browser

Voice-to-Text Converter

A modern, lightweight Node.js package for speech-to-text conversion with support for multiple engines and both Node.js and browser environments.

Features

🎤 Multiple Input Sources: Microphone, audio files, and streams
🔧 Multiple Engines: Web Speech API, Vosk (offline), Google Cloud Speech-to-Text
🌐 Cross-Platform: Works in Node.js and browsers
📝 TypeScript Support: Full type definitions included
🔄 Real-time Processing: Streaming and continuous recognition
🛡️ Error Handling: Comprehensive error handling and fallback mechanisms
🎯 Simple API: Clean, intuitive interface for developers

Installation

npm install voice-to-text-converter

Optional Dependencies

For offline processing with Vosk:

npm install vosk

For Google Cloud Speech-to-Text:

npm install @google-cloud/speech

For microphone recording in Node.js:

npm install node-record-lpcm16

Quick Start

Node.js

import { VoiceToText, transcribeFromFile, transcribeFromMicrophone } from 'voice-to-text-converter';

// Quick transcription from file
const results = await transcribeFromFile('audio.wav', {
  language: 'en-US'
});
console.log(results[0].transcript);

// Quick transcription from microphone
const micResults = await transcribeFromMicrophone({
  duration: 5000, // 5 seconds
  language: 'en-US'
});
console.log(micResults[0].transcript);

Browser

<script src="https://unpkg.com/voice-to-text-converter/lib/browser.js"></script>
<script>
  // Quick transcription from microphone
  voiceToText.transcribeFromMicrophone({
    duration: 5000,
    language: 'en-US'
  }).then(results => {
    console.log(results[0].transcript);
  });
</script>

Usage Examples

Basic Usage

import { VoiceToText } from 'voice-to-text-converter';

const voiceToText = new VoiceToText({
  defaultEngine: { engine: 'vosk', modelPath: './models/vosk-model-en-us' },
  defaultRecognitionConfig: {
    language: 'en-US',
    continuous: true,
    interimResults: true
  }
});

// Initialize the converter
await voiceToText.initialize();

// Set up event listeners
voiceToText.on('result', (result) => {
  console.log(`Transcript: ${result.transcript}`);
  console.log(`Confidence: ${result.confidence}`);
  console.log(`Is Final: ${result.isFinal}`);
});

voiceToText.on('error', (error) => {
  console.error('Recognition error:', error.message);
});

// Start listening from microphone
await voiceToText.fromMicrophone({
  duration: 10000 // Record for 10 seconds
});

// Clean up
await voiceToText.cleanup();

File Processing

import { VoiceToText } from 'voice-to-text-converter';

const voiceToText = new VoiceToText();
await voiceToText.initialize();

// Process single file
const results = await voiceToText.fromFile('speech.wav', {
  language: 'en-US',
  maxAlternatives: 3
});

results.forEach((result, index) => {
  console.log(`Result ${index + 1}: ${result.transcript}`);
  console.log(`Confidence: ${result.confidence}`);
});

Stream Processing

import { VoiceToText } from 'voice-to-text-converter';
import fs from 'fs';

const voiceToText = new VoiceToText();
await voiceToText.initialize();

const audioStream = fs.createReadStream('audio.wav');
const results = await voiceToText.fromStream(audioStream, {
  language: 'es-ES'
});

console.log('Transcription:', results.map(r => r.transcript).join(' '));

Real-time Recognition

import { VoiceToText } from 'voice-to-text-converter';

const voiceToText = new VoiceToText({
  defaultRecognitionConfig: {
    continuous: true,
    interimResults: true
  }
});

await voiceToText.initialize();

// Handle real-time results
voiceToText.on('result', (result) => {
  if (result.isFinal) {
    console.log('Final:', result.transcript);
  } else {
    console.log('Interim:', result.transcript);
  }
});

// Start continuous listening
await voiceToText.startListening({
  source: 'microphone'
});

// Stop after 30 seconds
setTimeout(async () => {
  await voiceToText.stopListening();
}, 30000);

Engine-Specific Usage

Vosk (Offline)

import { VoiceToText, VoskEngine } from 'voice-to-text-converter';

// Download and setup Vosk model first
const modelPath = await VoskEngine.downloadModel('en-US', 'small');

const voiceToText = new VoiceToText({
  defaultEngine: {
    engine: 'vosk',
    modelPath: modelPath
  }
});

await voiceToText.initialize();
const results = await voiceToText.fromFile('audio.wav');

Google Cloud Speech-to-Text

import { VoiceToText } from 'voice-to-text-converter';

const voiceToText = new VoiceToText({
  defaultEngine: {
    engine: 'google-cloud',
    apiKey: 'your-api-key',
    projectId: 'your-project-id'
  }
});

await voiceToText.initialize();
const results = await voiceToText.fromFile('audio.wav', {
  language: 'en-US',
  encoding: 'FLAC'
});

Web Speech API (Browser)

import { VoiceToText } from 'voice-to-text-converter';

const voiceToText = new VoiceToText({
  defaultEngine: { engine: 'web-speech' }
});

await voiceToText.initialize();

// Only works in browsers with microphone access
await voiceToText.fromMicrophone({
  duration: 5000,
  language: 'en-US'
});

API Reference

VoiceToText Class

Constructor

new VoiceToText(options?: VoiceToTextOptions)

Options:

defaultEngine?: EngineConfig - Default engine configuration
defaultRecognitionConfig?: SpeechRecognitionConfig - Default recognition settings
enableFallback?: boolean - Enable automatic engine fallback (default: true)
enginePriority?: Array<'web-speech' | 'vosk' | 'google-cloud'> - Engine priority order
debug?: boolean - Enable debug logging (default: false)

Methods

`initialize(): Promise<void>`

Initialize the voice-to-text converter and select the best available engine.

`fromMicrophone(options?: MicrophoneOptions): Promise<void>`

Start speech recognition from microphone input.

Options:

duration?: number - Recording duration in milliseconds
deviceId?: string - Specific microphone device ID
sampleRate?: number - Audio sample rate (default: 16000)

`fromFile(filePath: string, config?: SpeechRecognitionConfig): Promise<SpeechRecognitionResult[]>`

Process an audio file and return transcription results.

`fromStream(stream: NodeJS.ReadableStream, config?: SpeechRecognitionConfig): Promise<SpeechRecognitionResult[]>`

Process an audio stream and return transcription results.

`startListening(audioConfig: AudioInputConfig, config?: SpeechRecognitionConfig): Promise<void>`

Start continuous speech recognition.

`stopListening(): Promise<void>`

Stop ongoing speech recognition.

`abort(): Promise<void>`

Abort speech recognition immediately.

`switchEngine(engineConfig: EngineConfig): Promise<void>`

Switch to a different speech recognition engine.

`getCurrentEngine(): EngineInfo | null`

Get information about the currently active engine.

`cleanup(): Promise<void>`

Clean up resources and stop all recognition processes.

Properties

`isListening: boolean`

Whether the converter is currently listening/recording.

Static Methods

`getAvailableEngines(): Array<'web-speech' | 'vosk' | 'google-cloud'>`

Get list of available engines in the current environment.

`isEngineAvailable(engine: string): boolean`

Check if a specific engine is available.

`getEngineCapabilities(engine: string): EngineCapabilities`

Get capabilities and features of a specific engine.

`getBrowserSupport(): BrowserSupport`

Get browser compatibility information.

`quickTranscribe(source, options): Promise<SpeechRecognitionResult[]>`

Quick one-time transcription without managing instance lifecycle.

Events

The VoiceToText class extends EventEmitter and emits the following events:

start - Recognition started
end - Recognition ended
result - Transcription result available
error - Error occurred
audiostart - Audio input started
audioend - Audio input ended
soundstart - Sound detected
soundend - Sound ended
speechstart - Speech detected
speechend - Speech ended

Types and Interfaces

SpeechRecognitionResult

interface SpeechRecognitionResult {
  transcript: string;
  confidence: number;
  isFinal: boolean;
  alternatives?: Array<{
    transcript: string;
    confidence: number;
  }>;
  timestamp?: {
    start: number;
    end: number;
  };
}

SpeechRecognitionConfig

interface SpeechRecognitionConfig {
  language?: string; // Language code (e.g., 'en-US')
  sampleRate?: number; // Audio sample rate
  continuous?: boolean; // Enable continuous recognition
  interimResults?: boolean; // Return interim results
  maxAlternatives?: number; // Maximum alternatives to return
  confidenceThreshold?: number; // Confidence threshold (0-1)
  phrases?: string[]; // Custom vocabulary
  encoding?: 'LINEAR16' | 'FLAC' | 'MULAW' | 'AMR' | 'AMR_WB' | 'OGG_OPUS';
}

EngineConfig

interface EngineConfig {
  engine: 'web-speech' | 'vosk' | 'google-cloud';
  apiKey?: string; // For cloud services
  modelPath?: string; // For offline engines
  projectId?: string; // For Google Cloud
  endpoint?: string; // Custom endpoint URL
}

Quick Start Functions

`transcribeFromFile(filePath: string, options?): Promise<SpeechRecognitionResult[]>`

Quick transcription from an audio file.

`transcribeFromMicrophone(options?): Promise<SpeechRecognitionResult[]>`

Quick transcription from microphone input.

`transcribeFromStream(stream: NodeJS.ReadableStream, options?): Promise<SpeechRecognitionResult[]>`

Quick transcription from an audio stream.

`createVoiceToText(options?): VoiceToText`

Factory function to create a VoiceToText instance.

`getSystemInfo(): SystemInfo`

Get system information and available engines.

Engine Comparison

| Feature | Web Speech API | Vosk | Google Cloud Speech | |---------|---------------|------|-------------------| | Environment | Browser only | Node.js + Browser | Node.js + Browser | | Online/Offline | Online | Offline | Online | | Accuracy | High | Medium-High | Very High | | Speed | Fast | Fast | Fast | | Privacy | Data sent to Google | Fully private | Data sent to Google | | Cost | Free | Free | Pay per use | | Languages | 60+ | 20+ | 125+ | | File Processing | Limited | Yes | Yes | | Streaming | Yes | Yes | Yes | | Setup Complexity | None | Model download required | API key required |

When to Use Each Engine

Web Speech API:

Browser-based applications
Quick prototyping
No setup required
Real-time microphone input

Vosk:

Privacy-sensitive applications
Offline processing required
Cost-sensitive projects
Edge computing scenarios

Google Cloud Speech:

High accuracy requirements
Production applications
Multiple language support
Advanced features needed

Configuration

Environment Variables

# Google Cloud Speech (optional)
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
GOOGLE_CLOUD_PROJECT=your-project-id

# OpenAI API (if using Whisper integration)
OPENAI_API_KEY=your-openai-api-key

Vosk Model Setup

Download a Vosk model from alphacephei.com/vosk/models
Extract the model to a directory
Use the model path in your configuration:

const voiceToText = new VoiceToText({
  defaultEngine: {
    engine: 'vosk',
    modelPath: './models/vosk-model-en-us-0.22'
  }
});

Google Cloud Setup

Create a Google Cloud project
Enable the Speech-to-Text API
Create a service account and download the JSON key
Set the environment variable or pass credentials directly:

const voiceToText = new VoiceToText({
  defaultEngine: {
    engine: 'google-cloud',
    apiKey: 'your-api-key',
    projectId: 'your-project-id'
  }
});

Browser Usage

CDN

<script src="https://unpkg.com/voice-to-text-converter/lib/browser.js"></script>

ES Modules

import { VoiceToText } from 'voice-to-text-converter/browser';

const voiceToText = new VoiceToText();
await voiceToText.initialize();

// Request microphone permission
const hasPermission = await voiceToText.requestMicrophonePermission();
if (hasPermission) {
  await voiceToText.fromMicrophone({ duration: 5000 });
}

Browser Compatibility

Chrome 25+
Firefox 44+
Safari 14.1+
Edge 79+

Note: Web Speech API requires HTTPS in production environments.

Error Handling

import { VoiceToText, SpeechRecognitionError, SpeechRecognitionErrorType } from 'voice-to-text-converter';

const voiceToText = new VoiceToText();

voiceToText.on('error', (error) => {
  switch (error.type) {
    case SpeechRecognitionErrorType.NO_SPEECH:
      console.log('No speech detected');
      break;
    case SpeechRecognitionErrorType.AUDIO_CAPTURE:
      console.log('Microphone access denied');
      break;
    case SpeechRecognitionErrorType.NETWORK:
      console.log('Network error occurred');
      break;
    case SpeechRecognitionErrorType.NOT_ALLOWED:
      console.log('Permission denied');
      break;
    default:
      console.error('Recognition error:', error.message);
  }
});

try {
  await voiceToText.initialize();
  const results = await voiceToText.fromFile('audio.wav');
} catch (error) {
  console.error('Failed to process audio:', error.message);
}

Performance Tips

Optimization

Choose the right engine for your use case
Set appropriate sample rates (16000 Hz is usually sufficient)
Use confidence thresholds to filter low-quality results
Enable interim results only when needed
Implement proper cleanup to prevent memory leaks

Memory Management

// Always clean up resources
const voiceToText = new VoiceToText();
try {
  await voiceToText.initialize();
  // ... use the converter
} finally {
  await voiceToText.cleanup();
}

// Or use the quick functions for one-time use
const results = await transcribeFromFile('audio.wav');

Batch Processing

// Process multiple files efficiently
const voiceToText = new VoiceToText();
await voiceToText.initialize();

const files = ['audio1.wav', 'audio2.wav', 'audio3.wav'];
const results = await Promise.all(
  files.map(file => voiceToText.fromFile(file))
);

await voiceToText.cleanup();

Troubleshooting

Common Issues

"No speech recognition engines are available"

Cause: No compatible engines are installed or available
Solution: Install optional dependencies (vosk, @google-cloud/speech) or use in a browser environment

"Microphone access denied"

Cause: Browser blocked microphone access
Solution: Enable microphone permissions in browser settings, ensure HTTPS in production

"Model not found" (Vosk)

Cause: Vosk model path is incorrect or model not downloaded
Solution: Download the correct model and verify the path

"Authentication failed" (Google Cloud)

Cause: Invalid API credentials
Solution: Verify API key and project ID, check service account permissions

Poor recognition accuracy

Cause: Low audio quality, wrong language setting, or inappropriate engine
Solution:
- Improve audio quality (reduce noise, use better microphone)
- Set correct language in configuration
- Try different engines
- Adjust confidence threshold

Debug Mode

Enable debug mode to get detailed logging:

const voiceToText = new VoiceToText({ debug: true });

Testing Audio Setup

import { getSystemInfo, VoiceToText } from 'voice-to-text-converter';

// Check system capabilities
const systemInfo = getSystemInfo();
console.log('Available engines:', systemInfo.availableEngines);
console.log('Platform:', systemInfo.platform);

// Test browser support
if (systemInfo.platform === 'browser') {
  const support = VoiceToText.getBrowserSupport();
  console.log('Web Speech API:', support.webSpeechAPI);
  console.log('Media Recorder:', support.mediaRecorder);
  console.log('getUserMedia:', support.getUserMedia);
}

Examples

See the examples/ directory for complete working examples:

examples/node-basic.js - Basic Node.js usage
examples/node-advanced.js - Advanced Node.js features
examples/browser-simple.html - Simple browser implementation
examples/browser-advanced.html - Advanced browser features
examples/real-time.js - Real-time speech recognition
examples/file-processing.js - Batch file processing

Testing

Run the test suite:

npm test

Run tests with coverage:

npm run test:coverage

Run tests in watch mode:

npm run test:watch

Building

Build the package:

npm run build

Build in watch mode:

npm run build:watch

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Setup

Clone the repository:

git clone https://github.com/yourusername/voice-to-text-converter.git
cd voice-to-text-converter

Install dependencies:

npm install

Install optional dependencies for testing:

npm install vosk @google-cloud/speech node-record-lpcm16

Run tests:

npm test

Build the package:

npm run build

Code Style

This project uses ESLint and TypeScript for code quality. Run linting:

npm run lint
npm run lint:fix

Security Considerations

Privacy

Web Speech API: Audio data is sent to Google's servers
Google Cloud Speech: Audio data is sent to Google Cloud (with enterprise privacy controls)
Vosk: Fully offline, no data transmission

Permissions

Browser applications require microphone permission
Ensure HTTPS for production browser deployments
Validate and sanitize all audio file inputs

Best Practices

Always request explicit user consent for microphone access
Implement proper error handling for permission denials
Use HTTPS in production environments
Consider data retention policies for transcribed text
Implement rate limiting for cloud-based engines

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

See CHANGELOG.md for version history and changes.

Support

Acknowledgments

Vosk - Open source speech recognition toolkit
Google Cloud Speech-to-Text - Cloud-based speech recognition
Web Speech API - Browser speech recognition

Made with ❤️ by the Voice-to-Text Converter team