voice-to-text-converter
v1.0.0
Published
A modern, lightweight Node.js package for speech-to-text conversion with support for multiple engines
Maintainers
Readme
Voice-to-Text Converter
A modern, lightweight Node.js package for speech-to-text conversion with support for multiple engines and both Node.js and browser environments.
Features
- 🎤 Multiple Input Sources: Microphone, audio files, and streams
- 🔧 Multiple Engines: Web Speech API, Vosk (offline), Google Cloud Speech-to-Text
- 🌐 Cross-Platform: Works in Node.js and browsers
- 📝 TypeScript Support: Full type definitions included
- 🔄 Real-time Processing: Streaming and continuous recognition
- 🛡️ Error Handling: Comprehensive error handling and fallback mechanisms
- 🎯 Simple API: Clean, intuitive interface for developers
Installation
npm install voice-to-text-converterOptional Dependencies
For offline processing with Vosk:
npm install voskFor Google Cloud Speech-to-Text:
npm install @google-cloud/speechFor microphone recording in Node.js:
npm install node-record-lpcm16Quick Start
Node.js
import { VoiceToText, transcribeFromFile, transcribeFromMicrophone } from 'voice-to-text-converter';
// Quick transcription from file
const results = await transcribeFromFile('audio.wav', {
language: 'en-US'
});
console.log(results[0].transcript);
// Quick transcription from microphone
const micResults = await transcribeFromMicrophone({
duration: 5000, // 5 seconds
language: 'en-US'
});
console.log(micResults[0].transcript);Browser
<script src="https://unpkg.com/voice-to-text-converter/lib/browser.js"></script>
<script>
// Quick transcription from microphone
voiceToText.transcribeFromMicrophone({
duration: 5000,
language: 'en-US'
}).then(results => {
console.log(results[0].transcript);
});
</script>Usage Examples
Basic Usage
import { VoiceToText } from 'voice-to-text-converter';
const voiceToText = new VoiceToText({
defaultEngine: { engine: 'vosk', modelPath: './models/vosk-model-en-us' },
defaultRecognitionConfig: {
language: 'en-US',
continuous: true,
interimResults: true
}
});
// Initialize the converter
await voiceToText.initialize();
// Set up event listeners
voiceToText.on('result', (result) => {
console.log(`Transcript: ${result.transcript}`);
console.log(`Confidence: ${result.confidence}`);
console.log(`Is Final: ${result.isFinal}`);
});
voiceToText.on('error', (error) => {
console.error('Recognition error:', error.message);
});
// Start listening from microphone
await voiceToText.fromMicrophone({
duration: 10000 // Record for 10 seconds
});
// Clean up
await voiceToText.cleanup();File Processing
import { VoiceToText } from 'voice-to-text-converter';
const voiceToText = new VoiceToText();
await voiceToText.initialize();
// Process single file
const results = await voiceToText.fromFile('speech.wav', {
language: 'en-US',
maxAlternatives: 3
});
results.forEach((result, index) => {
console.log(`Result ${index + 1}: ${result.transcript}`);
console.log(`Confidence: ${result.confidence}`);
});Stream Processing
import { VoiceToText } from 'voice-to-text-converter';
import fs from 'fs';
const voiceToText = new VoiceToText();
await voiceToText.initialize();
const audioStream = fs.createReadStream('audio.wav');
const results = await voiceToText.fromStream(audioStream, {
language: 'es-ES'
});
console.log('Transcription:', results.map(r => r.transcript).join(' '));Real-time Recognition
import { VoiceToText } from 'voice-to-text-converter';
const voiceToText = new VoiceToText({
defaultRecognitionConfig: {
continuous: true,
interimResults: true
}
});
await voiceToText.initialize();
// Handle real-time results
voiceToText.on('result', (result) => {
if (result.isFinal) {
console.log('Final:', result.transcript);
} else {
console.log('Interim:', result.transcript);
}
});
// Start continuous listening
await voiceToText.startListening({
source: 'microphone'
});
// Stop after 30 seconds
setTimeout(async () => {
await voiceToText.stopListening();
}, 30000);Engine-Specific Usage
Vosk (Offline)
import { VoiceToText, VoskEngine } from 'voice-to-text-converter';
// Download and setup Vosk model first
const modelPath = await VoskEngine.downloadModel('en-US', 'small');
const voiceToText = new VoiceToText({
defaultEngine: {
engine: 'vosk',
modelPath: modelPath
}
});
await voiceToText.initialize();
const results = await voiceToText.fromFile('audio.wav');Google Cloud Speech-to-Text
import { VoiceToText } from 'voice-to-text-converter';
const voiceToText = new VoiceToText({
defaultEngine: {
engine: 'google-cloud',
apiKey: 'your-api-key',
projectId: 'your-project-id'
}
});
await voiceToText.initialize();
const results = await voiceToText.fromFile('audio.wav', {
language: 'en-US',
encoding: 'FLAC'
});Web Speech API (Browser)
import { VoiceToText } from 'voice-to-text-converter';
const voiceToText = new VoiceToText({
defaultEngine: { engine: 'web-speech' }
});
await voiceToText.initialize();
// Only works in browsers with microphone access
await voiceToText.fromMicrophone({
duration: 5000,
language: 'en-US'
});API Reference
VoiceToText Class
Constructor
new VoiceToText(options?: VoiceToTextOptions)Options:
defaultEngine?: EngineConfig- Default engine configurationdefaultRecognitionConfig?: SpeechRecognitionConfig- Default recognition settingsenableFallback?: boolean- Enable automatic engine fallback (default: true)enginePriority?: Array<'web-speech' | 'vosk' | 'google-cloud'>- Engine priority orderdebug?: boolean- Enable debug logging (default: false)
Methods
initialize(): Promise<void>
Initialize the voice-to-text converter and select the best available engine.
fromMicrophone(options?: MicrophoneOptions): Promise<void>
Start speech recognition from microphone input.
Options:
duration?: number- Recording duration in millisecondsdeviceId?: string- Specific microphone device IDsampleRate?: number- Audio sample rate (default: 16000)
fromFile(filePath: string, config?: SpeechRecognitionConfig): Promise<SpeechRecognitionResult[]>
Process an audio file and return transcription results.
fromStream(stream: NodeJS.ReadableStream, config?: SpeechRecognitionConfig): Promise<SpeechRecognitionResult[]>
Process an audio stream and return transcription results.
startListening(audioConfig: AudioInputConfig, config?: SpeechRecognitionConfig): Promise<void>
Start continuous speech recognition.
stopListening(): Promise<void>
Stop ongoing speech recognition.
abort(): Promise<void>
Abort speech recognition immediately.
switchEngine(engineConfig: EngineConfig): Promise<void>
Switch to a different speech recognition engine.
getCurrentEngine(): EngineInfo | null
Get information about the currently active engine.
cleanup(): Promise<void>
Clean up resources and stop all recognition processes.
Properties
isListening: boolean
Whether the converter is currently listening/recording.
Static Methods
getAvailableEngines(): Array<'web-speech' | 'vosk' | 'google-cloud'>
Get list of available engines in the current environment.
isEngineAvailable(engine: string): boolean
Check if a specific engine is available.
getEngineCapabilities(engine: string): EngineCapabilities
Get capabilities and features of a specific engine.
getBrowserSupport(): BrowserSupport
Get browser compatibility information.
quickTranscribe(source, options): Promise<SpeechRecognitionResult[]>
Quick one-time transcription without managing instance lifecycle.
Events
The VoiceToText class extends EventEmitter and emits the following events:
start- Recognition startedend- Recognition endedresult- Transcription result availableerror- Error occurredaudiostart- Audio input startedaudioend- Audio input endedsoundstart- Sound detectedsoundend- Sound endedspeechstart- Speech detectedspeechend- Speech ended
Types and Interfaces
SpeechRecognitionResult
interface SpeechRecognitionResult {
transcript: string;
confidence: number;
isFinal: boolean;
alternatives?: Array<{
transcript: string;
confidence: number;
}>;
timestamp?: {
start: number;
end: number;
};
}SpeechRecognitionConfig
interface SpeechRecognitionConfig {
language?: string; // Language code (e.g., 'en-US')
sampleRate?: number; // Audio sample rate
continuous?: boolean; // Enable continuous recognition
interimResults?: boolean; // Return interim results
maxAlternatives?: number; // Maximum alternatives to return
confidenceThreshold?: number; // Confidence threshold (0-1)
phrases?: string[]; // Custom vocabulary
encoding?: 'LINEAR16' | 'FLAC' | 'MULAW' | 'AMR' | 'AMR_WB' | 'OGG_OPUS';
}EngineConfig
interface EngineConfig {
engine: 'web-speech' | 'vosk' | 'google-cloud';
apiKey?: string; // For cloud services
modelPath?: string; // For offline engines
projectId?: string; // For Google Cloud
endpoint?: string; // Custom endpoint URL
}Quick Start Functions
transcribeFromFile(filePath: string, options?): Promise<SpeechRecognitionResult[]>
Quick transcription from an audio file.
transcribeFromMicrophone(options?): Promise<SpeechRecognitionResult[]>
Quick transcription from microphone input.
transcribeFromStream(stream: NodeJS.ReadableStream, options?): Promise<SpeechRecognitionResult[]>
Quick transcription from an audio stream.
createVoiceToText(options?): VoiceToText
Factory function to create a VoiceToText instance.
getSystemInfo(): SystemInfo
Get system information and available engines.
Engine Comparison
| Feature | Web Speech API | Vosk | Google Cloud Speech | |---------|---------------|------|-------------------| | Environment | Browser only | Node.js + Browser | Node.js + Browser | | Online/Offline | Online | Offline | Online | | Accuracy | High | Medium-High | Very High | | Speed | Fast | Fast | Fast | | Privacy | Data sent to Google | Fully private | Data sent to Google | | Cost | Free | Free | Pay per use | | Languages | 60+ | 20+ | 125+ | | File Processing | Limited | Yes | Yes | | Streaming | Yes | Yes | Yes | | Setup Complexity | None | Model download required | API key required |
When to Use Each Engine
Web Speech API:
- Browser-based applications
- Quick prototyping
- No setup required
- Real-time microphone input
Vosk:
- Privacy-sensitive applications
- Offline processing required
- Cost-sensitive projects
- Edge computing scenarios
Google Cloud Speech:
- High accuracy requirements
- Production applications
- Multiple language support
- Advanced features needed
Configuration
Environment Variables
# Google Cloud Speech (optional)
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
GOOGLE_CLOUD_PROJECT=your-project-id
# OpenAI API (if using Whisper integration)
OPENAI_API_KEY=your-openai-api-keyVosk Model Setup
- Download a Vosk model from alphacephei.com/vosk/models
- Extract the model to a directory
- Use the model path in your configuration:
const voiceToText = new VoiceToText({
defaultEngine: {
engine: 'vosk',
modelPath: './models/vosk-model-en-us-0.22'
}
});Google Cloud Setup
- Create a Google Cloud project
- Enable the Speech-to-Text API
- Create a service account and download the JSON key
- Set the environment variable or pass credentials directly:
const voiceToText = new VoiceToText({
defaultEngine: {
engine: 'google-cloud',
apiKey: 'your-api-key',
projectId: 'your-project-id'
}
});Browser Usage
CDN
<script src="https://unpkg.com/voice-to-text-converter/lib/browser.js"></script>ES Modules
import { VoiceToText } from 'voice-to-text-converter/browser';
const voiceToText = new VoiceToText();
await voiceToText.initialize();
// Request microphone permission
const hasPermission = await voiceToText.requestMicrophonePermission();
if (hasPermission) {
await voiceToText.fromMicrophone({ duration: 5000 });
}Browser Compatibility
- Chrome 25+
- Firefox 44+
- Safari 14.1+
- Edge 79+
Note: Web Speech API requires HTTPS in production environments.
Error Handling
import { VoiceToText, SpeechRecognitionError, SpeechRecognitionErrorType } from 'voice-to-text-converter';
const voiceToText = new VoiceToText();
voiceToText.on('error', (error) => {
switch (error.type) {
case SpeechRecognitionErrorType.NO_SPEECH:
console.log('No speech detected');
break;
case SpeechRecognitionErrorType.AUDIO_CAPTURE:
console.log('Microphone access denied');
break;
case SpeechRecognitionErrorType.NETWORK:
console.log('Network error occurred');
break;
case SpeechRecognitionErrorType.NOT_ALLOWED:
console.log('Permission denied');
break;
default:
console.error('Recognition error:', error.message);
}
});
try {
await voiceToText.initialize();
const results = await voiceToText.fromFile('audio.wav');
} catch (error) {
console.error('Failed to process audio:', error.message);
}Performance Tips
Optimization
- Choose the right engine for your use case
- Set appropriate sample rates (16000 Hz is usually sufficient)
- Use confidence thresholds to filter low-quality results
- Enable interim results only when needed
- Implement proper cleanup to prevent memory leaks
Memory Management
// Always clean up resources
const voiceToText = new VoiceToText();
try {
await voiceToText.initialize();
// ... use the converter
} finally {
await voiceToText.cleanup();
}
// Or use the quick functions for one-time use
const results = await transcribeFromFile('audio.wav');Batch Processing
// Process multiple files efficiently
const voiceToText = new VoiceToText();
await voiceToText.initialize();
const files = ['audio1.wav', 'audio2.wav', 'audio3.wav'];
const results = await Promise.all(
files.map(file => voiceToText.fromFile(file))
);
await voiceToText.cleanup();Troubleshooting
Common Issues
"No speech recognition engines are available"
- Cause: No compatible engines are installed or available
- Solution: Install optional dependencies (
vosk,@google-cloud/speech) or use in a browser environment
"Microphone access denied"
- Cause: Browser blocked microphone access
- Solution: Enable microphone permissions in browser settings, ensure HTTPS in production
"Model not found" (Vosk)
- Cause: Vosk model path is incorrect or model not downloaded
- Solution: Download the correct model and verify the path
"Authentication failed" (Google Cloud)
- Cause: Invalid API credentials
- Solution: Verify API key and project ID, check service account permissions
Poor recognition accuracy
- Cause: Low audio quality, wrong language setting, or inappropriate engine
- Solution:
- Improve audio quality (reduce noise, use better microphone)
- Set correct language in configuration
- Try different engines
- Adjust confidence threshold
Debug Mode
Enable debug mode to get detailed logging:
const voiceToText = new VoiceToText({ debug: true });Testing Audio Setup
import { getSystemInfo, VoiceToText } from 'voice-to-text-converter';
// Check system capabilities
const systemInfo = getSystemInfo();
console.log('Available engines:', systemInfo.availableEngines);
console.log('Platform:', systemInfo.platform);
// Test browser support
if (systemInfo.platform === 'browser') {
const support = VoiceToText.getBrowserSupport();
console.log('Web Speech API:', support.webSpeechAPI);
console.log('Media Recorder:', support.mediaRecorder);
console.log('getUserMedia:', support.getUserMedia);
}Examples
See the examples/ directory for complete working examples:
examples/node-basic.js- Basic Node.js usageexamples/node-advanced.js- Advanced Node.js featuresexamples/browser-simple.html- Simple browser implementationexamples/browser-advanced.html- Advanced browser featuresexamples/real-time.js- Real-time speech recognitionexamples/file-processing.js- Batch file processing
Testing
Run the test suite:
npm testRun tests with coverage:
npm run test:coverageRun tests in watch mode:
npm run test:watchBuilding
Build the package:
npm run buildBuild in watch mode:
npm run build:watchContributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Development Setup
- Clone the repository:
git clone https://github.com/yourusername/voice-to-text-converter.git
cd voice-to-text-converter- Install dependencies:
npm install- Install optional dependencies for testing:
npm install vosk @google-cloud/speech node-record-lpcm16- Run tests:
npm test- Build the package:
npm run buildCode Style
This project uses ESLint and TypeScript for code quality. Run linting:
npm run lint
npm run lint:fixSecurity Considerations
Privacy
- Web Speech API: Audio data is sent to Google's servers
- Google Cloud Speech: Audio data is sent to Google Cloud (with enterprise privacy controls)
- Vosk: Fully offline, no data transmission
Permissions
- Browser applications require microphone permission
- Ensure HTTPS for production browser deployments
- Validate and sanitize all audio file inputs
Best Practices
- Always request explicit user consent for microphone access
- Implement proper error handling for permission denials
- Use HTTPS in production environments
- Consider data retention policies for transcribed text
- Implement rate limiting for cloud-based engines
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
See CHANGELOG.md for version history and changes.
Support
Acknowledgments
- Vosk - Open source speech recognition toolkit
- Google Cloud Speech-to-Text - Cloud-based speech recognition
- Web Speech API - Browser speech recognition
Made with ❤️ by the Voice-to-Text Converter team
