npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@create-voice-agent/openai

v0.1.0

Published

OpenAI Speech-to-Text and Text-to-Speech integration for voice agents

Readme

@create-voice-agent/openai 🤖

OpenAI Speech-to-Text (Whisper) and Text-to-Speech integration for create-voice-agent.

This package provides both STT and TTS capabilities using OpenAI's Audio API.

Installation

npm install @create-voice-agent/openai
# or
pnpm add @create-voice-agent/openai

Quick Start

import { createVoiceAgent } from "create-voice-agent";
import { OpenAISpeechToText, OpenAITextToSpeech } from "@create-voice-agent/openai";

const voiceAgent = createVoiceAgent({
  model: new ChatOpenAI({ model: "gpt-4o" }),
  
  stt: new OpenAISpeechToText({
    apiKey: process.env.OPENAI_API_KEY!,
    // Built-in VAD automatically buffers audio until speech ends
    partialTranscripts: true,  // See partial results as user speaks
  }),
  
  tts: new OpenAITextToSpeech({
    apiKey: process.env.OPENAI_API_KEY!,
    voice: "nova",
  }),
});

Speech-to-Text (Whisper)

OpenAISpeechToText

Speech-to-Text using OpenAI's Whisper model with built-in VAD (Voice Activity Detection).

No external VAD required! Unlike raw Whisper, this implementation includes energy-based VAD that automatically buffers audio and detects speech boundaries.

This implementation automatically:

  • Buffers audio until speech ends (using energy-based VAD)
  • Only triggers onSpeechStart when actual speech is detected
  • Provides partial transcriptions as you speak
  • Filters out echo/noise from TTS playback
import { OpenAISpeechToText } from "@create-voice-agent/openai";

const stt = new OpenAISpeechToText({
  apiKey: process.env.OPENAI_API_KEY!,
  model: "whisper-1",
  
  // Callback when speech is detected (for barge-in)
  onSpeechStart: () => console.log("User started speaking..."),
  
  // Enable real-time partial transcriptions
  partialTranscripts: true,
  partialIntervalMs: 1000,
});

STT Configuration Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | apiKey | string | required | OpenAI API key | | model | string | "whisper-1" | Whisper model ID | | sampleRate | number | 16000 | Input audio sample rate | | onSpeechStart | () => void | - | Callback when VAD detects speech start | | minAudioDurationMs | number | 500 | Minimum audio duration to transcribe (filters noise) | | vadEnergyThreshold | number | 500 | Energy threshold for speech detection | | vadSilenceFrames | number | 15 | Silence frames (~480ms) before speech end | | partialTranscripts | boolean | true | Enable partial transcriptions during speech | | partialIntervalMs | number | 1000 | Interval between partial transcription requests |

Partial Transcriptions

When partialTranscripts is enabled, you'll see real-time transcriptions in the logs as the user speaks:

OpenAI STT [VAD]: Speech started (energy: 1523, threshold: 500)
OpenAI STT [buffering]: 1.00s | 32.0KB | energy: avg=2654, peak=5234
OpenAI STT [partial]: "I would like a" (1.05s)
OpenAI STT [buffering]: 2.00s | 64.0KB | energy: avg=2298, peak=5234
OpenAI STT [partial]: "I would like a turkey sandwich" (2.10s)
OpenAI STT [VAD]: Speech ended | duration: 2.85s | size: 91.2KB
OpenAI STT [transcribed]: "I would like a turkey sandwich with cheese" | audio: 2.85s | latency: 312ms

Note: Partial transcripts use additional Whisper API calls. Set partialTranscripts: false to disable and reduce API costs.

VAD Tuning

The built-in VAD can be tuned for different environments:

const stt = new OpenAISpeechToText({
  apiKey: process.env.OPENAI_API_KEY!,
  
  // For noisy environments - increase thresholds
  vadEnergyThreshold: 800,      // Higher = less sensitive
  minAudioDurationMs: 700,      // Longer minimum to filter noise
  vadSilenceFrames: 20,         // ~640ms silence before end
  
  // For quiet environments - decrease thresholds
  vadEnergyThreshold: 300,      // Lower = more sensitive
  minAudioDurationMs: 300,      // Shorter minimum
  vadSilenceFrames: 10,         // ~320ms silence before end
});

Text-to-Speech

OpenAITextToSpeech

Text-to-Speech using OpenAI's TTS API.

⚠️ Note: OpenAI TTS outputs MP3 audio, not PCM. You may need additional transforms to convert to PCM for certain playback systems.

import { OpenAITextToSpeech } from "@create-voice-agent/openai";

const tts = new OpenAITextToSpeech({
  apiKey: process.env.OPENAI_API_KEY!,
  model: "tts-1",
  voice: "nova",
  
  // Callbacks
  onAudioComplete: () => console.log("Finished speaking"),
  onInterrupt: () => console.log("Speech interrupted"),
});

TTS Configuration Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | apiKey | string | required | OpenAI API key | | model | string | "tts-1" | TTS model ID | | voice | OpenAIVoice | "alloy" | Voice to use |

TTS Models

| Model | Description | Use Case | |-------|-------------|----------| | tts-1 | Optimized for speed | Real-time applications | | tts-1-hd | Higher quality | Pre-recorded content |

Available Voices

| Voice | Description | |-------|-------------| | alloy | Neutral and balanced | | echo | Warm and conversational | | fable | Expressive and dramatic | | onyx | Deep and authoritative | | nova | Friendly and upbeat | | shimmer | Soft and gentle |

Instance Methods

interrupt()

Interrupt the current speech generation.

// User started speaking - stop the agent
tts.interrupt();

speak(text: string): ReadableStream<Buffer>

Generate speech directly without going through the voice pipeline. Returns a ReadableStream of PCM audio buffers.

This is useful for:

  • Initial greetings when a call starts
  • System announcements that bypass the agent
  • One-off speech synthesis outside of conversations
const tts = new OpenAITextToSpeech({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: "nova",
});

// Generate and play a greeting
const audioStream = tts.speak("Hi there! How can I assist you today?");

for await (const chunk of audioStream) {
  // Send to audio output (speakers, WebRTC, etc.)
  audioOutput.write(chunk);
}

The speak() method uses the same voice and model configuration as the main TTS pipeline. Audio is automatically resampled from OpenAI's native 24kHz to your configured sample rate.

Callbacks

onAudioComplete

Called when speech generation finishes.

const tts = new OpenAITextToSpeech({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: "nova",
  onAudioComplete: () => {
    console.log("Agent finished speaking");
  },
});

onInterrupt

Called when speech is interrupted.

const tts = new OpenAITextToSpeech({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: "nova",
  onInterrupt: () => {
    console.log("Speech was interrupted");
  },
});

Audio Format Considerations

STT Input

OpenAI Whisper expects audio in common formats. This integration:

  • Accepts raw PCM (16-bit, mono, 16kHz)
  • Automatically wraps in WAV headers before sending to the API

TTS Output

OpenAI TTS returns MP3 audio. If your pipeline expects PCM:

import { createVoiceMiddleware } from "create-voice-agent";

// You'll need an MP3-to-PCM decoder transform
const mp3DecoderMiddleware = createVoiceMiddleware("MP3Decoder", {
  afterTTS: [new MP3ToPCMTransform()], // Implement or use a library
});

Complete Example

import { 
  createVoiceAgent, 
  createThinkingFillerMiddleware,
} from "create-voice-agent";
import { OpenAISpeechToText, OpenAITextToSpeech } from "@create-voice-agent/openai";
import { ChatOpenAI } from "@langchain/openai";

const tts = new OpenAITextToSpeech({
  apiKey: process.env.OPENAI_API_KEY!,
  model: "tts-1",
  voice: "nova",
  onAudioComplete: () => console.log("Agent finished speaking"),
});

const stt = new OpenAISpeechToText({
  apiKey: process.env.OPENAI_API_KEY!,
  model: "whisper-1",
  // Built-in VAD handles buffering automatically
  partialTranscripts: true,
  onSpeechStart: () => {
    // Barge-in: interrupt TTS when user starts speaking
    tts.interrupt();
  },
});

const voiceAgent = createVoiceAgent({
  model: new ChatOpenAI({ model: "gpt-4o" }),
  prompt: "You are a helpful voice assistant.",
  
  stt,
  tts,
  
  middleware: [
    createThinkingFillerMiddleware({ thresholdMs: 1500 }),
  ],
});

// Process audio streams
const audioOutput = voiceAgent.process(audioInputStream);

When to Use OpenAI vs Other Providers

Use OpenAI When

  • You want a single API key for STT, TTS, and LLM
  • You need high-accuracy transcription (Whisper is excellent)
  • You want built-in VAD and partial transcription support
  • MP3 output format works for your use case

Consider Alternatives When

  • You need lower-latency streaming STT → Use AssemblyAI
  • You need PCM output or more voice options → Use ElevenLabs
  • You need emotionally expressive voices → Use Hume

License

MIT