npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@create-voice-agent/assemblyai

v0.0.1

Published

AssemblyAI Real-Time Speech-to-Text integration for voice agents

Readme

@create-voice-agent/assemblyai 🎙️

AssemblyAI Real-Time Speech-to-Text integration for create-voice-agent.

This package provides a streaming STT model using AssemblyAI's v3 Real-Time Transcription API for low-latency, accurate speech recognition.

Installation

npm install @create-voice-agent/assemblyai
# or
pnpm add @create-voice-agent/assemblyai

Quick Start

import { createVoiceAgent } from "create-voice-agent";
import { AssemblyAISpeechToText } from "@create-voice-agent/assemblyai";
import { ElevenLabsTextToSpeech } from "@create-voice-agent/elevenlabs";

const voiceAgent = createVoiceAgent({
  model: new ChatOpenAI({ model: "gpt-4o" }),
  
  stt: new AssemblyAISpeechToText({
    apiKey: process.env.ASSEMBLYAI_API_KEY!,
  }),
  
  tts: new ElevenLabsTextToSpeech({ /* ... */ }),
});

API Reference

AssemblyAISpeechToText

Real-time streaming Speech-to-Text model using AssemblyAI's WebSocket API.

import { AssemblyAISpeechToText } from "@create-voice-agent/assemblyai";

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  
  // Optional configuration
  sampleRate: 16000,
  encoding: "pcm_s16le",
  speechModel: "universal-streaming-english",
  region: "us",
  formatTurns: true,
  
  // Endpointing configuration
  endOfTurnConfidenceThreshold: 0.4,
  minEndOfTurnSilenceWhenConfident: 400,
  maxTurnSilence: 1280,
  
  // Improve recognition for specific terms
  keytermsPrompt: ["LangChain", "OpenAI", "Anthropic"],
  
  // Callbacks
  onSpeechStart: () => console.log("User started speaking"),
  onTurn: (turn) => console.log("Turn event:", turn),
  onEndOfTurn: (turn) => console.log("Turn ended:", turn.transcript),
});

Configuration Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | apiKey | string | required | AssemblyAI API key | | token | string | - | Temporary auth token (alternative to apiKey for client-side) | | sampleRate | number | 16000 | Audio sample rate in Hz | | encoding | "pcm_s16le" \| "pcm_mulaw" | "pcm_s16le" | Audio encoding format | | speechModel | string | "universal-streaming-english" | Speech recognition model | | region | "us" \| "eu" | "us" | API endpoint region | | formatTurns | boolean | true | Return formatted transcripts (punctuation, casing) | | keytermsPrompt | string[] | - | Words/phrases to boost recognition accuracy (max 100) | | endOfTurnConfidenceThreshold | number | 0.4 | Confidence threshold (0-1) for end of turn detection | | minEndOfTurnSilenceWhenConfident | number | 400 | Min silence (ms) to trigger end of turn when confident | | maxTurnSilence | number | 1280 | Max silence (ms) before end of turn is triggered |

Speech Models

| Model | Languages | Description | |-------|-----------|-------------| | universal-streaming-english | English | Lower latency, English-only (default) | | universal-streaming-multi | EN, ES, FR, DE, IT, PT | Multilingual support |

Regions

| Region | Endpoint | Use Case | |--------|----------|----------| | us | streaming.assemblyai.com | Default, US-based | | eu | streaming.eu.assemblyai.com | EU data residency requirements |

Callbacks

onSpeechStart

Triggered when the user starts speaking. Useful for barge-in detection.

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  onSpeechStart: () => {
    // User started speaking - interrupt agent if needed
    voiceAgent.tts.interrupt();
  },
});

onTurn

Triggered on every turn event (both partial and final transcripts).

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  onTurn: (turn) => {
    console.log(`Turn ${turn.turn_order}: "${turn.transcript}"`);
    console.log(`  Formatted: ${turn.turn_is_formatted}`);
    console.log(`  End of turn: ${turn.end_of_turn}`);
    console.log(`  Confidence: ${turn.end_of_turn_confidence}`);
    console.log(`  Words:`, turn.words);
  },
});

onEndOfTurn

Triggered specifically when end of turn is detected.

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  onEndOfTurn: (turn) => {
    console.log(`User finished: "${turn.transcript}"`);
  },
});

Instance Methods

updateConfiguration(config)

Update endpointing parameters during an active session.

// Switch to live captioning mode (faster endpointing)
stt.updateConfiguration({
  minEndOfTurnSilenceWhenConfident: 560,
  maxTurnSilence: 800,
});

// Switch back to conversational mode
stt.updateConfiguration({
  minEndOfTurnSilenceWhenConfident: 400,
  maxTurnSilence: 1280,
});

forceEndpoint()

Force an immediate end of turn. Useful when you know the user has finished speaking via external signals (e.g., button press, external VAD).

// User pressed "done" button
stt.forceEndpoint();

interrupt()

Interrupt and close the current transcription session.

// Clean up when done
stt.interrupt();

Instance Properties

| Property | Type | Description | |----------|------|-------------| | provider | "assemblyai" | Provider identifier | | sessionId | string \| null | Current session ID | | isConnected | boolean | Whether WebSocket is connected |

Turn Event Data

The AssemblyAITurnEvent object provides detailed transcription data:

interface AssemblyAITurnEvent {
  /** Integer that increments with each new turn */
  turn_order: number;
  
  /** Whether the text is formatted (punctuation, casing) */
  turn_is_formatted: boolean;
  
  /** Whether this is the end of the current turn */
  end_of_turn: boolean;
  
  /** The transcript text */
  transcript: string;
  
  /** Confidence (0-1) that the turn has finished */
  end_of_turn_confidence: number;
  
  /** Word-level data */
  words: AssemblyAIWord[];
}

interface AssemblyAIWord {
  text: string;
  word_is_final: boolean;
  start: number;      // ms
  end: number;        // ms
  confidence: number; // 0-1
}

Endpointing Configuration

AssemblyAI's v3 API uses intelligent endpointing to detect when the user has finished speaking. You can tune this behavior:

For Conversational AI (default)

Balanced settings that wait for natural pauses:

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  endOfTurnConfidenceThreshold: 0.4,
  minEndOfTurnSilenceWhenConfident: 400,
  maxTurnSilence: 1280,
});

For Live Captioning

Faster endpointing for real-time display:

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  endOfTurnConfidenceThreshold: 0.3,
  minEndOfTurnSilenceWhenConfident: 560,
  maxTurnSilence: 800,
});

For Dictation

Longer pauses allowed between sentences:

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  endOfTurnConfidenceThreshold: 0.5,
  minEndOfTurnSilenceWhenConfident: 600,
  maxTurnSilence: 2000,
});

Improving Recognition Accuracy

Use keytermsPrompt to boost recognition of specific terms:

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  keytermsPrompt: [
    // Product names
    "LangChain",
    "LangGraph", 
    "LangSmith",
    // Technical terms
    "RAG",
    "embeddings",
    "vector store",
    // Custom vocabulary
    "Acme Corp",
  ],
});

Limits:

  • Maximum 100 terms
  • Terms longer than 50 characters are ignored

Client-Side Usage with Temporary Tokens

For browser-based applications, generate temporary tokens server-side:

// Server: Generate temporary token
const response = await fetch("https://api.assemblyai.com/v2/realtime/token", {
  method: "POST",
  headers: {
    Authorization: process.env.ASSEMBLYAI_API_KEY!,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ expires_in: 3600 }),
});
const { token } = await response.json();

// Client: Use temporary token
const stt = new AssemblyAISpeechToText({
  apiKey: "", // Not needed when using token
  token: token,
});

Complete Example

import { createVoiceAgent, createThinkingFillerMiddleware } from "create-voice-agent";
import { AssemblyAISpeechToText } from "@create-voice-agent/assemblyai";
import { ElevenLabsTextToSpeech } from "@create-voice-agent/elevenlabs";
import { ChatOpenAI } from "@langchain/openai";

const stt = new AssemblyAISpeechToText({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
  speechModel: "universal-streaming-english",
  
  // Boost recognition for domain terms
  keytermsPrompt: ["LangChain", "voice agent", "AI assistant"],
  
  // Log all transcription events
  onTurn: (turn) => {
    if (!turn.turn_is_formatted) {
      process.stdout.write(`\r[Partial] ${turn.transcript}`);
    }
  },
  onEndOfTurn: (turn) => {
    console.log(`\n[Final] ${turn.transcript}`);
  },
});

const voiceAgent = createVoiceAgent({
  model: new ChatOpenAI({ model: "gpt-4o" }),
  prompt: "You are a helpful voice assistant.",
  
  stt,
  tts: new ElevenLabsTextToSpeech({
    apiKey: process.env.ELEVENLABS_API_KEY!,
    voiceId: process.env.ELEVENLABS_VOICE_ID!,
  }),
  
  middleware: [
    createThinkingFillerMiddleware({ thresholdMs: 1000 }),
  ],
});

// Process audio
const audioOutput = voiceAgent.process(audioInputStream);

License

MIT