@gdnaio/react-transcribe-streaming

v0.2.0

Published

9 days ago

React hook for real-time speech-to-text using AWS Transcribe Streaming

0High
0Medium
0Low

gdnawill

anandgdna

anvithas

react speech-to-text aws transcribe streaming voice microphone cognito real-time hook gdnaio

@gdnaio/react-transcribe-streaming

A lightweight React hook for real-time speech-to-text using AWS Transcribe Streaming. Captures microphone audio, streams it to AWS Transcribe, and returns a live-updating transcript — all in one hook.

Features

Real-time transcription — partial results update as you speak, final results accumulate
Single hook API — useTranscribe() gives you everything: transcript, listening, startListening, stopListening
Cognito authentication — exchanges a Cognito ID token for temporary AWS credentials automatically, no backend needed
Multi-language support — 17+ languages with built-in BCP-47 code mapping
Auto cleanup — microphone and AWS resources are released automatically when you stop or when the component unmounts
Minimal footprint — just two runtime dependencies (@aws-sdk/client-transcribe-streaming, @aws-sdk/credential-providers), tree-shakeable
TypeScript-first — full type definitions included out of the box
React 18+ — works with React 18 and React 19
Framework agnostic — works with Vite, Next.js, Create React App, and any React setup

Installation

npm install @gdnaio/react-transcribe-streaming
# or
pnpm add @gdnaio/react-transcribe-streaming
# or
yarn add @gdnaio/react-transcribe-streaming

Both ESM and CJS builds are included. TypeScript definitions ship with the package — no separate @types/ install needed.

AWS Setup (One-Time)

Before using the hook, you need three things in AWS: a User Pool, an Identity Pool, and an IAM role. You likely already have a User Pool if your app has Cognito auth. The Identity Pool and IAM role are new.

Step 1: Create a Cognito Identity Pool

Open the Amazon Cognito console
Click Identity pools in the left sidebar, then Create identity pool
Give it a name (e.g., my-app-identity-pool)
Under Authentication providers > Cognito, enter:
- User Pool ID — your existing User Pool ID (e.g., us-east-1_XXXXXXXXX)
- App Client ID — the app client ID from your User Pool
Click Create pool
AWS will prompt you to create two IAM roles (authenticated and unauthenticated). Accept the defaults
Note the Identity Pool ID — you'll need it in your app (e.g., us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)

Step 2: Add Transcribe Permission to the Authenticated Role

Open the IAM console
Find the authenticated role that was created with the Identity Pool (e.g., Cognito_MyAppAuth_Role)
Click Add permissions > Create inline policy
Switch to the JSON tab and paste:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "transcribe:StartStreamTranscription",
      "Resource": "*"
    }
  ]
}

Name the policy (e.g., TranscribeStreamingAccess) and save

Step 3: Verify Your User Pool

Ensure your app is already obtaining a Cognito ID token for authenticated users. This package needs the ID token (not the access token) to exchange for temporary AWS credentials.

Common auth libraries that provide this:

AWS Amplify — fetchAuthSession() returns tokens.idToken
@gdnaio/cognito-auth — getIdToken() returns the ID token
amazon-cognito-identity-js — getIdToken().getJwtToken()
Any OIDC library — the id_token from the token response

Step 4: HTTPS (Production)

Microphone access (getUserMedia) requires a secure context. Your app must be served over HTTPS in production. localhost is exempt for development.

Quick Start

import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'

function SpeechInput({ idToken }: { idToken: string }) {
  const { transcript, listening, startListening, stopListening } = useTranscribe({
    config: {
      region: 'us-east-1',
      identityPoolId: 'us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
      userPoolId: 'us-east-1_XXXXXXXXX',
      idToken,
    },
    languageCode: 'en-US',
  })

  return (
    <div>
      <button onClick={listening ? stopListening : startListening}>
        {listening ? 'Stop' : 'Start'} Listening
      </button>
      <p>{transcript || 'Click the button and start speaking...'}</p>
    </div>
  )
}

Framework Guides

Vite

Vite exposes environment variables via import.meta.env with a VITE_ prefix.

1. Add environment variables to your .env or .env.local:

VITE_AWS_REGION=us-east-1
VITE_AWS_USER_POOL_ID=us-east-1_XXXXXXXXX
VITE_AWS_IDENTITY_POOL_ID=us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

2. Create a voice input component:

// src/components/VoiceInput.tsx
import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'
import { useAuth } from 'your-auth-library'

export default function VoiceInput() {
  const { getIdToken } = useAuth()
  const [idToken, setIdToken] = useState('')

  useEffect(() => {
    getIdToken().then((token) => {
      if (token) setIdToken(token)
    })
  }, [getIdToken])

  const { transcript, listening, startListening, stopListening, resetTranscript, isMicrophoneAvailable } =
    useTranscribe({
      config: {
        region: import.meta.env.VITE_AWS_REGION,
        identityPoolId: import.meta.env.VITE_AWS_IDENTITY_POOL_ID,
        userPoolId: import.meta.env.VITE_AWS_USER_POOL_ID,
        idToken,
      },
      languageCode: 'en-US',
    })

  return (
    <div>
      <textarea value={transcript} readOnly rows={4} placeholder="Speak into your microphone..." />
      <div>
        <button onClick={listening ? stopListening : startListening} disabled={!isMicrophoneAvailable}>
          {listening ? 'Stop' : 'Record'}
        </button>
        <button onClick={resetTranscript}>Clear</button>
      </div>
    </div>
  )
}

3. Use it in your app:

// src/App.tsx
import VoiceInput from './components/VoiceInput'

function App() {
  return (
    <div>
      <h1>Voice Input Demo</h1>
      <VoiceInput />
    </div>
  )
}

Next.js

This package uses browser-only APIs (getUserMedia, AudioContext). In Next.js, you must mark the component as client-side.

1. Add environment variables to your .env.local:

NEXT_PUBLIC_AWS_REGION=us-east-1
NEXT_PUBLIC_AWS_USER_POOL_ID=us-east-1_XXXXXXXXX
NEXT_PUBLIC_AWS_IDENTITY_POOL_ID=us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

2. Create a client component (note the 'use client' directive):

// components/VoiceInput.tsx
'use client'

import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'

interface VoiceInputProps {
  idToken: string
  languageCode?: string
}

export default function VoiceInput({ idToken, languageCode = 'en-US' }: VoiceInputProps) {
  const { transcript, listening, startListening, stopListening, resetTranscript, isMicrophoneAvailable } =
    useTranscribe({
      config: {
        region: process.env.NEXT_PUBLIC_AWS_REGION!,
        identityPoolId: process.env.NEXT_PUBLIC_AWS_IDENTITY_POOL_ID!,
        userPoolId: process.env.NEXT_PUBLIC_AWS_USER_POOL_ID!,
        idToken,
      },
      languageCode,
    })

  return (
    <div>
      <p>{transcript || 'Click the button and start speaking...'}</p>
      <button onClick={listening ? stopListening : startListening} disabled={!isMicrophoneAvailable}>
        {listening ? 'Stop Recording' : 'Start Recording'}
      </button>
      <button onClick={resetTranscript}>Clear</button>
    </div>
  )
}

3. Use it in a page (App Router):

// app/page.tsx
import dynamic from 'next/dynamic'

// Dynamic import with SSR disabled — the hook uses browser APIs
const VoiceInput = dynamic(() => import('@/components/VoiceInput'), { ssr: false })

export default function Page() {
  const idToken = '...' // Get from your auth layer (server component, cookie, etc.)

  return (
    <main>
      <h1>Voice Input</h1>
      <VoiceInput idToken={idToken} />
    </main>
  )
}

Important: If you see ReferenceError: navigator is not defined or AudioContext is not defined, it means the component is being rendered on the server. Either:

Add 'use client' at the top of the file, or
Use dynamic(() => import(...), { ssr: false }) to disable SSR for that component

Create React App

CRA exposes environment variables via process.env with a REACT_APP_ prefix.

1. Add environment variables to your .env:

REACT_APP_AWS_REGION=us-east-1
REACT_APP_AWS_USER_POOL_ID=us-east-1_XXXXXXXXX
REACT_APP_AWS_IDENTITY_POOL_ID=us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

2. Create the component:

// src/components/VoiceInput.tsx
import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'

export default function VoiceInput({ idToken }: { idToken: string }) {
  const { transcript, listening, startListening, stopListening, resetTranscript, isMicrophoneAvailable } =
    useTranscribe({
      config: {
        region: process.env.REACT_APP_AWS_REGION!,
        identityPoolId: process.env.REACT_APP_AWS_IDENTITY_POOL_ID!,
        userPoolId: process.env.REACT_APP_AWS_USER_POOL_ID!,
        idToken,
      },
      languageCode: 'en-US',
    })

  return (
    <div>
      <p>{transcript || 'Click the button and start speaking...'}</p>
      <button onClick={listening ? stopListening : startListening} disabled={!isMicrophoneAvailable}>
        {listening ? 'Stop' : 'Record'}
      </button>
      <button onClick={resetTranscript}>Clear</button>
    </div>
  )
}

API Reference

`useTranscribe(options)`

The main hook. Call it at the top level of your React component.

Parameters

interface UseTranscribeOptions {
  config: TranscribeConfig
  languageCode?: string  // Default: "en-US"
}

interface TranscribeConfig {
  region: string          // AWS region, e.g. "us-east-1"
  identityPoolId: string  // Cognito Identity Pool ID
  userPoolId: string      // Cognito User Pool ID
  idToken: string         // Cognito ID token from your auth layer
}

| Param | Type | Required | Description | |-------|------|----------|-------------| | config.region | string | Yes | AWS region where your Identity Pool and Transcribe are available | | config.identityPoolId | string | Yes | Cognito Identity Pool ID (e.g., us-east-1:xxxxxxxx-xxxx-...) | | config.userPoolId | string | Yes | Cognito User Pool ID (e.g., us-east-1_XXXXXXXXX) | | config.idToken | string | Yes | A valid Cognito ID token for the authenticated user | | languageCode | string | No | BCP-47 language code. Defaults to "en-US". See Supported Languages |

Return Value

interface UseTranscribeReturn {
  transcript: string
  listening: boolean
  isMicrophoneAvailable: boolean
  startListening: () => Promise<void>
  stopListening: () => Promise<void>
  abortListening: () => void
  resetTranscript: () => void
}

| Property | Type | Description | |----------|------|-------------| | transcript | string | The current transcription text. Updates in real-time with partial (interim) results and accumulates final results. Resets when startListening is called again | | listening | boolean | true while the microphone is active and audio is being streamed to Transcribe | | isMicrophoneAvailable | boolean | Starts as true. Becomes false if the user denies microphone permission | | startListening | () => Promise<void> | Requests mic access, starts audio capture, connects to Transcribe, and begins streaming. Clears any previous transcript. If already listening, this is a no-op | | stopListening | () => Promise<void> | Gracefully stops audio capture, closes the Transcribe WebSocket stream, and releases the microphone. Safe to call even if not listening | | abortListening | () => void | Synchronous, fire-and-forget version of stopListening. Useful in cleanup code or event handlers where you can't await | | resetTranscript | () => void | Clears the transcript to an empty string without stopping the mic |

Exported Types

import type {
  TranscribeConfig,
  UseTranscribeOptions,
  UseTranscribeReturn,
} from '@gdnaio/react-transcribe-streaming'

Usage Examples

Getting the ID Token

The hook requires a Cognito ID token. Here's how to get it from common auth libraries:

AWS Amplify v6:

import { fetchAuthSession } from 'aws-amplify/auth'

const session = await fetchAuthSession()
const idToken = session.tokens?.idToken?.toString() ?? ''

AWS Amplify v5:

import { Auth } from 'aws-amplify'

const session = await Auth.currentSession()
const idToken = session.getIdToken().getJwtToken()

@gdnaio/cognito-auth:

import { useAuth } from '@gdnaio/cognito-auth'

const { getIdToken } = useAuth()
const idToken = await getIdToken()

amazon-cognito-identity-js:

const cognitoUser = userPool.getCurrentUser()
cognitoUser.getSession((err, session) => {
  const idToken = session.getIdToken().getJwtToken()
})

Complete Working Example

A full component with token fetching, error states, and visual feedback:

import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'

const CONFIG = {
  region: 'us-east-1',
  identityPoolId: 'us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
  userPoolId: 'us-east-1_XXXXXXXXX',
}

interface Props {
  getIdToken: () => Promise<string | null>
  language?: string
}

export default function VoiceTranscriber({ getIdToken, language = 'en-US' }: Props) {
  const [idToken, setIdToken] = useState('')

  useEffect(() => {
    getIdToken().then((token) => {
      if (token) setIdToken(token)
    })
  }, [getIdToken])

  const {
    transcript,
    listening,
    isMicrophoneAvailable,
    startListening,
    stopListening,
    resetTranscript,
  } = useTranscribe({
    config: { ...CONFIG, idToken },
    languageCode: language,
  })

  if (!isMicrophoneAvailable) {
    return <p>Microphone access was denied. Please allow microphone access and reload the page.</p>
  }

  return (
    <div>
      <div style={{ minHeight: 60, padding: 12, border: '1px solid #ccc', borderRadius: 8 }}>
        {transcript || <span style={{ color: '#999' }}>Click Record and start speaking...</span>}
      </div>

      <div style={{ marginTop: 8, display: 'flex', gap: 8 }}>
        <button
          onClick={listening ? stopListening : startListening}
          style={{
            padding: '8px 16px',
            background: listening ? '#ea4335' : '#1a73e8',
            color: 'white',
            border: 'none',
            borderRadius: 4,
            cursor: 'pointer',
          }}
        >
          {listening ? 'Stop Recording' : 'Start Recording'}
        </button>
        <button onClick={resetTranscript} style={{ padding: '8px 16px' }}>
          Clear
        </button>
      </div>

      {listening && (
        <p style={{ color: '#ea4335', marginTop: 8 }}>
          Listening...
        </p>
      )}
    </div>
  )
}

Dynamic Language Switching

function MultiLanguageInput() {
  const [language, setLanguage] = useState('en-US')
  const { transcript, listening, startListening, stopListening } = useTranscribe({
    config: { /* ... */ },
    languageCode: language,
  })

  return (
    <div>
      <select value={language} onChange={(e) => setLanguage(e.target.value)}>
        <option value="en-US">English</option>
        <option value="ar-SA">Arabic</option>
        <option value="fr-FR">French</option>
        <option value="es-US">Spanish</option>
        <option value="hi-IN">Hindi</option>
      </select>
      <button onClick={listening ? stopListening : startListening}>
        {listening ? 'Stop' : 'Record'}
      </button>
      <p>{transcript}</p>
    </div>
  )
}

Note: Changing languageCode while actively listening does not restart the stream automatically. Stop and start listening again to switch languages mid-session.

Appending Speech to Existing Text

function ChatInput() {
  const [text, setText] = useState('')
  const baseTextRef = useRef(text)
  const { transcript, listening, startListening, stopListening } =
    useTranscribe({ config: { /* ... */ } })

  // When starting, capture the current text as the base
  const handleStart = async () => {
    baseTextRef.current = text
    await startListening()
  }

  // Append transcript to the base text
  useEffect(() => {
    if (listening && transcript) {
      setText(baseTextRef.current + transcript)
    }
  }, [transcript, listening])

  return (
    <div>
      <input value={text} onChange={(e) => setText(e.target.value)} />
      <button onClick={listening ? stopListening : handleStart}>
        {listening ? 'Stop' : 'Mic'}
      </button>
    </div>
  )
}

Supported Languages

The hook includes a built-in language code mapper. Pass any of these BCP-47 codes as languageCode:

| Code | Language | Transcribe Code | |------|----------|-----------------| | en-US | English (US) | en-US | | en-GB | English (UK) | en-GB | | en-AU | English (Australia) | en-AU | | ar-SA | Arabic (Saudi Arabia) | ar-SA | | ar-AE | Arabic (UAE) | ar-AE | | fr-FR | French (France) | fr-FR | | fr-CA | French (Canada) | fr-CA | | es-ES | Spanish (Spain) | es-ES | | es-US | Spanish (US) | es-US | | es-LA | Spanish (Latin America) | es-US * | | de-DE | German | de-DE | | it-IT | Italian | it-IT | | pt-BR | Portuguese (Brazil) | pt-BR | | ja-JP | Japanese | ja-JP | | ko-KR | Korean | ko-KR | | zh-CN | Chinese (Mandarin) | zh-CN | | hi-IN | Hindi | hi-IN |

* es-LA is mapped to es-US since AWS Transcribe doesn't have a dedicated Latin American Spanish code.

Any unlisted code is passed through to Transcribe as-is. See AWS Transcribe Streaming supported languages for the full list of supported codes.

How It Works

User clicks "Start"
      |
      v
getUserMedia() ---- requests microphone access
      |
      v
AudioContext + ScriptProcessorNode ---- captures raw audio at native sample rate
      |
      v
Float32 -> Int16 PCM (little-endian) ---- converts to Transcribe-compatible format
      |
      v
fromCognitoIdentityPool() ---- exchanges Cognito ID token for temporary AWS credentials
      |
      v
TranscribeStreamingClient.send() ---- opens WebSocket to AWS Transcribe
      |
      v
Audio chunks streamed as async generator ---- yields { AudioEvent: { AudioChunk } }
      |
      v
TranscriptResultStream (async iterable) ---- receives partial and final transcript events
      |
      v
React state updates ---- transcript updates in real-time

Microphone capture — calls getUserMedia with echo cancellation and noise suppression enabled, creates an AudioContext and ScriptProcessorNode to capture raw audio frames
PCM encoding — converts Float32 audio samples to 16-bit signed integer PCM in little-endian byte order, the format AWS Transcribe expects
Credential exchange — uses fromCognitoIdentityPool from @aws-sdk/credential-providers to exchange the Cognito ID token for temporary AWS credentials scoped to transcribe:StartStreamTranscription
WebSocket streaming — creates a TranscribeStreamingClient and sends a StartStreamTranscriptionCommand with the audio stream as an async generator. Transcribe opens a WebSocket connection under the hood
Transcript processing — iterates the TranscriptResultStream async iterable. Partial results update the transcript state immediately (giving real-time feedback). Final results are accumulated so the full transcript builds up over the session

Cleanup and Resource Management

The hook automatically cleans up resources when:

You call stopListening() or abortListening()
An error occurs (network drop, credential expiry, etc.)
The Transcribe stream ends naturally

Cleanup includes:

Disconnecting the ScriptProcessorNode and AudioContext
Stopping all microphone MediaStream tracks (the browser's mic indicator turns off)
Aborting the Transcribe WebSocket stream
Setting listening to false

Note: The hook does not auto-stop on component unmount. If you need that, call abortListening in a cleanup effect:
useEffect(() => {
  return () => abortListening()
}, [abortListening])

Browser Compatibility

| Browser | Minimum Version | Notes | |---------|----------------|-------| | Chrome | 66+ | Full support | | Firefox | 76+ | Full support | | Safari | 14.1+ | Full support | | Edge | 79+ | Full support (Chromium-based) | | Mobile Chrome | 66+ | Full support | | Mobile Safari | 14.5+ | Full support |

Requires navigator.mediaDevices.getUserMedia and AudioContext APIs. Both are available in all modern browsers. HTTPS is required in production (localhost is exempt for development).

Error Handling

The hook handles errors internally and never throws. All errors are logged to console.error with a [useTranscribe] prefix.

| Scenario | What Happens | |----------|-------------| | Microphone permission denied | isMicrophoneAvailable becomes false. You can use this to show a message or disable the button | | Microphone already in use | startListening fails silently, error logged. listening stays false | | Network disconnection | The Transcribe WebSocket closes. listening becomes false. Last transcript is preserved. User can click to restart | | Expired or invalid ID token | Credential exchange fails, error logged. listening becomes false. Refresh the token and try again | | AWS service error | Error logged. listening becomes false. Check the console for details | | Unsupported browser | getUserMedia throws, caught and logged. listening stays false |

Bundle Size

The package itself is small (~6 KB). However, the AWS SDK dependencies add to the bundle:

| Dependency | Approximate Size (gzipped) | |------------|---------------------------| | @aws-sdk/client-transcribe-streaming | ~30 KB | | @aws-sdk/credential-providers | ~20 KB | | Total addition | ~50 KB gzipped |

The AWS SDK v3 is tree-shakeable — only the Transcribe Streaming client and Cognito credential provider are included in your bundle, not the entire SDK.

Troubleshooting

"Microphone not available" / `isMicrophoneAvailable` is `false`

The user denied microphone permission in the browser
Fix: Click the lock/camera icon in the browser's address bar, allow microphone access, and reload the page

"NotAllowedError" in console

Your app is served over HTTP (not HTTPS) in production
Fix: Serve your app over HTTPS. localhost is exempt for development

"No audio is being captured" / transcript stays empty

Check that your AWS credentials are valid — look for errors in the browser console
Verify the Identity Pool ID, User Pool ID, and region are correct
Ensure the IAM role has transcribe:StartStreamTranscription permission
Try a different microphone or check system audio settings

"CredentialsProviderError" or "Not authorized"

The Cognito ID token may be expired. Refresh it before calling startListening
The Identity Pool may not be linked to your User Pool. Verify the authentication provider in the Identity Pool settings
The authenticated IAM role may be missing the Transcribe permission

Next.js: "navigator is not defined" or "AudioContext is not defined"

The component is being server-side rendered. Browser APIs don't exist on the server
Fix: Add 'use client' at the top of the file, or use dynamic(() => import(...), { ssr: false })

Transcript has long delays

This is usually a network latency issue. AWS Transcribe Streaming requires a stable connection
Partial results should appear within 200-500ms of speaking. If they don't, check your network connection
The sample rate is auto-detected from your AudioContext. Higher sample rates (48kHz) produce better quality but more data

Language not recognized correctly

Ensure you're passing the correct BCP-47 code (e.g., ar-SA, not ar or arabic)
Some languages require region-specific codes. Check the Supported Languages table
Changing languageCode while listening does not take effect until you stop and restart

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@gdnaio/react-transcribe-streaming

Features

Table of Contents

Installation

AWS Setup (One-Time)

Step 1: Create a Cognito Identity Pool

Step 2: Add Transcribe Permission to the Authenticated Role

Step 3: Verify Your User Pool

Step 4: HTTPS (Production)

Quick Start

Framework Guides

Vite

Next.js

Create React App

API Reference

useTranscribe(options)

Parameters

Return Value

Exported Types

Usage Examples

Getting the ID Token

Complete Working Example

Dynamic Language Switching

Appending Speech to Existing Text

Supported Languages

How It Works

Cleanup and Resource Management

Browser Compatibility

Error Handling

Bundle Size

Troubleshooting

"Microphone not available" / isMicrophoneAvailable is false

"NotAllowedError" in console

"No audio is being captured" / transcript stays empty

"CredentialsProviderError" or "Not authorized"

Next.js: "navigator is not defined" or "AudioContext is not defined"

Transcript has long delays

Language not recognized correctly

License

`useTranscribe(options)`

"Microphone not available" / `isMicrophoneAvailable` is `false`