@gdnaio/react-transcribe-streaming
v0.2.0
Published
React hook for real-time speech-to-text using AWS Transcribe Streaming
Readme
@gdnaio/react-transcribe-streaming
A lightweight React hook for real-time speech-to-text using AWS Transcribe Streaming. Captures microphone audio, streams it to AWS Transcribe, and returns a live-updating transcript — all in one hook.
Features
- Real-time transcription — partial results update as you speak, final results accumulate
- Single hook API —
useTranscribe()gives you everything:transcript,listening,startListening,stopListening - Cognito authentication — exchanges a Cognito ID token for temporary AWS credentials automatically, no backend needed
- Multi-language support — 17+ languages with built-in BCP-47 code mapping
- Auto cleanup — microphone and AWS resources are released automatically when you stop or when the component unmounts
- Minimal footprint — just two runtime dependencies (
@aws-sdk/client-transcribe-streaming,@aws-sdk/credential-providers), tree-shakeable - TypeScript-first — full type definitions included out of the box
- React 18+ — works with React 18 and React 19
- Framework agnostic — works with Vite, Next.js, Create React App, and any React setup
Table of Contents
- Installation
- AWS Setup (One-Time)
- Quick Start
- Framework Guides
- API Reference
- Usage Examples
- Supported Languages
- How It Works
- Browser Compatibility
- Error Handling
- Troubleshooting
- License
Installation
npm install @gdnaio/react-transcribe-streaming
# or
pnpm add @gdnaio/react-transcribe-streaming
# or
yarn add @gdnaio/react-transcribe-streamingBoth ESM and CJS builds are included. TypeScript definitions ship with the package — no separate @types/ install needed.
AWS Setup (One-Time)
Before using the hook, you need three things in AWS: a User Pool, an Identity Pool, and an IAM role. You likely already have a User Pool if your app has Cognito auth. The Identity Pool and IAM role are new.
Step 1: Create a Cognito Identity Pool
- Open the Amazon Cognito console
- Click Identity pools in the left sidebar, then Create identity pool
- Give it a name (e.g.,
my-app-identity-pool) - Under Authentication providers > Cognito, enter:
- User Pool ID — your existing User Pool ID (e.g.,
us-east-1_XXXXXXXXX) - App Client ID — the app client ID from your User Pool
- User Pool ID — your existing User Pool ID (e.g.,
- Click Create pool
- AWS will prompt you to create two IAM roles (authenticated and unauthenticated). Accept the defaults
- Note the Identity Pool ID — you'll need it in your app (e.g.,
us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
Step 2: Add Transcribe Permission to the Authenticated Role
- Open the IAM console
- Find the authenticated role that was created with the Identity Pool (e.g.,
Cognito_MyAppAuth_Role) - Click Add permissions > Create inline policy
- Switch to the JSON tab and paste:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "transcribe:StartStreamTranscription",
"Resource": "*"
}
]
}- Name the policy (e.g.,
TranscribeStreamingAccess) and save
Step 3: Verify Your User Pool
Ensure your app is already obtaining a Cognito ID token for authenticated users. This package needs the ID token (not the access token) to exchange for temporary AWS credentials.
Common auth libraries that provide this:
- AWS Amplify —
fetchAuthSession()returnstokens.idToken - @gdnaio/cognito-auth —
getIdToken()returns the ID token - amazon-cognito-identity-js —
getIdToken().getJwtToken() - Any OIDC library — the
id_tokenfrom the token response
Step 4: HTTPS (Production)
Microphone access (getUserMedia) requires a secure context. Your app must be served over HTTPS in production. localhost is exempt for development.
Quick Start
import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'
function SpeechInput({ idToken }: { idToken: string }) {
const { transcript, listening, startListening, stopListening } = useTranscribe({
config: {
region: 'us-east-1',
identityPoolId: 'us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
userPoolId: 'us-east-1_XXXXXXXXX',
idToken,
},
languageCode: 'en-US',
})
return (
<div>
<button onClick={listening ? stopListening : startListening}>
{listening ? 'Stop' : 'Start'} Listening
</button>
<p>{transcript || 'Click the button and start speaking...'}</p>
</div>
)
}Framework Guides
Vite
Vite exposes environment variables via import.meta.env with a VITE_ prefix.
1. Add environment variables to your .env or .env.local:
VITE_AWS_REGION=us-east-1
VITE_AWS_USER_POOL_ID=us-east-1_XXXXXXXXX
VITE_AWS_IDENTITY_POOL_ID=us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx2. Create a voice input component:
// src/components/VoiceInput.tsx
import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'
import { useAuth } from 'your-auth-library'
export default function VoiceInput() {
const { getIdToken } = useAuth()
const [idToken, setIdToken] = useState('')
useEffect(() => {
getIdToken().then((token) => {
if (token) setIdToken(token)
})
}, [getIdToken])
const { transcript, listening, startListening, stopListening, resetTranscript, isMicrophoneAvailable } =
useTranscribe({
config: {
region: import.meta.env.VITE_AWS_REGION,
identityPoolId: import.meta.env.VITE_AWS_IDENTITY_POOL_ID,
userPoolId: import.meta.env.VITE_AWS_USER_POOL_ID,
idToken,
},
languageCode: 'en-US',
})
return (
<div>
<textarea value={transcript} readOnly rows={4} placeholder="Speak into your microphone..." />
<div>
<button onClick={listening ? stopListening : startListening} disabled={!isMicrophoneAvailable}>
{listening ? 'Stop' : 'Record'}
</button>
<button onClick={resetTranscript}>Clear</button>
</div>
</div>
)
}3. Use it in your app:
// src/App.tsx
import VoiceInput from './components/VoiceInput'
function App() {
return (
<div>
<h1>Voice Input Demo</h1>
<VoiceInput />
</div>
)
}Next.js
This package uses browser-only APIs (getUserMedia, AudioContext). In Next.js, you must mark the component as client-side.
1. Add environment variables to your .env.local:
NEXT_PUBLIC_AWS_REGION=us-east-1
NEXT_PUBLIC_AWS_USER_POOL_ID=us-east-1_XXXXXXXXX
NEXT_PUBLIC_AWS_IDENTITY_POOL_ID=us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx2. Create a client component (note the 'use client' directive):
// components/VoiceInput.tsx
'use client'
import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'
interface VoiceInputProps {
idToken: string
languageCode?: string
}
export default function VoiceInput({ idToken, languageCode = 'en-US' }: VoiceInputProps) {
const { transcript, listening, startListening, stopListening, resetTranscript, isMicrophoneAvailable } =
useTranscribe({
config: {
region: process.env.NEXT_PUBLIC_AWS_REGION!,
identityPoolId: process.env.NEXT_PUBLIC_AWS_IDENTITY_POOL_ID!,
userPoolId: process.env.NEXT_PUBLIC_AWS_USER_POOL_ID!,
idToken,
},
languageCode,
})
return (
<div>
<p>{transcript || 'Click the button and start speaking...'}</p>
<button onClick={listening ? stopListening : startListening} disabled={!isMicrophoneAvailable}>
{listening ? 'Stop Recording' : 'Start Recording'}
</button>
<button onClick={resetTranscript}>Clear</button>
</div>
)
}3. Use it in a page (App Router):
// app/page.tsx
import dynamic from 'next/dynamic'
// Dynamic import with SSR disabled — the hook uses browser APIs
const VoiceInput = dynamic(() => import('@/components/VoiceInput'), { ssr: false })
export default function Page() {
const idToken = '...' // Get from your auth layer (server component, cookie, etc.)
return (
<main>
<h1>Voice Input</h1>
<VoiceInput idToken={idToken} />
</main>
)
}Important: If you see ReferenceError: navigator is not defined or AudioContext is not defined, it means the component is being rendered on the server. Either:
- Add
'use client'at the top of the file, or - Use
dynamic(() => import(...), { ssr: false })to disable SSR for that component
Create React App
CRA exposes environment variables via process.env with a REACT_APP_ prefix.
1. Add environment variables to your .env:
REACT_APP_AWS_REGION=us-east-1
REACT_APP_AWS_USER_POOL_ID=us-east-1_XXXXXXXXX
REACT_APP_AWS_IDENTITY_POOL_ID=us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx2. Create the component:
// src/components/VoiceInput.tsx
import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'
export default function VoiceInput({ idToken }: { idToken: string }) {
const { transcript, listening, startListening, stopListening, resetTranscript, isMicrophoneAvailable } =
useTranscribe({
config: {
region: process.env.REACT_APP_AWS_REGION!,
identityPoolId: process.env.REACT_APP_AWS_IDENTITY_POOL_ID!,
userPoolId: process.env.REACT_APP_AWS_USER_POOL_ID!,
idToken,
},
languageCode: 'en-US',
})
return (
<div>
<p>{transcript || 'Click the button and start speaking...'}</p>
<button onClick={listening ? stopListening : startListening} disabled={!isMicrophoneAvailable}>
{listening ? 'Stop' : 'Record'}
</button>
<button onClick={resetTranscript}>Clear</button>
</div>
)
}API Reference
useTranscribe(options)
The main hook. Call it at the top level of your React component.
Parameters
interface UseTranscribeOptions {
config: TranscribeConfig
languageCode?: string // Default: "en-US"
}
interface TranscribeConfig {
region: string // AWS region, e.g. "us-east-1"
identityPoolId: string // Cognito Identity Pool ID
userPoolId: string // Cognito User Pool ID
idToken: string // Cognito ID token from your auth layer
}| Param | Type | Required | Description |
|-------|------|----------|-------------|
| config.region | string | Yes | AWS region where your Identity Pool and Transcribe are available |
| config.identityPoolId | string | Yes | Cognito Identity Pool ID (e.g., us-east-1:xxxxxxxx-xxxx-...) |
| config.userPoolId | string | Yes | Cognito User Pool ID (e.g., us-east-1_XXXXXXXXX) |
| config.idToken | string | Yes | A valid Cognito ID token for the authenticated user |
| languageCode | string | No | BCP-47 language code. Defaults to "en-US". See Supported Languages |
Return Value
interface UseTranscribeReturn {
transcript: string
listening: boolean
isMicrophoneAvailable: boolean
startListening: () => Promise<void>
stopListening: () => Promise<void>
abortListening: () => void
resetTranscript: () => void
}| Property | Type | Description |
|----------|------|-------------|
| transcript | string | The current transcription text. Updates in real-time with partial (interim) results and accumulates final results. Resets when startListening is called again |
| listening | boolean | true while the microphone is active and audio is being streamed to Transcribe |
| isMicrophoneAvailable | boolean | Starts as true. Becomes false if the user denies microphone permission |
| startListening | () => Promise<void> | Requests mic access, starts audio capture, connects to Transcribe, and begins streaming. Clears any previous transcript. If already listening, this is a no-op |
| stopListening | () => Promise<void> | Gracefully stops audio capture, closes the Transcribe WebSocket stream, and releases the microphone. Safe to call even if not listening |
| abortListening | () => void | Synchronous, fire-and-forget version of stopListening. Useful in cleanup code or event handlers where you can't await |
| resetTranscript | () => void | Clears the transcript to an empty string without stopping the mic |
Exported Types
import type {
TranscribeConfig,
UseTranscribeOptions,
UseTranscribeReturn,
} from '@gdnaio/react-transcribe-streaming'Usage Examples
Getting the ID Token
The hook requires a Cognito ID token. Here's how to get it from common auth libraries:
AWS Amplify v6:
import { fetchAuthSession } from 'aws-amplify/auth'
const session = await fetchAuthSession()
const idToken = session.tokens?.idToken?.toString() ?? ''AWS Amplify v5:
import { Auth } from 'aws-amplify'
const session = await Auth.currentSession()
const idToken = session.getIdToken().getJwtToken()@gdnaio/cognito-auth:
import { useAuth } from '@gdnaio/cognito-auth'
const { getIdToken } = useAuth()
const idToken = await getIdToken()amazon-cognito-identity-js:
const cognitoUser = userPool.getCurrentUser()
cognitoUser.getSession((err, session) => {
const idToken = session.getIdToken().getJwtToken()
})Complete Working Example
A full component with token fetching, error states, and visual feedback:
import { useState, useEffect } from 'react'
import { useTranscribe } from '@gdnaio/react-transcribe-streaming'
const CONFIG = {
region: 'us-east-1',
identityPoolId: 'us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
userPoolId: 'us-east-1_XXXXXXXXX',
}
interface Props {
getIdToken: () => Promise<string | null>
language?: string
}
export default function VoiceTranscriber({ getIdToken, language = 'en-US' }: Props) {
const [idToken, setIdToken] = useState('')
useEffect(() => {
getIdToken().then((token) => {
if (token) setIdToken(token)
})
}, [getIdToken])
const {
transcript,
listening,
isMicrophoneAvailable,
startListening,
stopListening,
resetTranscript,
} = useTranscribe({
config: { ...CONFIG, idToken },
languageCode: language,
})
if (!isMicrophoneAvailable) {
return <p>Microphone access was denied. Please allow microphone access and reload the page.</p>
}
return (
<div>
<div style={{ minHeight: 60, padding: 12, border: '1px solid #ccc', borderRadius: 8 }}>
{transcript || <span style={{ color: '#999' }}>Click Record and start speaking...</span>}
</div>
<div style={{ marginTop: 8, display: 'flex', gap: 8 }}>
<button
onClick={listening ? stopListening : startListening}
style={{
padding: '8px 16px',
background: listening ? '#ea4335' : '#1a73e8',
color: 'white',
border: 'none',
borderRadius: 4,
cursor: 'pointer',
}}
>
{listening ? 'Stop Recording' : 'Start Recording'}
</button>
<button onClick={resetTranscript} style={{ padding: '8px 16px' }}>
Clear
</button>
</div>
{listening && (
<p style={{ color: '#ea4335', marginTop: 8 }}>
Listening...
</p>
)}
</div>
)
}Dynamic Language Switching
function MultiLanguageInput() {
const [language, setLanguage] = useState('en-US')
const { transcript, listening, startListening, stopListening } = useTranscribe({
config: { /* ... */ },
languageCode: language,
})
return (
<div>
<select value={language} onChange={(e) => setLanguage(e.target.value)}>
<option value="en-US">English</option>
<option value="ar-SA">Arabic</option>
<option value="fr-FR">French</option>
<option value="es-US">Spanish</option>
<option value="hi-IN">Hindi</option>
</select>
<button onClick={listening ? stopListening : startListening}>
{listening ? 'Stop' : 'Record'}
</button>
<p>{transcript}</p>
</div>
)
}Note: Changing
languageCodewhile actively listening does not restart the stream automatically. Stop and start listening again to switch languages mid-session.
Appending Speech to Existing Text
function ChatInput() {
const [text, setText] = useState('')
const baseTextRef = useRef(text)
const { transcript, listening, startListening, stopListening } =
useTranscribe({ config: { /* ... */ } })
// When starting, capture the current text as the base
const handleStart = async () => {
baseTextRef.current = text
await startListening()
}
// Append transcript to the base text
useEffect(() => {
if (listening && transcript) {
setText(baseTextRef.current + transcript)
}
}, [transcript, listening])
return (
<div>
<input value={text} onChange={(e) => setText(e.target.value)} />
<button onClick={listening ? stopListening : handleStart}>
{listening ? 'Stop' : 'Mic'}
</button>
</div>
)
}Supported Languages
The hook includes a built-in language code mapper. Pass any of these BCP-47 codes as languageCode:
| Code | Language | Transcribe Code |
|------|----------|-----------------|
| en-US | English (US) | en-US |
| en-GB | English (UK) | en-GB |
| en-AU | English (Australia) | en-AU |
| ar-SA | Arabic (Saudi Arabia) | ar-SA |
| ar-AE | Arabic (UAE) | ar-AE |
| fr-FR | French (France) | fr-FR |
| fr-CA | French (Canada) | fr-CA |
| es-ES | Spanish (Spain) | es-ES |
| es-US | Spanish (US) | es-US |
| es-LA | Spanish (Latin America) | es-US * |
| de-DE | German | de-DE |
| it-IT | Italian | it-IT |
| pt-BR | Portuguese (Brazil) | pt-BR |
| ja-JP | Japanese | ja-JP |
| ko-KR | Korean | ko-KR |
| zh-CN | Chinese (Mandarin) | zh-CN |
| hi-IN | Hindi | hi-IN |
* es-LA is mapped to es-US since AWS Transcribe doesn't have a dedicated Latin American Spanish code.
Any unlisted code is passed through to Transcribe as-is. See AWS Transcribe Streaming supported languages for the full list of supported codes.
How It Works
User clicks "Start"
|
v
getUserMedia() ---- requests microphone access
|
v
AudioContext + ScriptProcessorNode ---- captures raw audio at native sample rate
|
v
Float32 -> Int16 PCM (little-endian) ---- converts to Transcribe-compatible format
|
v
fromCognitoIdentityPool() ---- exchanges Cognito ID token for temporary AWS credentials
|
v
TranscribeStreamingClient.send() ---- opens WebSocket to AWS Transcribe
|
v
Audio chunks streamed as async generator ---- yields { AudioEvent: { AudioChunk } }
|
v
TranscriptResultStream (async iterable) ---- receives partial and final transcript events
|
v
React state updates ---- transcript updates in real-time- Microphone capture — calls
getUserMediawith echo cancellation and noise suppression enabled, creates anAudioContextandScriptProcessorNodeto capture raw audio frames - PCM encoding — converts Float32 audio samples to 16-bit signed integer PCM in little-endian byte order, the format AWS Transcribe expects
- Credential exchange — uses
fromCognitoIdentityPoolfrom@aws-sdk/credential-providersto exchange the Cognito ID token for temporary AWS credentials scoped totranscribe:StartStreamTranscription - WebSocket streaming — creates a
TranscribeStreamingClientand sends aStartStreamTranscriptionCommandwith the audio stream as an async generator. Transcribe opens a WebSocket connection under the hood - Transcript processing — iterates the
TranscriptResultStreamasync iterable. Partial results update thetranscriptstate immediately (giving real-time feedback). Final results are accumulated so the full transcript builds up over the session
Cleanup and Resource Management
The hook automatically cleans up resources when:
- You call
stopListening()orabortListening() - An error occurs (network drop, credential expiry, etc.)
- The Transcribe stream ends naturally
Cleanup includes:
- Disconnecting the
ScriptProcessorNodeandAudioContext - Stopping all microphone
MediaStreamtracks (the browser's mic indicator turns off) - Aborting the Transcribe WebSocket stream
- Setting
listeningtofalse
Note: The hook does not auto-stop on component unmount. If you need that, call
abortListeningin a cleanup effect:useEffect(() => { return () => abortListening() }, [abortListening])
Browser Compatibility
| Browser | Minimum Version | Notes | |---------|----------------|-------| | Chrome | 66+ | Full support | | Firefox | 76+ | Full support | | Safari | 14.1+ | Full support | | Edge | 79+ | Full support (Chromium-based) | | Mobile Chrome | 66+ | Full support | | Mobile Safari | 14.5+ | Full support |
Requires navigator.mediaDevices.getUserMedia and AudioContext APIs. Both are available in all modern browsers. HTTPS is required in production (localhost is exempt for development).
Error Handling
The hook handles errors internally and never throws. All errors are logged to console.error with a [useTranscribe] prefix.
| Scenario | What Happens |
|----------|-------------|
| Microphone permission denied | isMicrophoneAvailable becomes false. You can use this to show a message or disable the button |
| Microphone already in use | startListening fails silently, error logged. listening stays false |
| Network disconnection | The Transcribe WebSocket closes. listening becomes false. Last transcript is preserved. User can click to restart |
| Expired or invalid ID token | Credential exchange fails, error logged. listening becomes false. Refresh the token and try again |
| AWS service error | Error logged. listening becomes false. Check the console for details |
| Unsupported browser | getUserMedia throws, caught and logged. listening stays false |
Bundle Size
The package itself is small (~6 KB). However, the AWS SDK dependencies add to the bundle:
| Dependency | Approximate Size (gzipped) |
|------------|---------------------------|
| @aws-sdk/client-transcribe-streaming | ~30 KB |
| @aws-sdk/credential-providers | ~20 KB |
| Total addition | ~50 KB gzipped |
The AWS SDK v3 is tree-shakeable — only the Transcribe Streaming client and Cognito credential provider are included in your bundle, not the entire SDK.
Troubleshooting
"Microphone not available" / isMicrophoneAvailable is false
- The user denied microphone permission in the browser
- Fix: Click the lock/camera icon in the browser's address bar, allow microphone access, and reload the page
"NotAllowedError" in console
- Your app is served over HTTP (not HTTPS) in production
- Fix: Serve your app over HTTPS.
localhostis exempt for development
"No audio is being captured" / transcript stays empty
- Check that your AWS credentials are valid — look for errors in the browser console
- Verify the Identity Pool ID, User Pool ID, and region are correct
- Ensure the IAM role has
transcribe:StartStreamTranscriptionpermission - Try a different microphone or check system audio settings
"CredentialsProviderError" or "Not authorized"
- The Cognito ID token may be expired. Refresh it before calling
startListening - The Identity Pool may not be linked to your User Pool. Verify the authentication provider in the Identity Pool settings
- The authenticated IAM role may be missing the Transcribe permission
Next.js: "navigator is not defined" or "AudioContext is not defined"
- The component is being server-side rendered. Browser APIs don't exist on the server
- Fix: Add
'use client'at the top of the file, or usedynamic(() => import(...), { ssr: false })
Transcript has long delays
- This is usually a network latency issue. AWS Transcribe Streaming requires a stable connection
- Partial results should appear within 200-500ms of speaking. If they don't, check your network connection
- The sample rate is auto-detected from your
AudioContext. Higher sample rates (48kHz) produce better quality but more data
Language not recognized correctly
- Ensure you're passing the correct BCP-47 code (e.g.,
ar-SA, notarorarabic) - Some languages require region-specific codes. Check the Supported Languages table
- Changing
languageCodewhile listening does not take effect until you stop and restart
License
MIT
