@gdnaio/react-polly-text-to-speech
v1.0.2
Published
React hook for text-to-speech using Amazon Polly with secure Cognito-based authentication
Readme
@gdnaio/react-polly-text-to-speech
React hook for text-to-speech using Amazon Polly with secure Cognito-based authentication.
No API keys in the browser. No third-party TTS services. Just AWS Polly called securely through Cognito Identity Pool credentials — the same pattern used by @gdnaio/react-transcribe-streaming.
Features
- Single-hook API —
usePollyTextToSpeechreturns speak/stop controls, loading/playing state, and audio data - Secure by default — Uses Cognito Identity Pool for temporary AWS credentials (no keys in frontend)
- All Polly engines — Standard, Neural, Long-form, and Generative
- SSML builder — Built-in
ssmlutility for pauses, emphasis, prosody, whispering, and more - Voice catalogue — Curated voice map with
getVoicesByLanguage()andgetVoiceInfo()helpers - Configurable output — MP3, OGG, PCM with custom sample rates
- No hidden audio — The hook does NOT create its own
Audioobject; you control playback via your own<audio>element and therefcallback - Client caching — The
PollyClientand AWS credentials are cached at module level across all hook instances; only the very first call triggers Cognito round-trips - Audio Blob exposed — Use
audioUrlwith your own<audio>player or download the blob - TypeScript — Full type definitions included
- Lightweight — Only depends on
@aws-sdk/client-pollyand@aws-sdk/credential-providers
Installation
npm install @gdnaio/react-polly-text-to-speechBoth ESM and CommonJS builds are included. TypeScript declarations ship with the package.
AWS Prerequisites
You need three things (identical setup to @gdnaio/react-transcribe-streaming):
1. Cognito User Pool
Your existing User Pool for authenticating users.
2. Cognito Identity Pool
Link it to your User Pool. This provides temporary AWS credentials to the browser.
3. IAM Role for Authenticated Users
Attach this inline policy to the Identity Pool's authenticated role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["polly:SynthesizeSpeech"],
"Resource": "*"
}
]
}Tip: If you already have a Transcribe Identity Pool stack, you can add the
polly:SynthesizeSpeechpermission to the same IAM role and reuse the Identity Pool.
Quick Start
import { usePollyTextToSpeech } from '@gdnaio/react-polly-text-to-speech'
function TextToSpeechButton({ idToken }: { idToken: string }) {
const { speak, stop, ref, loading, playing, error, audioUrl } = usePollyTextToSpeech({
config: {
region: 'us-east-1',
identityPoolId: 'us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
userPoolId: 'us-east-1_XXXXXXXXX',
idToken, // from your auth provider
},
})
return (
<div>
<button onClick={() => speak('Hello! Welcome to our application.')} disabled={loading}>
{loading ? 'Generating...' : 'Speak'}
</button>
{playing && <button onClick={stop}>Stop</button>}
{error && <p style={{ color: 'red' }}>{error}</p>}
{/* IMPORTANT: attach ref so the hook can track play/pause/duration */}
{audioUrl && <audio ref={ref} autoPlay controls src={audioUrl} />}
</div>
)
}API Reference
usePollyTextToSpeech(options)
Options
{
// Required — AWS Cognito credentials
config: {
region: string // AWS region (e.g. "us-east-1")
identityPoolId: string // Cognito Identity Pool ID
userPoolId: string // Cognito User Pool ID
idToken: string // JWT ID token from your auth provider
},
// Optional — voice settings
voice: {
voiceId?: string // Polly voice (default: "Joanna")
engine?: PollyEngine // "standard" | "neural" | "long-form" | "generative" (default: "neural")
languageCode?: string // Only needed for bilingual voices (e.g. "hi-IN" for Aditi)
},
// Optional — audio output settings
audio: {
format?: PollyOutputFormat // "mp3" | "ogg_vorbis" | "pcm" (default: "mp3")
sampleRate?: PollySampleRate // "8000" | "16000" | "22050" | "24000" | "44100" | "48000"
lexiconNames?: string[] // Custom pronunciation lexicons (max 5)
speechMarkTypes?: PollySpeechMarkType[] // "sentence" | "ssml" | "viseme" | "word"
}
}Return Value
{
speak: (text: string, textType?: 'text' | 'ssml') => Promise<void>
stop: () => void
ref: (el: HTMLAudioElement | null) => void // callback ref for your <audio> element
loading: boolean // true while API call is in-flight
playing: boolean // true while audio is playing
error: string | null // last error message, or null
audioBlob: Blob | null // raw audio Blob from last synthesis
audioUrl: string | null // Object URL for <audio src> usage
duration: number | null // audio duration in seconds (after metadata loads)
}Configuring Voices
// Neural voice (default) — natural sounding
const tts = usePollyTextToSpeech({
config,
voice: { voiceId: 'Matthew', engine: 'neural' },
})
// Generative voice — most expressive
const tts = usePollyTextToSpeech({
config,
voice: { voiceId: 'Ruth', engine: 'generative' },
})
// Long-form voice — optimised for articles/stories
const tts = usePollyTextToSpeech({
config,
voice: { voiceId: 'Danielle', engine: 'long-form' },
})
// Spanish voice
const tts = usePollyTextToSpeech({
config,
voice: { voiceId: 'Lupe', engine: 'neural', languageCode: 'es-US' },
})Using SSML
For fine-grained control over speech output, use the built-in ssml builder:
import { usePollyTextToSpeech, ssml } from '@gdnaio/react-polly-text-to-speech'
function SsmlExample() {
const { speak } = usePollyTextToSpeech({ config })
const handleSpeak = () => {
const text = ssml.speak(
ssml.sentence('Hello there!') +
ssml.pause('500ms') +
ssml.emphasis('This is really important.', 'strong') +
ssml.pause('300ms') +
ssml.prosody('And this part is spoken slowly.', { rate: 'slow' }) +
ssml.pause('200ms') +
ssml.whisper('This is a secret.')
)
speak(text, 'ssml')
}
return <button onClick={handleSpeak}>Speak with SSML</button>
}SSML Builder Methods
| Method | Description | Example |
|--------|-------------|---------|
| ssml.speak(content) | Wrap in <speak> root | ssml.speak('Hello') |
| ssml.pause(time) | Insert a break | ssml.pause('500ms') |
| ssml.emphasis(text, level) | Emphasise text | ssml.emphasis('wow', 'strong') |
| ssml.prosody(text, opts) | Control rate/pitch/volume | ssml.prosody('slow', { rate: 'slow' }) |
| ssml.paragraph(text) | Paragraph with natural pause | ssml.paragraph('First para.') |
| ssml.sentence(text) | Sentence boundary | ssml.sentence('A sentence.') |
| ssml.sayAs(text, type) | Interpret as date/number/etc | ssml.sayAs('2025', 'cardinal') |
| ssml.phoneme(text, ph) | Phonemic pronunciation | ssml.phoneme('pecan', 'pɪˈkɑːn') |
| ssml.sub(text, alias) | Substitution | ssml.sub('AWS', 'Amazon Web Services') |
| ssml.lang(text, lang) | Switch language mid-speech | ssml.lang('Bonjour', 'fr-FR') |
| ssml.whisper(text) | Whispering voice | ssml.whisper('secret') |
| ssml.amazonEffect(text, name) | Polly-specific effects | ssml.amazonEffect('news', 'drc') |
Voice Discovery
Browse available voices with the included catalogue:
import { getVoicesByLanguage, getVoiceInfo, POLLY_VOICES } from '@gdnaio/react-polly-text-to-speech'
// Get all English (US) voices
const usVoices = getVoicesByLanguage('en-US')
// → [{ voiceId: 'Joanna', name: 'Joanna', gender: 'Female', engines: [...] }, ...]
// Look up a specific voice
const info = getVoiceInfo('Matthew')
// → { voiceId: 'Matthew', name: 'Matthew', gender: 'Male', engines: ['neural', 'standard'] }
// Access the full catalogue
console.log(Object.keys(POLLY_VOICES))
// → ['en-US', 'en-GB', 'en-AU', 'en-IN', 'es-US', 'es-ES', 'fr-FR', ...]Using ref with Your Own Audio Player
The hook does not play audio internally. It synthesises audio and returns audioUrl — you render the <audio> element and pass the ref callback so the hook can track playing, duration, and stop() state.
function CustomPlayer() {
const { speak, stop, ref, audioUrl, playing, duration, loading } = usePollyTextToSpeech({ config })
return (
<div>
<button onClick={() => speak('Hello world')} disabled={loading}>
Generate Audio
</button>
{playing && <button onClick={stop}>Stop</button>}
{/* ref is required — without it, playing/duration/stop() won't work */}
{audioUrl && <audio ref={ref} autoPlay controls src={audioUrl} />}
{duration && <p>Duration: {duration.toFixed(1)}s</p>}
</div>
)
}Why ref matters
The ref callback is how the hook connects to the actual <audio> DOM element. Without it:
playingwill always befalsedurationwill always benullstop()will have no effect (there's no element to pause)
// WRONG — hook can't track the audio element
{audioUrl && <audio autoPlay controls src={audioUrl} />}
// CORRECT — hook is connected to the element
{audioUrl && <audio ref={ref} autoPlay controls src={audioUrl} />}Using with Vite
const { speak } = usePollyTextToSpeech({
config: {
region: import.meta.env.VITE_AWS_REGION,
identityPoolId: import.meta.env.VITE_AWS_IDENTITY_POOL_ID,
userPoolId: import.meta.env.VITE_AWS_USER_POOL_ID,
idToken: token, // from your auth hook
},
voice: { voiceId: 'Joanna', engine: 'neural' },
audio: { format: 'mp3' },
})Using with Next.js
Since this hook uses browser APIs (Audio, URL.createObjectURL), use dynamic import with SSR disabled:
// components/TtsButton.tsx
'use client'
import { usePollyTextToSpeech } from '@gdnaio/react-polly-text-to-speech'
export function TtsButton({ idToken }: { idToken: string }) {
const { speak, ref, audioUrl, loading } = usePollyTextToSpeech({
config: {
region: process.env.NEXT_PUBLIC_AWS_REGION!,
identityPoolId: process.env.NEXT_PUBLIC_AWS_IDENTITY_POOL_ID!,
userPoolId: process.env.NEXT_PUBLIC_AWS_USER_POOL_ID!,
idToken,
},
})
return (
<div>
<button onClick={() => speak('Hello from Next.js')} disabled={loading}>
Speak
</button>
{audioUrl && <audio ref={ref} autoPlay controls src={audioUrl} />}
</div>
)
}Token Retrieval
The hook needs a Cognito ID token (not access token). Common patterns:
// @gdnaio/cognito-auth
const { getIdToken } = useAuth()
const token = await getIdToken()
// AWS Amplify v6
import { fetchAuthSession } from 'aws-amplify/auth'
const { tokens } = await fetchAuthSession()
const token = tokens?.idToken?.toString()
// amazon-cognito-identity-js
cognitoUser.getSession((err, session) => {
const token = session.getIdToken().getJwtToken()
})Audio Output Formats
| Format | MIME Type | Use Case |
|--------|-----------|----------|
| mp3 (default) | audio/mpeg | Best browser compatibility, small file size |
| ogg_vorbis | audio/ogg | Open format, good quality-to-size ratio |
| pcm | audio/pcm | Raw audio for processing pipelines |
Supported Voices
The built-in POLLY_VOICES catalogue includes the following voices. You can pass any valid Polly VoiceId directly to the hook — the catalogue is a convenience helper, not a restriction.
For the full and most up-to-date list of all voices, engines, and languages, see the Amazon Polly Voice List.
English (US) — en-US
| Voice | Gender | Engines | | --- | --- | --- | | Joanna | Female | neural, standard, long-form | | Matthew | Male | neural, standard | | Ruth | Female | neural, long-form, generative | | Stephen | Male | neural, long-form, generative | | Danielle | Female | neural, long-form, generative | | Gregory | Male | neural, long-form, generative | | Ivy | Female | neural, standard | | Kendra | Female | neural, standard | | Kimberly | Female | neural, standard | | Salli | Female | neural, standard | | Joey | Male | neural, standard | | Justin | Male | neural, standard | | Kevin | Male | neural, standard |
English (GB) — en-GB
| Voice | Gender | Engines | | --- | --- | --- | | Amy | Female | neural, standard | | Emma | Female | neural, standard | | Brian | Male | neural, standard | | Arthur | Male | neural |
English (AU) — en-AU
| Voice | Gender | Engines | | --- | --- | --- | | Olivia | Female | neural | | Nicole | Female | standard | | Russell | Male | standard |
English (IN) — en-IN
| Voice | Gender | Engines | | --- | --- | --- | | Kajal | Female | neural | | Aditi | Female | standard | | Raveena | Female | standard |
Spanish (US) — es-US
| Voice | Gender | Engines | | --- | --- | --- | | Lupe | Female | neural, standard | | Pedro | Male | neural | | Penelope | Female | standard | | Miguel | Male | standard |
Spanish (ES) — es-ES
| Voice | Gender | Engines | | --- | --- | --- | | Lucia | Female | neural, standard | | Sergio | Male | neural | | Enrique | Male | standard | | Conchita | Female | standard |
Spanish (MX) — es-MX
| Voice | Gender | Engines | | --- | --- | --- | | Mia | Female | standard | | Andres | Male | neural |
French (FR) — fr-FR
| Voice | Gender | Engines | | --- | --- | --- | | Léa | Female | neural, standard | | Rémi | Male | neural | | Mathieu | Male | standard | | Céline | Female | standard |
French (CA) — fr-CA
| Voice | Gender | Engines | | --- | --- | --- | | Gabrielle | Female | neural | | Chantal | Female | standard |
German — de-DE
| Voice | Gender | Engines | | --- | --- | --- | | Vicki | Female | neural, standard | | Daniel | Male | neural | | Hans | Male | standard | | Marlene | Female | standard |
Italian — it-IT
| Voice | Gender | Engines | | --- | --- | --- | | Bianca | Female | neural, standard | | Adriano | Male | neural | | Carla | Female | standard | | Giorgio | Male | standard |
Portuguese (BR) — pt-BR
| Voice | Gender | Engines | | --- | --- | --- | | Camila | Female | neural, standard | | Vitória | Female | neural, standard | | Thiago | Male | neural | | Ricardo | Male | standard |
Japanese — ja-JP
| Voice | Gender | Engines | | --- | --- | --- | | Kazuha | Female | neural, long-form | | Tomoko | Female | neural, long-form | | Takumi | Male | neural, standard | | Mizuki | Female | standard |
Korean — ko-KR
| Voice | Gender | Engines | | --- | --- | --- | | Seoyeon | Female | neural, standard |
Chinese (Mandarin) — cmn-CN
| Voice | Gender | Engines | | --- | --- | --- | | Zhiyu | Female | neural, standard |
Hindi — hi-IN
| Voice | Gender | Engines | | --- | --- | --- | | Kajal | Female | neural | | Aditi | Female | standard |
Arabic (UAE) — ar-AE
| Voice | Gender | Engines | | --- | --- | --- | | Hala | Female | neural | | Zayd | Male | neural |
Arabic (Standard) — arb
| Voice | Gender | Engines | | --- | --- | --- | | Zeina | Female | standard |
Browser Support
| Browser | Minimum Version | |---------|----------------| | Chrome | 66+ | | Firefox | 76+ | | Safari | 14.1+ | | Edge | 79+ |
Gotchas and Common Mistakes
1. Forgetting ref on the <audio> element
This is the most common mistake. Without ref={ref}, the hook has no connection to the DOM audio element. playing, duration, and stop() will all be broken. Always pass the ref.
2. Multiple components = one shared client
The PollyClient and AWS credentials are cached at module level. If you render multiple usePollyTextToSpeech instances (e.g. one per chat message), they all share the same client. The Cognito GetId + GetCredentialsForIdentity API calls only happen once — every subsequent speak() call goes directly to Polly.
3. Pass idToken as state, not a stale string
The hook needs a valid Cognito ID token. If you pass an empty string or a stale token, the credential provider will fail. Fetch the token asynchronously and pass it via React state:
// WRONG — token is empty on first render, hook will error
const token = '' // or some stale value
const { speak } = usePollyTextToSpeech({ config: { ...rest, idToken: token } })
// CORRECT — fetch token, set state, then call speak()
const [idToken, setIdToken] = useState('')
useEffect(() => {
getIdToken().then(setIdToken)
}, [])
const { speak } = usePollyTextToSpeech({ config: { ...rest, idToken } })
// Only call speak() after idToken is set
useEffect(() => {
if (idToken) speak('Hello')
}, [idToken])4. Conditionally rendered <audio> loses the ref
If your <audio> element is inside a conditional ({audioUrl && <audio ref={ref} ... />}), the ref detaches when audioUrl becomes null (e.g. on a new speak() call). This is expected — the hook handles ref attach/detach cleanly.
5. SSML text must use textType: 'ssml'
If your text contains SSML tags like <break> or <prosody>, you must pass 'ssml' as the second argument to speak(). Otherwise Polly will read the tags as literal text.
// WRONG — tags read out loud as text
speak('<speak>Hello <break time="500ms"/> world</speak>')
// CORRECT
speak('<speak>Hello <break time="500ms"/> world</speak>', 'ssml')Error Handling
The hook catches errors and exposes them via the error state. It never throws. Common scenarios:
- Invalid credentials — expired token or misconfigured Identity Pool
- Invalid voice/engine combo — e.g. using
long-formengine with a voice that doesn't support it - Text too long — Polly has a 3,000 character limit for
SynthesizeSpeech(6,000 for SSML including tags) - Autoplay blocked — some browsers block
audio.play()without user interaction
License
MIT
