@gdnaio/react-polly-text-to-speech

v1.0.2

Published

7 days ago

React hook for text-to-speech using Amazon Polly with secure Cognito-based authentication

0High
0Medium
0Low

gdnawill

anandgdna

anvithas

react text-to-speech tts amazon-polly aws polly speech audio cognito ssml hook gdnaio

@gdnaio/react-polly-text-to-speech

React hook for text-to-speech using Amazon Polly with secure Cognito-based authentication.

No API keys in the browser. No third-party TTS services. Just AWS Polly called securely through Cognito Identity Pool credentials — the same pattern used by @gdnaio/react-transcribe-streaming.

Features

Single-hook API — usePollyTextToSpeech returns speak/stop controls, loading/playing state, and audio data
Secure by default — Uses Cognito Identity Pool for temporary AWS credentials (no keys in frontend)
All Polly engines — Standard, Neural, Long-form, and Generative
SSML builder — Built-in ssml utility for pauses, emphasis, prosody, whispering, and more
Voice catalogue — Curated voice map with getVoicesByLanguage() and getVoiceInfo() helpers
Configurable output — MP3, OGG, PCM with custom sample rates
No hidden audio — The hook does NOT create its own Audio object; you control playback via your own <audio> element and the ref callback
Client caching — The PollyClient and AWS credentials are cached at module level across all hook instances; only the very first call triggers Cognito round-trips
Audio Blob exposed — Use audioUrl with your own <audio> player or download the blob
TypeScript — Full type definitions included
Lightweight — Only depends on @aws-sdk/client-polly and @aws-sdk/credential-providers

Installation

npm install @gdnaio/react-polly-text-to-speech

Both ESM and CommonJS builds are included. TypeScript declarations ship with the package.

AWS Prerequisites

You need three things (identical setup to @gdnaio/react-transcribe-streaming):

1. Cognito User Pool

Your existing User Pool for authenticating users.

2. Cognito Identity Pool

Link it to your User Pool. This provides temporary AWS credentials to the browser.

3. IAM Role for Authenticated Users

Attach this inline policy to the Identity Pool's authenticated role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["polly:SynthesizeSpeech"],
      "Resource": "*"
    }
  ]
}

Tip: If you already have a Transcribe Identity Pool stack, you can add the polly:SynthesizeSpeech permission to the same IAM role and reuse the Identity Pool.

Quick Start

import { usePollyTextToSpeech } from '@gdnaio/react-polly-text-to-speech'

function TextToSpeechButton({ idToken }: { idToken: string }) {
  const { speak, stop, ref, loading, playing, error, audioUrl } = usePollyTextToSpeech({
    config: {
      region: 'us-east-1',
      identityPoolId: 'us-east-1:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
      userPoolId: 'us-east-1_XXXXXXXXX',
      idToken, // from your auth provider
    },
  })

  return (
    <div>
      <button onClick={() => speak('Hello! Welcome to our application.')} disabled={loading}>
        {loading ? 'Generating...' : 'Speak'}
      </button>
      {playing && <button onClick={stop}>Stop</button>}
      {error && <p style={{ color: 'red' }}>{error}</p>}

      {/* IMPORTANT: attach ref so the hook can track play/pause/duration */}
      {audioUrl && <audio ref={ref} autoPlay controls src={audioUrl} />}
    </div>
  )
}

API Reference

`usePollyTextToSpeech(options)`

Options

{
  // Required — AWS Cognito credentials
  config: {
    region: string            // AWS region (e.g. "us-east-1")
    identityPoolId: string    // Cognito Identity Pool ID
    userPoolId: string        // Cognito User Pool ID
    idToken: string           // JWT ID token from your auth provider
  },

  // Optional — voice settings
  voice: {
    voiceId?: string          // Polly voice (default: "Joanna")
    engine?: PollyEngine      // "standard" | "neural" | "long-form" | "generative" (default: "neural")
    languageCode?: string     // Only needed for bilingual voices (e.g. "hi-IN" for Aditi)
  },

  // Optional — audio output settings
  audio: {
    format?: PollyOutputFormat    // "mp3" | "ogg_vorbis" | "pcm" (default: "mp3")
    sampleRate?: PollySampleRate  // "8000" | "16000" | "22050" | "24000" | "44100" | "48000"
    lexiconNames?: string[]       // Custom pronunciation lexicons (max 5)
    speechMarkTypes?: PollySpeechMarkType[]  // "sentence" | "ssml" | "viseme" | "word"
  }
}

Return Value

{
  speak: (text: string, textType?: 'text' | 'ssml') => Promise<void>
  stop: () => void
  ref: (el: HTMLAudioElement | null) => void  // callback ref for your <audio> element
  loading: boolean          // true while API call is in-flight
  playing: boolean          // true while audio is playing
  error: string | null      // last error message, or null
  audioBlob: Blob | null    // raw audio Blob from last synthesis
  audioUrl: string | null   // Object URL for <audio src> usage
  duration: number | null   // audio duration in seconds (after metadata loads)
}

Configuring Voices

// Neural voice (default) — natural sounding
const tts = usePollyTextToSpeech({
  config,
  voice: { voiceId: 'Matthew', engine: 'neural' },
})

// Generative voice — most expressive
const tts = usePollyTextToSpeech({
  config,
  voice: { voiceId: 'Ruth', engine: 'generative' },
})

// Long-form voice — optimised for articles/stories
const tts = usePollyTextToSpeech({
  config,
  voice: { voiceId: 'Danielle', engine: 'long-form' },
})

// Spanish voice
const tts = usePollyTextToSpeech({
  config,
  voice: { voiceId: 'Lupe', engine: 'neural', languageCode: 'es-US' },
})

Using SSML

For fine-grained control over speech output, use the built-in ssml builder:

import { usePollyTextToSpeech, ssml } from '@gdnaio/react-polly-text-to-speech'

function SsmlExample() {
  const { speak } = usePollyTextToSpeech({ config })

  const handleSpeak = () => {
    const text = ssml.speak(
      ssml.sentence('Hello there!') +
      ssml.pause('500ms') +
      ssml.emphasis('This is really important.', 'strong') +
      ssml.pause('300ms') +
      ssml.prosody('And this part is spoken slowly.', { rate: 'slow' }) +
      ssml.pause('200ms') +
      ssml.whisper('This is a secret.')
    )

    speak(text, 'ssml')
  }

  return <button onClick={handleSpeak}>Speak with SSML</button>
}

SSML Builder Methods

| Method | Description | Example | |--------|-------------|---------| | ssml.speak(content) | Wrap in <speak> root | ssml.speak('Hello') | | ssml.pause(time) | Insert a break | ssml.pause('500ms') | | ssml.emphasis(text, level) | Emphasise text | ssml.emphasis('wow', 'strong') | | ssml.prosody(text, opts) | Control rate/pitch/volume | ssml.prosody('slow', { rate: 'slow' }) | | ssml.paragraph(text) | Paragraph with natural pause | ssml.paragraph('First para.') | | ssml.sentence(text) | Sentence boundary | ssml.sentence('A sentence.') | | ssml.sayAs(text, type) | Interpret as date/number/etc | ssml.sayAs('2025', 'cardinal') | | ssml.phoneme(text, ph) | Phonemic pronunciation | ssml.phoneme('pecan', 'pɪˈkɑːn') | | ssml.sub(text, alias) | Substitution | ssml.sub('AWS', 'Amazon Web Services') | | ssml.lang(text, lang) | Switch language mid-speech | ssml.lang('Bonjour', 'fr-FR') | | ssml.whisper(text) | Whispering voice | ssml.whisper('secret') | | ssml.amazonEffect(text, name) | Polly-specific effects | ssml.amazonEffect('news', 'drc') |

Voice Discovery

Browse available voices with the included catalogue:

import { getVoicesByLanguage, getVoiceInfo, POLLY_VOICES } from '@gdnaio/react-polly-text-to-speech'

// Get all English (US) voices
const usVoices = getVoicesByLanguage('en-US')
// → [{ voiceId: 'Joanna', name: 'Joanna', gender: 'Female', engines: [...] }, ...]

// Look up a specific voice
const info = getVoiceInfo('Matthew')
// → { voiceId: 'Matthew', name: 'Matthew', gender: 'Male', engines: ['neural', 'standard'] }

// Access the full catalogue
console.log(Object.keys(POLLY_VOICES))
// → ['en-US', 'en-GB', 'en-AU', 'en-IN', 'es-US', 'es-ES', 'fr-FR', ...]

Using `ref` with Your Own Audio Player

The hook does not play audio internally. It synthesises audio and returns audioUrl — you render the <audio> element and pass the ref callback so the hook can track playing, duration, and stop() state.

function CustomPlayer() {
  const { speak, stop, ref, audioUrl, playing, duration, loading } = usePollyTextToSpeech({ config })

  return (
    <div>
      <button onClick={() => speak('Hello world')} disabled={loading}>
        Generate Audio
      </button>
      {playing && <button onClick={stop}>Stop</button>}

      {/* ref is required — without it, playing/duration/stop() won't work */}
      {audioUrl && <audio ref={ref} autoPlay controls src={audioUrl} />}

      {duration && <p>Duration: {duration.toFixed(1)}s</p>}
    </div>
  )
}

Why `ref` matters

The ref callback is how the hook connects to the actual <audio> DOM element. Without it:

playing will always be false
duration will always be null
stop() will have no effect (there's no element to pause)

// WRONG — hook can't track the audio element
{audioUrl && <audio autoPlay controls src={audioUrl} />}

// CORRECT — hook is connected to the element
{audioUrl && <audio ref={ref} autoPlay controls src={audioUrl} />}

Using with Vite

const { speak } = usePollyTextToSpeech({
  config: {
    region: import.meta.env.VITE_AWS_REGION,
    identityPoolId: import.meta.env.VITE_AWS_IDENTITY_POOL_ID,
    userPoolId: import.meta.env.VITE_AWS_USER_POOL_ID,
    idToken: token, // from your auth hook
  },
  voice: { voiceId: 'Joanna', engine: 'neural' },
  audio: { format: 'mp3' },
})

Using with Next.js

Since this hook uses browser APIs (Audio, URL.createObjectURL), use dynamic import with SSR disabled:

// components/TtsButton.tsx
'use client'

import { usePollyTextToSpeech } from '@gdnaio/react-polly-text-to-speech'

export function TtsButton({ idToken }: { idToken: string }) {
  const { speak, ref, audioUrl, loading } = usePollyTextToSpeech({
    config: {
      region: process.env.NEXT_PUBLIC_AWS_REGION!,
      identityPoolId: process.env.NEXT_PUBLIC_AWS_IDENTITY_POOL_ID!,
      userPoolId: process.env.NEXT_PUBLIC_AWS_USER_POOL_ID!,
      idToken,
    },
  })

  return (
    <div>
      <button onClick={() => speak('Hello from Next.js')} disabled={loading}>
        Speak
      </button>
      {audioUrl && <audio ref={ref} autoPlay controls src={audioUrl} />}
    </div>
  )
}

Token Retrieval

The hook needs a Cognito ID token (not access token). Common patterns:

// @gdnaio/cognito-auth
const { getIdToken } = useAuth()
const token = await getIdToken()

// AWS Amplify v6
import { fetchAuthSession } from 'aws-amplify/auth'
const { tokens } = await fetchAuthSession()
const token = tokens?.idToken?.toString()

// amazon-cognito-identity-js
cognitoUser.getSession((err, session) => {
  const token = session.getIdToken().getJwtToken()
})

Audio Output Formats

| Format | MIME Type | Use Case | |--------|-----------|----------| | mp3 (default) | audio/mpeg | Best browser compatibility, small file size | | ogg_vorbis | audio/ogg | Open format, good quality-to-size ratio | | pcm | audio/pcm | Raw audio for processing pipelines |

Supported Voices

The built-in POLLY_VOICES catalogue includes the following voices. You can pass any valid Polly VoiceId directly to the hook — the catalogue is a convenience helper, not a restriction.

For the full and most up-to-date list of all voices, engines, and languages, see the Amazon Polly Voice List.

English (US) — `en-US`

| Voice | Gender | Engines | | --- | --- | --- | | Joanna | Female | neural, standard, long-form | | Matthew | Male | neural, standard | | Ruth | Female | neural, long-form, generative | | Stephen | Male | neural, long-form, generative | | Danielle | Female | neural, long-form, generative | | Gregory | Male | neural, long-form, generative | | Ivy | Female | neural, standard | | Kendra | Female | neural, standard | | Kimberly | Female | neural, standard | | Salli | Female | neural, standard | | Joey | Male | neural, standard | | Justin | Male | neural, standard | | Kevin | Male | neural, standard |

English (GB) — `en-GB`

| Voice | Gender | Engines | | --- | --- | --- | | Amy | Female | neural, standard | | Emma | Female | neural, standard | | Brian | Male | neural, standard | | Arthur | Male | neural |

English (AU) — `en-AU`

| Voice | Gender | Engines | | --- | --- | --- | | Olivia | Female | neural | | Nicole | Female | standard | | Russell | Male | standard |

English (IN) — `en-IN`

| Voice | Gender | Engines | | --- | --- | --- | | Kajal | Female | neural | | Aditi | Female | standard | | Raveena | Female | standard |

Spanish (US) — `es-US`

| Voice | Gender | Engines | | --- | --- | --- | | Lupe | Female | neural, standard | | Pedro | Male | neural | | Penelope | Female | standard | | Miguel | Male | standard |

Spanish (ES) — `es-ES`

| Voice | Gender | Engines | | --- | --- | --- | | Lucia | Female | neural, standard | | Sergio | Male | neural | | Enrique | Male | standard | | Conchita | Female | standard |

Spanish (MX) — `es-MX`

| Voice | Gender | Engines | | --- | --- | --- | | Mia | Female | standard | | Andres | Male | neural |

French (FR) — `fr-FR`

| Voice | Gender | Engines | | --- | --- | --- | | Léa | Female | neural, standard | | Rémi | Male | neural | | Mathieu | Male | standard | | Céline | Female | standard |

French (CA) — `fr-CA`

| Voice | Gender | Engines | | --- | --- | --- | | Gabrielle | Female | neural | | Chantal | Female | standard |

German — `de-DE`

| Voice | Gender | Engines | | --- | --- | --- | | Vicki | Female | neural, standard | | Daniel | Male | neural | | Hans | Male | standard | | Marlene | Female | standard |

Italian — `it-IT`

| Voice | Gender | Engines | | --- | --- | --- | | Bianca | Female | neural, standard | | Adriano | Male | neural | | Carla | Female | standard | | Giorgio | Male | standard |

Portuguese (BR) — `pt-BR`

| Voice | Gender | Engines | | --- | --- | --- | | Camila | Female | neural, standard | | Vitória | Female | neural, standard | | Thiago | Male | neural | | Ricardo | Male | standard |

Japanese — `ja-JP`

| Voice | Gender | Engines | | --- | --- | --- | | Kazuha | Female | neural, long-form | | Tomoko | Female | neural, long-form | | Takumi | Male | neural, standard | | Mizuki | Female | standard |

Korean — `ko-KR`

| Voice | Gender | Engines | | --- | --- | --- | | Seoyeon | Female | neural, standard |

Chinese (Mandarin) — `cmn-CN`

| Voice | Gender | Engines | | --- | --- | --- | | Zhiyu | Female | neural, standard |

Hindi — `hi-IN`

| Voice | Gender | Engines | | --- | --- | --- | | Kajal | Female | neural | | Aditi | Female | standard |

Arabic (UAE) — `ar-AE`

| Voice | Gender | Engines | | --- | --- | --- | | Hala | Female | neural | | Zayd | Male | neural |

Arabic (Standard) — `arb`

| Voice | Gender | Engines | | --- | --- | --- | | Zeina | Female | standard |

Browser Support

| Browser | Minimum Version | |---------|----------------| | Chrome | 66+ | | Firefox | 76+ | | Safari | 14.1+ | | Edge | 79+ |

Gotchas and Common Mistakes

1. Forgetting `ref` on the `<audio>` element

This is the most common mistake. Without ref={ref}, the hook has no connection to the DOM audio element. playing, duration, and stop() will all be broken. Always pass the ref.

2. Multiple components = one shared client

The PollyClient and AWS credentials are cached at module level. If you render multiple usePollyTextToSpeech instances (e.g. one per chat message), they all share the same client. The Cognito GetId + GetCredentialsForIdentity API calls only happen once — every subsequent speak() call goes directly to Polly.

3. Pass `idToken` as state, not a stale string

The hook needs a valid Cognito ID token. If you pass an empty string or a stale token, the credential provider will fail. Fetch the token asynchronously and pass it via React state:

// WRONG — token is empty on first render, hook will error
const token = '' // or some stale value
const { speak } = usePollyTextToSpeech({ config: { ...rest, idToken: token } })

// CORRECT — fetch token, set state, then call speak()
const [idToken, setIdToken] = useState('')

useEffect(() => {
  getIdToken().then(setIdToken)
}, [])

const { speak } = usePollyTextToSpeech({ config: { ...rest, idToken } })

// Only call speak() after idToken is set
useEffect(() => {
  if (idToken) speak('Hello')
}, [idToken])

4. Conditionally rendered `<audio>` loses the ref

If your <audio> element is inside a conditional ({audioUrl && <audio ref={ref} ... />}), the ref detaches when audioUrl becomes null (e.g. on a new speak() call). This is expected — the hook handles ref attach/detach cleanly.

5. SSML text must use `textType: 'ssml'`

If your text contains SSML tags like <break> or <prosody>, you must pass 'ssml' as the second argument to speak(). Otherwise Polly will read the tags as literal text.

// WRONG — tags read out loud as text
speak('<speak>Hello <break time="500ms"/> world</speak>')

// CORRECT
speak('<speak>Hello <break time="500ms"/> world</speak>', 'ssml')

Error Handling

The hook catches errors and exposes them via the error state. It never throws. Common scenarios:

Invalid credentials — expired token or misconfigured Identity Pool
Invalid voice/engine combo — e.g. using long-form engine with a voice that doesn't support it
Text too long — Polly has a 3,000 character limit for SynthesizeSpeech (6,000 for SSML including tags)
Autoplay blocked — some browsers block audio.play() without user interaction

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@gdnaio/react-polly-text-to-speech

Features

Installation

AWS Prerequisites

1. Cognito User Pool

2. Cognito Identity Pool

3. IAM Role for Authenticated Users

Quick Start

API Reference

usePollyTextToSpeech(options)

Options

Return Value

Configuring Voices

Using SSML

SSML Builder Methods

Voice Discovery

Using ref with Your Own Audio Player

Why ref matters

Using with Vite

Using with Next.js

Token Retrieval

Audio Output Formats

Supported Voices

English (US) — en-US

English (GB) — en-GB

English (AU) — en-AU

English (IN) — en-IN

Spanish (US) — es-US

Spanish (ES) — es-ES

Spanish (MX) — es-MX

French (FR) — fr-FR

French (CA) — fr-CA

German — de-DE

Italian — it-IT

Portuguese (BR) — pt-BR

Japanese — ja-JP

Korean — ko-KR

Chinese (Mandarin) — cmn-CN

Hindi — hi-IN

Arabic (UAE) — ar-AE

Arabic (Standard) — arb

Browser Support

Gotchas and Common Mistakes

1. Forgetting ref on the <audio> element

2. Multiple components = one shared client

3. Pass idToken as state, not a stale string

4. Conditionally rendered <audio> loses the ref

5. SSML text must use textType: 'ssml'

Error Handling

License

`usePollyTextToSpeech(options)`

Using `ref` with Your Own Audio Player

Why `ref` matters

English (US) — `en-US`

English (GB) — `en-GB`

English (AU) — `en-AU`

English (IN) — `en-IN`

Spanish (US) — `es-US`

Spanish (ES) — `es-ES`

Spanish (MX) — `es-MX`

French (FR) — `fr-FR`

French (CA) — `fr-CA`

German — `de-DE`

Italian — `it-IT`

Portuguese (BR) — `pt-BR`

Japanese — `ja-JP`

Korean — `ko-KR`

Chinese (Mandarin) — `cmn-CN`

Hindi — `hi-IN`

Arabic (UAE) — `ar-AE`

Arabic (Standard) — `arb`

1. Forgetting `ref` on the `<audio>` element

3. Pass `idToken` as state, not a stale string

4. Conditionally rendered `<audio>` loses the ref

5. SSML text must use `textType: 'ssml'`