@gravityai-dev/aws-nova
v1.2.3
Published
AWS Nova integration for Gravity workflow system
Readme
@gravityai-dev/aws-nova
Multimodal speech generation with AWS Nova Sonic - Support for both text-to-speech and speech-to-speech with real-time streaming.
Features
- 🎤 Speech-to-Speech - Process audio input and generate speech responses
- 🔤 Text-to-Speech - Convert text to natural-sounding speech
- 🌊 Real-time Streaming - Stream audio chunks as they're generated
- 🔧 Tool Integration - Support for function calling and RAG tools
- 🎭 Multiple Voices - 11 different voices with accent support
- 📊 Usage Analytics - Token counting and performance metrics
- 🔄 Redis Integration - Real-time audio publishing to Redis streams
- 🔇 Audio Feedback Prevention - Automatic microphone muting during playback
- 🛡️ Error Recovery - Automatic cleanup events for robust session management
- 🎵 Smooth Playback - Server and client-side audio buffering
Quick Start
Installation
npm install @gravityai-dev/aws-novaBasic Usage
import { NovaSpeechService } from '@gravityai-dev/aws-nova';
const service = new NovaSpeechService();
// Text-to-speech
const result = await service.generateSpeechStream({
systemPrompt: "You are a helpful assistant.",
voice: "tiffany",
temperature: 0.7,
redisChannel: "audio-stream"
}, metadata, context);
console.log(`Generated ${result.chunk_count} audio chunks`);Speech-to-Speech
// Process audio input (Base64 encoded PCM)
const result = await service.generateSpeechStream({
audioInput: base64AudioData, // 16kHz, 16-bit, mono PCM
systemPrompt: "Respond conversationally",
voice: "matthew"
}, metadata, context);
console.log("Transcription:", result.transcription);
console.log("Response:", result.assistantResponse);Available Voices
Choose from 11 different voices with regional accents:
| Voice | Gender | Accent | Description |
|-------|--------|--------|-------------|
| tiffany | Female | American | Default, clear and professional |
| matthew | Male | American | Warm and conversational |
| amy | Female | American | Friendly and approachable |
| ambre | Female | French | Elegant French accent |
| florian | Male | German | Professional German accent |
| beatrice | Female | Italian | Melodic Italian accent |
| lorenzo | Male | Italian | Rich Italian accent |
| greta | Female | Swedish | Clear Swedish accent |
| lennart | Male | Swedish | Distinctive Swedish accent |
| lupe | Female | Spanish | Vibrant Spanish accent |
| carlos | Male | Spanish | Confident Spanish accent |
Configuration Options
interface NovaSpeechConfig {
systemPrompt?: string; // System instructions
audioInput?: string; // Base64 PCM audio (16kHz, 16-bit, mono)
voice?: string; // Voice selection (default: "tiffany")
temperature?: number; // Generation creativity (0-1)
topP?: number; // Nucleus sampling (0-1)
redisChannel?: string; // Redis channel for audio streaming
conversationHistory?: Array<{role: string, content: string}>;
toolResponse?: any[]; // Tool/RAG responses
controlSignal?: string; // "START_CALL" | "STOP"
}Real-time Audio Streaming Architecture
Control Flow vs Data Flow Separation
The Nova Speech integration uses a clean separation between control signals and audio data:
Control Signals (via GraphQL)
- START_CALL - Initiates workflow and Nova session
- END_CALL - Terminates session and cleanup
Data Flow (Direct Redis Streaming)
- Audio chunks stream directly to Redis
Audio-Streamchannel - Bypasses GraphQL for real-time performance
- No buffering overhead - immediate streaming
Streaming Sequence
1. Client → GraphQL: START_CALL
2. Server → Workflow → Nova: Initialize session
3. Nova → Redis: Publishes AUDIO_SESSION_READY
4. Client: Waits for AUDIO_SESSION_READY signal
5. Client → Redis: Streams audio chunks directly
6. Nova → Redis: Publishes audio responses
7. Client → GraphQL: END_CALL (when done)Audio Channel Format
// Incoming audio to Nova (Audio-Stream channel)
{
chatId: string,
nodeId: `${nodeId}-${workflowId}`, // e.g., "awsnovaspeech1-workflow_123"
workflowId: string,
audioData: string, // Base64 PCM audio
action: "SEND_AUDIO" | "START_AUDIO_SEGMENT" | "END_AUDIO_SEGMENT",
timestamp: number
}
// Outgoing audio from Nova (AI Results channel)
{
audioData: string, // Base64 MP3 audio
format: "mp3",
metadata: {
audioState: AudioState,
chatId: string,
workflowId: string
}
}Client Implementation
// 1. Start call via GraphQL
const response = await talkToAgent({
message: "START_CALL",
isAudio: true,
metadata: {
action: "START_CALL",
workflowId: "workflow_123",
nodeId: "awsnovaspeech1"
}
});
// 2. Subscribe to audio state updates
redis.subscribe('gravity:output');
redis.on('message', (channel, message) => {
const data = JSON.parse(message);
if (data.metadata?.audioState === 'AUDIO_SESSION_READY') {
// 3. Start streaming audio
startAudioStreaming();
}
});
// 4. Stream audio directly to Redis
function streamAudioChunk(audioData: string) {
redis.publish('Audio-Stream', JSON.stringify({
chatId,
nodeId: `awsnovaspeech1-${workflowId}`,
workflowId,
audioData,
action: "SEND_AUDIO",
timestamp: Date.now()
}));
}Key Architectural Decisions
Why wait for AUDIO_SESSION_READY?
- Nova needs time to establish bidirectional stream with AWS
- Audio sent before ready signal will be lost
- Ensures reliable audio delivery
Why separate control and data planes?
- GraphQL for control = reliable state management
- Redis for audio = real-time performance
- Clean separation of concerns
Why use composite nodeId-workflowId?
- Supports multiple workflows running simultaneously
- Enables multiple Nova nodes per workflow
- Stable nodeId allows direct routing without discovery
Why no buffering?
- Reduces latency for real-time conversation
- Simplifies client implementation
- Nova handles internal buffering as needed
Tool Integration
Support for function calling and RAG tools:
const result = await service.generateSpeechStream({
systemPrompt: "You can search for information when needed.",
toolResponse: [
{ title: "Weather", content: "Sunny, 75°F" }
]
}, metadata, context);Architecture
Clean, modular architecture with focused components:
├── index.ts # Clean entry point (47 lines)
├── orchestration/
│ └── SessionOrchestrator.ts # Main business logic
├── client/
│ └── BedrockClientFactory.ts # AWS client creation
├── audio/
│ └── AudioBufferManager.ts # Audio buffering & Redis
├── processing/
│ ├── EventParser.ts # Event parsing & validation
│ ├── TextAccumulator.ts # Transcription separation
│ └── UsageStatsCollector.ts # Metrics collection
├── errors/
│ └── AwsErrorHandler.ts # Centralized error handling
└── stream/ # Core streaming componentsRequirements
- Node.js 16+
- AWS credentials with Bedrock access
- Redis server (for audio streaming)
Error Handling
The package includes comprehensive error handling:
- ValidationException - Invalid event sequences
- ModelStreamErrorException - Nova model errors
- Timeout Protection - Automatic recovery from timeouts
- Centralized Logging - Detailed error reporting
- Session Recovery - Automatic cleanup events (contentEnd, promptEnd, sessionEnd) on errors
- Startup Cleanup - Sends cleanup sequence on START_CALL to ensure clean state
Advanced Usage
Conversation History
const result = await service.generateSpeechStream({
conversationHistory: [
{ role: "user", content: "Hello" },
{ role: "assistant", content: "Hi there!" }
],
systemPrompt: "Continue the conversation naturally"
}, metadata, context);Audio Format Requirements
Input Audio:
- Format: Raw PCM (not WAV)
- Sample Rate: 16kHz
- Bit Depth: 16-bit signed little-endian
- Channels: Mono
- Encoding: Base64
Output Audio:
- Format: MP3
- Encoding: Base64
- Streaming: Real-time chunks via Redis
Response Format
interface NovaSpeechStats {
estimated: boolean;
total_tokens: number;
inputTokens: number;
outputTokens: number;
chunk_count: number;
textOutput: string; // Full text response
transcription: string; // User's speech transcription
assistantResponse: string; // Assistant's text response
audioOutput?: string; // Base64 MP3 audio
}Recent Improvements
Audio Feedback Prevention
- Microphone automatically mutes when Nova starts speaking
- Prevents echo loops where Nova hears its own output
- DirectStreamingAudioPlayer notifies immediately on first audio chunk
Session State Management
- Sends full cleanup sequence on START_CALL to ensure clean state
- Prevents Nova from getting stuck in bad state from previous errors
- Follows AWS best practices for error recovery
Audio Streaming Reliability
- Server buffers small 2.5KB chunks into 10KB chunks
- Client waits for 3+ chunks before processing
- Results in smoother playback without gaps
Publishing Pattern
- Uses platform's gravityPublish function consistently
- Fire-and-forget pattern with error logging
- Avoids hanging on Redis operations
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
MIT License - see LICENSE file for details.
Implementation Notes
Critical Event Sequence
⚠️ promptEnd MUST be sent AFTER Nova's completionEnd response
1. Send input events (sessionStart → promptStart → content)
2. Wait for Nova response
3. Receive: completionStart → audioOutput → contentEnd → completionEnd
4. On completionEnd: Send promptEnd → sessionEnd → close queue
5. On timeout error: Trigger same completion handlerToken Count Understanding
Speech Tokens vs Text Tokens:
- Text Tokens: ~4 chars/token, standard NLP tokenization
- Speech Tokens: ~150 tokens/second, time-based audio segmentation
- System prompt (364 chars) = 22 text tokens is normal
- Audio input (1s @ 16kHz) = ~150 speech tokens is expected
Audio Format Requirements
Input Audio (from Polly to Nova):
- Format: Raw PCM (NOT WAV)
- Sample Rate: 16kHz
- Bit Depth: 16-bit signed little-endian
- Channels: Mono
- Encoding: Base64
- Chunking: 24KB chunks work well
AWS Nova Speech Service
This service provides integration with AWS Nova Speech (Sonic) for text-to-speech and speech-to-speech capabilities.
Important Implementation Notes
Handling Nova Speech Completion Events
Nova Speech has been observed to be unreliable in sending completionEnd events. To ensure proper session closure and event handling, the implementation uses the following approach:
ContentEnd with END_TURN: When a
contentEndevent is received withstopReason: "END_TURN", the system automatically synthesizes acompletionEndevent and triggers the completion handler.Timeout Handling: When Nova times out waiting for input events (ValidationException), the system:
- Creates a synthetic
contentEndevent withstopReason: "END_TURN" - Processes it through the response processor
- This automatically triggers the completion flow
- Creates a synthetic
This approach ensures reliable session closure regardless of whether Nova sends a completionEnd event or not.
⚠️ IMPORTANT: Nova Sonic Audio-Only Mode ⚠️
Nova Sonic does NOT require or accept user text prompts when processing audio input.
When using audio input:
- ✅ Send: sessionStart → promptStart → systemPrompt → audioInput → (wait for response)
- ❌ Do NOT send: user text events
- The audio itself IS the user input - no text transcription needed
- Nova will process the audio directly without requiring a text representation
🔧 Tool Use with Nova Speech
Overview
Nova Speech supports tool use (function calling) for enhanced capabilities. Tools must be configured in the promptStart event and Nova will automatically request tools when needed.
Tool Configuration
Tools are configured in the promptStart event:
{
event: {
promptStart: {
promptName: "chat_123",
toolConfiguration: {
tools: [{
toolSpec: {
name: "RAGtool",
description: "Retrieves relevant information from a knowledge base",
inputSchema: {
json: '{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}'
}
}
}],
toolChoice: {
tool: { name: "RAGtool" } // Optional: force tool use
}
}
}
}
}Tool Response Format
Nova expects tool results in a specific format according to AWS documentation:
{
event: {
toolResult: {
promptName: "chat_123",
contentName: "unique-content-id",
content: '[{"json": <your-tool-result>}]', // Stringified array with json property
status: "success" // or "error"
}
}
}Implementation Pattern
- Configure tools in promptStart - Define available tools
- Intercept Nova's tool requests - Monitor output events for toolUse
- Send tool results - Use Nova's toolUseId in response
- Continue conversation - Nova processes results and continues
Critical Implementation Details
- ID Matching: The
toolUseIdfrom Nova's request MUST match in your response - Event Timing: Tool results should be sent with a small delay (500ms) to ensure Nova is ready
- Format Requirements: Tool results must be wrapped in
[{"json": ...}]format - No Simulation: Never simulate Nova's tool request events - only respond to actual requests
Example Tool Flow
// 1. Nova requests a tool
{
event: {
toolUse: {
toolUseId: "abc-123",
toolName: "RAGtool",
input: { query: "auto loan interest rates" }
}
}
}
// 2. Your response
{
event: {
toolResult: {
promptName: "chat_123",
contentName: "def-456",
content: '[{"json":[{"title":"Auto Loan","text":"1.99% - 5% rates"}]}]',
status: "success"
}
}
}Common Tool Use Errors
- "Tool Response parsing error" - Wrong format, ensure
[{"json": ...}]structure - "Not expecting a tool result" - toolUseId doesn't match Nova's request
- "ValidationException" - Event structure issues or timing problems
Overview
This service provides a robust interface to AWS Nova Sonic for generating high-quality speech from text or audio inputs. It implements bidirectional streaming with AWS Bedrock Runtime, enabling real-time audio generation and delivery.
Architecture
service/
├── index.ts # Clean entry point (47 lines)
├── orchestration/
│ └── SessionOrchestrator.ts # Main business logic
├── client/
│ └── BedrockClientFactory.ts # AWS client creation
├── config/
│ └── SessionConfigBuilder.ts # Configuration & validation
├── status/
│ └── StatusPublisher.ts # Redis status publishing
├── errors/
│ └── AwsErrorHandler.ts # Centralized error handling
├── events/
│ └── EventMetadataProcessor.ts # Metadata utilities
├── audio/
│ └── AudioBufferManager.ts # Audio buffering & Redis
├── processing/
│ ├── EventParser.ts # Event parsing & validation
│ ├── TextAccumulator.ts # Transcription separation
│ └── UsageStatsCollector.ts # Metrics collection
└── stream/ # Core streaming (reduced complexity)
├── SessionManager.ts # Session lifecycle
├── StreamHandler.ts # Stream orchestration
├── EventQueue.ts # Event delivery
└── responseProcessor.ts # Response handlingKey Components
NovaSpeechService - Clean entry point that delegates to SessionOrchestrator SessionOrchestrator - Manages complete session lifecycle and business logic AudioBufferManager - Handles audio buffering, timing, and Redis publishing TextAccumulator - Separates user transcription from assistant responses AwsErrorHandler - Centralized AWS error handling with timeout recovery
Event Flow
The service follows a strict event sequence for Nova API communication:
graph TD
A[sessionStart] --> B[promptStart]
B --> C[contentStart - SYSTEM]
C --> D[textInput - system prompt]
D --> E[contentEnd]
E --> F[contentStart - AUDIO/USER]
F --> G[audioInput - chunks]
G --> H[contentEnd]
H --> I[Wait for Nova Response]
I --> J[Receive completionStart]
J --> K[Receive audio/text output]
K --> L[Receive contentEnd]
L --> M[Receive completionEnd]
M --> N[Send promptEnd]
N --> O[Send sessionEnd]
O --> P[Close queue]Configuration
NovaSpeechConfig Interface
interface NovaSpeechConfig {
modelId?: string; // Default: "amazon.nova-sonic-v1:0"
systemPrompt?: string; // System instructions
prompt?: string; // User text input
audioInput?: string; // Base64 encoded audio input
voice?: string; // Voice selection (default: "tiffany")
temperature?: number; // Generation temperature (0-1)
maxTokens?: number; // Maximum tokens to generate
topP?: number; // Top-p sampling parameter
redisChannel?: string; // Redis channel for audio publishing
interactive?: boolean; // Interactive mode flag
}Available Voices
tiffany- Default female voicematthew- Male voiceamy- Female voiceambre- Female voice (French accent)florian- Male voice (German accent)beatrice- Female voice (Italian accent)lorenzo- Male voice (Italian accent)greta- Female voice (Swedish accent)lennart- Male voice (Swedish accent)lupe- Female voice (Spanish accent)carlos- Male voice (Spanish accent)
Usage Example
import { NovaSpeechService } from "./services/aws/novaSpeech";
import { getNodeCredentials } from "./credentials";
const service = new NovaSpeechService();
const config: NovaSpeechConfig = {
systemPrompt: "You are a helpful assistant.",
prompt: "Hello, how can I help you today?",
voice: "tiffany",
temperature: 0.7,
maxTokens: 2000,
redisChannel: "audio-stream-channel",
};
const metadata = {
workflowId: "workflow-123",
executionId: "exec-456",
chatId: "chat-789",
conversationId: "conv-abc",
userId: "user-xyz",
};
const context = {
// credential context
};
try {
const result = await service.generateSpeechStream(config, metadata, context);
console.log("Generated audio with", result.chunk_count, "chunks");
console.log("Text output:", result.textOutput);
} catch (error) {
console.error("Speech generation failed:", error);
}Audio Streaming
The service streams audio output in real-time:
- Audio is generated in chunks by Nova
- Each chunk is Base64 encoded MP3 data
- Chunks are published to Redis with metadata
- Clients can subscribe to Redis channel for real-time playback
Redis Audio Event Structure
{
audioData: string; // Base64 encoded MP3 chunk
format: "mp3";
textReference: string; // Associated text
sourceType: "NovaSpeech";
chatId: string;
conversationId: string;
userId: string;
providerId: string;
workflowId: string;
workflowRunId: string;
redisChannel: string;
index: number; // Chunk sequence number
}Error Handling
The service implements comprehensive error handling:
- ValidationException - Invalid event order or missing fields
- ModelStreamErrorException - Nova model streaming errors
- InternalServerException - AWS service errors
- Timeout Protection - 30-second response timeout
Important Implementation Notes
Session Lifecycle Management
Completion Handling:
- Only
completionEndtriggers cleanup -contentEndis informational only - Idempotent completion handler - Use flags to prevent duplicate execution
- Error handling for timeouts - ValidationException with "Timed out waiting for input events" triggers same cleanup
- Proper cleanup sequence: promptEnd → sessionEnd → queue.close()
Event Ordering
- System prompt MUST be the first content block
- Audio input should follow text content blocks
- All content blocks MUST be properly closed with
contentEnd - Small delays (50ms) between audio chunks prevent overwhelming Nova
- CRITICAL: promptEnd/sessionEnd must come AFTER Nova's completionEnd, not after input
Event Structure Requirements
Event Wrapper Required: All events MUST be wrapped with
eventproperty- ✅ Correct:
{ event: { sessionStart: {...} } } - ❌ Wrong:
{ sessionStart: {...} }
- ✅ Correct:
Streaming Format: SDK expects AsyncIterable with
{ chunk: { bytes: Uint8Array } }format
Known Issues and Solutions
"Timed out waiting for input events"
- Cause: Nova didn't receive promptEnd/sessionEnd properly
- Solution: Error handler triggers completion callback on timeout
No output generation
- Cause: Sending promptEnd too early (before Nova's response)
- Solution: Only send promptEnd after receiving completionEnd
Audio Format
- Input: 16kHz, 16-bit, mono PCM (when using audio input)
- Output: MP3 format in Base64 encoding
Performance Considerations
- Uses HTTP/2 for efficient streaming
- Audio chunks are split into 4KB pieces
- Implements connection pooling with max 10 concurrent streams
- 5-minute request/session timeout
Debugging
Enable detailed logging:
import { logger } from "./logger";
logger.level = "debug";Key log points:
- Event creation and sending
- Response event processing
- Audio chunk publishing
- Session state changes
- Error conditions
Dependencies
@aws-sdk/client-bedrock-runtime- AWS Bedrock client@smithy/node-http-handler- HTTP/2 handlerpino- Logging- Redis client (for audio publishing)
Future Improvements
- Add support for streaming text input
- Implement conversation history in prompts
- Add voice cloning capabilities
- Support for multiple languages
- Implement caching for repeated requests
- Add metrics and monitoring
- Support for SSML (Speech Synthesis Markup Language)
