gemini-realtime-stream
v1.0.0
Published
Google Gemini AI real-time streaming with audio processing capabilities
Downloads
9
Maintainers
Readme
Gemini Real-time Stream MCP Server
A Model Context Protocol (MCP) server that provides real-time streaming capabilities with Google's Gemini AI models, including live audio/video processing, function calling, and bidirectional WebSocket communication.
Features
Core Capabilities
- Real-time Streaming: Bidirectional WebSocket communication with Gemini models
- Live Audio Processing: Real-time audio input/output with voice activity detection
- Live Video Processing: Screen capture and video stream processing
- Function Calling: Dynamic tool discovery and execution with JSON schema validation
- Multimodal Support: Text, image, audio, and video input/output processing
- Session Management: Persistent conversation contexts and state management
Available Tools
start_realtime_session
Initialize a real-time streaming session with Gemini Live API.
Parameters:
model(string, optional): Gemini model to use (default: "gemini-2.0-flash-exp")voice(string, optional): Voice configuration for audio outputsystem_instruction(string, optional): System instructions for the modeltools(array, optional): Available tools for function calling
send_realtime_message
Send a message to an active real-time session.
Parameters:
session_id(string): Active session identifiercontent(string): Message content to sendcontent_type(string, optional): Content type (default: "text")
stream_audio_input
Stream audio input to the real-time session.
Parameters:
session_id(string): Active session identifieraudio_data(string): Base64-encoded audio dataformat(string, optional): Audio format (default: "pcm16")sample_rate(number, optional): Sample rate in Hz (default: 16000)
capture_screen_stream
Capture and stream screen content to the session.
Parameters:
session_id(string): Active session identifierregion(object, optional): Screen region to capturequality(string, optional): Capture quality ("high", "medium", "low")
get_session_status
Retrieve the current status of a real-time session.
Parameters:
session_id(string): Session identifier to check
end_realtime_session
Terminate an active real-time streaming session.
Parameters:
session_id(string): Session identifier to terminate
list_active_sessions
List all currently active real-time sessions.
Parameters: None
Installation
- Install dependencies:
npm install- Build the TypeScript code:
npm run build- Configure your Gemini API key:
export GEMINI_API_KEY="your-api-key-here"Configuration
Add the server to your MCP client configuration:
{
"mcpServers": {
"gemini-realtime-stream": {
"command": "node",
"args": ["/path/to/gemini-realtime-stream/dist/gemini-realtime-stream.js"],
"env": {
"GEMINI_API_KEY": "your-api-key-here"
}
}
}
}Usage Examples
Basic Real-time Chat
// Start a new session
const session = await startRealtimeSession({
model: "gemini-2.0-flash-exp",
system_instruction: "You are a helpful AI assistant."
});
// Send a message
await sendRealtimeMessage({
session_id: session.id,
content: "Hello, how are you today?"
});Audio Streaming
// Start session with voice capabilities
const session = await startRealtimeSession({
model: "gemini-2.0-flash-exp",
voice: "Aoede"
});
// Stream audio input
await streamAudioInput({
session_id: session.id,
audio_data: base64AudioData,
format: "pcm16",
sample_rate: 16000
});Screen Sharing
// Capture and stream screen content
await captureScreenStream({
session_id: session.id,
region: { x: 0, y: 0, width: 1920, height: 1080 },
quality: "high"
});API Reference
Session Management
- Sessions are automatically managed with unique identifiers
- Each session maintains its own conversation context
- Sessions can be terminated manually or will timeout after inactivity
Audio Processing
- Supports PCM16 audio format at various sample rates
- Real-time voice activity detection
- Bidirectional audio streaming (input and output)
Video Processing
- Screen capture with configurable regions and quality
- Real-time video stream processing
- Support for multiple video formats
Function Calling
- Dynamic tool discovery and registration
- JSON schema validation for tool parameters
- Parallel function execution support
Error Handling
The server provides comprehensive error handling:
- Invalid session IDs return appropriate error messages
- Network connectivity issues are handled gracefully
- Audio/video processing errors are logged and reported
Security Considerations
- API keys should be stored securely as environment variables
- Screen capture requires appropriate system permissions
- Audio input requires microphone access permissions
Dependencies
@modelcontextprotocol/sdk: MCP SDK for server implementation@google/generative-ai: Google Generative AI SDKws: WebSocket library for real-time communication- Additional dependencies for audio/video processing
License
This project is licensed under the MIT License.
Contributing
Contributions are welcome! Please read the contributing guidelines before submitting pull requests.
Support
For issues and questions, please use the GitHub issue tracker.
