npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@r16t/multimodal-mcp

v1.4.0

Published

Multi-provider media generation MCP server

Readme

multimodal-mcp

Multi-provider media generation MCP server. Generate images, videos, audio, and transcriptions from text prompts using OpenAI, xAI, Gemini, ElevenLabs, and BFL (FLUX) through a single unified interface.

Features

  • 🎨 Image Generation — Generate images via OpenAI (gpt-image-1), xAI (grok-imagine-image), Gemini (imagen-4), or BFL (FLUX Pro 1.1)
  • ✏️ Image Editing — Edit images via OpenAI, xAI, Gemini, or BFL (FLUX Kontext)
  • 🎬 Video Generation — Generate videos via OpenAI (sora-2), xAI (grok-imagine-video), or Gemini (veo-3.1)
  • 🔊 Audio Generation — Text-to-speech via OpenAI (tts-1), Gemini, or ElevenLabs (Flash v2.5). Sound effects via ElevenLabs
  • 🎙️ Audio Transcription — Speech-to-text via OpenAI (Whisper) or ElevenLabs (Scribe)
  • 🔄 Auto-Discovery — Automatically detects configured providers from environment variables
  • 🎯 Provider Selection — Auto-selects or explicitly choose a provider per request
  • 📁 File Output — Saves all generated media to disk with descriptive filenames

Quick Start

Set the API key for at least one provider. Most users only need one — add more to access additional providers.

# Using OpenAI
claude mcp add multimodal-mcp -e OPENAI_API_KEY=sk-... -- npx -y @r16t/multimodal-mcp@latest

# Or using xAI
# claude mcp add multimodal-mcp -e XAI_API_KEY=xai-... -- npx -y @r16t/multimodal-mcp@latest

# Or using Gemini
# claude mcp add multimodal-mcp -e GEMINI_API_KEY=AIza... -- npx -y @r16t/multimodal-mcp@latest

# Or using ElevenLabs (audio + transcription)
# claude mcp add multimodal-mcp -e ELEVENLABS_API_KEY=xi-... -- npx -y @r16t/multimodal-mcp@latest

# Or using BFL/FLUX (images)
# claude mcp add multimodal-mcp -e BFL_API_KEY=... -- npx -y @r16t/multimodal-mcp@latest

Using a different editor? See setup instructions for Claude Desktop, Cursor, VS Code, Windsurf, and Cline.

Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | OPENAI_API_KEY | At least one provider key | OpenAI API key — enables image, video, audio generation, and transcription via gpt-image-1, sora-2, tts-1, and whisper-1 | | XAI_API_KEY | At least one provider key | xAI API key — enables image and video generation via grok-imagine-image and grok-imagine-video | | GEMINI_API_KEY | At least one provider key | Gemini API key — enables image, video, and audio generation via imagen-4, veo-3.1, and gemini-2.5-flash-preview-tts | | GOOGLE_API_KEY | — | Alias for GEMINI_API_KEY; either name is accepted | | ELEVENLABS_API_KEY | At least one provider key | ElevenLabs API key — enables audio generation (TTS, sound effects) and transcription via Flash v2.5 and Scribe v1 | | BFL_API_KEY | At least one provider key | BFL API key — enables image generation and editing via FLUX Pro 1.1 and FLUX Kontext | | MEDIA_OUTPUT_DIR | No | Directory for saved media files. Defaults to the current working directory |

Available Tools

generate_image

Generate an image from a text prompt.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | prompt | string | Yes | Text description of the image to generate | | provider | string | No | Provider to use: openai, xai, google, bfl. Auto-selects if omitted | | aspectRatio | string | No | Aspect ratio: 1:1, 16:9, 9:16, 4:3, 3:4 | | quality | string | No | Quality level: low, standard, high | | outputDirectory | string | No | Directory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd | | providerOptions | object | No | Provider-specific parameters passed through directly |

generate_video

Generate a video from a text prompt. Video generation is asynchronous and may take several minutes.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | prompt | string | Yes | Text description of the video to generate | | provider | string | No | Provider to use: openai, xai, google. Auto-selects if omitted | | duration | number | No | Video duration in seconds (provider limits apply) | | aspectRatio | string | No | Aspect ratio: 16:9, 9:16, 1:1 | | resolution | string | No | Resolution: 480p, 720p, 1080p | | outputDirectory | string | No | Directory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd | | providerOptions | object | No | Provider-specific parameters passed through directly |

generate_audio

Generate audio from text. Supports text-to-speech and sound effects. Audio generation is synchronous.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | text | string | Yes | Text to convert to speech, or a description of the sound effect to generate | | provider | string | No | Provider to use: openai, google, elevenlabs. Auto-selects if omitted | | voice | string | No | Voice name (provider-specific). OpenAI: alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer. Google: Kore, Charon, Fenrir, Aoede, Puck, etc. ElevenLabs: voice ID | | speed | number | No | Speech speed multiplier (OpenAI only): 0.25 to 4.0 | | format | string | No | Output format (OpenAI only): mp3, opus, aac, flac, wav, pcm | | outputDirectory | string | No | Directory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd | | providerOptions | object | No | Provider-specific parameters passed through directly. ElevenLabs: set mode: "sound-effect" for sound effects, model for TTS model selection |

transcribe_audio

Transcribe audio to text (speech-to-text).

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | audioPath | string | Yes | Absolute path to the audio file to transcribe | | provider | string | No | Provider to use: openai, elevenlabs. Auto-selects if omitted | | language | string | No | Language code (e.g., en, fr, es) to hint the transcription language | | providerOptions | object | No | Provider-specific parameters passed through directly |

list_providers

List all configured media generation providers and their capabilities. Takes no parameters.

Provider Capabilities

| Provider | Image | Image Editing | Video | Audio | Transcription | Key Models | |----------|:-----:|:------------:|:-----:|:-----:|:------------:|------------| | OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ | gpt-image-1, sora-2, tts-1, whisper-1 | | xAI | ✅ | ✅ | ✅ | — | — | grok-imagine-image, grok-imagine-video | | Gemini | ✅ | ✅ | ✅ | ✅ | — | imagen-4, veo-3.1, gemini-2.5-flash-preview-tts | | ElevenLabs | — | — | — | ✅ | ✅ | eleven_flash_v2_5, scribe_v1 | | BFL | ✅ | ✅ | — | — | — | flux-pro-1.1, flux-kontext-pro |

Image Aspect Ratios

| Provider | 1:1 | 16:9 | 9:16 | 4:3 | 3:4 | |----------|:---:|:----:|:----:|:---:|:---:| | OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ | | xAI | ✅ | ✅ | ✅ | ✅ | ✅ | | Gemini | ✅ | ✅ | ✅ | ✅ | ✅ | | BFL | ✅ | ✅ | ✅ | ✅ | ✅ |

Video Aspect Ratios & Resolutions

| Provider | 16:9 | 9:16 | 1:1 | 480p | 720p | 1080p | |----------|:----:|:----:|:---:|:----:|:----:|:-----:| | OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | xAI | ✅ | ✅ | ✅ | — | ✅ | ✅ | | Gemini | ✅ | ✅ | — | — | ✅ | ✅ |

Audio Formats

| Provider | mp3 | opus | aac | flac | wav | pcm | |----------|:---:|:----:|:---:|:----:|:---:|:---:| | OpenAI | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | Gemini | — | — | — | — | ✅ | — | | ElevenLabs | ✅ | ✅ | — | — | — | ✅ |

Troubleshooting

No providers configured

[config] No provider API keys detected

Set at least one of OPENAI_API_KEY, XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, or BFL_API_KEY in the MCP server's env block.

Provider not available for requested media type

Each provider supports different media types (see Provider Capabilities). If you specify a provider that isn't configured (no API key) or doesn't support the requested media type, you'll receive an error. Omit the provider parameter to auto-select from configured providers.

Video generation timeout

Video generation polls for up to 10 minutes. If your video hasn't completed in that window, the request will fail with a timeout error. Try a shorter duration or a simpler prompt.

xAI image generation returned no data

This indicates the xAI API returned an empty response. Check that your XAI_API_KEY is valid and that your prompt does not violate xAI content policies.

Gemini image/video generation failed: 403

Verify your GEMINI_API_KEY has the Generative Language API enabled in Google Cloud Console.

Development

npm run build      # Compile TypeScript to build/
npm test           # Run tests with Vitest
npm run lint       # Lint and auto-fix with ESLint
npm run typecheck  # Type-check without emitting
npm run dev        # Watch mode for TypeScript compilation

Editor Setup

Replace OPENAI_API_KEY with your provider of choice (XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, BFL_API_KEY). You can set multiple keys to enable multiple providers.

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cursor

Add to .cursor/mcp.json in your project root (or ~/.cursor/mcp.json globally):

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

VS Code (GitHub Copilot)

Add to .vscode/mcp.json in your project root:

{
  "servers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Windsurf

Add to ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cline

Add to ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

License

MIT