agent-media
v0.14.0
Published
Agent-first media toolkit CLI
Downloads
622
Maintainers
Readme
agent-media
Media processing CLI for AI agents.
- Image: generate, edit, remove-background, upscale, resize, convert, extend, crop
- Video: generate (text-to-video and image-to-video)
- Audio: extract from video, transcribe (with speaker identification)
Installation
Global
npm install -g agent-media@latestFrom Source
git clone https://github.com/agntswrm/agent-media
cd agent-media
pnpm install && pnpm build && pnpm link --globalVia bunx / npx
Run directly without installing:
bunx agent-media@latest --help
npx agent-media@latest --helpSkills for AI Agents
Install agent-media skills to your coding agent (Claude Code, Cursor, Codex, etc.):
npx skills add agntswrm/agent-mediaThis adds media processing skills that your AI agent can use automatically. Available skills:
agent-media- Overview of all capabilitiesimage-generate- Generate images from textimage-edit- Edit images with text promptsimage-resize- Resize imagesimage-convert- Convert image formatsimage-extend- Extend image canvas with paddingimage-remove-background- Remove backgroundsimage-crop- Crop images to specified dimensionsimage-upscale- Upscale images with AI super-resolutionaudio-extract- Extract audio from videoaudio-transcribe- Transcribe audio to textvideo-generate- Generate videos from text or images
Quick Start
# generate an image
agent-media image generate --prompt "a robot" --out rob.png
# remove background
agent-media image remove-background --in rob.png --out rob_nobg.png
# edit the image
agent-media image edit --in rob_nobg.png --prompt "the robot is sitting on a bench next to a cat, in the background you can see the Eiffel Tower in Paris" --out rob_cat_paris.png
# generate a video with audio (cat meows, robot speaks!)
agent-media video generate --in rob_cat_paris.png --prompt "the cat meows and the robot says: \"Yes, me too.\"" --audio --out rob_cat_video.mp4
# extract audio from video
agent-media audio extract --in rob_cat_video.mp4 --out rob_cat_audio.mp3
# transcribe the audio
agent-media audio transcribe --in rob_cat_audio.mp3Requirements
- Node.js >= 18.0.0
- API key from fal.ai, Replicate, Runpod, or AI Gateway for AI features
Local processing (no API key): resize, convert, extend, crop, upscale, audio extract, remove-background, transcribe
Cloud processing (API key required): image generate, image edit, upscale, video generate, remove-background, transcribe
Note: You may see a
mutex lock failederror when using local remove-background, upscale, or transcribe — ignore it, the output is correct if JSON shows"ok": true.
image
agent-media image resize --in <path> [options]
agent-media image convert --in <path> --format <f>
agent-media image extend --in <path> --padding <px> --color <hex>
agent-media image crop --in <path> --width <px> --height <px>
agent-media image generate --prompt <text>
agent-media image edit --in <paths...> --prompt <text>
agent-media image remove-background --in <path>
agent-media image upscale --in <path>resize
local
agent-media image resize --in sunset-mountains.jpg --width 800
agent-media image resize --in sunset-mountains.jpg --height 600
agent-media image resize --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.jpg --width 800| Option | Description |
|--------|-------------|
| --in <path> | Input file path or URL (required) |
| --width <px> | Target width in pixels |
| --height <px> | Target height in pixels |
| --out <path> | Output path, filename or directory (default: ./) |
convert
local
agent-media image convert --in sunset-mountains.png --format webp
agent-media image convert --in sunset-mountains.jpg --format png
agent-media image convert --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.png --format jpg --quality 90| Option | Description |
|--------|-------------|
| --in <path> | Input file path or URL (required) |
| --format <f> | Output format: png, jpg, webp (required) |
| --quality <n> | Quality 1-100 for lossy formats (default: 80) |
| --out <path> | Output path, filename or directory (default: ./) |
extend
local
Extend image canvas by adding padding on all sides with a solid background color.
agent-media image extend --in sunset-mountains.jpg --padding 50 --color "#E4ECF8"
agent-media image extend --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.png --padding 100 --color "#FFFFFF"| Option | Description |
|--------|-------------|
| --in <path> | Input file path or URL (required) |
| --padding <px> | Padding size in pixels to add on all sides (required) |
| --color <hex> | Background color for extended area (required). Also flattens transparency. |
| --dpi <n> | DPI/density for output image (default: 300) |
| --out <path> | Output path, filename or directory (default: ./) |
crop
local
Crop an image to specified dimensions around a focal point. The crop region is calculated to center on the focal point while staying within image bounds.
agent-media image crop --in sunset-mountains.jpg --width 800 --height 600
agent-media image crop --in sunset-mountains.jpg --width 800 --height 600 --focus-x 20 --focus-y 30
agent-media image crop --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.jpg --width 400 --height 400| Option | Description |
|--------|-------------|
| --in <path> | Input file path or URL (required) |
| --width <px> | Width of crop area in pixels (required) |
| --height <px> | Height of crop area in pixels (required) |
| --focus-x <n> | Focal point X position 0-100 (default: 50 = center) |
| --focus-y <n> | Focal point Y position 0-100 (default: 50 = center) |
| --dpi <n> | DPI/density for output image (default: 300) |
| --out <path> | Output path, filename or directory (default: ./) |
generate
API key required
agent-media image generate --prompt "a cat wearing a hat"
agent-media image generate --prompt "sunset over mountains" --width 1024 --height 768| Option | Description |
|--------|-------------|
| --prompt <text> | Text description (required) |
| --width <px> | Width (default: 1280) |
| --height <px> | Height (default: 720) |
| --out <path> | Output path, filename or directory (default: ./) |
| --provider <name> | Provider (fal, replicate, runpod, ai-gateway) |
| --model <name> | Model override (e.g., fal-ai/flux-2, bfl/flux-2-pro) |
edit
API key required
Edit one or more images using a text prompt (image-to-image). Supports multiple input images for combining styles, subjects, or scenes.
agent-media image edit --in sunset-mountains.jpg --prompt "make the sky more vibrant"
agent-media image edit --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/man-portrait.png --prompt "add sunglasses"
agent-media image edit --in style.jpg content.jpg --prompt "apply the style of the first image to the second"| Option | Description |
|--------|-------------|
| --in <paths...> | One or more input file paths or URLs (required) |
| --prompt <text> | Text description of the desired edit (required) |
| --out <path> | Output path, filename or directory (default: ./) |
| --provider <name> | Provider (fal, replicate, runpod, ai-gateway) |
| --model <name> | Model override (e.g., fal-ai/flux-2/edit, google/gemini-3-pro-image) |
remove-background
local or cloud
agent-media image remove-background --in man-portrait.png
agent-media image remove-background --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/man-portrait.png| Option | Description |
|--------|-------------|
| --in <path> | Input file path or URL (required) |
| --out <path> | Output path, filename or directory (default: ./) |
| --provider <name> | Provider (local, fal, replicate) |
upscale
local or cloud
Upscale an image using AI super-resolution to increase resolution with detail generation.
agent-media image upscale --in sunset-mountains.jpg
agent-media image upscale --in sunset-mountains.jpg --scale 4 --provider fal
agent-media image upscale --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.jpg --provider replicate| Option | Description |
|--------|-------------|
| --in <path> | Input file path or URL (required) |
| --scale <n> | Scale factor: 2 or 4 (default: 2). Local provider always outputs 4x. |
| --out <path> | Output path, filename or directory (default: ./) |
| --provider <name> | Provider (local, fal, replicate) |
| --model <name> | Model override |
video
# Generate video from text
agent-media video generate --prompt <text>
# Generate video from image (animate an image)
agent-media video generate --in <image> --prompt <text>generate
API key required
Generate video from a text prompt. Optionally provide an input image to animate it (image-to-video). The prompt describes what should happen in the video.
# Text-to-video
agent-media video generate --prompt "a cat walking through a garden"
# Image-to-video (animate an image)
agent-media video generate --in woman-portrait.png --prompt "person smiles and waves hello"
# With audio/speech generation (runpod)
agent-media video generate --in woman-portrait.png --prompt "The woman says: \"Hello, welcome to our channel!\"" --audio --provider runpod
# With ambient audio (fal)
agent-media video generate --prompt "fireworks in the night sky" --audio --duration 10 --provider fal
# Higher resolution
agent-media video generate --prompt "ocean waves" --resolution 1080p| Option | Description |
|--------|-------------|
| --prompt <text> | Text description of the video (required) |
| --in <path> | Input image for image-to-video (optional) |
| --duration <s> | Duration in seconds (default: 5 for runpod, 6 for others) |
| --resolution <r> | Resolution: 720p, 1080p (default: 720p) |
| --fps <n> | Frame rate: 25, 50 (default: 25) |
| --audio | Generate audio track (includes speech from quoted text with runpod) |
| --out <path> | Output path, filename or directory (default: ./) |
| --provider <name> | Provider (fal, replicate, runpod) |
| --model <name> | Model override |
audio
# Extract audio from video
agent-media audio extract --in <video>
# Transcribe audio to text
agent-media audio transcribe --in <audio>extract
local
Extract audio track from a video file.
agent-media audio extract --in woman-greeting.mp4
agent-media audio extract --in woman-greeting.mp4 --format wav
agent-media audio extract --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/woman-greeting.mp4| Option | Description |
|--------|-------------|
| --in <path> | Input video file path or URL (required) |
| --format <f> | Output format: mp3, wav (default: mp3) |
| --out <path> | Output path, filename or directory (default: ./) |
transcribe
local or cloud (diarization requires cloud)
Transcribe audio to text with timestamps. Speaker identification (diarization) requires a cloud provider.
agent-media audio transcribe --in woman-greeting.mp3
agent-media audio transcribe --in woman-greeting.mp3 --diarize --speakers 2
agent-media audio transcribe --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/woman-greeting.mp3| Option | Description |
|--------|-------------|
| --in <path> | Input audio file path or URL (required) |
| --diarize | Enable speaker identification (cloud only) |
| --language <code> | Language code (auto-detected if not provided) |
| --speakers <n> | Number of speakers hint |
| --out <path> | Output path, filename or directory (default: ./) |
| --provider <name> | Provider (local, fal, replicate) |
| --model <name> | Model override |
Output Format
All commands return JSON to stdout:
{
"ok": true,
"media_type": "image",
"action": "resize",
"provider": "local",
"output_path": "resized_123_abc.png",
"mime": "image/png",
"bytes": 45678
}On error:
{
"ok": false,
"error": {
"code": "INVALID_INPUT",
"message": "At least one of width or height must be specified"
}
}Exit code is 0 on success, 1 on error.
Providers
Default Models
| Provider | resize | convert | extend | crop | image generate | image edit | remove-background | upscale | video generate | transcribe |
|----------|--------|---------|--------|------|----------------|------------|-------------------|---------|----------------|------------|
| local | ✓* | ✓* | ✓* | ✓* | - | - | Xenova/modnet** | Xenova/swin2SR** | - | moonshine-base** |
| fal | - | - | - | - | fal-ai/flux-2 | fal-ai/flux-2/edit | fal-ai/birefnet/v2 | fal-ai/esrgan | fal-ai/ltx-2 | fal-ai/wizper |
| replicate | - | - | - | - | black-forest-labs/flux-2-dev | black-forest-labs/flux-kontext-dev | men1scus/birefnet | nightmareai/real-esrgan | lightricks/ltx-video | whisper-diarization |
| runpod | - | - | - | - | alibaba/wan-2.6 | google/nano-banana-pro-edit | - | - | wan-2.6 | - |
| ai-gateway | - | - | - | - | bfl/flux-2-pro | google/gemini-3-pro-image | - | - | - | - |
* Powered by Sharp for fast image processing ** Powered by Transformers.js for local ML inference (models downloaded on first use)
Use --model <name> to override the default model for any command.
Provider Selection
- Explicit flag (highest priority):
--provider fal - Environment auto-detect: Set
FAL_API_KEY,REPLICATE_API_TOKEN,RUNPOD_API_KEY, orAI_GATEWAY_API_KEYto auto-select that provider - Fallback to local: For resize/convert when no provider specified
- First supporting provider: For generate/remove-background
Environment Variables
| Variable | Description | Get Key |
|----------|-------------|---------|
| FAL_API_KEY | fal.ai API key | fal.ai |
| REPLICATE_API_TOKEN | Replicate API token | replicate.com |
| RUNPOD_API_KEY | Runpod API key | runpod.io |
| AI_GATEWAY_API_KEY | AI Gateway API key | vercel.com |
| AGENT_MEDIA_DIR | Output directory (default: current directory) | - |
Roadmap
- [x] Local background removal (zero API keys)
- [x] Local transcription (zero API keys)
- [x] Video generation (text-to-video and image-to-video)
- [ ] Batch processing support
