agent-media

v0.14.0

Published

7 days ago

Agent-first media toolkit CLI

Downloads

622

0High
0Medium
0Low

timpietrusky

media image audio video transcribe extract cli agent ai resize convert generate

agent-media

Media processing CLI for AI agents.

Image: generate, edit, remove-background, upscale, resize, convert, extend, crop
Video: generate (text-to-video and image-to-video)
Audio: extract from video, transcribe (with speaker identification)

Installation

Global

npm install -g agent-media@latest

From Source

git clone https://github.com/agntswrm/agent-media
cd agent-media
pnpm install && pnpm build && pnpm link --global

Via bunx / npx

Run directly without installing:

bunx agent-media@latest --help
npx agent-media@latest --help

Skills for AI Agents

Install agent-media skills to your coding agent (Claude Code, Cursor, Codex, etc.):

npx skills add agntswrm/agent-media

This adds media processing skills that your AI agent can use automatically. Available skills:

agent-media - Overview of all capabilities
image-generate - Generate images from text
image-edit - Edit images with text prompts
image-resize - Resize images
image-convert - Convert image formats
image-extend - Extend image canvas with padding
image-remove-background - Remove backgrounds
image-crop - Crop images to specified dimensions
image-upscale - Upscale images with AI super-resolution
audio-extract - Extract audio from video
audio-transcribe - Transcribe audio to text
video-generate - Generate videos from text or images

Quick Start

# generate an image
agent-media image generate --prompt "a robot" --out rob.png

# remove background
agent-media image remove-background --in rob.png --out rob_nobg.png

# edit the image
agent-media image edit --in rob_nobg.png --prompt "the robot is sitting on a bench next to a cat, in the background you can see the Eiffel Tower in Paris" --out rob_cat_paris.png

# generate a video with audio (cat meows, robot speaks!)
agent-media video generate --in rob_cat_paris.png --prompt "the cat meows and the robot says: \"Yes, me too.\"" --audio --out rob_cat_video.mp4

# extract audio from video
agent-media audio extract --in rob_cat_video.mp4 --out rob_cat_audio.mp3

# transcribe the audio
agent-media audio transcribe --in rob_cat_audio.mp3

Requirements

Node.js >= 18.0.0
API key from fal.ai, Replicate, Runpod, or AI Gateway for AI features

Local processing (no API key): resize, convert, extend, crop, upscale, audio extract, remove-background, transcribe

Cloud processing (API key required): image generate, image edit, upscale, video generate, remove-background, transcribe

Note: You may see a mutex lock failed error when using local remove-background, upscale, or transcribe — ignore it, the output is correct if JSON shows "ok": true.

image

agent-media image resize --in <path> [options]
agent-media image convert --in <path> --format <f>
agent-media image extend --in <path> --padding <px> --color <hex>
agent-media image crop --in <path> --width <px> --height <px>
agent-media image generate --prompt <text>
agent-media image edit --in <paths...> --prompt <text>
agent-media image remove-background --in <path>
agent-media image upscale --in <path>

resize

local

agent-media image resize --in sunset-mountains.jpg --width 800
agent-media image resize --in sunset-mountains.jpg --height 600
agent-media image resize --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.jpg --width 800

| Option | Description | |--------|-------------| | --in <path> | Input file path or URL (required) | | --width <px> | Target width in pixels | | --height <px> | Target height in pixels | | --out <path> | Output path, filename or directory (default: ./) |

convert

local

agent-media image convert --in sunset-mountains.png --format webp
agent-media image convert --in sunset-mountains.jpg --format png
agent-media image convert --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.png --format jpg --quality 90

| Option | Description | |--------|-------------| | --in <path> | Input file path or URL (required) | | --format <f> | Output format: png, jpg, webp (required) | | --quality <n> | Quality 1-100 for lossy formats (default: 80) | | --out <path> | Output path, filename or directory (default: ./) |

extend

local

Extend image canvas by adding padding on all sides with a solid background color.

agent-media image extend --in sunset-mountains.jpg --padding 50 --color "#E4ECF8"
agent-media image extend --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.png --padding 100 --color "#FFFFFF"

| Option | Description | |--------|-------------| | --in <path> | Input file path or URL (required) | | --padding <px> | Padding size in pixels to add on all sides (required) | | --color <hex> | Background color for extended area (required). Also flattens transparency. | | --dpi <n> | DPI/density for output image (default: 300) | | --out <path> | Output path, filename or directory (default: ./) |

crop

local

Crop an image to specified dimensions around a focal point. The crop region is calculated to center on the focal point while staying within image bounds.

agent-media image crop --in sunset-mountains.jpg --width 800 --height 600
agent-media image crop --in sunset-mountains.jpg --width 800 --height 600 --focus-x 20 --focus-y 30
agent-media image crop --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.jpg --width 400 --height 400

| Option | Description | |--------|-------------| | --in <path> | Input file path or URL (required) | | --width <px> | Width of crop area in pixels (required) | | --height <px> | Height of crop area in pixels (required) | | --focus-x <n> | Focal point X position 0-100 (default: 50 = center) | | --focus-y <n> | Focal point Y position 0-100 (default: 50 = center) | | --dpi <n> | DPI/density for output image (default: 300) | | --out <path> | Output path, filename or directory (default: ./) |

generate

API key required

agent-media image generate --prompt "a cat wearing a hat"
agent-media image generate --prompt "sunset over mountains" --width 1024 --height 768

| Option | Description | |--------|-------------| | --prompt <text> | Text description (required) | | --width <px> | Width (default: 1280) | | --height <px> | Height (default: 720) | | --out <path> | Output path, filename or directory (default: ./) | | --provider <name> | Provider (fal, replicate, runpod, ai-gateway) | | --model <name> | Model override (e.g., fal-ai/flux-2, bfl/flux-2-pro) |

edit

API key required

Edit one or more images using a text prompt (image-to-image). Supports multiple input images for combining styles, subjects, or scenes.

agent-media image edit --in sunset-mountains.jpg --prompt "make the sky more vibrant"
agent-media image edit --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/man-portrait.png --prompt "add sunglasses"
agent-media image edit --in style.jpg content.jpg --prompt "apply the style of the first image to the second"

| Option | Description | |--------|-------------| | --in <paths...> | One or more input file paths or URLs (required) | | --prompt <text> | Text description of the desired edit (required) | | --out <path> | Output path, filename or directory (default: ./) | | --provider <name> | Provider (fal, replicate, runpod, ai-gateway) | | --model <name> | Model override (e.g., fal-ai/flux-2/edit, google/gemini-3-pro-image) |

remove-background

local or cloud

agent-media image remove-background --in man-portrait.png
agent-media image remove-background --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/man-portrait.png

| Option | Description | |--------|-------------| | --in <path> | Input file path or URL (required) | | --out <path> | Output path, filename or directory (default: ./) | | --provider <name> | Provider (local, fal, replicate) |

upscale

local or cloud

Upscale an image using AI super-resolution to increase resolution with detail generation.

agent-media image upscale --in sunset-mountains.jpg
agent-media image upscale --in sunset-mountains.jpg --scale 4 --provider fal
agent-media image upscale --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.jpg --provider replicate

| Option | Description | |--------|-------------| | --in <path> | Input file path or URL (required) | | --scale <n> | Scale factor: 2 or 4 (default: 2). Local provider always outputs 4x. | | --out <path> | Output path, filename or directory (default: ./) | | --provider <name> | Provider (local, fal, replicate) | | --model <name> | Model override |

video

# Generate video from text
agent-media video generate --prompt <text>

# Generate video from image (animate an image)
agent-media video generate --in <image> --prompt <text>

generate

API key required

Generate video from a text prompt. Optionally provide an input image to animate it (image-to-video). The prompt describes what should happen in the video.

# Text-to-video
agent-media video generate --prompt "a cat walking through a garden"

# Image-to-video (animate an image)
agent-media video generate --in woman-portrait.png --prompt "person smiles and waves hello"

# With audio/speech generation (runpod)
agent-media video generate --in woman-portrait.png --prompt "The woman says: \"Hello, welcome to our channel!\"" --audio --provider runpod

# With ambient audio (fal)
agent-media video generate --prompt "fireworks in the night sky" --audio --duration 10 --provider fal

# Higher resolution
agent-media video generate --prompt "ocean waves" --resolution 1080p

| Option | Description | |--------|-------------| | --prompt <text> | Text description of the video (required) | | --in <path> | Input image for image-to-video (optional) | | --duration <s> | Duration in seconds (default: 5 for runpod, 6 for others) | | --resolution <r> | Resolution: 720p, 1080p (default: 720p) | | --fps <n> | Frame rate: 25, 50 (default: 25) | | --audio | Generate audio track (includes speech from quoted text with runpod) | | --out <path> | Output path, filename or directory (default: ./) | | --provider <name> | Provider (fal, replicate, runpod) | | --model <name> | Model override |

audio

# Extract audio from video
agent-media audio extract --in <video>

# Transcribe audio to text
agent-media audio transcribe --in <audio>

extract

local

Extract audio track from a video file.

agent-media audio extract --in woman-greeting.mp4
agent-media audio extract --in woman-greeting.mp4 --format wav
agent-media audio extract --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/woman-greeting.mp4

| Option | Description | |--------|-------------| | --in <path> | Input video file path or URL (required) | | --format <f> | Output format: mp3, wav (default: mp3) | | --out <path> | Output path, filename or directory (default: ./) |

transcribe

local or cloud (diarization requires cloud)

Transcribe audio to text with timestamps. Speaker identification (diarization) requires a cloud provider.

agent-media audio transcribe --in woman-greeting.mp3
agent-media audio transcribe --in woman-greeting.mp3 --diarize --speakers 2
agent-media audio transcribe --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/woman-greeting.mp3

| Option | Description | |--------|-------------| | --in <path> | Input audio file path or URL (required) | | --diarize | Enable speaker identification (cloud only) | | --language <code> | Language code (auto-detected if not provided) | | --speakers <n> | Number of speakers hint | | --out <path> | Output path, filename or directory (default: ./) | | --provider <name> | Provider (local, fal, replicate) | | --model <name> | Model override |

Output Format

All commands return JSON to stdout:

{
  "ok": true,
  "media_type": "image",
  "action": "resize",
  "provider": "local",
  "output_path": "resized_123_abc.png",
  "mime": "image/png",
  "bytes": 45678
}

On error:

{
  "ok": false,
  "error": {
    "code": "INVALID_INPUT",
    "message": "At least one of width or height must be specified"
  }
}

Exit code is 0 on success, 1 on error.

Providers

Default Models

| Provider | resize | convert | extend | crop | image generate | image edit | remove-background | upscale | video generate | transcribe | |----------|--------|---------|--------|------|----------------|------------|-------------------|---------|----------------|------------| | local | ✓* | ✓* | ✓* | ✓* | - | - | Xenova/modnet** | Xenova/swin2SR** | - | moonshine-base** | | fal | - | - | - | - | fal-ai/flux-2 | fal-ai/flux-2/edit | fal-ai/birefnet/v2 | fal-ai/esrgan | fal-ai/ltx-2 | fal-ai/wizper | | replicate | - | - | - | - | black-forest-labs/flux-2-dev | black-forest-labs/flux-kontext-dev | men1scus/birefnet | nightmareai/real-esrgan | lightricks/ltx-video | whisper-diarization | | runpod | - | - | - | - | alibaba/wan-2.6 | google/nano-banana-pro-edit | - | - | wan-2.6 | - | | ai-gateway | - | - | - | - | bfl/flux-2-pro | google/gemini-3-pro-image | - | - | - | - |

* Powered by Sharp for fast image processing ** Powered by Transformers.js for local ML inference (models downloaded on first use)

Use --model <name> to override the default model for any command.

Provider Selection

Explicit flag (highest priority): --provider fal
Environment auto-detect: Set FAL_API_KEY, REPLICATE_API_TOKEN, RUNPOD_API_KEY, or AI_GATEWAY_API_KEY to auto-select that provider
Fallback to local: For resize/convert when no provider specified
First supporting provider: For generate/remove-background

Environment Variables

| Variable | Description | Get Key | |----------|-------------|---------| | FAL_API_KEY | fal.ai API key | fal.ai | | REPLICATE_API_TOKEN | Replicate API token | replicate.com | | RUNPOD_API_KEY | Runpod API key | runpod.io | | AI_GATEWAY_API_KEY | AI Gateway API key | vercel.com | | AGENT_MEDIA_DIR | Output directory (default: current directory) | - |

Roadmap

[x] Local background removal (zero API keys)
[x] Local transcription (zero API keys)
[x] Video generation (text-to-video and image-to-video)
[ ] Batch processing support

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

agent-media

Installation

Global

From Source

Via bunx / npx

Skills for AI Agents

Quick Start

Requirements

image

resize

convert

extend

crop

generate

edit

remove-background

upscale

video

generate

audio

extract

transcribe

Output Format

Providers

Default Models

Provider Selection

Environment Variables

Roadmap