@scopeful/replicate-models-runner

v1.0.0

Published

16 days ago

Use this skill whenever the user wants to call a Replicate model from code, an agent, or its MCP server. Triggers include any mention of Replicate, replicate.run, predictions.create, replicate-mcp, running Flux / SDXL / Llama on Replicate, polling a predi

Downloads

0High
0Medium
0Low

igorgridel

agent skill coding-agent

name: replicate-models-runner description: Use this skill whenever the user wants to call a Replicate model from code, an agent, or its MCP server. Triggers include any mention of "Replicate", "replicate.run", "predictions.create", "replicate-mcp", running Flux / SDXL / Llama on Replicate, polling a prediction, setting up Replicate webhooks, or asking why a Replicate model is slow or expensive.

Run Replicate models without burning compute

Replicate is a serverless GPU API. Most agents call it wrong: they hard-code replicate.run() against a public model name, hit a cold start every time, then panic-poll the prediction at 100ms intervals. This skill fixes that. Replicate bills per second of GPU time, so every avoided second is real money saved.

When to use Replicate

Use it when:

Calling open-source models (Flux, SDXL, Whisper, Llama, upscalers, depth models) without standing up GPU infra
One billing pane for many models instead of separate accounts at Fal, Together, Modal
Fine-tuning (LoRA training is first-class on Replicate)
Deploying your own model via cog to a private endpoint

Do not reach for Replicate when:

Sub-second latency matters on every call (cold starts hurt; use Fal or a hot deployment)
The user wants ComfyUI graph execution (use ComfyUI Cloud or RunComfy)
Text generation at scale (token-priced APIs from Anthropic, OpenAI, Groq win on price and latency)
The model has a non-commercial license and the user is shipping a paid product without going through Replicate's hosted endpoint (see License gotcha below)

Install

Python (1.0.7): pip install replicate. Node (1.4.0): npm install replicate. Go: go get github.com/replicate/replicate-go. Swift, Elixir, Ruby clients also exist. Set REPLICATE_API_TOKEN=r8_... in env.

MCP server (official, hosted)

Replicate ships its own MCP server. Remote-hosted at mcp.replicate.com (auto-updated with the HTTP API), plus a local stdio version (replicate-mcp on npm):

// claude_desktop_config.json / .cursor/mcp.json / .vscode/mcp.json
{
  "mcpServers": {
    "replicate": {
      "command": "npx",
      "args": ["-y", "replicate-mcp"],
      "env": { "REPLICATE_API_TOKEN": "r8_..." }
    }
  }
}

The MCP exposes the full HTTP API surface: model search, prediction create/get/cancel, deployment management, file uploads. Tool names mirror the API verbs (search_models, create_prediction, get_prediction, cancel_prediction, list_predictions). [VERIFY] exact tool naming.

How calls should be structured

The single most common agent mistake: calling a model by name and hoping for the best. Pin the version.

# Bad: ambient version, output changes silently when the model owner updates
output = replicate.run("black-forest-labs/flux-dev", input={"prompt": "..."})

# Good: pinned version, reproducible across months
output = replicate.run(
    "black-forest-labs/flux-dev:843b6e1c...",  # 64-char version hash
    input={"prompt": "a tabby cat in soft window light"}
)

Get the version hash from the model's "Versions" tab on replicate.com, or query the API: GET /v1/models/{owner}/{name}/versions. Hard-code it in your code. Bump it deliberately, not silently.

replicate.run() is the high-level helper: it creates a prediction, waits for completion, returns the output. For anything longer than ~10 seconds, prefer the low-level predictions.create() so you control polling and don't tie up the calling process.

Predictions lifecycle and polling

Statuses: starting -> processing -> succeeded | failed | canceled.

import replicate, time

pred = replicate.predictions.create(
    version="black-forest-labs/flux-dev:843b6e1c...",
    input={"prompt": "..."}
)
while pred.status not in ("succeeded", "failed", "canceled"):
    time.sleep(2)         # 2s is plenty; faster gets you 429s
    pred.reload()
output_urls = pred.output if pred.status == "succeeded" else None

Three rules: (1) Poll every 2-5 seconds, never tighter; (2) for short predictions, prefer replicate.run() or the Prefer: wait=n header (one request, no polling); (3) for long predictions (video, training, slow upscalers), use webhooks instead of polling.

Webhook pattern for long-running predictions

pred = replicate.predictions.create(
    version="...", input={"prompt": "..."},
    webhook="https://your.app/replicate/hook",
    webhook_events_filter=["completed"]  # also: start, output, logs
)

The webhook POST body is identical to a predictions.get response. Verify the signature before trusting it: validateWebhook() (JS) / replicate.signatures.verify() (Python) using the webhook secret from your account settings. Skipping verification means anyone with your webhook URL can forge completions.

Streaming (LLMs and supported models only)

Llama-family models and a handful of others stream tokens over SSE.

for event in replicate.stream(
    "meta/meta-llama-3-70b-instruct",
    input={"prompt": "..."}
):
    print(event, end="")  # event.data has the token

Streaming only works when the model declared support. Image and video models don't stream output; they finish then return URLs.

Hardware and cost reference

Replicate bills per second of GPU/CPU time. Pick the smallest tier the model fits on.

| Tier | Approx $/sec | Good for | |------|--------------|----------| | CPU | $0.0001 | Tiny utilities, format converters | | T4 | $0.000225 | SD 1.5, small classifiers, Whisper-small | | A40 | ~$0.000725 | SDXL, mid-size diffusion | | L40S | $0.000975 | Modern diffusion, mid-size LLMs | | A100 80GB | $0.0014 | Flux Dev, Llama 70B, video | | H100 | $0.001525 | Largest models, lowest wall-clock time |

Multi-GPU (4x/8x A100, H100, L40S) needs a committed-spend contract. Rates are confirmed; some hosted models (Flux Schnell, Whisper) are flat per prediction, not per-second. Live USD math at scopeful.org/tools/replicate.

Common gotchas

Output URLs expire after 1 hour (not 24h). Download and rehost immediately. URLs are on replicate.delivery. Web-UI predictions are kept indefinitely; API predictions are not
Cold starts on public models are typically 5-60 seconds the first time per hardware pool. Same model called again within a few minutes stays warm. For predictable warmth, create a deployment with min_instances >= 1. Only worth it above ~1 request/minute, otherwise you're paying for idle GPU
Version pinning is non-optional for production. Model owners can publish new versions that silently change output. Pin the 64-char hash
FileOutput vs URL string. Python SDK 1.0+ returns FileOutput objects. To get strings back, pass use_file_output=False to replicate.run(), or call .url
Failed predictions are still billed. A run that errored still consumed GPU time
Concurrency: default account limit is 600 prediction creates per minute. Burst beyond that returns 429

License gotcha (do not skip)

Replicate hosts models with mixed licenses. The platform itself does not gatekeep, but the model license follows the output. Flux Dev is non-commercial when self-hosted, but Replicate holds a commercial agreement with Black Forest Labs, so images generated through Replicate's hosted endpoint are commercially usable. Pull the same weights to your own GPU and that exemption is gone. Before shipping a paid product, read the license box on the model page. When in doubt, flag it.

What to deliver to the user

When you run a prediction on the user's behalf, return:

The output URLs (or downloaded files, given the 1h expiry)
Prediction ID so they can find it in the Replicate dashboard
metrics.predict_time (actual billed seconds) and a rough cost estimate
The hardware tier the model ran on (visible in the prediction response)

If a prediction fails, return error verbatim and stop. Do not auto-retry without telling the user.

What NOT to do

Don't loop replicate.run() in a tight while-loop; that's a thousand cold starts
Don't hard-code an unpinned owner/model string in production
Don't poll faster than every 2 seconds
Don't download an output URL "later"; it's gone after 1 hour
Don't skip webhook signature verification
Don't quote a USD price without naming the hardware tier; T4 vs H100 is a 7x delta on the same model

Useful follow-ups

For sub-second image generation, point the user to Fal.ai: same Flux models, hot endpoints, no cold start
For ComfyUI graph execution, point to ComfyUI Cloud or RunComfy
For LoRA training on Flux, Replicate's ostris/flux-dev-lora-trainer is the canonical path
For their own private model: build with cog, push to Replicate, create a deployment with min_instances=1
Always cross-check the live billing math at scopeful.org/tools/replicate before quoting prices the live billing math at scopeful.org/tools/replicate before quoting prices