@scopeful/replicate-models-runner
v1.0.0
Published
Use this skill whenever the user wants to call a Replicate model from code, an agent, or its MCP server. Triggers include any mention of Replicate, replicate.run, predictions.create, replicate-mcp, running Flux / SDXL / Llama on Replicate, polling a predi
Downloads
67
Readme
name: replicate-models-runner description: Use this skill whenever the user wants to call a Replicate model from code, an agent, or its MCP server. Triggers include any mention of "Replicate", "replicate.run", "predictions.create", "replicate-mcp", running Flux / SDXL / Llama on Replicate, polling a prediction, setting up Replicate webhooks, or asking why a Replicate model is slow or expensive.
Run Replicate models without burning compute
Replicate is a serverless GPU API. Most agents call it wrong: they hard-code replicate.run() against a public model name, hit a cold start every time, then panic-poll the prediction at 100ms intervals. This skill fixes that. Replicate bills per second of GPU time, so every avoided second is real money saved.
When to use Replicate
Use it when:
- Calling open-source models (Flux, SDXL, Whisper, Llama, upscalers, depth models) without standing up GPU infra
- One billing pane for many models instead of separate accounts at Fal, Together, Modal
- Fine-tuning (LoRA training is first-class on Replicate)
- Deploying your own model via
cogto a private endpoint
Do not reach for Replicate when:
- Sub-second latency matters on every call (cold starts hurt; use Fal or a hot deployment)
- The user wants ComfyUI graph execution (use ComfyUI Cloud or RunComfy)
- Text generation at scale (token-priced APIs from Anthropic, OpenAI, Groq win on price and latency)
- The model has a non-commercial license and the user is shipping a paid product without going through Replicate's hosted endpoint (see License gotcha below)
Install
Python (1.0.7): pip install replicate. Node (1.4.0): npm install replicate. Go: go get github.com/replicate/replicate-go. Swift, Elixir, Ruby clients also exist. Set REPLICATE_API_TOKEN=r8_... in env.
MCP server (official, hosted)
Replicate ships its own MCP server. Remote-hosted at mcp.replicate.com (auto-updated with the HTTP API), plus a local stdio version (replicate-mcp on npm):
// claude_desktop_config.json / .cursor/mcp.json / .vscode/mcp.json
{
"mcpServers": {
"replicate": {
"command": "npx",
"args": ["-y", "replicate-mcp"],
"env": { "REPLICATE_API_TOKEN": "r8_..." }
}
}
}The MCP exposes the full HTTP API surface: model search, prediction create/get/cancel, deployment management, file uploads. Tool names mirror the API verbs (search_models, create_prediction, get_prediction, cancel_prediction, list_predictions). [VERIFY] exact tool naming.
How calls should be structured
The single most common agent mistake: calling a model by name and hoping for the best. Pin the version.
# Bad: ambient version, output changes silently when the model owner updates
output = replicate.run("black-forest-labs/flux-dev", input={"prompt": "..."})
# Good: pinned version, reproducible across months
output = replicate.run(
"black-forest-labs/flux-dev:843b6e1c...", # 64-char version hash
input={"prompt": "a tabby cat in soft window light"}
)Get the version hash from the model's "Versions" tab on replicate.com, or query the API: GET /v1/models/{owner}/{name}/versions. Hard-code it in your code. Bump it deliberately, not silently.
replicate.run() is the high-level helper: it creates a prediction, waits for completion, returns the output. For anything longer than ~10 seconds, prefer the low-level predictions.create() so you control polling and don't tie up the calling process.
Predictions lifecycle and polling
Statuses: starting -> processing -> succeeded | failed | canceled.
import replicate, time
pred = replicate.predictions.create(
version="black-forest-labs/flux-dev:843b6e1c...",
input={"prompt": "..."}
)
while pred.status not in ("succeeded", "failed", "canceled"):
time.sleep(2) # 2s is plenty; faster gets you 429s
pred.reload()
output_urls = pred.output if pred.status == "succeeded" else NoneThree rules: (1) Poll every 2-5 seconds, never tighter; (2) for short predictions, prefer replicate.run() or the Prefer: wait=n header (one request, no polling); (3) for long predictions (video, training, slow upscalers), use webhooks instead of polling.
Webhook pattern for long-running predictions
pred = replicate.predictions.create(
version="...", input={"prompt": "..."},
webhook="https://your.app/replicate/hook",
webhook_events_filter=["completed"] # also: start, output, logs
)The webhook POST body is identical to a predictions.get response. Verify the signature before trusting it: validateWebhook() (JS) / replicate.signatures.verify() (Python) using the webhook secret from your account settings. Skipping verification means anyone with your webhook URL can forge completions.
Streaming (LLMs and supported models only)
Llama-family models and a handful of others stream tokens over SSE.
for event in replicate.stream(
"meta/meta-llama-3-70b-instruct",
input={"prompt": "..."}
):
print(event, end="") # event.data has the tokenStreaming only works when the model declared support. Image and video models don't stream output; they finish then return URLs.
Hardware and cost reference
Replicate bills per second of GPU/CPU time. Pick the smallest tier the model fits on.
| Tier | Approx $/sec | Good for | |------|--------------|----------| | CPU | $0.0001 | Tiny utilities, format converters | | T4 | $0.000225 | SD 1.5, small classifiers, Whisper-small | | A40 | ~$0.000725 | SDXL, mid-size diffusion | | L40S | $0.000975 | Modern diffusion, mid-size LLMs | | A100 80GB | $0.0014 | Flux Dev, Llama 70B, video | | H100 | $0.001525 | Largest models, lowest wall-clock time |
Multi-GPU (4x/8x A100, H100, L40S) needs a committed-spend contract. Rates are confirmed; some hosted models (Flux Schnell, Whisper) are flat per prediction, not per-second. Live USD math at scopeful.org/tools/replicate.
Common gotchas
- Output URLs expire after 1 hour (not 24h). Download and rehost immediately. URLs are on
replicate.delivery. Web-UI predictions are kept indefinitely; API predictions are not - Cold starts on public models are typically 5-60 seconds the first time per hardware pool. Same model called again within a few minutes stays warm. For predictable warmth, create a deployment with
min_instances >= 1. Only worth it above ~1 request/minute, otherwise you're paying for idle GPU - Version pinning is non-optional for production. Model owners can publish new versions that silently change output. Pin the 64-char hash
FileOutputvs URL string. Python SDK 1.0+ returnsFileOutputobjects. To get strings back, passuse_file_output=Falsetoreplicate.run(), or call.url- Failed predictions are still billed. A run that errored still consumed GPU time
- Concurrency: default account limit is 600 prediction creates per minute. Burst beyond that returns 429
License gotcha (do not skip)
Replicate hosts models with mixed licenses. The platform itself does not gatekeep, but the model license follows the output. Flux Dev is non-commercial when self-hosted, but Replicate holds a commercial agreement with Black Forest Labs, so images generated through Replicate's hosted endpoint are commercially usable. Pull the same weights to your own GPU and that exemption is gone. Before shipping a paid product, read the license box on the model page. When in doubt, flag it.
What to deliver to the user
When you run a prediction on the user's behalf, return:
- The output URLs (or downloaded files, given the 1h expiry)
- Prediction ID so they can find it in the Replicate dashboard
metrics.predict_time(actual billed seconds) and a rough cost estimate- The hardware tier the model ran on (visible in the prediction response)
If a prediction fails, return error verbatim and stop. Do not auto-retry without telling the user.
What NOT to do
- Don't loop
replicate.run()in a tight while-loop; that's a thousand cold starts - Don't hard-code an unpinned
owner/modelstring in production - Don't poll faster than every 2 seconds
- Don't download an output URL "later"; it's gone after 1 hour
- Don't skip webhook signature verification
- Don't quote a USD price without naming the hardware tier; T4 vs H100 is a 7x delta on the same model
Useful follow-ups
- For sub-second image generation, point the user to Fal.ai: same Flux models, hot endpoints, no cold start
- For ComfyUI graph execution, point to ComfyUI Cloud or RunComfy
- For LoRA training on Flux, Replicate's
ostris/flux-dev-lora-traineris the canonical path - For their own private model: build with cog, push to Replicate, create a deployment with
min_instances=1 - Always cross-check the live billing math at
scopeful.org/tools/replicatebefore quoting prices the live billing math atscopeful.org/tools/replicatebefore quoting prices
