@runtime-judgement/mcp-server
v0.1.0
Published
Runtime Judgement MCP server — verify, attribute, and snapshot from inside Claude Code / Codex CLI / Aider. Three tools that fit the inner-loop verification position from the Sprint 11 capability roadmap.
Downloads
36
Maintainers
Readme
@runtime-judgement/mcp-server
A Model Context Protocol server that gives Claude Code, Codex CLI, Aider and any other MCP-aware coding agent three Runtime Judgement tools to call mid-session, before commit:
rj.verify_change— run the snapshot suite against the current pipeline to verify a patch hasn't regressed the locked-in behaviour.rj.attribute_trace— ingest a failed trace, attribute the root cause, return the cited cause + L1/L4 verdict + suggested fix.rj.suggest_snapshot— lock a verdict in as a regression snapshot so the nextrj.verify_changecall guards against it.
This is the inner-loop verification position from the Sprint 11 capability roadmap (§7 Demo 6). The coding agent never leaves its session to check the web app — verification happens inside the same prompt cycle.
The first dog-food customer is this repo: the test plan for the package
is to point Claude Code at the runtime-judgement-app codebase, have it
make a patch, and call rj.verify_change against the canonical suite
before committing. Eat the dog food.
One-liner setup
Note:
@runtime-judgement/mcp-serveris not yet published to npm. Until it is, install from source (see below).
Install from source:
git clone https://github.com/rambo-01/runtime-judgement-app
cd runtime-judgement-app
pnpm install
pnpm --filter @runtime-judgement/mcp-server buildThen point your MCP config at:
{
"command": "node",
"args": ["/absolute/path/to/runtime-judgement-app/packages/rj-mcp-server/dist/index.js"]
}The path-with-spaces fix (10ba427) is already in main, so paths like ~/Claude Code/... work correctly.
Once published to npm, the one-liner will be:
# Works in Claude Code, Cursor (via Claude), Windsurf, Cline, and any MCP-compatible tool
npx @runtime-judgement/mcp-serverThe server will prompt for your API key on first run and write it to ~/.rj-config.json.
Or set it directly:
RJ_API_KEY=rj_live_... npx @runtime-judgement/mcp-serverGet your key at https://runtime-judgement.app/app/settings/api-keys
Quick start with Claude Code
Get your RJ API token from https://runtime-judgement.app/app/integrations.
Add to your
~/.config/claude-code/mcp.json:{ "rj": { "command": "npx", "args": ["-y", "@runtime-judgement/mcp-server"], "env": { "RJ_API_URL": "https://runtime-judgement-app.vercel.app", "RJ_API_KEY": "rj_..." } } }Restart Claude Code. The
rjMCP server appears in your tool list with three tools:rj.attribute_trace,rj.verify_change,rj.suggest_snapshot.Ask Claude: "Read the trace at
~/Downloads/demo-trace.jsonand callrj.attribute_traceon it. The user-visible failure surfaced in spanlookup_order_status."Claude calls
rj.attribute_traceand returns the cited cause + suggested fix as inline Markdown — no need to leave the terminal.After patching, ask Claude: "Now call
rj.verify_changeagainst suite01HZ...to confirm the fix." Claude reports the verdict (pass / regression / drift) so you know whether to commit.
Each tool returns a human_readable Markdown field alongside the
structured JSON, so Claude Code can surface a clean summary inline. The
structured payload is preserved for machine consumers and for the next
agent step.
Available tools
| Tool | Purpose | One-line example |
|---|---|---|
| rj.attribute_trace | Attribute a failed trace to its root cause | rj.attribute_trace({ trace: <otel json>, errorSpanId: "lookup_order_status" }) |
| rj.verify_change | Verify a patch hasn't regressed locked-in behaviour | rj.verify_change({ suiteId: "01HZSUITE..." }) |
| rj.suggest_snapshot | Lock an attribution in as a regression snapshot | rj.suggest_snapshot({ attributionId: "01HZATTR...", name: "tool-args-guard" }) |
Full input/output reference is in the Tool reference section below.
5-minute setup
1. Install
Note:
@runtime-judgement/mcp-serveris not yet on npm. Until it is published, see the "One-liner setup" section above for the install-from-source path.
Once published, the server can be run via npx:
npx @runtime-judgement/mcp-serverOr installed globally:
pnpm add -g @runtime-judgement/mcp-server2. Set the environment
# Runtime Judgement — get these from https://runtime-judgement-app.vercel.app/app/settings
export RJ_API_URL="https://runtime-judgement-app.vercel.app"
export RJ_API_KEY="rj_..."The server inherits whatever env it's launched in. The agent's own config
file (Claude Code: ~/.config/claude-code/mcp_servers.json; Codex CLI:
~/.config/codex/mcp_servers.toml) is the right place to declare these.
3. Register with your agent
Claude Code
{
"mcpServers": {
"runtime-judgement": {
"command": "npx",
"args": ["@runtime-judgement/mcp-server"],
"env": {
"RJ_API_URL": "https://runtime-judgement-app.vercel.app",
"RJ_API_KEY": "rj_..."
}
}
}
}Codex CLI
[mcp_servers.runtime-judgement]
command = "npx"
args = ["@runtime-judgement/mcp-server"]
env = { RJ_API_URL = "https://runtime-judgement-app.vercel.app", RJ_API_KEY = "rj_..." }Aider
# .aider.conf.yml
mcp_servers:
- name: runtime-judgement
command: ["npx", "@runtime-judgement/mcp-server"]
env:
RJ_API_URL: https://runtime-judgement-app.vercel.app
RJ_API_KEY: rj_...Use without Claude Code (any MCP host)
Any tool that supports the Model Context Protocol can connect to this server. Use the following generic JSON config block and adapt the key names to your host:
{
"mcpServers": {
"runtime-judgement": {
"command": "npx",
"args": ["-y", "@runtime-judgement/mcp-server"],
"env": {
"RJ_API_KEY": "rj_live_..."
}
}
}
}| Host | Config file location |
|---|---|
| Claude Code | ~/.claude/mcp.json or .claude/mcp.json in the project root |
| Cursor | .cursor/mcp.json in the project root |
| Windsurf | ~/.codeium/windsurf/mcp_config.json |
| Cline | VS Code settings → Cline MCP Servers |
| Any stdio-MCP host | Point command at npx @runtime-judgement/mcp-server and pass RJ_API_KEY via env |
4. Use it
The agent will see three tools in its tool list. Ask:
"Before you commit this patch, run
rj.verify_changeagainst suite01HZ...and tell me the verdict."
The tool will POST to the suite-run endpoint, wait for the result, and return verdict counts + cited spans so the agent can decide whether to proceed.
Tool reference
rj.verify_change
| Arg | Type | Required | Notes |
| -------------- | ---------------- | -------- | ---------------------------------------------------- |
| suiteId | string | yes | Snapshot suite ULID |
| tags | string[] | no | Run only snapshots tagged with these tags |
| perturbation | object | no | Forward-compat hint about what the agent changed |
Returns:
{
suiteId: string
suiteName?: string
verdict: "pass" | "regression" | "drift" | "error" | "empty"
counts: {
total: number
passed: number
changedIntentional: number
changedUnexpected: number
skipped: number
errored: number
}
outcomes: Array<{
snapshotId: string
status: "passed" | "changed-intentional" | "changed-unexpected" | "skipped" | "error"
citedSpanIds?: string[]
message?: string
}>
runIds: string[]
durationMs?: number
spendUsd?: number
}The verdict field is the single-string summary the agent should look at:
pass— every snapshot is unchanged. Proceed with commit.regression— at least one snapshot's verdict changed unexpectedly. Stop and inspectoutcomesto see which.drift— every change ischanged-intentional. The agent should confirm with the user whether the intentional drift is what was wanted.error— at least one snapshot failed to replay (judge timeout, pipeline crash, etc.). Surface in agent output as a transient issue.empty— suite has no snapshots. Configuration issue, not a verdict.
rj.attribute_trace
| Arg | Type | Required | Notes |
| ----------------- | --------- | -------- | -------------------------------------------------- |
| trace | object | yes | Raw trace JSON (OTEL gen-ai / LangSmith / custom) |
| errorSpanId | string | yes | Span where the user-visible failure surfaced |
| errorDescription| string | no | Human description (improves judge precision) |
| errorEvidence | string | no | Verbatim quote from the failure |
| pipeline | string | no | Pipeline name override (defaults to q72-k1) |
Returns:
{
attributionId: string
traceId: string
sourceFormat: "otel-genai" | "langsmith" | "custom-json"
spanCount?: number
deduped: boolean
l1: { axis: string; confidence: number }
l4: { category: string; confidence: number }
citedSpans: string[]
explanation: string | null
suggestedFix: string | null
cost: { usd?: number } | null
algoVersions: Record<string, unknown> | null
}Two-call dance under the hood: ingest the trace via POST /api/traces,
then run the attribution pipeline via POST /api/attributions. Both
calls share the same Bearer token.
rj.suggest_snapshot
| Arg | Type | Required | Notes |
| -------------- | -------- | -------- | ---------------------------------------------------- |
| attributionId| string | yes | ULID from rj.attribute_trace |
| name | string | yes | Human-readable snapshot name (unique per user) |
| description | string | no | Longer description for the suite UI |
| suiteName | string | no | Add to this suite (lazy-created); else "Unfiled" |
Returns:
{
snapshotId: string
name: string
suiteId?: string
nextStep: string // human hint pointing the agent at rj.verify_change
}Useful 409/404 hints in structuredContent:
hint: "name_conflict"— pick a differentname.hint: "attribution_not_found"— theattributionIdis wrong or belongs to a different user.
How the loop fits together
┌──────────────────────────────────────┐
│ agent runs your test / observes bug │
└──────────────────────────────────────┘
│
▼
rj.attribute_trace
│
▼
rj.suggest_snapshot
│
▼
┌────────────────────────────────────┐
│ agent writes a patch │
└────────────────────────────────────┘
│
▼
rj.verify_change
│
▼
┌──────────────────────────────────────┐
│ verdict=pass → commit │
│ verdict=regression → fix + re-loop │
│ verdict=drift → confirm w/ user │
└──────────────────────────────────────┘This is the same loop the human follows on the web app — collapsed into three tool calls the agent can make without leaving its session.
SDK availability + handling
The server depends on @modelcontextprotocol/sdk for the JSON-RPC
transport. If you're building from source in an environment where the
SDK can't be installed (no internet during build, restricted registry,
etc.), the tool modules under src/tools/* are fully usable as a
library — they have zero SDK imports and can be called directly:
import verifyChange from "@runtime-judgement/mcp-server/tools/verify-change"
const result = await verifyChange.invoke(
{ suiteId: "01HZ..." },
{ env: process.env },
)The transport layer (src/index.ts) is the only part that imports the
SDK. If you need to run the tools without the SDK, import the tool
modules directly and wire your own transport.
Development
pnpm install --filter @runtime-judgement/mcp-server
pnpm --filter @runtime-judgement/mcp-server build
pnpm --filter @runtime-judgement/mcp-server testOutput lands in dist/ with .d.ts declarations. The binary
entrypoint is dist/index.js (referenced by the bin field in
package.json).
What's not in v0.1
- No HTTP transport — stdio only. Future work: an HTTP wrapper for hosted deployments.
- No streaming — the snapshot suite run is synchronous up to the 300s RJ function timeout. Long-running suites should be queued and polled (deferred to v0.2).
- No tool-side caching — every
rj.verify_changetriggers a fresh run server-side. RJ has subgraph caching (Sprint 5 / migration 0007) that takes care of within-trace deduplication, but cross-call cache is the user's job.
Dog-food test
The package's own first customer is the runtime-judgement-app repo this package lives inside. The acceptance test for v0.1 is:
- Spawn Claude Code in the repo root with this server registered.
- Have it make a patch to (say) the compressor's heuristic file.
- Have it call
rj.verify_changeagainst the canonical regression suite. - Confirm the verdict matches what
pnpm benchreports independently.
If those match, the tool is honest. If they diverge, file a bug — the divergence is the failure attribution.
