@runtime-judgement/mcp-server

v0.1.0

Published

19 days ago

Runtime Judgement MCP server — verify, attribute, and snapshot from inside Claude Code / Codex CLI / Aider. Three tools that fit the inner-loop verification position from the Sprint 11 capability roadmap.

Downloads

0High
0Medium
0Low

rossamac01

mcp model-context-protocol runtime-judgement claude-code codex aider agents verification attribution snapshot-testing

@runtime-judgement/mcp-server

A Model Context Protocol server that gives Claude Code, Codex CLI, Aider and any other MCP-aware coding agent three Runtime Judgement tools to call mid-session, before commit:

rj.verify_change — run the snapshot suite against the current pipeline to verify a patch hasn't regressed the locked-in behaviour.
rj.attribute_trace — ingest a failed trace, attribute the root cause, return the cited cause + L1/L4 verdict + suggested fix.
rj.suggest_snapshot — lock a verdict in as a regression snapshot so the next rj.verify_change call guards against it.

This is the inner-loop verification position from the Sprint 11 capability roadmap (§7 Demo 6). The coding agent never leaves its session to check the web app — verification happens inside the same prompt cycle.

The first dog-food customer is this repo: the test plan for the package is to point Claude Code at the runtime-judgement-app codebase, have it make a patch, and call rj.verify_change against the canonical suite before committing. Eat the dog food.

One-liner setup

Note: @runtime-judgement/mcp-server is not yet published to npm. Until it is, install from source (see below).

Install from source:

git clone https://github.com/rambo-01/runtime-judgement-app
cd runtime-judgement-app
pnpm install
pnpm --filter @runtime-judgement/mcp-server build

Then point your MCP config at:

{
  "command": "node",
  "args": ["/absolute/path/to/runtime-judgement-app/packages/rj-mcp-server/dist/index.js"]
}

The path-with-spaces fix (10ba427) is already in main, so paths like ~/Claude Code/... work correctly.

Once published to npm, the one-liner will be:

# Works in Claude Code, Cursor (via Claude), Windsurf, Cline, and any MCP-compatible tool
npx @runtime-judgement/mcp-server

The server will prompt for your API key on first run and write it to ~/.rj-config.json.

Or set it directly:

RJ_API_KEY=rj_live_... npx @runtime-judgement/mcp-server

Get your key at https://runtime-judgement.app/app/settings/api-keys

Quick start with Claude Code

Get your RJ API token from https://runtime-judgement.app/app/integrations.

Add to your ~/.config/claude-code/mcp.json:

{
  "rj": {
    "command": "npx",
    "args": ["-y", "@runtime-judgement/mcp-server"],
    "env": {
      "RJ_API_URL": "https://runtime-judgement-app.vercel.app",
      "RJ_API_KEY": "rj_..."
    }
  }
}

Restart Claude Code. The rj MCP server appears in your tool list with three tools: rj.attribute_trace, rj.verify_change, rj.suggest_snapshot.
Ask Claude: "Read the trace at ~/Downloads/demo-trace.json and call rj.attribute_trace on it. The user-visible failure surfaced in span lookup_order_status."
Claude calls rj.attribute_trace and returns the cited cause + suggested fix as inline Markdown — no need to leave the terminal.
After patching, ask Claude: "Now call rj.verify_change against suite 01HZ... to confirm the fix." Claude reports the verdict (pass / regression / drift) so you know whether to commit.

Each tool returns a human_readable Markdown field alongside the structured JSON, so Claude Code can surface a clean summary inline. The structured payload is preserved for machine consumers and for the next agent step.

Available tools

| Tool | Purpose | One-line example | |---|---|---| | rj.attribute_trace | Attribute a failed trace to its root cause | rj.attribute_trace({ trace: <otel json>, errorSpanId: "lookup_order_status" }) | | rj.verify_change | Verify a patch hasn't regressed locked-in behaviour | rj.verify_change({ suiteId: "01HZSUITE..." }) | | rj.suggest_snapshot | Lock an attribution in as a regression snapshot | rj.suggest_snapshot({ attributionId: "01HZATTR...", name: "tool-args-guard" }) |

Full input/output reference is in the Tool reference section below.

5-minute setup

1. Install

Note: @runtime-judgement/mcp-server is not yet on npm. Until it is published, see the "One-liner setup" section above for the install-from-source path.

Once published, the server can be run via npx:

npx @runtime-judgement/mcp-server

Or installed globally:

pnpm add -g @runtime-judgement/mcp-server

2. Set the environment

# Runtime Judgement — get these from https://runtime-judgement-app.vercel.app/app/settings
export RJ_API_URL="https://runtime-judgement-app.vercel.app"
export RJ_API_KEY="rj_..."

The server inherits whatever env it's launched in. The agent's own config file (Claude Code: ~/.config/claude-code/mcp_servers.json; Codex CLI: ~/.config/codex/mcp_servers.toml) is the right place to declare these.

3. Register with your agent

Claude Code

{
  "mcpServers": {
    "runtime-judgement": {
      "command": "npx",
      "args": ["@runtime-judgement/mcp-server"],
      "env": {
        "RJ_API_URL": "https://runtime-judgement-app.vercel.app",
        "RJ_API_KEY": "rj_..."
      }
    }
  }
}

Codex CLI

[mcp_servers.runtime-judgement]
command = "npx"
args = ["@runtime-judgement/mcp-server"]
env = { RJ_API_URL = "https://runtime-judgement-app.vercel.app", RJ_API_KEY = "rj_..." }

Aider

# .aider.conf.yml
mcp_servers:
  - name: runtime-judgement
    command: ["npx", "@runtime-judgement/mcp-server"]
    env:
      RJ_API_URL: https://runtime-judgement-app.vercel.app
      RJ_API_KEY: rj_...

Use without Claude Code (any MCP host)

Any tool that supports the Model Context Protocol can connect to this server. Use the following generic JSON config block and adapt the key names to your host:

{
  "mcpServers": {
    "runtime-judgement": {
      "command": "npx",
      "args": ["-y", "@runtime-judgement/mcp-server"],
      "env": {
        "RJ_API_KEY": "rj_live_..."
      }
    }
  }
}

| Host | Config file location | |---|---| | Claude Code | ~/.claude/mcp.json or .claude/mcp.json in the project root | | Cursor | .cursor/mcp.json in the project root | | Windsurf | ~/.codeium/windsurf/mcp_config.json | | Cline | VS Code settings → Cline MCP Servers | | Any stdio-MCP host | Point command at npx @runtime-judgement/mcp-server and pass RJ_API_KEY via env |

4. Use it

The agent will see three tools in its tool list. Ask:

"Before you commit this patch, run rj.verify_change against suite 01HZ... and tell me the verdict."

The tool will POST to the suite-run endpoint, wait for the result, and return verdict counts + cited spans so the agent can decide whether to proceed.

Tool reference

`rj.verify_change`

| Arg | Type | Required | Notes | | -------------- | ---------------- | -------- | ---------------------------------------------------- | | suiteId | string | yes | Snapshot suite ULID | | tags | string[] | no | Run only snapshots tagged with these tags | | perturbation | object | no | Forward-compat hint about what the agent changed |

Returns:

{
  suiteId: string
  suiteName?: string
  verdict: "pass" | "regression" | "drift" | "error" | "empty"
  counts: {
    total: number
    passed: number
    changedIntentional: number
    changedUnexpected: number
    skipped: number
    errored: number
  }
  outcomes: Array<{
    snapshotId: string
    status: "passed" | "changed-intentional" | "changed-unexpected" | "skipped" | "error"
    citedSpanIds?: string[]
    message?: string
  }>
  runIds: string[]
  durationMs?: number
  spendUsd?: number
}

The verdict field is the single-string summary the agent should look at:

pass — every snapshot is unchanged. Proceed with commit.
regression — at least one snapshot's verdict changed unexpectedly. Stop and inspect outcomes to see which.
drift — every change is changed-intentional. The agent should confirm with the user whether the intentional drift is what was wanted.
error — at least one snapshot failed to replay (judge timeout, pipeline crash, etc.). Surface in agent output as a transient issue.
empty — suite has no snapshots. Configuration issue, not a verdict.

`rj.attribute_trace`

| Arg | Type | Required | Notes | | ----------------- | --------- | -------- | -------------------------------------------------- | | trace | object | yes | Raw trace JSON (OTEL gen-ai / LangSmith / custom) | | errorSpanId | string | yes | Span where the user-visible failure surfaced | | errorDescription| string | no | Human description (improves judge precision) | | errorEvidence | string | no | Verbatim quote from the failure | | pipeline | string | no | Pipeline name override (defaults to q72-k1) |

Returns:

{
  attributionId: string
  traceId: string
  sourceFormat: "otel-genai" | "langsmith" | "custom-json"
  spanCount?: number
  deduped: boolean
  l1: { axis: string; confidence: number }
  l4: { category: string; confidence: number }
  citedSpans: string[]
  explanation: string | null
  suggestedFix: string | null
  cost: { usd?: number } | null
  algoVersions: Record<string, unknown> | null
}

Two-call dance under the hood: ingest the trace via POST /api/traces, then run the attribution pipeline via POST /api/attributions. Both calls share the same Bearer token.

`rj.suggest_snapshot`

| Arg | Type | Required | Notes | | -------------- | -------- | -------- | ---------------------------------------------------- | | attributionId| string | yes | ULID from rj.attribute_trace | | name | string | yes | Human-readable snapshot name (unique per user) | | description | string | no | Longer description for the suite UI | | suiteName | string | no | Add to this suite (lazy-created); else "Unfiled" |

Returns:

{
  snapshotId: string
  name: string
  suiteId?: string
  nextStep: string  // human hint pointing the agent at rj.verify_change
}

Useful 409/404 hints in structuredContent:

hint: "name_conflict" — pick a different name.
hint: "attribution_not_found" — the attributionId is wrong or belongs to a different user.

How the loop fits together

                ┌──────────────────────────────────────┐
                │ agent runs your test / observes bug  │
                └──────────────────────────────────────┘
                                  │
                                  ▼
                          rj.attribute_trace
                                  │
                                  ▼
                         rj.suggest_snapshot
                                  │
                                  ▼
                   ┌────────────────────────────────────┐
                   │ agent writes a patch               │
                   └────────────────────────────────────┘
                                  │
                                  ▼
                           rj.verify_change
                                  │
                                  ▼
                ┌──────────────────────────────────────┐
                │ verdict=pass → commit                │
                │ verdict=regression → fix + re-loop   │
                │ verdict=drift → confirm w/ user      │
                └──────────────────────────────────────┘

This is the same loop the human follows on the web app — collapsed into three tool calls the agent can make without leaving its session.

SDK availability + handling

The server depends on @modelcontextprotocol/sdk for the JSON-RPC transport. If you're building from source in an environment where the SDK can't be installed (no internet during build, restricted registry, etc.), the tool modules under src/tools/* are fully usable as a library — they have zero SDK imports and can be called directly:

import verifyChange from "@runtime-judgement/mcp-server/tools/verify-change"

const result = await verifyChange.invoke(
  { suiteId: "01HZ..." },
  { env: process.env },
)

The transport layer (src/index.ts) is the only part that imports the SDK. If you need to run the tools without the SDK, import the tool modules directly and wire your own transport.

Development

pnpm install --filter @runtime-judgement/mcp-server
pnpm --filter @runtime-judgement/mcp-server build
pnpm --filter @runtime-judgement/mcp-server test

Output lands in dist/ with .d.ts declarations. The binary entrypoint is dist/index.js (referenced by the bin field in package.json).

What's not in v0.1

No HTTP transport — stdio only. Future work: an HTTP wrapper for hosted deployments.
No streaming — the snapshot suite run is synchronous up to the 300s RJ function timeout. Long-running suites should be queued and polled (deferred to v0.2).
No tool-side caching — every rj.verify_change triggers a fresh run server-side. RJ has subgraph caching (Sprint 5 / migration 0007) that takes care of within-trace deduplication, but cross-call cache is the user's job.

Dog-food test

The package's own first customer is the runtime-judgement-app repo this package lives inside. The acceptance test for v0.1 is:

Spawn Claude Code in the repo root with this server registered.
Have it make a patch to (say) the compressor's heuristic file.
Have it call rj.verify_change against the canonical regression suite.
Confirm the verdict matches what pnpm bench reports independently.

If those match, the tool is honest. If they diverge, file a bug — the divergence is the failure attribution.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@runtime-judgement/mcp-server

One-liner setup

Quick start with Claude Code

Available tools

5-minute setup

1. Install

2. Set the environment

3. Register with your agent

Claude Code

Codex CLI

Aider

Use without Claude Code (any MCP host)

4. Use it

Tool reference

rj.verify_change

rj.attribute_trace

rj.suggest_snapshot

How the loop fits together

SDK availability + handling

Development

What's not in v0.1

Dog-food test

`rj.verify_change`

`rj.attribute_trace`

`rj.suggest_snapshot`