@mutagent/diagnostics

v0.1.0-alpha.8

Published

9 days ago

mutagent-diagnostics: AI agent diagnostics-on-tap skill for Claude Code, Codex, and other coding runtimes

0High
0Medium
0Low

bcmutagent

mutagent diagnostics ai-agents agent-optimization skill

MUTAGENT Diagnostics

 ███▄ ▄███▓ █    ██ ▄▄▄█████▓ ▄▄▄        ▄████ ▓█████  ███▄    █ ▄▄▄█████▓
▓██▒▀█▀ ██▒ ██  ▓██▒▓  ██▒ ▓▒▒████▄     ██▒ ▀█▒▓█   ▀  ██ ▀█   █ ▓  ██▒ ▓▒
▓██    ▓██░▓██  ▒██░▒ ▓██░ ▒░▒██  ▀█▄  ▒██░▄▄▄░▒███   ▓██  ▀█ ██▒▒ ▓██░ ▒░
▒██    ▒██ ▓▓█  ░██░░ ▓██▓ ░ ░██▄▄▄▄██ ░▓█  ██▓▒▓█  ▄ ▓██▒  ▐▌██▒░ ▓██▓ ░
▒██▒   ░██▒▒▒█████▓   ▒██▒ ░  ▓█   ▓██▒░▒▓███▀▒░▒████▒▒██░   ▓██░  ▒██▒ ░
░ ▒░   ░  ░░▒▓▒ ▒ ▒   ▒ ░░    ▒▒   ▓▒█░ ░▒   ▒ ░░ ▒░ ░░ ▒░   ▒ ▒   ▒ ░░
░  ░      ░░░▒░ ░ ░     ░      ▒   ▒▒ ░  ░   ░  ░ ░  ░░ ░░   ░ ▒░    ░
░      ░    ░░░ ░ ░   ░        ░   ▒   ░ ░   ░    ░      ░   ░ ░   ░
       ░      ░                    ░  ░      ░    ░  ░         ░

Diagnostics-on-Tap for AI agents. Pull evidence from your agent traces, translate user feedback into root-cause records, surface ranked remedies, and apply approved fixes — all from within your existing AI coding session.

What it does

@mutagent/diagnostics is an AI coding agent skill — a self-contained bundle that installs into your existing coding-agent runtime (Claude Code, Codex, Cursor, OpenCode) and gives it the ability to diagnose itself.

Invoke it naturally in chat: "diagnose my agents" or "why did my agent fail last night". The skill:

Tier 0 static scan — free, instant pattern match on your traces to bound token cost
Auto-extracts a rich entity context at ingest (system prompt · tool inventory · input sample) — deterministically, no LLM
Slices trace history into up to 5 parallel analysis clusters
Dispatches N Analyzers that run RCA against each cluster
Classifies each finding on 3 axes: WHAT failed · WHY it failed · WHERE in the stack
Builds a deterministic render input (Step 8.5 enricher) — derives the big-stat row, 24h latency heatmap, and signal census; fail-loud if the input is starved
Renders a gold-standard HTML report — Methodology [INTERNAL] · Overview · one tab per finding · Decisions — with copy-back markdown approval workflow
Spawns a Background apply-worker on an isolated git worktree that opens a PR (or issues idempotent REST mutations for cloud targets) — without touching your checkout

First invocation triggers interactive onboarding to configure your source + target platform. All subsequent invocations run the full diagnostic cycle automatically.

Quick Start

# First-time install — bootstraps config + runs onboarding wizard
$ pnpx @mutagent/diagnostics init

mutagent-diagnostics v0.1.0-alpha
Checking host runtime...  claude-code detected

? Select trace source platform:
  > Langfuse (cloud / self-hosted)
    OpenTelemetry (Jaeger / Tempo / Honeycomb)
    Local JSONL files
    Claude Code session transcripts
    Codex session transcripts

? Select target platform for fix delivery:
  > Claude Code (local agent config)
    Cursor (.cursorrules)
    OpenCode (config)
    Mastra.ai (TypeScript PR)
    Cloud REST (idempotent PUT)

Config written to .mutagent-diagnostics/config.yaml
Agents installed: .claude/agents/diagnostics-analyzer.md
                  .claude/agents/diagnostics-apply-worker.md

Ready. Invoke: "diagnose my agents" in your coding session.

After init, trigger diagnostics naturally from your AI session:

You: diagnose my agents

> Running Tier 0 static scan on 247 traces...
> 3 signal clusters detected (errors: 12, low-score: 8, feedback: 4)
> Dispatching 3 analyzers in parallel...
> RCA complete. 7 findings ranked by severity.
> Opening HTML report...

Platform Support

Source Platforms (trace evidence)

| Platform | Type | Notes | |----------|------|-------| | Langfuse | Cloud / self-hosted | Full filter/search coverage | | OpenTelemetry | Jaeger · Tempo · Honeycomb | OTLP endpoint | | Local JSONL | .jsonl / .ndjson | Filesystem read | | Claude Code | Session transcripts | Auto-detected from ~/.claude/projects/ | | Codex CLI | Session transcripts | Auto-detected from ~/.codex/sessions/ |

Target Platforms (apply fixes to)

| Platform | Class | Delivery | |----------|-------|---------| | Claude Code | local-agent | Edit ~/.claude/agents/*.md via PR | | Codex CLI | local-agent | Edit agent configs via PR | | Cursor | local-agent | Edit .cursorrules via PR | | OpenCode | local-agent | Edit config via PR | | Mastra.ai | local-code-construct | TypeScript PR to agent graph | | Cloud Agent SDK | local-code-construct | TypeScript PR to SDK integration | | Cloud REST | remote | Idempotent PUT mutations |

Architecture

%%{init: {'theme':'dark','themeVariables':{'primaryColor':'#a78bfa','primaryTextColor':'#e2e8f0','primaryBorderColor':'#7c3aed','lineColor':'#06b6d4','edgeLabelBackground':'#1e1b4b','background':'#0f0f23','clusterBkg':'#1e1b4b','clusterBorder':'#4c1d95','titleColor':'#a78bfa'}}}%%
flowchart TD
    INV["Invoke<br/>'diagnose my agents'"]
    DETECT{Config<br/>present?}
    ONB["Onboarding<br/>(8-phase)"]
    PROTO["Load orchestrator-protocol.md<br/>Parent session = orchestrator"]

    TIER0["Tier 0 static scan<br/>scripts/tier0-scan.ts"]
    SLICER["Dynamic-cluster slicer<br/>scripts/slicer.ts<br/>cap = 5"]
    ANL["N Analyzers ≤ 5<br/>assets/agents/diagnostics-analyzer.md"]
    RCA["RCA Layer<br/>WHAT / WHY / WHERE"]
    ENRICH["Step 8.5 — Build Render Input<br/>scripts/enrich/build-render-input.ts<br/>deterministic · fail-loud"]
    REPORT["Gold-standard HTML Report<br/>assets/templates/report.html.tpl<br/>copy-back markdown approval"]
    APPLY["BG apply-worker on worktree<br/>assets/agents/diagnostics-apply-worker.md<br/>PR or REST"]

    classDef purple fill:#4c1d95,stroke:#a78bfa,color:#e2e8f0
    classDef cyan   fill:#0e7490,stroke:#06b6d4,color:#e2e8f0
    classDef dim    fill:#1e293b,stroke:#475569,color:#94a3b8

    INV --> DETECT
    DETECT -->|missing| ONB
    DETECT -->|present| PROTO
    PROTO --> TIER0
    TIER0 --> SLICER
    SLICER --> ANL
    ANL --> RCA
    RCA --> ENRICH
    ENRICH --> REPORT
    REPORT --> APPLY

    class INV,DETECT purple
    class TIER0,SLICER,ANL,RCA,ENRICH cyan
    class ONB,PROTO,REPORT,APPLY dim

Full DAG and component dependency graph: references/reference.md

Features

| Feature | Detail | |---------|--------| | 3-axis failure taxonomy | Every finding classified on (WHAT, WHY, WHERE) — 10 WHAT types · 9 WHY types · 8 WHERE types | | Tier 0 static scan | Cost gate before any LLM call — pattern-match + signal count in milliseconds | | Dynamic cluster slicing | Groups traces by error / feedback / latency signal; window-based fallback when no a-priori signal | | Cap-of-5 fan-out | Never more than 5 parallel analyzers — keeps cost and complexity bounded | | Auto-extracted entity context | Every normalizer derives an EntityContext (system prompt · tool inventory with per-tool stats · input sample) at ingest — deterministic, no LLM; operator never hand-fills it. Fields > 1 KB render as collapsed ExpandableSection (system prompt always collapsed for PII) | | Gold-standard HTML report | 8-tab report — Methodology [INTERNAL] · Overview (entity card · 6-tile big-stat · 24h latency heatmap · signal census · scan funnel) · one tab per finding (taxonomy · evidence · why-chain · assumptions pills · ★-recommended remedies) · Decisions. Built by the deterministic Step 8.5 enricher; renderer is fail-loud on starved input (R1 §9.3) | | HTML report + copy-back | Human-in-the-loop review via rendered HTML; operator pastes approval markdown back into chat | | Structured contract mode | When a target ships a self-diagnosis-contract.yaml, the report becomes a structured 10-category pass/fail/pending scorecard against the declared success criteria | | Branch-hygiene apply | All local fixes land in an isolated git worktree → PR; operator's checkout is never touched | | Stale-target detection | Hash-compare before every write; re-diagnose prompt if drift detected | | Idempotent REST | Cloud-target PUTs include Idempotency-Key; safe to retry on 5xx | | Dual audit trail | Every apply emits pr-body.md + audit.json + audit.md | | Self-diagnostics | Post-session self-RCA (opt-in, default OFF); feeds skill's own transcripts through the same pipeline | | Platform-portable UX | AskUserQuestion on Claude Code; numbered chat-choice fallback on Codex / Cursor / OpenCode | | Platform-native install | pnpx @mutagent/diagnostics init auto-detects runtime and installs agent .md files |

Failure Taxonomy

WHAT  →  wrong-output · missing-output · loop · latency-spike · cost-overshoot
         format-violation · hallucination · user-complaint · low-score · missing-context

WHY   →  prompt-underspec · prompt-overspec · tool-misuse · tool-missing
         context-overflow · provider-limit · data-staleness · handoff-loss · dependency-failure

WHERE →  system-prompt · tool-definition · agent-config · routing-config
         upstream-data · provider-side · harness-side · user-input

Configuration

Config lives at <project>/.mutagent-diagnostics/config.yaml (generated by init). Secrets (API keys) live at <project>/.mutagentrc — gitignored, never committed.

# .mutagent-diagnostics/config.yaml — generated by pnpx @mutagent/diagnostics init

source:
  platform: "langfuse"           # langfuse | otel | local-jsonl | claude-code | codex
  config:
    host: "https://cloud.langfuse.com"
    # Keys: set LANGFUSE_SECRET_KEY + LANGFUSE_PUBLIC_KEY in .mutagentrc

target:
  platform: "local-claude"       # local-claude | local-codex | local-cursor | local-opencode
                                 # local-mastra | local-cloud-agent-sdk | cloud-rest
  config: {}

filters:
  time_window:
    from: "7daysAgo"
    to:   "now"
  score_below: null              # auto-scaled threshold when set
  limit: 100

ask_tool:
  runtime: "claude-code"         # claude-code | chat-multi-choice

self_diagnostics:
  enabled: false                 # default OFF

Full schema with doc strings: references/config.md

Design Principles

These principles are the audit surface for every PR. See references/principles.md for full text and scope annotations.

| # | Principle | Summary | |---|-----------|---------| | PR-001 | Tier 0 before LLM | Static scan runs before any LLM call | | PR-002 | RCA layer mandatory | Raw scores never shown; always WHAT+WHY+WHERE+evidence | | PR-003 | Read-before-write for REST | GET current state, diff, show diff, then PUT | | PR-004 | Branch hygiene | All local applies in isolated worktree → PR; operator checkout untouched | | PR-005 | Cap-of-5 fan-out | Never > 5 parallel analyzers | | PR-006 | Codebase-agnostic placeholders | Templates use {PLACEHOLDER}; no hardcoded paths | | PR-007 | Progressive disclosure | SKILL.md ≤ 500 lines; all detail in references/ | | PR-008 | Hyperlinked external docs | Links to upstream docs, not local copies | | PR-009 | Single source of truth for config | TypeBox schema in scripts/config/schema.ts | | PR-010 | Multi-choice picker UX | AskUserQuestion on Claude Code; chat fallback elsewhere | | PR-011 | Stale-target detection | Hash-compare before every write | | PR-012 | Idempotent retry for REST | Idempotency-Key header; max 2 retries | | PR-013 | Dual-emit audit | PR body + audit.json + audit.md on every apply | | PR-014 | HITL via HTML copy-back | HTML report is primary review surface | | PR-015 | Per-section rationale links | Reference docs link to the FRs they implement | | PR-016 | Adapter contract pattern | Unified interface per role (normalizer, apply-worker) | | PR-017 | Dynamic-cluster slicing | Cluster by signal when available; window-based fallback | | PR-018 | Translation grounds in evidence | Findings must cite trace message index or code line | | PR-019 | Scripts vs agent operations | Explicit Type A / B / C classification per operation | | PR-020 | Recursive whys to origin | No fixed depth; keep asking until isOrigin: true | | PR-021 | CLI install detection | which <cli> at onboarding; prompt install if missing | | PR-022 | Self-Diagnostics protocol | Post-session self-RCA; default OFF | | PR-023 | Clipboard payloads self-contained | Every remedy embeds ActionablePlan (v0.3+) | | PR-024 | Orchestration in parent session | Never delegate the loop to a coordinator sub-agent (DP-A) | | PR-025 | Self-diagnosis == client diagnosis | One RCA engine; only the subject differs (DP-B) | | PR-026 | Findings grounded inline | Read the definition or mark hypothesis-pending-source (DP-C) | | PR-027 | Declare scope; census-pick the primary signal | Disclose what was NOT assessed (DP-D) | | PR-028 | Cheap-first Tier 0 | Broad cheap checks; don't over-index on one signal (DP-E) | | PR-029 | Report IS the product surface | Template-stamp over procedural rendering (DP-F) | | PR-030 | Clipboard payloads carry full decision context | Apply-worker authors the PR without re-reading the session (DP-G, extends PR-023) | | PR-031 | Remedies declare apply-target + change-type + write-access | (DP-H) | | PR-032 | Migrations carry a SWEEP | grep + fix all refs in the same change (DP-I) | | PR-033 | Tag each finding PRODUCT/META/CORE | Self-diagnosis is a discovery lane for PRODUCT/CORE; --audience client NODE-STRIPs internal (DP-J) | | PR-034 | CLI-contract drift watch | Assert the built command string against the source-platform manual (DP-K) | | PR-035 | Fresh runs MUST LLM-deep-read | Caps bound the read, never skip it (promotes R2.1 +D1) | | PR-036 | Fix the MEASUREMENT layer | 5-trace awareness sample discovers what Tier-0 can't measure (promotes R2.2) | | PR-037 | Learn across runs | Approved findings accumulate into a class-memory library, matched first in Tier-0 (promotes R2.3 +D2) | | PR-038 | Make signal selection legible | Tier pie, selection-rule cards, mermaid selection trace (promotes R2.4) | | PR-039 | Prove the sample represents the population | 4-dim coverage proof + confidence grade (promotes R2.5) | | PR-040 | Operator-driven, not skill-driven | Single-arg NL invocation, verbatim brief preserved (promotes R2.6 +D2) | | PR-041 | Cost realism | Distrust skill-side cost tracking; cost cap INACTIVE by default (promotes D1) | | PR-042 | Preserve the verbatim operator invocation across runs | (promotes D2) | | PR-043 | Every methodology decision is recorded in runMeta as an append-only audit row | | | PR-044 | Normalize before analyze | Canonical schema (Step 3.7) before any Tier-0/LLM work | | PR-045 | Remedies carry two distinct rationales | rationale (why-real) vs whyWorks (why-the-fix-works) | | PR-046 | Feedback two-layered + fix-feedback | raw user signal → translated component-level; + fix-feedback on the solution | | PR-047 | No-code-access honesty | explicit assumption when source absent; never fabricate paths | | PR-048 | Trace Hungriness | scale the deep read with the population (tiers 100/250/500/1000) | | PR-049 | Signal detection + selection | one reconciled primary signal drives census · heatmap · funnel | | PR-050 | Inline-JS parse gate | every emitted <script> must parse (new Function); data gates can't catch dead JS | | PR-051 | Signal source-allowlist | only mechanical {loop·latency·cost} + deep-read-evidenced signals; hygiene metrics never emitted | | PR-052 | Remedy linkage | every remedy links to a target + origin; explicit diffStatus when source absent, never fabricate | | PR-053 | Methodology widgets thread into runMeta | codified threading; absence is a loud INTERNAL marker, not a blank |

PR-024..PR-033 were promoted from Wave-3 dogfood candidates (DP-A..DP-J) and operator-LOCKED in the 2026-05-30 Constitutional Review. PR-034 (DP-K) and PR-035..PR-043 were promoted from Wave-6 methodology layer (R2.1–R2.6 + D1/D2) — see references/principles.md for full text.

Project Layout

mutagent-diagnostics/
├── README.md                          ← this file
├── package.json                       ← @mutagent/diagnostics v0.1.0-alpha.0
├── tsconfig.json
├── eslint.config.js
├── docs/index.html                    ← gh-pages standalone artifact
└── .claude/skills/mutagent-diagnostics/
    ├── SKILL.md                       ← agentskills.io-compliant manifest (§0-§9)
    ├── .npmignore                     ← strips internal/ + *.test.ts on publish
    │
    ├── scripts/                       ← Type A: pure scripts (bun runtime)
    │   ├── tier0-scan.ts              ← static scan (free cost gate)
    │   ├── slicer.ts                  ← dynamic-cluster slicing
    │   ├── stale-detector.ts          ← hash compare before apply
    │   ├── cli/
    │   │   ├── init.ts                ← pnpx entrypoint
    │   │   ├── install-agents.ts      ← idempotent agent installer
    │   │   └── run.sh                 ← bun → pnpm → npm fallback selector
    │   ├── config/                    ← schema.ts · load.ts · validate.ts
    │   ├── contract/                  ← types.ts — SelfDiagnosisContract schema (structured-report mode)
    │   ├── fetch/                     ← langfuse.ts · claude-code-transcripts.sh
    │   ├── normalize/                 ← trace.ts · per-platform normalizers · platforms/entity-context.ts (Wave-5 R1.7)
    │   ├── enrich/build-render-input.ts ← Step 8.5 deterministic enricher → RenderInput (Wave-5 R1.4)
    │   ├── report/                    ← render.ts (gold-standard 8-tab HTML) · persist-selections.ts
    │   ├── lint/template-inline-js.ts ← R-007-B: reject TS in HTML scripts
    │   ├── setup/                     ← detect.ts · reconfigure.ts
    │   ├── tier0/                     ← langfuse.ts · claude-code.ts
    │   ├── validate/
    │   └── self-diagnostics/          ← [INTERNAL] probe.ts · dispatch.ts
    │
    ├── assets/
    │   ├── agents/
    │   │   ├── diagnostics-analyzer.md        ← dispatched per cluster (Step 6)
    │   │   └── diagnostics-apply-worker.md    ← spawned at apply gate (Step 11)
    │   ├── templates/
    │   │   ├── report.html.tpl        ← runtime report template (shipped)
    │   │   ├── config.yaml.tpl        ← onboarding config skeleton
    │   │   ├── pr-body.md.tpl         ← apply PR description
    │   │   ├── audit.json.tpl         ← structured audit trail
    │   │   └── audit.md.tpl           ← human-readable audit trail
    │   └── wireframes/                ← picker-UX prompts (onboarding / diagnostics)
    │
    ├── references/                    ← load-on-demand docs (not in SKILL.md)
    │   ├── reference.md               ← entry point + full DAG
    │   ├── principles.md              ← PR-001..PR-053 (audit surface)
    │   ├── operator-feedback-log.md   ← operator feedback on the report shape (Wave-5 R1.6)
    │   ├── config.md                  ← schema with doc strings
    │   ├── workflows/                 ← onboarding · orchestrator-protocol (Step 8.5) · diagnostics · apply · rca · schedule-prep
    │   ├── source-platforms/          ← per-platform fetch + filter examples
    │   ├── target-platforms/          ← per-target apply recipes
    │   └── internal/
    │       └── self-diagnostics.md    ← [INTERNAL] PR-022 playbook
    │
    ├── examples/
    │   └── sample-findings.json       ← example RCA output
    │
    └── internal/                      ← [stripped on publish — see .npmignore]
        └── templates/review/          ← dev review templates (maintainer use only)

For Skill Maintainers

Internal templates (`internal/templates/review/`)

Five reusable HTML templates for design reviews, status dashboards, and skill walkthroughs. These are never shipped to end users — the .npmignore at the skill root strips the entire internal/ directory on publish.

| Template | Use for | |----------|---------| | _brand-shell.html | Starting a new review/dashboard from scratch | | iteration-template.html | Multi-phase approval with per-phase Goal / Scope / Files / Risks | | review-template.html | Iterative spec lock-in with per-section approval checkboxes | | status-template.html | Progress reporting (scroll layout, PR banner, Kanban, timeline) | | skill-overview-template.html | Walking a colleague through SKILL.md §0-§9 visually |

Decision tree for which template to copy: see internal/templates/review/README.md.

Dispatch-gate pattern

The parent coding-agent session IS the orchestrator — no coordinator sub-agent. Sub-agents (diagnostics-analyzer.md, diagnostics-apply-worker.md) are dispatched by the parent session at well-defined steps (Step 6 / Step 11 of orchestrator-protocol.md). Dispatch gates prevent the *afkloop skill from auto-running diagnostics without operator review — any scheduled-diagnostics scenario must go through the HITL copy-back gate first.

Trunk-style PR model

#1047 is the long-lived trunk PR for the v0.1 → v1.0 lifecycle. Milestone commits accumulate on feat/mutagent-diagnostics-skill; squash-merge at v1.0 collapses to one main commit. Do not merge to main without operator sign-off.

Publish checklist

# Verify .npmignore strips internal/ correctly
npm pack --dry-run | grep -v "internal/"

# Lint + typecheck
bun run lint && bun run typecheck

# Test
bun test ./.claude/skills/mutagent-diagnostics/scripts/

Registry target: GitHub Packages (restricted access). Not published to public npmjs.org.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme