@mutagent/diagnostics
v0.1.0-alpha.8
Published
mutagent-diagnostics: AI agent diagnostics-on-tap skill for Claude Code, Codex, and other coding runtimes
Maintainers
Readme
MUTAGENT Diagnostics
███▄ ▄███▓ █ ██ ▄▄▄█████▓ ▄▄▄ ▄████ ▓█████ ███▄ █ ▄▄▄█████▓
▓██▒▀█▀ ██▒ ██ ▓██▒▓ ██▒ ▓▒▒████▄ ██▒ ▀█▒▓█ ▀ ██ ▀█ █ ▓ ██▒ ▓▒
▓██ ▓██░▓██ ▒██░▒ ▓██░ ▒░▒██ ▀█▄ ▒██░▄▄▄░▒███ ▓██ ▀█ ██▒▒ ▓██░ ▒░
▒██ ▒██ ▓▓█ ░██░░ ▓██▓ ░ ░██▄▄▄▄██ ░▓█ ██▓▒▓█ ▄ ▓██▒ ▐▌██▒░ ▓██▓ ░
▒██▒ ░██▒▒▒█████▓ ▒██▒ ░ ▓█ ▓██▒░▒▓███▀▒░▒████▒▒██░ ▓██░ ▒██▒ ░
░ ▒░ ░ ░░▒▓▒ ▒ ▒ ▒ ░░ ▒▒ ▓▒█░ ░▒ ▒ ░░ ▒░ ░░ ▒░ ▒ ▒ ▒ ░░
░ ░ ░░░▒░ ░ ░ ░ ▒ ▒▒ ░ ░ ░ ░ ░ ░░ ░░ ░ ▒░ ░
░ ░ ░░░ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ░ ░ ░ ░
░ ░ ░ ░ ░ ░ ░ ░Diagnostics-on-Tap for AI agents. Pull evidence from your agent traces, translate user feedback into root-cause records, surface ranked remedies, and apply approved fixes — all from within your existing AI coding session.
Table of Contents
- What it does
- Quick Start
- Platform Support
- Architecture
- Features
- Configuration
- Design Principles
- Project Layout
- For Skill Maintainers
- License
What it does
@mutagent/diagnostics is an AI coding agent skill — a self-contained bundle that installs
into your existing coding-agent runtime (Claude Code, Codex, Cursor, OpenCode) and gives it
the ability to diagnose itself.
Invoke it naturally in chat: "diagnose my agents" or "why did my agent fail last night".
The skill:
- Tier 0 static scan — free, instant pattern match on your traces to bound token cost
- Auto-extracts a rich entity context at ingest (system prompt · tool inventory · input sample) — deterministically, no LLM
- Slices trace history into up to 5 parallel analysis clusters
- Dispatches N Analyzers that run RCA against each cluster
- Classifies each finding on 3 axes: WHAT failed · WHY it failed · WHERE in the stack
- Builds a deterministic render input (Step 8.5 enricher) — derives the big-stat row, 24h latency heatmap, and signal census; fail-loud if the input is starved
- Renders a gold-standard HTML report — Methodology [INTERNAL] · Overview · one tab per finding · Decisions — with copy-back markdown approval workflow
- Spawns a Background apply-worker on an isolated git worktree that opens a PR (or issues idempotent REST mutations for cloud targets) — without touching your checkout
First invocation triggers interactive onboarding to configure your source + target platform. All subsequent invocations run the full diagnostic cycle automatically.
Quick Start
# First-time install — bootstraps config + runs onboarding wizard
$ pnpx @mutagent/diagnostics init
mutagent-diagnostics v0.1.0-alpha
Checking host runtime... claude-code detected
? Select trace source platform:
> Langfuse (cloud / self-hosted)
OpenTelemetry (Jaeger / Tempo / Honeycomb)
Local JSONL files
Claude Code session transcripts
Codex session transcripts
? Select target platform for fix delivery:
> Claude Code (local agent config)
Cursor (.cursorrules)
OpenCode (config)
Mastra.ai (TypeScript PR)
Cloud REST (idempotent PUT)
Config written to .mutagent-diagnostics/config.yaml
Agents installed: .claude/agents/diagnostics-analyzer.md
.claude/agents/diagnostics-apply-worker.md
Ready. Invoke: "diagnose my agents" in your coding session.After init, trigger diagnostics naturally from your AI session:
You: diagnose my agents
> Running Tier 0 static scan on 247 traces...
> 3 signal clusters detected (errors: 12, low-score: 8, feedback: 4)
> Dispatching 3 analyzers in parallel...
> RCA complete. 7 findings ranked by severity.
> Opening HTML report...Platform Support
Source Platforms (trace evidence)
| Platform | Type | Notes |
|----------|------|-------|
| Langfuse | Cloud / self-hosted | Full filter/search coverage |
| OpenTelemetry | Jaeger · Tempo · Honeycomb | OTLP endpoint |
| Local JSONL | .jsonl / .ndjson | Filesystem read |
| Claude Code | Session transcripts | Auto-detected from ~/.claude/projects/ |
| Codex CLI | Session transcripts | Auto-detected from ~/.codex/sessions/ |
Target Platforms (apply fixes to)
| Platform | Class | Delivery |
|----------|-------|---------|
| Claude Code | local-agent | Edit ~/.claude/agents/*.md via PR |
| Codex CLI | local-agent | Edit agent configs via PR |
| Cursor | local-agent | Edit .cursorrules via PR |
| OpenCode | local-agent | Edit config via PR |
| Mastra.ai | local-code-construct | TypeScript PR to agent graph |
| Cloud Agent SDK | local-code-construct | TypeScript PR to SDK integration |
| Cloud REST | remote | Idempotent PUT mutations |
Architecture
%%{init: {'theme':'dark','themeVariables':{'primaryColor':'#a78bfa','primaryTextColor':'#e2e8f0','primaryBorderColor':'#7c3aed','lineColor':'#06b6d4','edgeLabelBackground':'#1e1b4b','background':'#0f0f23','clusterBkg':'#1e1b4b','clusterBorder':'#4c1d95','titleColor':'#a78bfa'}}}%%
flowchart TD
INV["Invoke<br/>'diagnose my agents'"]
DETECT{Config<br/>present?}
ONB["Onboarding<br/>(8-phase)"]
PROTO["Load orchestrator-protocol.md<br/>Parent session = orchestrator"]
TIER0["Tier 0 static scan<br/>scripts/tier0-scan.ts"]
SLICER["Dynamic-cluster slicer<br/>scripts/slicer.ts<br/>cap = 5"]
ANL["N Analyzers ≤ 5<br/>assets/agents/diagnostics-analyzer.md"]
RCA["RCA Layer<br/>WHAT / WHY / WHERE"]
ENRICH["Step 8.5 — Build Render Input<br/>scripts/enrich/build-render-input.ts<br/>deterministic · fail-loud"]
REPORT["Gold-standard HTML Report<br/>assets/templates/report.html.tpl<br/>copy-back markdown approval"]
APPLY["BG apply-worker on worktree<br/>assets/agents/diagnostics-apply-worker.md<br/>PR or REST"]
classDef purple fill:#4c1d95,stroke:#a78bfa,color:#e2e8f0
classDef cyan fill:#0e7490,stroke:#06b6d4,color:#e2e8f0
classDef dim fill:#1e293b,stroke:#475569,color:#94a3b8
INV --> DETECT
DETECT -->|missing| ONB
DETECT -->|present| PROTO
PROTO --> TIER0
TIER0 --> SLICER
SLICER --> ANL
ANL --> RCA
RCA --> ENRICH
ENRICH --> REPORT
REPORT --> APPLY
class INV,DETECT purple
class TIER0,SLICER,ANL,RCA,ENRICH cyan
class ONB,PROTO,REPORT,APPLY dimFull DAG and component dependency graph: references/reference.md
Features
| Feature | Detail |
|---------|--------|
| 3-axis failure taxonomy | Every finding classified on (WHAT, WHY, WHERE) — 10 WHAT types · 9 WHY types · 8 WHERE types |
| Tier 0 static scan | Cost gate before any LLM call — pattern-match + signal count in milliseconds |
| Dynamic cluster slicing | Groups traces by error / feedback / latency signal; window-based fallback when no a-priori signal |
| Cap-of-5 fan-out | Never more than 5 parallel analyzers — keeps cost and complexity bounded |
| Auto-extracted entity context | Every normalizer derives an EntityContext (system prompt · tool inventory with per-tool stats · input sample) at ingest — deterministic, no LLM; operator never hand-fills it. Fields > 1 KB render as collapsed ExpandableSection (system prompt always collapsed for PII) |
| Gold-standard HTML report | 8-tab report — Methodology [INTERNAL] · Overview (entity card · 6-tile big-stat · 24h latency heatmap · signal census · scan funnel) · one tab per finding (taxonomy · evidence · why-chain · assumptions pills · ★-recommended remedies) · Decisions. Built by the deterministic Step 8.5 enricher; renderer is fail-loud on starved input (R1 §9.3) |
| HTML report + copy-back | Human-in-the-loop review via rendered HTML; operator pastes approval markdown back into chat |
| Structured contract mode | When a target ships a self-diagnosis-contract.yaml, the report becomes a structured 10-category pass/fail/pending scorecard against the declared success criteria |
| Branch-hygiene apply | All local fixes land in an isolated git worktree → PR; operator's checkout is never touched |
| Stale-target detection | Hash-compare before every write; re-diagnose prompt if drift detected |
| Idempotent REST | Cloud-target PUTs include Idempotency-Key; safe to retry on 5xx |
| Dual audit trail | Every apply emits pr-body.md + audit.json + audit.md |
| Self-diagnostics | Post-session self-RCA (opt-in, default OFF); feeds skill's own transcripts through the same pipeline |
| Platform-portable UX | AskUserQuestion on Claude Code; numbered chat-choice fallback on Codex / Cursor / OpenCode |
| Platform-native install | pnpx @mutagent/diagnostics init auto-detects runtime and installs agent .md files |
Failure Taxonomy
WHAT → wrong-output · missing-output · loop · latency-spike · cost-overshoot
format-violation · hallucination · user-complaint · low-score · missing-context
WHY → prompt-underspec · prompt-overspec · tool-misuse · tool-missing
context-overflow · provider-limit · data-staleness · handoff-loss · dependency-failure
WHERE → system-prompt · tool-definition · agent-config · routing-config
upstream-data · provider-side · harness-side · user-inputConfiguration
Config lives at <project>/.mutagent-diagnostics/config.yaml (generated by init).
Secrets (API keys) live at <project>/.mutagentrc — gitignored, never committed.
# .mutagent-diagnostics/config.yaml — generated by pnpx @mutagent/diagnostics init
source:
platform: "langfuse" # langfuse | otel | local-jsonl | claude-code | codex
config:
host: "https://cloud.langfuse.com"
# Keys: set LANGFUSE_SECRET_KEY + LANGFUSE_PUBLIC_KEY in .mutagentrc
target:
platform: "local-claude" # local-claude | local-codex | local-cursor | local-opencode
# local-mastra | local-cloud-agent-sdk | cloud-rest
config: {}
filters:
time_window:
from: "7daysAgo"
to: "now"
score_below: null # auto-scaled threshold when set
limit: 100
ask_tool:
runtime: "claude-code" # claude-code | chat-multi-choice
self_diagnostics:
enabled: false # default OFFFull schema with doc strings: references/config.md
Design Principles
These principles are the audit surface for every PR. See
references/principles.md
for full text and scope annotations.
| # | Principle | Summary |
|---|-----------|---------|
| PR-001 | Tier 0 before LLM | Static scan runs before any LLM call |
| PR-002 | RCA layer mandatory | Raw scores never shown; always WHAT+WHY+WHERE+evidence |
| PR-003 | Read-before-write for REST | GET current state, diff, show diff, then PUT |
| PR-004 | Branch hygiene | All local applies in isolated worktree → PR; operator checkout untouched |
| PR-005 | Cap-of-5 fan-out | Never > 5 parallel analyzers |
| PR-006 | Codebase-agnostic placeholders | Templates use {PLACEHOLDER}; no hardcoded paths |
| PR-007 | Progressive disclosure | SKILL.md ≤ 500 lines; all detail in references/ |
| PR-008 | Hyperlinked external docs | Links to upstream docs, not local copies |
| PR-009 | Single source of truth for config | TypeBox schema in scripts/config/schema.ts |
| PR-010 | Multi-choice picker UX | AskUserQuestion on Claude Code; chat fallback elsewhere |
| PR-011 | Stale-target detection | Hash-compare before every write |
| PR-012 | Idempotent retry for REST | Idempotency-Key header; max 2 retries |
| PR-013 | Dual-emit audit | PR body + audit.json + audit.md on every apply |
| PR-014 | HITL via HTML copy-back | HTML report is primary review surface |
| PR-015 | Per-section rationale links | Reference docs link to the FRs they implement |
| PR-016 | Adapter contract pattern | Unified interface per role (normalizer, apply-worker) |
| PR-017 | Dynamic-cluster slicing | Cluster by signal when available; window-based fallback |
| PR-018 | Translation grounds in evidence | Findings must cite trace message index or code line |
| PR-019 | Scripts vs agent operations | Explicit Type A / B / C classification per operation |
| PR-020 | Recursive whys to origin | No fixed depth; keep asking until isOrigin: true |
| PR-021 | CLI install detection | which <cli> at onboarding; prompt install if missing |
| PR-022 | Self-Diagnostics protocol | Post-session self-RCA; default OFF |
| PR-023 | Clipboard payloads self-contained | Every remedy embeds ActionablePlan (v0.3+) |
| PR-024 | Orchestration in parent session | Never delegate the loop to a coordinator sub-agent (DP-A) |
| PR-025 | Self-diagnosis == client diagnosis | One RCA engine; only the subject differs (DP-B) |
| PR-026 | Findings grounded inline | Read the definition or mark hypothesis-pending-source (DP-C) |
| PR-027 | Declare scope; census-pick the primary signal | Disclose what was NOT assessed (DP-D) |
| PR-028 | Cheap-first Tier 0 | Broad cheap checks; don't over-index on one signal (DP-E) |
| PR-029 | Report IS the product surface | Template-stamp over procedural rendering (DP-F) |
| PR-030 | Clipboard payloads carry full decision context | Apply-worker authors the PR without re-reading the session (DP-G, extends PR-023) |
| PR-031 | Remedies declare apply-target + change-type + write-access | (DP-H) |
| PR-032 | Migrations carry a SWEEP | grep + fix all refs in the same change (DP-I) |
| PR-033 | Tag each finding PRODUCT/META/CORE | Self-diagnosis is a discovery lane for PRODUCT/CORE; --audience client NODE-STRIPs internal (DP-J) |
| PR-034 | CLI-contract drift watch | Assert the built command string against the source-platform manual (DP-K) |
| PR-035 | Fresh runs MUST LLM-deep-read | Caps bound the read, never skip it (promotes R2.1 +D1) |
| PR-036 | Fix the MEASUREMENT layer | 5-trace awareness sample discovers what Tier-0 can't measure (promotes R2.2) |
| PR-037 | Learn across runs | Approved findings accumulate into a class-memory library, matched first in Tier-0 (promotes R2.3 +D2) |
| PR-038 | Make signal selection legible | Tier pie, selection-rule cards, mermaid selection trace (promotes R2.4) |
| PR-039 | Prove the sample represents the population | 4-dim coverage proof + confidence grade (promotes R2.5) |
| PR-040 | Operator-driven, not skill-driven | Single-arg NL invocation, verbatim brief preserved (promotes R2.6 +D2) |
| PR-041 | Cost realism | Distrust skill-side cost tracking; cost cap INACTIVE by default (promotes D1) |
| PR-042 | Preserve the verbatim operator invocation across runs | (promotes D2) |
| PR-043 | Every methodology decision is recorded in runMeta as an append-only audit row | |
| PR-044 | Normalize before analyze | Canonical schema (Step 3.7) before any Tier-0/LLM work |
| PR-045 | Remedies carry two distinct rationales | rationale (why-real) vs whyWorks (why-the-fix-works) |
| PR-046 | Feedback two-layered + fix-feedback | raw user signal → translated component-level; + fix-feedback on the solution |
| PR-047 | No-code-access honesty | explicit assumption when source absent; never fabricate paths |
| PR-048 | Trace Hungriness | scale the deep read with the population (tiers 100/250/500/1000) |
| PR-049 | Signal detection + selection | one reconciled primary signal drives census · heatmap · funnel |
| PR-050 | Inline-JS parse gate | every emitted <script> must parse (new Function); data gates can't catch dead JS |
| PR-051 | Signal source-allowlist | only mechanical {loop·latency·cost} + deep-read-evidenced signals; hygiene metrics never emitted |
| PR-052 | Remedy linkage | every remedy links to a target + origin; explicit diffStatus when source absent, never fabricate |
| PR-053 | Methodology widgets thread into runMeta | codified threading; absence is a loud INTERNAL marker, not a blank |
PR-024..PR-033 were promoted from Wave-3 dogfood candidates (DP-A..DP-J) and operator-LOCKED in the 2026-05-30 Constitutional Review. PR-034 (DP-K) and PR-035..PR-043 were promoted from Wave-6 methodology layer (R2.1–R2.6 + D1/D2) — see
references/principles.mdfor full text.
Project Layout
mutagent-diagnostics/
├── README.md ← this file
├── package.json ← @mutagent/diagnostics v0.1.0-alpha.0
├── tsconfig.json
├── eslint.config.js
├── docs/index.html ← gh-pages standalone artifact
└── .claude/skills/mutagent-diagnostics/
├── SKILL.md ← agentskills.io-compliant manifest (§0-§9)
├── .npmignore ← strips internal/ + *.test.ts on publish
│
├── scripts/ ← Type A: pure scripts (bun runtime)
│ ├── tier0-scan.ts ← static scan (free cost gate)
│ ├── slicer.ts ← dynamic-cluster slicing
│ ├── stale-detector.ts ← hash compare before apply
│ ├── cli/
│ │ ├── init.ts ← pnpx entrypoint
│ │ ├── install-agents.ts ← idempotent agent installer
│ │ └── run.sh ← bun → pnpm → npm fallback selector
│ ├── config/ ← schema.ts · load.ts · validate.ts
│ ├── contract/ ← types.ts — SelfDiagnosisContract schema (structured-report mode)
│ ├── fetch/ ← langfuse.ts · claude-code-transcripts.sh
│ ├── normalize/ ← trace.ts · per-platform normalizers · platforms/entity-context.ts (Wave-5 R1.7)
│ ├── enrich/build-render-input.ts ← Step 8.5 deterministic enricher → RenderInput (Wave-5 R1.4)
│ ├── report/ ← render.ts (gold-standard 8-tab HTML) · persist-selections.ts
│ ├── lint/template-inline-js.ts ← R-007-B: reject TS in HTML scripts
│ ├── setup/ ← detect.ts · reconfigure.ts
│ ├── tier0/ ← langfuse.ts · claude-code.ts
│ ├── validate/
│ └── self-diagnostics/ ← [INTERNAL] probe.ts · dispatch.ts
│
├── assets/
│ ├── agents/
│ │ ├── diagnostics-analyzer.md ← dispatched per cluster (Step 6)
│ │ └── diagnostics-apply-worker.md ← spawned at apply gate (Step 11)
│ ├── templates/
│ │ ├── report.html.tpl ← runtime report template (shipped)
│ │ ├── config.yaml.tpl ← onboarding config skeleton
│ │ ├── pr-body.md.tpl ← apply PR description
│ │ ├── audit.json.tpl ← structured audit trail
│ │ └── audit.md.tpl ← human-readable audit trail
│ └── wireframes/ ← picker-UX prompts (onboarding / diagnostics)
│
├── references/ ← load-on-demand docs (not in SKILL.md)
│ ├── reference.md ← entry point + full DAG
│ ├── principles.md ← PR-001..PR-053 (audit surface)
│ ├── operator-feedback-log.md ← operator feedback on the report shape (Wave-5 R1.6)
│ ├── config.md ← schema with doc strings
│ ├── workflows/ ← onboarding · orchestrator-protocol (Step 8.5) · diagnostics · apply · rca · schedule-prep
│ ├── source-platforms/ ← per-platform fetch + filter examples
│ ├── target-platforms/ ← per-target apply recipes
│ └── internal/
│ └── self-diagnostics.md ← [INTERNAL] PR-022 playbook
│
├── examples/
│ └── sample-findings.json ← example RCA output
│
└── internal/ ← [stripped on publish — see .npmignore]
└── templates/review/ ← dev review templates (maintainer use only)For Skill Maintainers
Internal templates (internal/templates/review/)
Five reusable HTML templates for design reviews, status dashboards, and skill walkthroughs.
These are never shipped to end users — the .npmignore at the skill root strips the
entire internal/ directory on publish.
| Template | Use for |
|----------|---------|
| _brand-shell.html | Starting a new review/dashboard from scratch |
| iteration-template.html | Multi-phase approval with per-phase Goal / Scope / Files / Risks |
| review-template.html | Iterative spec lock-in with per-section approval checkboxes |
| status-template.html | Progress reporting (scroll layout, PR banner, Kanban, timeline) |
| skill-overview-template.html | Walking a colleague through SKILL.md §0-§9 visually |
Decision tree for which template to copy: see internal/templates/review/README.md.
Dispatch-gate pattern
The parent coding-agent session IS the orchestrator — no coordinator sub-agent.
Sub-agents (diagnostics-analyzer.md, diagnostics-apply-worker.md) are dispatched
by the parent session at well-defined steps (Step 6 / Step 11 of orchestrator-protocol.md).
Dispatch gates prevent the *afkloop skill from auto-running diagnostics without operator
review — any scheduled-diagnostics scenario must go through the HITL copy-back gate first.
Trunk-style PR model
#1047 is the long-lived trunk PR for the v0.1 → v1.0 lifecycle.
Milestone commits accumulate on feat/mutagent-diagnostics-skill; squash-merge at v1.0
collapses to one main commit. Do not merge to main without operator sign-off.
Publish checklist
# Verify .npmignore strips internal/ correctly
npm pack --dry-run | grep -v "internal/"
# Lint + typecheck
bun run lint && bun run typecheck
# Test
bun test ./.claude/skills/mutagent-diagnostics/scripts/Registry target: GitHub Packages (restricted access). Not published to public npmjs.org.
License
Proprietary. All rights reserved. See LICENSE.txt for full terms.
@mutagent/diagnostics is published to GitHub Packages (restricted access) —
not available on public npm registries.
