agent-triage
v0.2.0
Published
Diagnose your AI agents in production — extract behavioral policies from prompts, evaluate traces against them, and generate diagnostic reports.
Maintainers
Keywords
Readme
agent-triage
Diagnose your AI agents in production. Extract testable policies from your agent's system prompt, evaluate real traces against them, and generate a diagnostic report that pinpoints exactly what's failing, which agent caused it, and what to fix — in minutes, not days.
Why?
Your agent's system prompt is a behavioral contract. agent-triage turns it into testable policies, audits production traces against every one of them, and shows you exactly where things go wrong — down to the specific step, the specific agent, and the specific policy that was violated.
Quick Start
See it in action (requires an LLM API key):
npx agent-triage demoUse it on your own agent:
# 1. Extract policies from your agent's system prompt
npx agent-triage init --prompt system-prompt.txt
# 2. Evaluate traces (from JSON, LangSmith, or OpenTelemetry)
npx agent-triage analyze --traces conversations.json --prompt system-prompt.txt
# 3. Open the report in your browser
npx agent-triage viewOr skip the setup entirely — agent-triage can auto-discover agents and extract policies directly from LangSmith traces:
# Zero-config: auto-discovers agents, extracts policies, evaluates everything
npx agent-triage analyze --langsmith my-projectCost: ~$0.40 for 10 conversations with claude-sonnet-4-6 (default), ~$0.02 with gpt-4o-mini. Use --dry-run to preview before running anything.
Privacy: Traces stay on your machine. Only LLM API calls leave — no telemetry, nothing sent to us.
Installation
# Run directly (no install needed)
npx agent-triage demo
# Or install as a project dependency
npm install agent-triageRequirements
Set your API key as an environment variable:
export OPENAI_API_KEY=sk-...
# or
export ANTHROPIC_API_KEY=sk-ant-...The Debugging Workflow
agent-triage is built around a debugging funnel — start cheap and broad, narrow to expensive and deep:
SIGNAL → SCOPE → ISOLATE → DIAGNOSE → FIX → VERIFYHere's what that looks like in practice:
# 1. Signal: is something wrong? (instant, reads from disk)
agent-triage status
# 2. Scope: what kind of conversations are failing? (zero LLM cost)
agent-triage analyze --langsmith my-project --since 24h --quick
# 3. Isolate: find the worst failure
agent-triage explain --worst
# 4. Diagnose: deep-dive into a specific conversation
agent-triage explain conv_abc123
# 5. Fix the prompt, then verify
agent-triage analyze --langsmith my-project
agent-triage diff before/report.json after/report.json
# 6. Track progress over time
agent-triage historyEvery command in this flow builds on the previous one. status tells you if there's a problem. explain --worst tells you what the problem is. diff tells you if your fix worked.
What It Does
agent-triage evaluates production agent traces against behavioral policies extracted from your system prompt and generates a single, self-contained HTML diagnostic report. It tells you what failed, where it started — down to the exact step and the responsible agent — why it happened, what to change, and what that change might break.
The report is designed for fast triage first, then deep forensics when you need it:
1. Verdict & Metrics
Get an at-a-glance pipeline summary (e.g., 15 policies extracted → 8 traces evaluated → 10 failures found) and a clear verdict (e.g., "6 of 15 policies are failing"). A metrics dashboard tracks 12 quality scores (Success, Relevancy, Hallucination, Sentiment, Context Retention, etc.) alongside policy compliance so you can spot regressions and trends quickly.
2. Patterns, Top Offenders, and Recommended Fixes
See where things break at scale: failures grouped by type and subtype (e.g., Hallucination, Missing Handoff, Wrong Routing, Tone Violation) and attributed to root-cause categories — prompt issues, orchestration failures, model limitations, or RAG gaps. The report highlights the most affected traces with summaries and severity badges, and provides a ranked list of concrete recommendations — each with a confidence level and the number of conversations impacted — so you can ship the highest-leverage change first.
3. Step-by-Step Deep Dive
For the most severe failure, the report drills all the way down:
- Exact root-cause step: a color-coded timeline that marks where the failure begins, tags the violated policies directly on the offending steps (e.g.,
"No fabricated pricing ✕","Escalate billing ✕"), and attributes the failure to the responsible agent in multi-agent setups. - Failure cascade: how the initial mistake propagates — from hallucination to user pushback to a missed handoff to the agent doubling down.
- What happened / Impact / Fix: a structured narrative with step references and a concrete recommended change with confidence score.
- Blast-radius preview: which other policies are likely to shift if you apply the fix, with estimated impact percentages — so you don't trade one problem for another.
Every failing trace gets its own expandable diagnosis card with the same structure. Every report includes the exact CLI command used to generate it for reproducibility.
| Metric | What It Measures | |--------|-----------------| | Success Score | Did the agent achieve the user's goal? | | AI Relevancy | Were responses on-topic and useful? | | Sentiment | How did the user feel during the conversation? | | Hallucination | Did the agent make claims not in the system prompt? | | Repetition | Did the agent repeat itself unnecessarily? | | Consistency | Were responses consistent with each other? | | Natural Language | Did the agent sound natural and human? | | Context Retention | Did the agent remember earlier context? | | Verbosity | Were responses appropriately concise? | | Task Completion | Were all user requests addressed? | | Clarity | Were responses clear and easy to understand? | | Truncation | Were responses cut off mid-sentence? |
Commands
analyze
Evaluate traces against policies and generate a diagnostic report.
# From a JSON file
agent-triage analyze --traces conversations.json --prompt system-prompt.txt
# From LangSmith (zero-config — auto-discovers agents and policies)
agent-triage analyze --langsmith my-project
# From OpenTelemetry export
agent-triage analyze --otel traces.json
# Quick mode: skip diagnosis/fixes, ~60% cheaper
agent-triage analyze --langsmith my-project --quick
# Filter by time and agent
agent-triage analyze --langsmith my-project --since 24h --agent "billing-agent"Options:
--quick— skip diagnosis and fix generation (faster, ~60% cheaper)--since <duration>/--until <duration>— time window (e.g.2h,24h,7d)--agent <name>— filter to a specific agent--dry-run— show estimated cost without calling the LLM--max-conversations <n>— limit evaluation to N traces--format json— output JSON to stdout instead of terminal summary--model <model>— use a specific model (default: gpt-4o-mini)--provider <provider>— openai, anthropic, or openai-compatible--include-prompt— include the system prompt text in the report JSON--summary-only— omit trace transcripts from report
explain
Deep-dive diagnosis of a single conversation — root cause, cascade chain, blast radius, and suggested fix.
# Explain the worst failing conversation from the last report
agent-triage explain --worst
# Explain a specific conversation
agent-triage explain conv_abc123
# Explain from a trace source (if no report exists yet)
agent-triage explain conv_abc123 --langsmith my-projectcheck
Targeted policy compliance check — faster and cheaper than full analyze (no metrics, no diagnosis).
# Check all policies
agent-triage check --langsmith my-project --since 24h
# Check specific policies
agent-triage check --langsmith my-project --policy escalation-policy --policy tone-policy
# CI gate: exit code 1 if compliance below threshold
agent-triage check --traces conversations.json --threshold 90status
Instant health check from the last report. Zero LLM cost — reads from disk.
agent-triage statushistory
Show compliance trends across analyze runs. Zero LLM cost.
# Show all runs
agent-triage history
# Show last 5 runs
agent-triage history --last 5
# JSON output
agent-triage history --format jsoninit
Extract testable policies from your agent's system prompt.
agent-triage init --prompt system-prompt.txtOutputs policies.json — an editable file of behavioral rules your agent should follow. Review and adjust before running evaluation.
diff
Compare two reports to see what changed after prompt edits.
agent-triage diff before/report.json after/report.jsonview
Open the generated HTML report in your default browser.
agent-triage viewdemo
Run a full demo with built-in example agents and traces.
agent-triage demoMCP Server
agent-triage includes an MCP (Model Context Protocol) server, so AI assistants like Claude and Cursor can debug your agents programmatically.
{
"mcpServers": {
"agent-triage": {
"command": "npx",
"args": ["-y", "agent-triage-mcp"]
}
}
}The MCP server exposes 9 tools that follow the same debugging funnel:
| Tool | Cost | Purpose |
|------|------|---------|
| triage_status | Zero | Health check from last report |
| triage_sample | Zero | Browse conversations with keyword search |
| triage_list_policies | Zero | List loaded policies |
| triage_history | Zero | Compliance trends across runs |
| triage_diff | Zero | Compare two reports |
| triage_check | Moderate | Targeted policy compliance |
| triage_explain | Moderate | Root cause diagnosis |
| triage_init | Moderate | Extract policies from prompt |
| triage_analyze | High | Full evaluation pipeline |
An AI assistant using these tools would naturally: check triage_status to see if there's a problem, use triage_sample with keyword search to find relevant conversations, then triage_explain to diagnose the root cause.
Trace Format
agent-triage accepts traces in three formats:
JSON (recommended)
[
{
"id": "conv_001",
"messages": [
{ "role": "system", "content": "You are a support agent..." },
{ "role": "user", "content": "I need help with my order" },
{ "role": "assistant", "content": "I'd be happy to help!" }
]
}
]Flexible field mapping is supported — role/sender, content/text/message, human/ai/bot/agent role variants are all accepted. JSONL format (one conversation per line) also works.
LangSmith
Point to a LangSmith project and agent-triage will fetch traces automatically. Auto-detects trace-based vs session-based architectures, discovers agents by system prompt, and pushes time filters server-side for efficiency. Requires LANGSMITH_API_KEY.
Note: LangSmith's API is rate-limited, so fetching large projects can take a few minutes. agent-triage throttles requests automatically and shows progress. Use
--since/--untilto narrow the time window, or--max-conversationsto cap the number of traces fetched.
OpenTelemetry
Export OTLP/JSON traces from any OpenTelemetry-instrumented agent. agent-triage follows the GenAI semantic conventions (pinned to v1.36.0).
Configuration
Create agent-triage.config.yaml for persistent settings:
llm:
provider: openai
model: gpt-4o-mini
# apiKey: ${OPENAI_API_KEY} # resolved from env vars
maxConcurrency: 5
prompt:
path: system-prompt.txt
agent:
name: "My Support Agent"
output:
dir: .
maxConversations: 500Environment variable references (${VAR_NAME}) are automatically resolved in config values. CLI flags take precedence over config file values.
Programmatic API
agent-triage can be used as a library:
import {
readJsonTraces,
extractPolicies,
createLlmClient,
evaluateAll,
buildHtml,
} from "agent-triage";
const llm = createLlmClient("openai", process.env.OPENAI_API_KEY!, "gpt-4o-mini");
const conversations = await readJsonTraces("./conversations.json");
// ... evaluate, aggregate, generate reportSee src/index.ts for all available exports.
How It Compares
| Feature | agent-triage | IntellAgent | DeepEval | Promptfoo | |---------|:-:|:-:|:-:|:-:| | Production trace analysis | Yes | No | Partial | Partial | | Policy extraction from prompts | Yes | No | No | No | | Multi-connector (JSON, LangSmith, OTel) | Yes | LangGraph only | Custom | Custom | | Quality metrics (12 built-in) | Yes | Binary pass/fail | Custom | Custom | | Self-contained HTML report | Yes | No | Dashboard | No | | Step-level root cause + cascade | Yes | No | No | No | | Blast-radius warnings | Yes | No | No | No | | Cross-run diff | Yes | No | No | Yes | | MCP server for AI assistants | Yes | No | No | No | | Zero-config LangSmith | Yes | No | No | No | | CI compliance gates | Yes | No | Yes | Yes | | No infrastructure required | Yes | Yes | No (server) | Yes | | License | MIT | MIT | Apache 2.0 | MIT |
Comparison accurate as of February 2026. Open an issue if any entry needs updating.
agent-triage vs. Converra
agent-triage is a standalone diagnostic tool. It gives you a complete picture of what's failing and why.
Converra is an optional next step that automates the fix cycle:
- Tested fix proposals — concrete prompt patches with confidence scores, not directional hints
- Simulation testing — test fixes against personas, scenarios, and complexity levels before deploying
- Regression gating — ensure fixes don't break other policies
- Continuous monitoring — alerts and dashboards for agent health over time
- Team collaboration — shared workspace for reviewing and deploying fixes
License
Contributing
We welcome contributions, especially new trace connectors. See CONTRIBUTING.md for development setup and guidelines.
git clone https://github.com/converra/agent-triage
cd agent-triage
npm install
npm run build
npm testBuilt by Converra
