agent-reliability
v0.2.1
Published
End-to-end reliability testing for multi-step AI agent workflows. Tests what DeepEval and Langfuse can't: cascading failures across 10+ step chains.
Maintainers
Keywords
Readme
agent-reliability
End-to-end reliability and security testing for multi-step AI agent workflows.
If each step has 85% success rate, a 10-step workflow succeeds only 19.7% of the time. Existing tools (DeepEval, Langfuse, LangSmith) test individual LLM calls. This tests the entire chain — where one failure cascades through everything.
As of v0.2, it also ships SecurityReviewer: a workflow-level security check suite covering prompt injection, canary leaks, unsafe step code, missing controls, and dependency advisories. See §10.
npm install agent-reliabilityTable of Contents
- The Problem
- Quick Start
- Core Concepts
- Full Usage Guide
- Real-World Examples
- API Reference
- Architecture
- Comparison
The Problem
You built an AI agent with 5 steps:
Classify Intent → Fetch Context → Generate Reply → Safety Check → Send
85% 90% 85% 95% 99%Each step looks great individually. But end-to-end?
0.85 × 0.90 × 0.85 × 0.95 × 0.99 = 61.2%Your agent fails 39% of the time. And it's worse than that — failures correlate. When the LLM misclassifies intent in Step 1, Steps 2-5 all get garbage input. The real E2E rate is even lower.
Nobody measures this. DeepEval tests the LLM call. Langfuse traces the call. Neither tests the chain.
Quick Start
import { WorkflowRunner, ReliabilityAnalyzer } from "agent-reliability";
// 1. Define your agent workflow
const runner = new WorkflowRunner("my-agent");
runner
.addStep({
id: "classify",
name: "Classify Intent",
execute: async (input) => {
// Your real classification logic
const intent = await classifyIntent(input.message);
return { output: { intent, message: input.message } };
},
})
.addStep({
id: "generate",
name: "Generate Reply",
execute: async (input) => {
const reply = await callLLM(input.intent, input.message);
return { output: reply, tokens_used: 150, cost_usd: 0.002 };
},
retry: { max_attempts: 3, backoff_ms: 1000 },
})
.addStep({
id: "safety",
name: "Safety Check",
execute: async (input) => {
if (input.includes("harmful")) throw new Error("Blocked by safety filter");
return { output: input };
},
});
// 2. Run 100 times
const runs = await runner.benchmark(100);
// 3. Analyze
const analyzer = new ReliabilityAnalyzer();
const report = analyzer.analyze("my-agent", runs);
console.log(`E2E Success Rate: ${(report.e2e_success_rate * 100).toFixed(1)}%`);
console.log(`Failure Hotspots:`, report.failure_hotspots);
console.log(`Cascade Map:`, report.cascade_map);Core Concepts
| Concept | What It Means | |---------|---------------| | E2E Success Rate | Fraction of runs where ALL steps completed successfully. The metric that matters. | | Cascade Failure | When Step 2 fails, Steps 3-5 get skipped. One failure kills the whole chain. | | Correlation Gap | Difference between predicted rate (product of step rates) and actual rate. Gap > 10% = failures are correlated. | | Chaos Testing | Inject real-world failures (timeouts, rate limits, malformed output) to find breaking points. | | Reliability Score | Single 0-1 number grading workflow reliability. Empirically calibrated weights. |
Full Usage Guide
1. Define Workflow Steps
Every step needs an id, name, and execute function:
import { WorkflowStep } from "agent-reliability";
const step: WorkflowStep = {
id: "fetch_context",
name: "Fetch Context from DB",
execute: async (input, context) => {
const docs = await db.query(input.query);
return {
output: docs,
latency_ms: 45,
tokens_used: 0,
cost_usd: 0,
};
},
// Optional: validate output before passing to next step
validate: (output) => ({
valid: output.length > 0,
errors: output.length === 0 ? ["No documents found"] : [],
warnings: [],
}),
// Optional: retry on failure
retry: {
max_attempts: 3,
backoff_ms: 1000,
retry_on: ["timeout", "429"], // only retry these errors
},
// Optional: fallback if all retries fail
fallback: async (error, input, context) => {
return { output: [{ content: "Default context", score: 0.5 }] };
},
// Optional: timeout
timeout_ms: 5000,
// Optional: dependencies (for DAG workflows, not just linear chains)
depends_on: ["classify"],
};2. Run and Analyze
import { WorkflowRunner, ReliabilityAnalyzer } from "agent-reliability";
const runner = new WorkflowRunner("customer-support");
runner.addStep(classifyStep);
runner.addStep(fetchStep);
runner.addStep(generateStep);
runner.addStep(safetyStep);
runner.addStep(sendStep);
// Single run
const result = await runner.run({ message: "I need help with billing" });
console.log(result.success); // true or false
console.log(result.failure_point); // "classify" or null
console.log(result.cascade_skipped); // ["fetch", "generate", "safety", "send"]
// Benchmark: 100 runs with different inputs
const runs = await runner.benchmark(100, (i) => ({
message: testMessages[i],
}));
// Analyze
const analyzer = new ReliabilityAnalyzer();
const report = analyzer.analyze("customer-support", runs);
// The metrics nobody else computes:
console.log(report.e2e_success_rate); // 0.62 (62% end-to-end)
console.log(report.step_success_rates); // { classify: 0.85, fetch: 0.92, ... }
console.log(report.predicted_e2e_rate); // 0.71 (predicted if independent)
console.log(report.failure_hotspots); // [{ step_id: "classify", failure_rate: 0.15, ... }]
console.log(report.cascade_map); // { classify: ["fetch", "generate", "safety", "send"] }
// The correlation gap: predicted vs actual
const gap = analyzer.correlationGap(report);
console.log(gap.interpretation);
// "Strong correlation — failures cascade. Fix the root cause step."
// What step reliability do you need for 90% E2E across 5 steps?
const required = analyzer.requiredStepReliability(5, 0.90);
console.log(required); // 0.9791 — each step needs 97.9%!3. Chaos Testing
Inject real-world failures to find how your agent breaks:
import { ChaosInjector } from "agent-reliability";
const chaos = new ChaosInjector({
failure_rate: 0.15, // 15% of steps will fail
seed: 42, // reproducible results
failure_types: {
timeout: 0.20, // network timeout
rate_limit: 0.20, // 429 Too Many Requests
malformed_output: 0.25, // truncated JSON, missing fields
latency_spike: 0.15, // 3-5 second delays
context_overflow: 0.05, // output too long for next step
null_return: 0.05, // step returns null
random_error: 0.10, // ECONNREFUSED, SSL errors, etc.
},
});
// Wrap individual steps
const hardenedStep = chaos.wrap(myStep);
// Or wrap all steps at once
const runner = new WorkflowRunner("chaos-test");
const steps = [classifyStep, fetchStep, generateStep, safetyStep];
for (const step of chaos.wrapAll(steps)) {
runner.addStep(step);
}
// Run and see what breaks
const runs = await runner.benchmark(100);
const report = analyzer.analyze("chaos-test", runs);
// Now you know: under 15% failure injection, your E2E drops to X%4. Real-Time Monitoring
Stream events as they happen — don't wait for the benchmark to finish:
import { RealtimeRunner } from "agent-reliability";
const rt = new RealtimeRunner("production-monitor", steps, {
alert_e2e_threshold: 0.5, // alert when E2E drops below 50%
alert_cascade_size: 3, // alert when 3+ steps get skipped
});
// Subscribe to events
rt.on("step:failed", (event) => {
console.error(`FAILED: ${event.data.step_name} — ${event.data.error}`);
sendSlackAlert(event);
});
rt.on("alert:cascade", (event) => {
console.error(`CASCADE: ${event.data.failure_point} killed ${event.data.count} steps`);
pageOnCall(event);
});
rt.on("alert:e2e_drop", (event) => {
console.error(`E2E DROPPED to ${(event.data.e2e_rate * 100).toFixed(0)}%`);
});
rt.on("report:updated", (event) => {
// Live report updates after every run
updateDashboard(event.data.report);
});
// Run single
await rt.runOnce(userInput);
// Or run concurrent (5 workers)
const results = await rt.runConcurrent(100, 5);
// Live report always available
const report = rt.getReport();5. Scale Testing
Rate-limited testing with rolling time windows:
import { ScaleRunner } from "agent-reliability";
const scale = new ScaleRunner("load-test", steps, {
rate_limit_rps: 10, // 10 runs per second
windows: [300, 900, 3600], // 5min, 15min, 1hr rolling stats
adaptive_threshold: 0.5, // speed up testing when failures spike
adaptive_multiplier: 3, // 3x faster during failure spikes
});
// Subscribe to events
scale.on("alert:e2e_drop", (e) => console.error("E2E dropped!"));
// Run 1000 tests at 10 RPS
const { report, windowed_reports } = await scale.runBatch(1000, 3);
// windowed_reports shows reliability over different time windows:
// [
// { window: "300s", report: { e2e_success_rate: 0.72 } }, // last 5 min
// { window: "900s", report: { e2e_success_rate: 0.68 } }, // last 15 min
// { window: "3600s", report: { e2e_success_rate: 0.65 } }, // last hour
// ]
// Stop mid-run
setTimeout(() => scale.stop(), 30000); // stop after 30s6. Scoring and Grading
Get a single number and actionable grade:
import { ReliabilityScorer } from "agent-reliability";
const scorer = new ReliabilityScorer();
const result = scorer.score(report);
console.log(result.score); // 0.72
console.log(result.grade); // "reliable"
console.log(result.breakdown);
// {
// e2e_component: 0.432, // 60% weight × 0.72 E2E rate
// step_min_component: 0.12, // 15% weight × 0.80 weakest step
// cascade_component: 0.08, // 10% weight × 0.80 cascade score
// latency_component: 0.09, // 10% weight × 0.90 latency score
// cost_component: 0.048, // 5% weight × 0.96 cost score
// }
console.log(result.recommendations);
// [
// "Step 'classify' is only 80% reliable. Add retry/fallback.",
// "3 cascade patterns detected. Fix root-cause steps.",
// ]
// Grade scale:
// 0.0-0.3: unreliable — not production-ready
// 0.3-0.6: fragile — works sometimes, needs hardening
// 0.6-0.8: reliable — production candidate
// 0.8-1.0: robust — production-grade7. Streaming Agents
Test agents that use OpenAI's stream: true:
import { streamingLLMStep, openaiStreamAdapter } from "agent-reliability";
import OpenAI from "openai";
const openai = new OpenAI();
const step = streamingLLMStep({
id: "generate",
name: "Generate Reply",
stream_fn: async function* (input) {
const stream = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: input.message }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content ?? "";
yield { content, done: chunk.choices[0]?.finish_reason === "stop" };
}
},
parse: (text) => JSON.parse(text),
on_chunk: (chunk, accumulated) => {
// Real-time UI update
process.stdout.write(chunk.content);
},
});8. Pre-Built Steps
Common agent patterns as drop-in steps:
import { llmStep, toolStep, ragStep, routerStep, guardrailStep } from "agent-reliability";
// LLM call with automatic retry
const classify = llmStep({
id: "classify",
name: "Classify Intent",
model: "gpt-4",
client: openai,
system_prompt: "Classify the user's intent as: billing, support, sales, other",
format_prompt: (input) => input.message,
parse_response: (text) => ({ intent: text.trim().toLowerCase() }),
});
// Tool call with timeout
const search = toolStep({
id: "search",
name: "Search Knowledge Base",
tool_fn: async (args) => await kb.search(args.query),
extract_args: (input) => ({ query: input.message }),
timeout_ms: 3000,
});
// RAG retrieval with quality validation
const retrieve = ragStep({
id: "retrieve",
name: "Retrieve Context",
retrieve_fn: async (query) => await vectorDB.search(query, { top_k: 5 }),
extract_query: (input) => input.message,
min_docs: 2,
min_score: 0.7,
});
// Conditional routing
const router = routerStep({
id: "route",
name: "Route by Intent",
route_fn: (input) => input.intent,
branches: {
billing: billingStep,
support: supportStep,
sales: salesStep,
},
default_branch: "support",
});
// Safety guardrail
const safety = guardrailStep({
id: "safety",
name: "Content Safety",
checks: [
{ name: "no_pii", check: (input) => !input.match(/\b\d{3}-\d{2}-\d{4}\b/), error_msg: "Contains SSN" },
{ name: "no_harmful", check: (input) => !input.includes("hack"), error_msg: "Contains harmful content" },
{ name: "min_length", check: (input) => input.length >= 10, error_msg: "Response too short" },
],
});9. Reports and Audit Trails
import { ReliabilityReporter } from "agent-reliability";
const reporter = new ReliabilityReporter();
// JSON export (for programmatic consumption)
const json = reporter.toJSON(report);
fs.writeFileSync("reliability-report.json", json);
// HTML dashboard (open in browser)
const html = reporter.toHTML(report, runs);
fs.writeFileSync("reliability-report.html", html);
// JSONL audit trail (for compliance — EU AI Act, SOC2)
const auditLog = runner.getAuditLog();
const jsonl = reporter.auditToJSONL(auditLog);
fs.writeFileSync("audit-trail.jsonl", jsonl);
// Each line:
// {"timestamp":"2026-04-12T...","run_id":"abc","step_id":"classify",
// "actor":"step:Classify Intent","trigger":"execute","outcome":"success",
// "input_hash":"<sha256-hex>","output_hash":"<sha256-hex>"}Security notes for reports:
- Step error messages, step IDs, and workflow IDs are included verbatim in JSON reports and the HTML dashboard. Do not throw errors whose
.messagecontains secrets (API keys, PII, prompts with user data) — catch and sanitize at the step boundary first. - HTML reports escape interpolated values, but the report is still generated from data your steps produce. Treat the HTML output as "trusted only as much as your step inputs are trusted."
input_hash/output_hashuse SHA-256 overJSON.stringify(data). Key ordering in objects is not canonicalized, so two logically-equal objects with different key order will produce different hashes. Use the hashes for change detection, not cryptographic proof of equivalence.
10. Security Review
SecurityReviewer runs security checks against the same workflow you're testing for reliability. It slots alongside WorkflowRunner and ChaosInjector and emits a SecurityReport with findings graded critical | high | medium | low | info.
Scope. This library tests the workflow layer — the chain of steps you define. Model-layer threats (data poisoning, membership inference, model inversion, backdoors) require training access and are out of scope. Infrastructure threats (auth, cloud misconfig, K8s) belong to your deployment config, not here.
import { WorkflowRunner, SecurityReviewer } from "agent-reliability";
const runner = new WorkflowRunner("support-agent");
// ...addStep(...) as usual
const reviewer = new SecurityReviewer(runner, {
baselineInput: { message: "hello" },
canaries: [process.env.TEST_API_KEY!, "[email protected]"],
toolAllowList: {
retrieve: ["kb.search"],
generate: ["openai.chat"],
},
});
const report = await reviewer.run(); // runs all checks
console.log(report.summary); // { critical: 0, high: 1, medium: 3, low: 2, info: 1 }
for (const f of report.findings) {
console.log(`[${f.severity}] ${f.check} · ${f.step_id ?? "-"} — ${f.title}`);
}Run a subset of checks:
await reviewer.run({ checks: ["injection", "canary", "static-code"] });Checks available:
injection— Replays a corpus of 19 prompt-injection and jailbreak payloads against each step. Flags outputs that echo the payload verbatim or contain known marker strings (PWNED,INJECTED,LEAK, etc.). Import and extend the corpus viaINJECTION_PAYLOADS. Customize how payloads are merged into your input shape withinjectInput: (baseline, payload) => ....canary— You declare secret strings that should never appear in any step output (API keys, PII in fixtures, prompt-template internals). The reviewer runs the workflow once with yourbaselineInputand asserts no canary leaks into anyStepRunResult.output. Passing no canaries skips the check with an info-level note.tools— Scans step source for tool-like calls (fetch,exec,axios, filesystem writes) and flags steps without atoolAllowListentry. Honest limitation: a static scan can't see tools invoked through helper modules. For rigorous enforcement, wrap each tool with a runtime call-logger — planned for v0.3.static-code— Regex scan ofstep.execute.toString()foreval(),new Function(),child_process,exec/spawn, dynamicrequire(). These are rare in LLM agent code and almost always a bug.missing-controls— Inspects step config: notimeout_ms? novalidate? noretry? nofallback? Emits a finding per missing control. Good for CI gates — fail the build if any step lacks a timeout.deps— Wrapsnpm audit --jsonin the working directory and folds advisories into the report at matching severity. Requires apackage-lock.jsonandnpmonPATH.
Extending the injection corpus. The corpus ships as INJECTION_PAYLOADS in agent-reliability/payloads. For your own agent, add project-specific payloads (known customer-reported prompt attacks, regulator requirements) and pass them via injectInput. A growing internal corpus is the strongest defense over time — ship it as a test fixture in your repo and re-run on every prompt-template change.
CI example:
// tests/security.ci.ts
import { WorkflowRunner, SecurityReviewer } from "agent-reliability";
import { buildWorkflow } from "../src/agent";
it("has no high-severity security findings", async () => {
const runner = buildWorkflow();
const report = await new SecurityReviewer(runner, {
baselineInput: { message: "hello" },
canaries: [process.env.FIXTURE_API_KEY!],
}).run();
const blocking = report.findings.filter(
(f) => f.severity === "critical" || f.severity === "high",
);
expect(blocking).toEqual([]);
});Real-World Examples
Example 1: Customer Support Agent
const runner = new WorkflowRunner("support-agent");
runner
.addStep(llmStep({ id: "classify", name: "Classify", model: "gpt-4", client: openai, system_prompt: "Classify intent", format_prompt: (i) => i.message }))
.addStep(ragStep({ id: "retrieve", name: "Retrieve", retrieve_fn: kb.search, extract_query: (i) => i }))
.addStep(llmStep({ id: "generate", name: "Generate", model: "gpt-4", client: openai, system_prompt: "Answer using context", format_prompt: (i) => JSON.stringify(i) }))
.addStep(guardrailStep({ id: "safety", name: "Safety", checks: [{ name: "no_pii", check: (i) => !i.match(/\d{3}-\d{2}-\d{4}/), error_msg: "PII detected" }] }));
// Benchmark
const runs = await runner.benchmark(200);
const report = analyzer.analyze("support-agent", runs);
const score = scorer.score(report);
console.log(`Grade: ${score.grade} (${score.score.toFixed(2)})`);Example 2: Code Generation Agent with Chaos
const chaos = new ChaosInjector({ failure_rate: 0.20, seed: 42 });
const runner = new WorkflowRunner("codegen");
runner
.addStep(chaos.wrap(planStep))
.addStep(chaos.wrap(generateStep))
.addStep(chaos.wrap(testStep))
.addStep(chaos.wrap(reviewStep));
const runs = await runner.benchmark(500);
const report = analyzer.analyze("codegen", runs);
console.log(`Under 20% chaos: E2E = ${(report.e2e_success_rate * 100).toFixed(1)}%`);Example 3: Production Monitoring
const rt = new RealtimeRunner("prod-agent", steps);
rt.on("alert:e2e_drop", async (e) => {
await slack.send(`Agent reliability dropped to ${(e.data.e2e_rate * 100).toFixed(0)}%`);
});
rt.on("alert:cascade", async (e) => {
await pagerduty.trigger(`Cascade: ${e.data.failure_point} → ${e.data.count} steps skipped`);
});
// Continuous monitoring
while (true) {
await rt.runOnce(getNextRequest());
}API Reference
| Class | Purpose |
|-------|---------|
| WorkflowRunner | Execute workflows with retry, fallback, timeout, validation |
| ReliabilityAnalyzer | Compute E2E rates, cascade maps, correlation gaps |
| ReliabilityScorer | Single 0-1 score with grade and recommendations |
| ChaosInjector | Inject failures: timeout, 429, malformed, latency, null |
| RealtimeRunner | Live event streaming, concurrent execution, alerts |
| ScaleRunner | Rate-limited batch runs with rolling time windows |
| ReliabilityReporter | JSON, HTML dashboard, JSONL audit trail |
| Step Builder | Pattern |
|-------------|---------|
| llmStep() | LLM call with retry and response parsing |
| toolStep() | External tool/API call with timeout |
| ragStep() | Vector DB retrieval with quality validation |
| routerStep() | Conditional branching |
| guardrailStep() | Safety checks (PII, harmful content, format) |
| streamingLLMStep() | Streaming LLM with chunk buffering |
Architecture
┌─────────────────┐
│ Your Agent │
│ (any framework)│
└────────┬────────┘
│ define steps
┌────────▼────────┐
│ WorkflowRunner │ ←── ChaosInjector (optional)
│ retry/fallback/ │
│ timeout/validate│
└────────┬────────┘
│ run N times
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Run 1 │ │ Run 2 │ │ Run N │
│ step→step│ │ step→step│ │ step→step│
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└──────────────┼──────────────┘
┌───────▼───────┐
│ Analyzer │
│ E2E rate │
│ cascade map │
│ correlation │
└───────┬───────┘
┌───────▼───────┐
│ Scorer │
│ 0-1 score │
│ grade + recs │
└───────┬───────┘
┌───────▼───────┐
│ Reporter │
│ JSON/HTML/ │
│ JSONL audit │
└───────────────┘Comparison
| Feature | DeepEval | Langfuse | LangSmith | agent-reliability | |---------|----------|----------|-----------|----------------------| | Test individual LLM calls | Yes | Yes | Yes | Yes | | Test multi-step chains | No | No | No | Yes | | E2E success rate | No | No | No | Yes | | Cascade failure detection | No | No | No | Yes | | Predicted vs actual gap | No | No | No | Yes | | Chaos/fault injection | No | No | No | Yes | | Real-time event streaming | No | Yes | Yes | Yes | | Concurrent benchmark | No | No | No | Yes | | Rolling time windows | No | No | No | Yes | | Tamper-evident audit trail | No | No | No | Yes | | Single reliability score | No | No | No | Yes | | Streaming agent support | No | Yes | Yes | Yes | | Pre-built step patterns | No | No | No | Yes | | Open source | Yes | Yes | No | Yes |
Use It Directly in Any AI IDE
One command. The IDE does the rest.
Step 1: Install
npm install agent-reliabilityStep 2: Tell Your IDE
Just paste this prompt into any AI IDE — Cursor, Claude Code, Windsurf, Copilot, Cody, Aider — it works everywhere:
I installed agent-reliability (npm package). Read my agent code and:
1. Wrap each step as a WorkflowStep
2. Run benchmark(100) with ChaosInjector at 15% failure rate
3. Analyze with ReliabilityAnalyzer
4. Score with ReliabilityScorer
5. Show me: E2E rate, failure hotspots, cascade map, grade
6. Generate the HTML report
Use these imports:
import { WorkflowRunner, ReliabilityAnalyzer, ReliabilityScorer, ChaosInjector } from "agent-reliability"That's it. The IDE reads your code, wraps your steps, runs the benchmark, and shows the report.
Cursor
# In Cursor chat (Cmd+L):
@codebase Use agent-reliability to test the agent in src/agent.ts.
Run 100 times with 15% chaos injection. Show cascade failures.Add to .cursor/rules for automatic use:
When the user asks to test agent reliability, use the agent-reliability npm package.
Import WorkflowRunner, ReliabilityAnalyzer, ChaosInjector, ReliabilityScorer.
Wrap each agent step, benchmark 100 runs, report E2E rate and failure hotspots.Claude Code
# In terminal:
claude "Use agent-reliability to test my agent in src/agent.ts.
Benchmark 100 runs, inject 15% chaos, generate HTML report."Windsurf
# In Windsurf chat:
Use agent-reliability to benchmark my agent workflow.
Show the reliability score and recommendations.GitHub Copilot
# In Copilot chat:
/test Use agent-reliability to create a reliability test for my agent.
Include chaos injection and cascade failure detection.Aider
aider "Add a reliability test using agent-reliability package.
Test src/agent.ts with 100 runs and chaos injection."What the IDE Will Generate
The IDE reads your agent code and produces something like this:
import { WorkflowRunner, ReliabilityAnalyzer, ChaosInjector, ReliabilityScorer, ReliabilityReporter } from "agent-reliability";
import { myAgent } from "./src/agent";
import * as fs from "fs";
async function testReliability() {
const chaos = new ChaosInjector({ failure_rate: 0.15, seed: 42 });
const runner = new WorkflowRunner("my-agent");
// IDE auto-wraps your agent's steps
runner.addStep(chaos.wrap({ id: "parse", name: "Parse Input", execute: myAgent.parse }));
runner.addStep(chaos.wrap({ id: "retrieve", name: "Retrieve Docs", execute: myAgent.retrieve }));
runner.addStep(chaos.wrap({ id: "generate", name: "Generate Reply", execute: myAgent.generate }));
runner.addStep(chaos.wrap({ id: "validate", name: "Safety Check", execute: myAgent.validate }));
// Benchmark
const runs = await runner.benchmark(100);
// Analyze
const analyzer = new ReliabilityAnalyzer();
const report = analyzer.analyze("my-agent", runs);
// Score
const scorer = new ReliabilityScorer();
const result = scorer.score(report);
// Print results
console.log(analyzer.formatReport(report));
console.log(`\nScore: ${result.score.toFixed(2)} — ${result.grade.toUpperCase()}`);
for (const rec of result.recommendations) console.log(` → ${rec}`);
// Save HTML report
const reporter = new ReliabilityReporter();
fs.writeFileSync("reliability.html", reporter.toHTML(report, runs));
console.log("\nReport saved: reliability.html");
}
testReliability();As a Jest/Vitest Test
// test/reliability.test.ts — the IDE can generate this for you
import { WorkflowRunner, ReliabilityAnalyzer, ReliabilityScorer, ChaosInjector } from "agent-reliability";
test("agent is production-ready", async () => {
const chaos = new ChaosInjector({ failure_rate: 0.15, seed: 42 });
const runner = new WorkflowRunner("my-agent");
// ... add steps ...
const runs = await runner.benchmark(100);
const report = new ReliabilityAnalyzer().analyze("my-agent", runs);
const score = new ReliabilityScorer().score(report);
expect(score.grade).not.toBe("unreliable");
expect(report.e2e_success_rate).toBeGreaterThan(0.5);
});Run: npx jest test/reliability.test.ts
CI/CD — Block Unreliable Agents from Deploying
# .github/workflows/reliability.yml
name: Agent Reliability Gate
on: [push, pull_request]
jobs:
reliability:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20 }
- run: npm install
- run: npx jest test/reliability.test.ts --forceExitIf the agent's reliability score is "unreliable", the CI build fails. No unreliable agents reach production.
License
MIT
