agent-check

v0.1.2

Published

3 months ago

A testing library for AI agents — trace tool calls, assert on behavior, and track baselines

Downloads

0High
0Medium
0Low

jhburrus3

agent testing ai llm trace tool-calls baseline bun

agent-check

Test agent behavior, not model outputs.

agent-check is a testing library for AI agents — inspired by React Testing Library. Instead of asserting on prose or chat completions, you assert on what the agent did: which tools it called, in what order, how much it cost, and whether it respected policy constraints.

import { test, expect } from "bun:test";
import { run, mock } from "agent-check";

test("agent greets the user by name", async () => {
  const trace = await run(
    async (ctx) => {
      const user = await ctx.tools.lookupUser(ctx.input);
      return { greeting: `Hello, ${user.name}!` };
    },
    {
      input: { userId: "42" },
      mocks: {
        lookupUser: mock.fn({ id: "42", name: "Bob" }),
        deleteUser: mock.forbidden("Agent must never delete users"),
      },
    }
  );

  expect(trace).toConverge();
  expect(trace).toHaveCalledTool("lookupUser");
  expect(trace).not.toHaveCalledTool("deleteUser");
  expect(trace).toHaveCalledToolWith("lookupUser", { userId: "42" });
  expect(trace.output).toEqual({ greeting: "Hello, Bob!" });
});

Installation

bun install agent-check

agent-check is designed for Bun's test runner. It extends expect with custom matchers automatically via a preload file.

Setup

agent-check auto-registers its matchers when imported. Add the preload to your bunfig.toml:

[test]
preload = ["agent-check/setup"]

Quick Start

import { test, expect } from "bun:test";
import { run, mock } from "agent-check";

test("customer support agent looks up order before responding", async () => {
  const trace = await run(
    async (ctx) => {
      const order = await ctx.tools.getOrder(ctx.input);
      ctx.trace.setCost(0.002);
      ctx.trace.setTokens({ input: 120, output: 40 });
      return { status: order.status, message: `Your order is ${order.status}.` };
    },
    {
      input: { orderId: "ORD-789" },
      mocks: {
        getOrder: mock.fn({ id: "ORD-789", status: "shipped" }),
        cancelOrder: mock.forbidden("Agent must not cancel orders"),
      },
    }
  );

  // Did it converge?
  expect(trace).toConverge();
  expect(trace).toHaveStopReason("converged");

  // Did it call the right tools?
  expect(trace).toHaveCalledTool("getOrder");
  expect(trace).not.toHaveCalledTool("cancelOrder");
  expect(trace).toHaveCalledToolWith("getOrder", { orderId: "ORD-789" });

  // Was it efficient?
  expect(trace).toBeWithinBudget({ maxUsd: 0.01 });
  expect(trace).toBeWithinTokens({ maxTotal: 500 });

  // Did it produce the right output?
  expect(trace.output).toEqual({
    status: "shipped",
    message: "Your order is shipped.",
  });
});

Core Concepts

The `run()` Function

run() is the entry point. It executes your agent function in a controlled environment, captures everything that happens, and returns a Trace object you can assert on.

Agent Function  →  run()  →  TraceBuilder  →  Trace  →  expect(trace).toHaveCalledTool(...)
                     ↑            ↑
                   injects     accumulates
                   mocked       tool calls,
                   tools        timing, cost

The agent function receives a RunContext with mocked tools (auto-tracked) and a TraceWriter for manual reporting. The function's return value becomes trace.output. If it throws, trace.converged is false and trace.error captures the exception.

RunContext

Your agent function receives a RunContext<TInput, TTools> with three fields:

| Field | Type | Description | |-------|------|-------------| | ctx.input | TInput | The input data you passed via options.input | | ctx.tools | TTools | Auto-tracked mock tools — every call is recorded | | ctx.trace | TraceWriter | Manual reporting: cost, tokens, turns, metadata |

RunContext supports generics for full type safety. Define your tools as an interface and use it as a type parameter — no casting required:

interface MyTools {
  lookupUser: (id: string) => Promise<User>;
  sendEmail: (to: string, body: string) => Promise<void>;
}

async function myAgent(ctx: RunContext<MyInput, MyTools>) {
  const user = await ctx.tools.lookupUser("42"); // fully typed
  await ctx.tools.sendEmail(user.email, "Hello!"); // autocomplete works
}

Traces

A Trace<TInput, TOutput> is a frozen snapshot of everything that happened during the agent run. When your agent is typed, trace.input and trace.output are typed too — no casting needed.

interface Trace<TInput = unknown, TOutput = unknown> {
  converged: boolean;              // Did the agent finish without error?
  stopReason: "converged" | "maxTurns" | "error" | "timeout";
  error?: Error;                   // The error, if it threw
  input: TInput;                   // What was passed in
  output: TOutput;                 // What was returned (or manually set)
  toolCalls: readonly ToolCall[];  // Every tool call, in order
  turns: readonly Turn[];          // Manually-reported turns
  duration: number;                // Wall-clock ms
  startedAt: number;               // Epoch timestamp
  endedAt: number;                 // Epoch timestamp
  cost?: number;                   // USD (manually reported)
  tokens?: TokenUsage;             // Token counts (manually reported)
  metadata: Record<string, unknown>;  // Arbitrary key-values
}

Turns

A Turn represents a single iteration of the agent loop — typically one LLM call followed by zero or more tool calls. This mirrors how real agent loops work.

interface Turn {
  index: number;                   // Auto-incremented, starting at 0
  label?: string;                  // Optional developer label
  toolCalls: ToolCall[];           // Tool calls made during this turn
  response?: string;               // Text output from this turn
  tokens?: TokenUsage;
  duration: number;
  startedAt: number;
  endedAt: number;
  metadata?: Record<string, unknown>;
}

Mocks

Mocks simulate the tools your agent calls. agent-check wraps each mock in a tracking proxy that records tool name, input, output, timing, and errors — automatically.

// Static return value — always returns the same thing
mock.fn({ id: "123", name: "Alice" })

// Dynamic implementation — receives the call arguments
mock.fn((input) => ({ id: input.id, name: "Computed" }))

// Sequence — different value on each call, repeats the last when exhausted
mock.sequence([
  { intent: "question", confidence: 0.95 },
  { message: "Here is your answer.", tokensUsed: 150 },
])

// Forbidden — throws immediately if called
mock.forbidden("Agent should never delete accounts")

API Reference

`run(agentFn, options?)`

Executes an agent function and returns a Trace. Types are inferred from the agent function — if your agent takes RunContext<MyInput, MyTools> and returns Promise<MyOutput>, the trace is automatically typed as Trace<MyInput, MyOutput>.

Parameters:

| Parameter | Type | Description | |-----------|------|-------------| | agentFn | (ctx: RunContext<TInput, TTools>) => TOutput \| Promise<TOutput> | Your agent function | | options.input | TInput | Input data, available as ctx.input | | options.mocks | Record<string, MockToolFn> | Named mock tools | | options.timeout | number | Timeout in ms (default: 30000) | | options.metadata | Record<string, unknown> | Metadata attached to the trace |

Returns: Promise<Trace<TInput, Awaited<TOutput>>>

const trace = await run(
  async (ctx) => {
    const data = await ctx.tools.fetchData(ctx.input);
    return { result: data };
  },
  {
    input: "query",
    mocks: { fetchData: mock.fn({ answer: 42 }) },
    timeout: 5000,
    metadata: { model: "gpt-4" },
  }
);

Behavior:

Each mock in options.mocks is wrapped in a tracking proxy before being exposed as ctx.tools[name]
If the agent function returns a value, it becomes trace.output
If the agent function throws, trace.converged is false, trace.stopReason is "error", and trace.error captures it
If the function exceeds timeout, trace.stopReason is "timeout"
The trace is frozen (immutable) after run() returns

`mock.fn(valueOrImpl?)`

Creates a mock tool function.

Overloads:

// No arguments — returns undefined
mock.fn()

// Static value — always returns this value
mock.fn({ id: "123", name: "Alice" })
mock.fn("hello")
mock.fn([1, 2, 3])
mock.fn(null)

// Function implementation — called with the tool's arguments
mock.fn((input) => ({ id: input.id, name: "Computed" }))

Note: agent-check distinguishes static values from implementations by checking typeof. If you pass a function, it's used as the implementation. If you pass anything else, it's returned as-is.

`mock.sequence(values)`

Creates a mock that returns a different value on each call. When all values are exhausted, the last value is repeated. This is ideal for tools called multiple times with different expected responses (e.g. an LLM called once for classification, then again for answer generation).

mock.sequence([
  { intent: "question", confidence: 0.95 },  // first call
  { message: "Here is your answer." },        // second call and beyond
])

Requires at least one value — mock.sequence([]) throws.

`mock.forbidden(message?)`

Creates a mock that throws ForbiddenToolError if called. Use this to assert that an agent never invokes a dangerous or disallowed tool.

mock.forbidden()                              // Default error message
mock.forbidden("Agent must not delete users") // Custom message

If a forbidden mock is called:

The tool call is still recorded in the trace (with the error)
The error propagates, causing trace.converged = false and trace.stopReason = "error"
trace.error is a ForbiddenToolError instance

`ForbiddenToolError`

A custom error class thrown when a mock.forbidden() tool is called. Exported from the main entry point so you can use it in instanceof checks.

import { ForbiddenToolError } from "agent-check";

if (trace.error instanceof ForbiddenToolError) {
  console.log("Agent called a forbidden tool");
}

| Property | Type | Description | |----------|------|-------------| | name | "ForbiddenToolError" | Error name for identification | | message | string | Custom message or default 'Forbidden tool "toolName" was called' |

TraceWriter

The TraceWriter is available as ctx.trace inside your agent function. Use it to report things agent-check can't automatically observe — like LLM API costs, token usage, or logical turns.

| Method | Description | |--------|-------------| | addToolCall(call) | Manually record a tool call | | startTurn(label?, metadata?) | Start a named turn (returns TurnHandle) | | setOutput(output) | Override the function's return value | | setCost(usd) | Report cost in USD | | setTokens({ input, output, total? }) | Report token usage | | setMetadata(key, value) | Attach arbitrary metadata |

const trace = await run(async (ctx) => {
  // Report LLM usage
  ctx.trace.setCost(0.003);
  ctx.trace.setTokens({ input: 150, output: 50 });
  ctx.trace.setMetadata("model", "claude-sonnet-4-20250514");

  // Override the output
  ctx.trace.setOutput({ custom: "output" });

  return "this return value is ignored because setOutput was called";
});

TurnHandle

Returned by ctx.trace.startTurn(). Represents a single iteration of the agent loop.

| Method | Description | |--------|-------------| | addToolCall(call) | Record a tool call within this turn | | setResponse(text) | Capture text output from this turn | | end() | Close the turn (records duration) |

const turn = ctx.trace.startTurn("planning", { model: "gpt-4" });
turn.addToolCall({ name: "think", input: "problem", output: "plan" });
turn.setResponse("I will search for relevant documents first.");
turn.end();

Tool calls added via turn.addToolCall() appear both in the turn's toolCalls array and in the top-level trace.toolCalls. Turn indices auto-increment starting at 0.

Matchers

agent-check extends Bun's expect with custom matchers. They're registered automatically via the preload file.

All matchers work with .not negation:

expect(trace).toHaveCalledTool("search");
expect(trace).not.toHaveCalledTool("deleteAll");

Tool Matchers

`toHaveCalledTool(toolName)`

Asserts that a tool was called at least once.

expect(trace).toHaveCalledTool("lookupUser");
expect(trace).not.toHaveCalledTool("deleteUser");

`toHaveCalledToolWith(toolName, expectedInput)`

Asserts that a tool was called with matching input. Supports Bun's asymmetric matchers (expect.any(), expect.stringContaining(), etc.).

expect(trace).toHaveCalledToolWith("lookupUser", { userId: "42" });
expect(trace).toHaveCalledToolWith("lookupUser", { userId: expect.any(String) });

`toHaveToolCallCount(toolName, count)`

Asserts that a specific tool was called exactly count times.

expect(trace).toHaveToolCallCount("search", 2);
expect(trace).toHaveToolCallCount("lookup", 1);

`toHaveToolCallCount({ max })`

Asserts that the total number of tool calls (across all tools) is at most max.

expect(trace).toHaveToolCallCount({ max: 5 });

`toHaveToolOrder(expectedOrder)`

Asserts that tools were called in the specified order. The order is checked as a subsequence — other tools can appear between the expected ones.

expect(trace).toHaveToolOrder(["lookupUser", "sendEmail"]);

// Passes even if other tools were called between them:
// lookupUser → log → validate → sendEmail  ✓

Budget Matchers

`toBeWithinBudget({ maxUsd })`

Asserts that trace.cost is at most maxUsd. Fails if cost was never set.

expect(trace).toBeWithinBudget({ maxUsd: 0.02 });

`toBeWithinTokens({ maxTotal })`

Asserts that total token count is at most maxTotal. Fails if tokens were never set.

expect(trace).toBeWithinTokens({ maxTotal: 4000 });

`toBeWithinLatency({ maxMs })`

Asserts that the trace's wall-clock duration is at most maxMs.

expect(trace).toBeWithinLatency({ maxMs: 3000 });

Structural Matchers

`toConverge()`

Asserts that the agent finished without error.

expect(trace).toConverge();
expect(failedTrace).not.toConverge();

`toHaveTurns(opts?)`

With no arguments, asserts that at least one turn was recorded. With { min, max }, asserts the turn count is within range.

expect(trace).toHaveTurns();                    // at least 1 turn
expect(trace).toHaveTurns({ min: 2 });          // at least 2
expect(trace).toHaveTurns({ max: 8 });          // at most 8
expect(trace).toHaveTurns({ min: 2, max: 8 });  // between 2 and 8

`toHaveStopReason(expected)`

Asserts that the trace stopped for a specific reason.

expect(trace).toHaveStopReason("converged");
expect(trace).toHaveStopReason("error");
expect(trace).toHaveStopReason("timeout");

Baseline Matchers

`toMatchBaseline(baseline)`

Asserts that a trace matches a previously captured baseline — the structural "behavioral envelope" of the agent. Detects drift in tool usage, turn count, cost, and stop reason. See Baseline Regression System for details.

const baseline = extractBaseline(referenceTrace);
expect(newTrace).toMatchBaseline(baseline);

Baseline Regression System

The killer feature. Capture a trace's structural invariants and detect drift when prompts change, models upgrade, or tools update.

How It Works

A Baseline captures the structural "shape" of a trace — not exact values, but ranges and invariants:

interface Baseline {
  version: 1;
  toolSet: string[];              // unique tool names, sorted
  toolOrder: string[];            // full tool call sequence
  turnCount: { min: number; max: number };
  costRange?: { min: number; max: number };
  tokenRange?: { min: number; max: number };
  outputShape: string[];          // top-level keys of output (if object)
  stopReason: string;
}

API

import { extractBaseline, compareBaseline, saveBaseline, loadBaseline, updateBaseline } from "agent-check";

// Extract a baseline from a known-good trace
const baseline = extractBaseline(trace);

// Compare a new trace against the baseline
const diff = compareBaseline(newTrace, baseline);
// diff.pass: boolean
// diff.differences: string[] — human-readable list of what changed

// Persist baselines to disk
await saveBaseline(baseline, ".baselines/support-agent.json");
const loaded = await loadBaseline(".baselines/support-agent.json");

// Widen ranges from a new trace (e.g. after accepting a change)
const updated = updateBaseline(existing, newTrace);

Usage in Tests

test("agent behavior matches baseline", async () => {
  const baseline = await loadBaseline(".baselines/support-agent.json");
  const trace = await run(supportAgent, { input, mocks });
  expect(trace).toMatchBaseline(baseline);
});

Trace I/O

Save, load, and debug traces.

`saveTrace(trace, path)` / `loadTrace(path)`

Serialize traces to JSON and load them back. Error objects are properly serialized and reconstructed.

import { saveTrace, loadTrace } from "agent-check";

await saveTrace(trace, ".traces/run-123.json");
const loaded = await loadTrace(".traces/run-123.json");

`printTrace(trace)`

Returns a human-readable summary string for debugging — like RTL's screen.debug().

import { printTrace } from "agent-check";

console.log(printTrace(trace));

Output:

Trace: converged (3 turns, 5 tool calls, 0.002 USD, 1000 tokens, 245ms)
  Turn 0 [classify]:
    → llm("Classify...") → {"intent":"question","confidence":0.95}
  Turn 1 [gather-context]:
    → lookupCustomer("cust-1") → {"id":"cust-1","name":"Alice"}
    → searchKnowledgeBase("What is...") → [{"title":"Return Policy"}]
  Turn 2 [decide]:
    → llm("Answer this...") → {"message":"Here is your answer..."}
    → sendResponse("cust-1","Here is your answer...") → undefined
  Output: {"intent":"question","responded":true,"escalated":false}

Recipes

Testing Tool Order

Verify that your agent calls tools in the correct sequence:

test("agent searches before responding", async () => {
  const trace = await run(
    async (ctx) => {
      const results = await ctx.tools.search(ctx.input);
      const answer = await ctx.tools.summarize(results);
      return answer;
    },
    {
      mocks: {
        search: mock.fn([{ title: "Result 1" }]),
        summarize: mock.fn("Here's what I found..."),
      },
      input: "What is agent-check?",
    }
  );

  expect(trace).toHaveToolOrder(["search", "summarize"]);
});

Testing Cost Budgets

Ensure your agent stays within cost and token limits:

test("agent stays within budget", async () => {
  const trace = await run(async (ctx) => {
    ctx.trace.setCost(0.005);
    ctx.trace.setTokens({ input: 500, output: 200 });
    return "done";
  });

  expect(trace).toBeWithinBudget({ maxUsd: 0.01 });
  expect(trace).toBeWithinTokens({ maxTotal: 1000 });
  expect(trace).toBeWithinLatency({ maxMs: 5000 });
});

Testing Policy Compliance

Use mock.forbidden() to verify agents don't call dangerous tools:

test("agent never deletes data", async () => {
  const trace = await run(
    async (ctx) => {
      const user = await ctx.tools.getUser({ id: "42" });
      return { name: user.name };
    },
    {
      mocks: {
        getUser: mock.fn({ id: "42", name: "Alice" }),
        deleteUser: mock.forbidden("Must not delete users"),
        dropDatabase: mock.forbidden("Must not drop database"),
      },
    }
  );

  expect(trace).toConverge();
  expect(trace).not.toHaveCalledTool("deleteUser");
  expect(trace).not.toHaveCalledTool("dropDatabase");
});

Multi-Turn Agents

Track logical turns in complex agent flows:

test("agent follows plan-execute-verify pattern", async () => {
  const trace = await run(async (ctx) => {
    const turn1 = ctx.trace.startTurn("plan");
    turn1.addToolCall({ name: "analyze", input: ctx.input, output: "plan" });
    turn1.end();

    const turn2 = ctx.trace.startTurn("execute");
    turn2.addToolCall({ name: "act", input: "plan", output: "result" });
    turn2.end();

    const turn3 = ctx.trace.startTurn("verify");
    turn3.addToolCall({ name: "check", input: "result", output: "ok" });
    turn3.end();

    return "verified";
  });

  expect(trace).toConverge();
  expect(trace).toHaveTurns({ min: 3, max: 3 });
  expect(trace.turns[0]!.label).toBe("plan");
  expect(trace.turns[1]!.label).toBe("execute");
  expect(trace.turns[2]!.label).toBe("verify");
});

Testing Timeouts

Verify that slow agents are handled correctly:

test("agent times out gracefully", async () => {
  const trace = await run(
    async () => {
      await new Promise((r) => setTimeout(r, 60000));
      return "never";
    },
    { timeout: 100 }
  );

  expect(trace).not.toConverge();
  expect(trace).toHaveStopReason("timeout");
  expect(trace.error!.message).toContain("timed out");
});

Dynamic Mocks

Use function implementations for mocks that need to compute responses:

test("agent handles dynamic responses", async () => {
  const trace = await run(
    async (ctx) => {
      const a = await ctx.tools.increment(1);
      const b = await ctx.tools.increment(2);
      return { a, b };
    },
    {
      mocks: {
        increment: mock.fn((n: number) => n + 1),
      },
    }
  );

  expect(trace).toConverge();
  expect(trace.output).toEqual({ a: 2, b: 3 });
  expect(trace).toHaveToolCallCount("increment", 2);
});

Baseline Regression Testing

Capture a baseline from a known-good run and detect drift in future runs:

import { extractBaseline, saveBaseline, loadBaseline } from "agent-check";

// First time: capture and save baseline
test("capture baseline", async () => {
  const trace = await run(supportAgent, { input, mocks: baseMocks() });
  const baseline = extractBaseline(trace);
  await saveBaseline(baseline, ".baselines/support-agent.json");
});

// Subsequent runs: verify against baseline
test("agent behavior matches baseline", async () => {
  const baseline = await loadBaseline(".baselines/support-agent.json");
  const trace = await run(supportAgent, { input, mocks: baseMocks() });
  expect(trace).toMatchBaseline(baseline);
});

Debugging with printTrace

When a test fails, use printTrace to quickly see what happened:

import { printTrace } from "agent-check";

test("debug a failing agent", async () => {
  const trace = await run(supportAgent, { input, mocks: baseMocks() });

  // Print trace for debugging
  console.log(printTrace(trace));

  expect(trace).toConverge();
});

Types

All types are exported from the main entry point:

import type {
  Trace,
  ToolCall,
  Turn,
  TokenUsage,
  TraceWriter,
  TurnHandle,
  RunOptions,
  RunContext,
  AgentFn,
  MockToolFn,
  Baseline,
  BaselineDiff,
} from "agent-check";

`ToolCall`

interface ToolCall {
  name: string;       // Tool name
  input: unknown;     // Arguments passed to the tool
  output: unknown;    // Return value
  error?: Error;      // Error thrown, if any
  duration: number;   // Wall-clock ms
  startedAt: number;  // Epoch timestamp
  endedAt: number;    // Epoch timestamp
}

`Turn`

interface Turn {
  index: number;                   // Auto-incremented, starting at 0
  label?: string;                  // Optional developer label
  toolCalls: ToolCall[];
  response?: string;               // Text output from this turn
  tokens?: TokenUsage;
  duration: number;
  startedAt: number;
  endedAt: number;
  metadata?: Record<string, unknown>;
}

`TokenUsage`

interface TokenUsage {
  input: number;
  output: number;
  total?: number;  // Computed as input + output if omitted
}

`Baseline`

interface Baseline {
  version: 1;
  toolSet: string[];              // unique tool names, sorted
  toolOrder: string[];            // full tool call sequence
  turnCount: { min: number; max: number };
  costRange?: { min: number; max: number };
  tokenRange?: { min: number; max: number };
  outputShape: string[];          // top-level keys of output (if object)
  stopReason: string;
  metadata?: Record<string, unknown>;
}

`BaselineDiff`

interface BaselineDiff {
  pass: boolean;
  differences: string[];   // human-readable list of what changed
}

Project Structure

src/
  index.ts                    # Public API exports
  types.ts                    # All TypeScript interfaces (with generics)
  run.ts                      # run() — wires mocks, timeout, error handling
  trace-builder.ts            # Mutable accumulator → frozen Trace
  mock.ts                     # mock.fn(), mock.forbidden()
  baseline.ts                 # Baseline extraction, comparison, persistence
  trace-io.ts                 # Save/load/print traces
  setup.ts                    # expect.extend() preload
  matchers.d.ts               # Declaration merging for bun:test
  matchers/
    index.ts                  # Barrel export + allMatchers object
    helpers.ts                # assertIsTrace() guard
    tool-matchers.ts          # toHaveCalledTool, toHaveCalledToolWith, etc.
    budget-matchers.ts        # toBeWithinBudget, toBeWithinTokens, etc.
    structural-matchers.ts    # toConverge, toHaveTurns, toHaveStopReason
    baseline-matchers.ts      # toMatchBaseline
scripts/
  build.ts                    # Bun.build() + tsc declaration generation
tests/
  helpers.ts                  # buildTrace(), buildToolCall(), buildTurn() factories
  trace-builder.test.ts
  mock.test.ts
  run.test.ts
  baseline.test.ts
  trace-io.test.ts
  matchers/
    tool-matchers.test.ts
    budget-matchers.test.ts
    structural-matchers.test.ts
    baseline-matchers.test.ts
  integration/
    full-flow.test.ts         # End-to-end test of the full API
examples/
  support-agent/              # Multi-turn e-commerce support agent
  rag-pipeline/               # Retrieval-augmented generation pipeline
  code-review-agent/          # Automated code review with security scanning

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

agent-check

Table of Contents

Installation

Setup

Quick Start

Core Concepts

The run() Function

RunContext

Traces

Turns

Mocks

API Reference

run(agentFn, options?)

mock.fn(valueOrImpl?)

mock.sequence(values)

mock.forbidden(message?)

ForbiddenToolError

TraceWriter

TurnHandle

Matchers

Tool Matchers

toHaveCalledTool(toolName)

toHaveCalledToolWith(toolName, expectedInput)

toHaveToolCallCount(toolName, count)

toHaveToolCallCount({ max })

toHaveToolOrder(expectedOrder)

Budget Matchers

toBeWithinBudget({ maxUsd })

toBeWithinTokens({ maxTotal })

toBeWithinLatency({ maxMs })

Structural Matchers

toConverge()

toHaveTurns(opts?)

toHaveStopReason(expected)

Baseline Matchers

toMatchBaseline(baseline)

Baseline Regression System

How It Works

API

Usage in Tests

Trace I/O

saveTrace(trace, path) / loadTrace(path)

printTrace(trace)

Recipes

Testing Tool Order

Testing Cost Budgets

Testing Policy Compliance

Multi-Turn Agents

Testing Timeouts

Dynamic Mocks

Baseline Regression Testing

Debugging with printTrace

Types

ToolCall

Turn

TokenUsage

Baseline

BaselineDiff

Project Structure

License

The `run()` Function

`run(agentFn, options?)`

`mock.fn(valueOrImpl?)`

`mock.sequence(values)`

`mock.forbidden(message?)`

`ForbiddenToolError`

`toHaveCalledTool(toolName)`

`toHaveCalledToolWith(toolName, expectedInput)`

`toHaveToolCallCount(toolName, count)`

`toHaveToolCallCount({ max })`

`toHaveToolOrder(expectedOrder)`

`toBeWithinBudget({ maxUsd })`

`toBeWithinTokens({ maxTotal })`

`toBeWithinLatency({ maxMs })`

`toConverge()`

`toHaveTurns(opts?)`

`toHaveStopReason(expected)`

`toMatchBaseline(baseline)`

`saveTrace(trace, path)` / `loadTrace(path)`

`printTrace(trace)`

`ToolCall`

`Turn`

`TokenUsage`

`Baseline`

`BaselineDiff`