agent-vcr

v0.2.1

Published

2 months ago

Snapshot tests for LLM tool calls — record once, replay deterministically in CI. Catch regressions in your agent wiring before they hit production.

Agent VCR

The Problem

You shipped a support agent. It worked perfectly in testing.

Three weeks later, someone tweaks a prompt — and now it's issuing refunds without verifying the order first. The LLM still sounds right. But the tool call sequence broke.

No test caught it. No eval flagged it. Only a production incident revealed it.

Agent VCR fixes this. It's VCR for your tool calls — the same idea as HTTP cassette recording, applied to tool_calls. Record the exact sequence your agent should call, check it into git, and let CI fail the build if anything drifts.

Without Agent VCR:     prompt change → "it still works" → production incident
With Agent VCR:        prompt change → CI fails → you fix it before merging

Why Agent VCR?

| | Without Agent VCR | With Agent VCR | |---|---|---| | CI signal | Manual click-through or flaky live LLM | ✅ Deterministic — same trace every run | | CI cost | API keys + $$$ per PR | ✅ Zero live-model cost in CI once golden traces exist (recording still calls your API) | | Regression detection | Hope for the best | ✅ Build fails if tool sequence changes | | Tool order | Untested | ✅ Guaranteed by golden trace | | Arg validation | Ad-hoc | ✅ Schema v1 + validate CLI in CI | | Framework | Re-invent test harness | ✅ Drop-in adapters for OpenAI, Anthropic, Vercel AI, LangChain |

What it tests: Tool dispatch logic, routing graphs, arg shapes, guard rails. What it doesn't test: Model quality or reasoning — keep those in evals and staging.

Install

npm install agent-vcr
# pnpm add agent-vcr  /  yarn add agent-vcr

Scaffold a new project in one command:

npx agent-vcr init

Creates traces/example.expected.json, a sample Vitest test, and a record.config.json — ready to run.

API keys and environment variables

Live recording and SDK adapters need provider credentials in your environment (or in record.config.json as apiKey for OpenAI-compatible recording only). Never commit real keys — copy .env.example to .env locally and fill in values, or use export / your CI secret store.

| What you use | Typical env var | Where to get a key | |--------------|-----------------|-------------------| | agent-vcr record, recordTrace, OpenAI SDK adapter | OPENAI_API_KEY | OpenAI API keys | | Anthropic SDK adapter (fromAnthropic) | ANTHROPIC_API_KEY | Anthropic console | | Vercel AI + @ai-sdk/openai | OPENAI_API_KEY | Same as OpenAI | | LangChain ChatOpenAI | OPENAI_API_KEY | Same as OpenAI | | OpenAI-compatible base URL (Groq, Ollama, Azure proxy, …) | OPENAI_API_KEY or apiKey in config | Provider docs; Ollama often needs no real secret |

For non-OpenAI Vercel AI providers (for example @ai-sdk/google), use the env variable names from Vercel AI provider docs — Agent VCR does not read those directly; your SDK does.

# Node 20+: load .env for the CLI process (from a project with agent-vcr installed)
node --env-file=.env ./node_modules/.bin/agent-vcr record --config record.config.json --out traces/out.json

Quick Start (3 minutes)

Step 1 — Record your agent with a real LLM (once)

Create record.config.json:

{
  "system": "You are a support agent. Always look up the order before refunding.",
  "user": "Refund $10 on order 123",
  "scenario": "refund_happy_path",
  "model": "gpt-4o-mini",
  "tools": [
    {
      "name": "lookup_order",
      "description": "Look up an order by ID",
      "parameters": {
        "type": "object",
        "properties": { "orderId": { "type": "string" } },
        "required": ["orderId"]
      }
    },
    {
      "name": "refund_order",
      "description": "Refund an order",
      "parameters": {
        "type": "object",
        "properties": {
          "orderId": { "type": "string" },
          "amount": { "type": "number" }
        },
        "required": ["orderId", "amount"]
      }
    }
  ],
  "stubs": {
    "lookup_order": { "found": true, "orderId": "123", "balanceCents": 1000 },
    "refund_order": { "ok": true }
  }
}

export OPENAI_API_KEY="…"   # never commit keys; or run with Node 20+: node --env-file=.env …
npx agent-vcr record --config record.config.json --out traces/refund.expected.json

*  Recording — scenario: refund_happy_path, model: gpt-4o-mini
   user: "Refund $10 on order 123"

   1. lookup_order { orderId: "123" }
   2. refund_order { orderId: "123", amount: 10 }

✔ Saved → traces/refund.expected.json (2 calls)

  Commit this file to git. Run your tests with ScriptedLlm to replay deterministically.

That JSON file is now your golden trace. Commit it.

Step 2 — Replay deterministically in CI (no API key needed)

// tests/refund-agent.vcr.test.ts
import { describe, it, expect } from 'vitest'
import {
  ScriptedLlm,
  assistantText,
  assistantWithTools,
  collectToolCalls,
  compareTraces,
  loadTraceFile,
  toolCall,
} from 'agent-vcr'

describe('refund agent — tool trace', () => {
  it('always looks up before refunding (happy path)', async () => {
    const llm = new ScriptedLlm([
      assistantWithTools([toolCall('lookup_order', { orderId: '123' }, 'c1')]),
      assistantWithTools([toolCall('refund_order', { orderId: '123', amount: 10 }, 'c2')]),
      assistantText('Done.'),
    ])

    const recorded = await collectToolCalls({
      system: 'You are a support agent.',
      user: 'Refund $10 on order 123',
      llm,
      async executeTool(name) {
        if (name === 'lookup_order') return { found: true }
        if (name === 'refund_order') return { ok: true }
        throw new Error(`Unknown tool: ${name}`)
      },
    })

    const expected = await loadTraceFile('traces/refund.expected.json')
    const result = compareTraces(expected.calls, recorded)
    expect(result).toEqual({ ok: true })
  })

  it('catches regression when lookup is skipped', async () => {
    const badLlm = new ScriptedLlm([
      // Simulates a prompt change that causes the model to skip lookup
      assistantWithTools([toolCall('refund_order', { orderId: '123', amount: 10 }, 'c1')]),
      assistantText('Done.'),
    ])

    const recorded = await collectToolCalls({
      system: 'You are a support agent.',
      user: 'Refund $10 on order 123',
      llm: badLlm,
      async executeTool(name) {
        if (name === 'refund_order') return { ok: true }
        throw new Error(`Unknown tool: ${name}`)
      },
    })

    const expected = await loadTraceFile('traces/refund.expected.json')
    expect(compareTraces(expected.calls, recorded).ok).toBe(false) // ← build fails!
  })
})

npx vitest run   # ✔ passes in CI — no API key, no cost, same result every time

Record Mode

agent-vcr record reads record.config.json (prompts, tool schemas, stubs), calls a real LLM over the OpenAI-compatible HTTP API, and writes the resulting tool calls to a golden trace. It does not import or run your app’s own agent entrypoint — for that, use collectToolCalls in tests with your SDK adapter and executeTool implementation.

# Basic usage (OPENAI_API_KEY in env — export, shell profile, or node --env-file=.env)
agent-vcr record --config record.config.json --out traces/my-flow.expected.json

# Override model
agent-vcr record --config record.config.json --out traces/my-flow.expected.json --model gpt-4o

# Point at any OpenAI-compatible API (Anthropic proxy, Groq, Ollama, Azure, ...)
agent-vcr record --config record.config.json --out traces/my-flow.expected.json \
  --base-url https://api.groq.com/openai/v1 \
  --model llama-3.3-70b-versatile

Record config fields:

| Field | Type | Required | Description | |---|---|---|---| | system | string | No | System prompt | | user | string | Yes | User message | | tools | ToolDefinition[] | Yes | Tools with JSON Schema parameters | | stubs | Record<name, result> | No | Tool responses during recording (default: { ok: true }) | | scenario | string | No | Label written to the trace file | | model | string | No | Model name (default: gpt-4o-mini) | | baseURL | string | No | API base URL (default: OpenAI) | | apiKey | string | No | Falls back to OPENAI_API_KEY env var | | maxSteps | number | No | Max tool-calling rounds (default: 16) | | httpFetch | typeof fetch | No | API only — passed to recordTrace() to override global fetch (tests, proxies). Not supported in record.config.json. |

Examples

All examples run with npm run build first (uses dist/):

npm run build && npm run example:all

| Example | Scenario | Key concept | |---|---|---| | minimal | Refund agent | Basic record → replay | | email-router | Route emails by category + priority | Multi-step routing | | financial-transfer | Balance → limit → transfer | High-stakes sequencing + regression demo | | rag-validation | Search → rerank → cite | subsequence mode for flexible pipelines | | ticket-escalation | VIP customer 4-step escalation | Long tool chains |

Run individually:

npm run example:email-router
npm run example:financial-transfer   # includes intentional regression demo
npm run example:rag-validation
npm run example:ticket-escalation

Framework Adapters

Agent VCR works with your existing LLM SDK. Import the adapter for your framework — no peer dependency added to Agent VCR itself.

OpenAI SDK

import OpenAI from 'openai'
import { collectToolCalls, compareTraces, loadTraceFile } from 'agent-vcr'
import { fromOpenAI } from 'agent-vcr/adapters/openai'

const client = new OpenAI()
const llm = fromOpenAI(client, 'gpt-4o-mini', {
  tools: [/* your OpenAI tool definitions */],
})

const recorded = await collectToolCalls({
  user: 'Refund order 123',
  llm,
  executeTool: async (name, args) => myDispatch(name, args),
})

Anthropic SDK

import Anthropic from '@anthropic-ai/sdk'
import { collectToolCalls } from 'agent-vcr'
import { fromAnthropic } from 'agent-vcr/adapters/anthropic'

const client = new Anthropic()
const llm = fromAnthropic(client, 'claude-3-5-sonnet-20241022', {
  tools: [
    {
      name: 'lookup_order',
      description: 'Look up an order by ID',
      input_schema: { type: 'object', properties: { orderId: { type: 'string' } }, required: ['orderId'] },
    },
  ],
})

Vercel AI SDK

import { openai } from '@ai-sdk/openai'
import { collectToolCalls } from 'agent-vcr'
import { fromVercelAI } from 'agent-vcr/adapters/vercel-ai'
import { z } from 'zod'

const llm = fromVercelAI(openai('gpt-4o-mini'), {
  tools: {
    lookup_order: {
      description: 'Look up an order by ID',
      parameters: z.object({ orderId: z.string() }),
    },
  },
})

LangChain.js

import { ChatOpenAI } from '@langchain/openai'
import { collectToolCalls } from 'agent-vcr'
import { fromLangChain } from 'agent-vcr/adapters/langchain'

const chat = new ChatOpenAI({ model: 'gpt-4o-mini' })
const llm = fromLangChain(chat.bindTools([lookupOrderTool, refundOrderTool]))

Any OpenAI-compatible API

# Groq
agent-vcr record --config cfg.json --base-url https://api.groq.com/openai/v1 --model llama-3.3-70b-versatile

# Ollama (local)
agent-vcr record --config cfg.json --base-url http://localhost:11434/v1 --model llama3.2

How It Works

  ┌─────────────────────────────────────────────────────────────────┐
  │  RECORD (once, with real LLM)          REPLAY (CI, no live LLM)  │
  │                                                                 │
  │  User ──► Real LLM ──► tool_calls      User ──► ScriptedLlm   │
  │              │                                      │           │
  │        Real APIs                            executeTool stubs  │
  │              │                                      │           │
  │         save to                          compare to            │
  │   traces/*.expected.json  ◄────────────  traces/*.expected.json│
  │         (git commit)                      fail build on drift  │
  └─────────────────────────────────────────────────────────────────┘

What ScriptedLlm validates: Given these model decisions (scripted), your code still dispatches the right tools with the right arguments.

What it doesn't validate: What the model would actually decide with a live API — that lives in evals, staging, and occasional integration runs.

Two comparison modes:

exact — Same length, each call matches name + args. Strictest. Use for deterministic flows like financial operations.
subsequence — Expected calls appear in order inside actual; extra calls allowed. Use for flexible pipelines like RAG where intermediate steps may change.

Trace File Format

Golden traces are plain JSON — human-readable, diffable, language-agnostic:

{
  "version": 1,
  "scenario": "refund_happy_path",
  "calls": [
    { "name": "lookup_order",  "args": { "orderId": "123" } },
    { "name": "refund_order",  "args": { "orderId": "123", "amount": 10 } }
  ]
}

Any language can emit this format and use the CLI to diff it. The schema is validated with Zod on both read and write.

CLI

# Scaffold Agent VCR in a new project
agent-vcr init [--dir <path>] [--skip-existing]

# Record a golden trace with a real LLM
agent-vcr record --config <record.config.json> --out <trace.json> [--model <name>] [--base-url <url>]

# Validate a trace file matches schema v1
agent-vcr validate <trace.json>

# Diff expected vs actual (exits 1 on mismatch)
agent-vcr diff <expected.json> <actual.json> [--mode exact|subsequence]

Exit codes:

| Code | Meaning | |---|---| | 0 | Success | | 1 | Trace mismatch (diff) | | 2 | Usage error, invalid JSON, or schema error |

Example diff output (mismatch):

✖ Trace mismatch (exact)
────────────────────────────────────────────────────────────
  reason: call 0 differs
  at index: 0
  expected: lookup_order({ orderId: "123" })
  actual:   refund_order({ orderId: "123", amount: 10 })

Full trace comparison:
  #   Expected                        Actual                          Match
  ────────────────────────────────────────────────────────────────────────────
  0   lookup_order(orderId)           refund_order(orderId, amount)   ✖
  1   refund_order(orderId, amount)   (none)                          ✖

Library API

| Export | Description | |---|---| | collectToolCalls(options) | Run assistant ↔ tool loop; returns ToolCallRecord[] | | ScriptedLlm(turns) | Deterministic complete() — returns scripted turns in order | | toolCall(name, args, id?) | Build a ToolCallSpec for scripted turns | | assistantWithTools(specs) | Build an assistant turn that calls tools | | assistantText(content) | Build a final text-only assistant turn | | compareTraces(expected, actual, options?) | Compare two ToolCallRecord[] lists | | loadTraceFile(path) | Read + validate a trace JSON file | | saveTraceFile(path, trace) | Validate + write a trace JSON file | | recordTrace(config) | Call a real LLM, capture tool calls, return trace (see httpFetch above). Never commit API keys — use OPENAI_API_KEY locally only | | initProject(options?) | Scaffold traces/ dir + sample test | | parseTraceFileV1(raw) | Parse raw JSON (throws on invalid) | | safeParseTraceFileV1(raw) | Parse raw JSON (returns { success, error }) |

GitHub Actions

# This repo’s workflow (see .github/workflows/ci.yml): build, validate every
# examples/*/traces/*.expected.json, then run all package examples.

- uses: actions/checkout@v4
- uses: actions/setup-node@v4
  with:
    node-version: '20'
    cache: npm
    cache-dependency-path: package-lock.json

- run: npm ci
- run: npm run lint
- run: npm test
- run: npm run build

- name: Validate golden traces
  run: |
    for f in examples/*/traces/*.expected.json; do
      node dist/cli.js validate "$f"
    done

- run: npm run example:all

# In your own app, point the glob at your traces (e.g. traces/*.expected.json)
# and use `npx agent-vcr validate` if you depend on the package instead of dist/.

Roadmap

Current release: trace schema v1, compareTraces (exact / subsequence), ScriptedLlm, collectToolCalls, CLI (validate, diff, record, init), adapters (OpenAI, Anthropic, Vercel AI, LangChain), and five runnable examples under examples/.

Planned: arg redaction for secrets in traces; CLI flags --forbidden-tools / --budget; Python client; LangGraph adapter; HTML diff report.

Use Cases

When should I use Agent VCR?

✅ Support agents that must verify before acting (lookup → refund, check → transfer)
✅ Email/ticket routing agents where wrong routing = real business impact
✅ RAG pipelines where "skipping retrieval" = silent hallucination
✅ Financial operations with mandatory compliance steps
✅ Any agent where tool order matters as much as the final answer

When is it a poor fit?

❌ Agents with no structured tools (pure text generation)
❌ Teams only caring about prose quality (use evals instead)
❌ Highly exploratory agents where tool sequence legitimately varies every run (use subsequence mode or evals)

Contributing

Contributions welcome — open an issue or PR on GitHub.

git clone https://github.com/ajayvnkt/agent-vcr.git
cd agent-vcr
npm install
npm run build && npm test && npm run example:all

Please run npm run lint && npm test before submitting. See CONTRIBUTING.md for details.

Star History

License

MIT — see LICENSE for details.