npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@mukundakatta/agentsnap

v0.1.1

Published

Snapshot tests for AI agents — record tool-call traces, diff against baselines, fail CI on regressions. Zero dependencies, drops into any test runner.

Readme

agentsnap

Snapshot tests for AI agents. Record an agent run's tool-call trace, diff it against a baseline, fail CI on regressions. Zero runtime dependencies. Drops into any test runner.

npm install --save-dev @mukundakatta/agentsnap
import { record, traceTool, expectSnapshot } from '@mukundakatta/agentsnap';

const search = traceTool('search', async ({ q }) => fetchResults(q));
const summarize = traceTool('summarize', async ({ docs }) => llm(docs));

async function agent(question) {
  const docs = await search({ q: question });
  return summarize({ docs });
}

test('research agent stays on rails', async () => {
  const trace = await record(() => agent('What is RLHF?'));
  await expectSnapshot(trace, '__snapshots__/research.snap.json');
});

First run writes the snapshot. Every run after that diffs against it. If the agent calls a different tool, calls them in a different order, or starts erroring, the test fails with a readable diff. Regenerate with AGENTSNAP_UPDATE=1.

TypeScript types ship in the box (src/index.d.ts) — no @types/agentsnap package needed.

See it in action

git clone https://github.com/MukundaKatta/agentsnap && cd agentsnap
node examples/demo-regression.js

A fake "research agent" gets quietly swapped for one that calls fetch_url instead of search. agentsnap prints the colored diff that would block CI.

Why

Most LLM eval libraries score outputs against expected strings. That misses the actual failure mode of agents in production: they start calling the wrong tools, or call them in the wrong order, or stop calling one entirely. agentsnap captures the trace — the ordered sequence of tool calls, their arguments, and a hash of their results — and treats it like a Jest snapshot. If anything structural changes, your test runner tells you.

Diff statuses

| Status | When | Default action | |---|---|---| | PASSED | Bytewise match | green | | OUTPUT_DRIFT | Tools + args identical, only output text or external result hashes differ | warn (non-failing) | | TOOLS_REORDERED | Same tool names, different order | fail | | TOOLS_CHANGED | Different tool names called, or different arguments | fail | | REGRESSION | New error in the trace, or a tool that used to work now throws | fail |

Override per snapshot via expectSnapshot(trace, path, { failOn: [...] }).

API

record(fn, opts?) → Promise<Trace>

Run fn and capture every traceTool() call inside it (including nested async work). Returns a structured trace.

const trace = await record(
  () => myAgent.run('book SFO'),
  { input: 'book SFO', model: 'claude-sonnet-4-6' }
);

Options:

  • input — what the user/caller sent in. Stored verbatim in the trace.
  • model — model id string. Surfaced in OUTPUT_DRIFT diffs.
  • captureResults — store full tool results in the trace (default false; only the SHA-256 hash is stored to avoid snapshot bloat and PII leaks).

traceTool(name, fn) → wrapped fn

Wraps a tool function. Inside record(), calls are appended to the active trace. Outside record(), it's a transparent pass-through — no overhead, no behavior change.

const search = traceTool('search', async ({ q }) => api.search(q));
const result = await search({ q: 'sfo' }); // works the same as api.search

AsyncLocalStorage powers the recorder, so the wrapped function works correctly across await, Promise.all, timers, and other async boundaries.

expectSnapshot(trace, path, opts?) → Promise<{status, path, changes?}>

  • No file at path → writes the snapshot and returns {status: 'CREATED'}.
  • AGENTSNAP_UPDATE=1 (env) or opts.update: true → overwrites the snapshot.
  • Otherwise → diffs. If the diff status is in opts.failOn (default ['TOOLS_CHANGED', 'TOOLS_REORDERED', 'REGRESSION']), throws an AgentSnapshotMismatch error so the host test runner reports a failure.

diff(baseline, current) → DiffResult

Low-level diff if you want to handle the result yourself instead of throwing.

formatDiff(result, path?) → string

Render a diff result as a colored terminal block. Used internally for the failure message; also exported for custom reporters.

Trace format

{
  "version": 1,
  "model": "claude-sonnet-4-6",
  "input": "Book a flight to SFO",
  "output": "Booked. Confirmation #ABC123.",
  "tools": [
    { "name": "search_flights", "args": { "to": "SFO" }, "result_hash": "sha256:..." },
    { "name": "book_flight",    "args": { "id": "UA123" }, "result_hash": "sha256:..." }
  ],
  "error": null,
  "fingerprint": { "node": "v22.0.0", "agentsnap": "0.1.0" }
}

fingerprint is ignored when diffing (Node version drift shouldn't fail your tests).

Test runners

agentsnap doesn't ship a runner — it just throws on mismatch. Anything that surfaces thrown errors as failures works:

  • node:testnode --test 'test/**/*.test.js'
  • vitestimport { test } from 'vitest', then call as shown above
  • jest — same shape; works with --experimental-vm-modules for ESM
  • playwright / mocha / tap / ava — same story

Recipes

Update all snapshots

AGENTSNAP_UPDATE=1 npm test

Capture full tool results (debugging only)

const trace = await record(fn, { captureResults: true });

Don't commit traces with captureResults enabled if your tools touch real APIs — the snapshot will contain raw responses (potentially PII).

Treat any drift as failure

await expectSnapshot(trace, path, {
  failOn: ['OUTPUT_DRIFT', 'TOOLS_CHANGED', 'TOOLS_REORDERED', 'REGRESSION'],
});

Pair with a real LLM

record() wraps any async function. Whether your tools call a deterministic mock or the live Anthropic SDK, the recording flow is identical. For deterministic snapshots in CI, mock the model and call real tools (or vice versa) depending on what you want to gate.

CLI

@mukundakatta/agentsnap ships an agentsnap binary for diffing/normalizing/updating trace files outside a test runner — handy in CI or for ad-hoc inspection:

# Diff two recorded traces; exits 1 on drift
npx -p @mukundakatta/agentsnap agentsnap diff baseline.json current.json --pretty

# Normalize a trace (strip fingerprint, sort keys) for stable storage
cat trace.json | npx -p @mukundakatta/agentsnap agentsnap normalize - --pretty

# Overwrite a baseline with a new run (after eyeballing the diff)
npx -p @mukundakatta/agentsnap agentsnap update baseline.json current.json

Output is JSON to stdout (use --pretty for indented). Exit code is 0 when there is no drift, 1 when there is, 2 on usage errors. Run agentsnap --help for the full subcommand reference.

What this is not

  • Not an eval framework. No scoring, no LLM-judge, no benchmark dataset. Just snapshot-and-diff.
  • Not a tracer for production. This is a test-time tool. For production observability, reach for OpenTelemetry, Langfuse, etc.
  • Not a workflow product. No CLI, no YAML schema, no cloud upload, no Slack digest. One primitive, shipped well.

Sibling libraries

Part of the agent reliability stack — all @mukundakatta/* scoped, all zero-dep:

Natural pipeline: fit → guard → snap → vet → cast.

Status

v0.1.0 — initial release. Core API stable, TypeScript types included, 33 unit tests, CI on Node 20/22/24. Adapter packages for the Anthropic SDK, OpenAI SDK, and MCP clients are planned for v0.2 to remove the need for manual traceTool() wrapping.

License

MIT