@swarmclawai/prompt-snapshot

v0.1.0

Published

2 months ago

Vitest-for-prompts. File-based prompt eval runner with snapshots and assertions — no SaaS, no dashboard, CI-native. Built for agents.

0High
0Medium
0Low

waydelyle

prompt prompts prompt-testing prompt-eval eval llm anthropic openai claude snapshot-testing vitest ai-testing agent-cli

prompt-snapshot

Vitest-for-prompts. A file-based prompt eval runner with snapshots and assertions. No dashboard, no SaaS, no sign-up — just .prompt.ts files in your repo and a CLI that runs them in CI.

Why this exists

If you ship any LLM feature, you should be testing your prompts the way you test your code — in CI, on every change. The existing options (Langfuse, DeepEval, Phoenix, Promptfoo) are heavy observability platforms shaped for enterprise teams: SDK init, hosted dashboards, sign-ups, YAML configs, whole new abstractions to learn.

A solo dev or small team wants this instead:

Colocated .prompt.ts files next to the code that uses the prompt.
A TypeScript-native API with types, autocomplete, and Zod schemas.
Snapshot testing — run a prompt, capture the output, fail the build if it drifts.
Deterministic assertions (contains, matchesSchema, …) for the parts that matter.
One CLI, JSON output, stable exit codes — CI-friendly on day one.

That's what prompt-snapshot is.

30-second demo

Create summarizer.prompt.ts:

import { defineSuite, contains, shorterThan } from "@swarmclawai/prompt-snapshot";

export default defineSuite({
  name: "summarizer",
  model: { provider: "anthropic", model: "claude-haiku-4-5", temperature: 0 },
  prompt: ({ doc }: { doc: string }) =>
    `Summarize the following in one sentence.\n\n${doc}`,
  cases: [
    {
      name: "short-doc",
      input: { doc: "The cat sat on the mat. The dog watched from across the room." },
      assertions: [contains("cat", { caseInsensitive: true }), shorterThan(300)],
      snapshot: true,
    },
  ],
});

Run it:

export ANTHROPIC_API_KEY=sk-...
npx @swarmclawai/prompt-snapshot@latest run

First run writes a snapshot to __snapshots__/summarizer.snap.json. Every run after that fails if the output drifts. prompt-snapshot update refreshes snapshots on purpose.

Install

pnpm add -D @swarmclawai/prompt-snapshot
# or
npm i -D @swarmclawai/prompt-snapshot

Commands

| Command | Purpose | |---|---| | prompt-snapshot run [path] | Run every suite under a path | | prompt-snapshot update [path] | Run + rewrite snapshots | | prompt-snapshot list [path] | Discover suites without running | | prompt-snapshot help-agents | Print the machine-readable catalog |

Every command accepts --json and returns a one-line JSON envelope. Exit codes: 0 all passed, 1 some failed or suites wouldn't load, 2 internal error.

Assertions

| Assertion | Checks | |---|---| | contains(s, { caseInsensitive? }) | output contains the substring | | notContains(s, { caseInsensitive? }) | output does not contain the substring | | matches(/regex/) | output matches a regex | | matchesSchema(zodSchema) | output is JSON (possibly fenced) and matches the schema | | startsWith(s) | trimmed output starts with the prefix | | oneOf([...]) | trimmed output equals one of the options | | longerThan(n) / shorterThan(n) | char length bound |

Combine them freely per case. Failures are reported with rule name + message.

Snapshots

Set snapshot: true on a case (or on the suite) and the first run writes a normalized copy of the output to <dir>/__snapshots__/<suite>.snap.json. Subsequent runs diff the new output against the saved one and fail on drift. Run with update to rewrite.

Normalization trims trailing whitespace and collapses trailing newlines — so whitespace noise doesn't cause false failures, but substantive drift does.

Providers

| Provider | Env var | Notes | |---|---|---| | anthropic | ANTHROPIC_API_KEY | Uses @anthropic-ai/sdk. Defaults: temperature: 0, max_tokens: 1024 | | openai | OPENAI_API_KEY | Uses openai SDK | | openai-compatible | varies | Point baseURL at Ollama, LM Studio, vLLM, or any OpenAI-compatible API | | mock | — | Only via programmatic injection (see below) |

Using from Vitest

prompt-snapshot is a library first, a CLI second. Drive it directly from Vitest if you prefer:

import { test, expect } from "vitest";
import { runSuite, createMockProvider } from "@swarmclawai/prompt-snapshot";
import suite from "./summarizer.prompt.js";

test("summarizer suite", async () => {
  const res = await runSuite(suite, {
    provider: () => createMockProvider({ respond: () => "A cat story." }),
  });
  expect(res.failed).toBe(0);
});

The provider factory makes it trivial to inject a mock for unit tests and a real client for integration tests.

Built for coding agents

Every swarmclawai CLI follows the same agent conventions so Claude Code, Cursor, Cline, Aider, Codex et al can drive them without guessing:

--json everywhere, one-line envelope on stdout
Stderr for logs, stdout for data
Stable exit codes: 0 / 1 / 2
Non-interactive by default; missing inputs error cleanly
prompt-snapshot help-agents returns the entire command catalog as JSON

See AGENTS.md for the full machine-readable reference.

How it compares

| | prompt-snapshot | Langfuse | DeepEval | Promptfoo | Phoenix | |---|---|---|---|---|---| | Zero setup, no dashboard | ✅ | ❌ | ❌ | ✅ | ❌ | | TypeScript-native API | ✅ | partial | ❌ | partial | ❌ | | Snapshot testing | ✅ | ❌ | ❌ | ❌ | ❌ | | Zod-schema assertions | ✅ | ❌ | ❌ | ❌ | ❌ | | Pluggable providers (incl. local) | ✅ | ✅ | partial | ✅ | ✅ | | CI-first | ✅ | partial | ✅ | ✅ | partial |

Roadmap

similarTo(reference, { threshold }) — embedding-based similarity
judge(criteria, { model }) — LLM-as-judge (opt-in, cost-aware)
Parallel case execution with configurable concurrency
A thin Vitest reporter plugin
GitHub Action template for running on every PR

Contributing

See CONTRIBUTING.md.

License

MIT