@mutineerjs/tidemark

v0.3.3

Published

25 days ago

Snapshot testing for LLM features — detect prompt/model/schema drift before production does

Downloads

288

0High
0Medium
0Low

mutineer

llm snapshot testing vitest drift ai zod typescript

Tidemark

Snapshot testing for LLM features. Detect prompt, model, and schema drift before production does.

Define your prompts as typed promptFn functions with Zod-validated output, capture behaviour across test cases as committed JSON snapshots, and get automatic drift detection with field-level attribution when the prompt, model, or schema changes underneath them. Ships as a Vitest matcher with no new test runner, no SaaS, and no separate prompt store.

Install

# Vitest
npm install @mutineerjs/tidemark zod vitest

# Jest
npm install @mutineerjs/tidemark zod jest

Quick Start

Define your promptFn in a regular TypeScript module:

// src/classify.ts
import { createPromptFn, AnthropicAdapter } from '@mutineerjs/tidemark';
import * as z from 'zod';

const adapter = new AnthropicAdapter('claude-sonnet-4-5-20250929', {
  apiKey: process.env.ANTHROPIC_API_KEY, // never commit API keys
});

export const classifyFn = createPromptFn({
  name: 'classify',
  prompt: (i) => `Classify this support message: ${i.text}`,
  inputSchema: z.object({ text: z.string() }),
  outputSchema: z.object({
    category: z.enum(['billing', 'tech', 'general']),
    confidence: z.number(),
  }),
  adapter,
});

Then import it in your snapshot test file:

// src/classify.snap.test.ts
import { expectPromptFn } from '@mutineerjs/tidemark/vitest';
import { classifyFn } from './classify';

it('classifyFn matches snapshot', async () => {
  await expectPromptFn(classifyFn).toMatchSnapshot([
    { name: 'billing', input: { text: 'charged twice' } },
    { name: 'general', input: { text: 'update my address' } },
  ]);
});

The first run writes __snapshots__/classify.snap.json next to your test file. Subsequent runs fail if the prompt, schema, or model changes, with attribution telling you exactly which hash changed.

Vitest Configuration

// vitest.config.ts
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    setupFiles: ['@mutineerjs/tidemark/vitest'],
    testTimeout: 30_000,
  },
});

Alternatively, if you already have a vitest.setup.ts for other setup, import it there:

// vitest.setup.ts
import '@mutineerjs/tidemark/vitest';

Snapshot tests make real LLM API calls, so Vitest's default 5 s timeout is too short. The testTimeout: 30_000 above sets it to 30 s — adjust to fit your provider's latency.

If you have a mixed suite (fast unit tests alongside AI snapshot tests), keep a separate config for the snapshot tests and explicitly exclude that directory from the main config:

// vitest.config.ts — unit tests only
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    include: ['src/**/*.test.ts'],
    exclude: ['src/snapshots/**', 'node_modules/**'],
  },
});

// vitest.snapshot.config.ts — AI snapshot tests
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    include: ['src/snapshots/**/*.test.ts'],
    testTimeout: 30_000,
  },
});

Run them independently:

vitest run                                       # unit tests
vitest run --config vitest.snapshot.config.ts   # snapshot tests

Jest Configuration

// jest.config.ts
export default {
  setupFilesAfterEnv: ['@mutineerjs/tidemark/jest'],
  testTimeout: 30_000,
};

Then import from the Jest sub-package in your tests:

import { expectPromptFn } from '@mutineerjs/tidemark/jest';

it('classifyFn matches snapshot', async () => {
  await expectPromptFn(classifyFn).toMatchSnapshot([
    { name: 'billing', input: { text: 'charged twice' } },
    { name: 'general', input: { text: 'update my address' } },
  ]);
});

The Jest adapter works identically to the Vitest adapter: first run writes the snapshot, subsequent runs detect drift. In CI (--ci flag), Jest sets its snapshot mode to none and Tidemark skips all LLM calls, trusting the committed snapshot.

Controlling When Drift Checks Run

LLM snapshot tests make real API calls, which is slow and costs money. Tidemark lets you tune how often the drift check actually fires on subsequent runs using the opts parameter.

Run on a fixed fraction of test executions:

// Run the LLM drift check ~20% of the time
await expectPromptFn(classifyFn).toMatchSnapshot(cases, { sample: 0.2 });

sample takes a probability between 0 and 1. On each run, Tidemark draws a random number — if it falls above sample, the test passes immediately without calling the LLM.

Run 1-in-N times:

// Run the LLM drift check roughly once every 10 test runs
await expectPromptFn(classifyFn).toMatchSnapshot(cases, { every: 10 });

every: N is equivalent to sample: 1/N. Use it when you want to think in terms of frequency rather than probability.

When sampling kicks in: Only the drift check on subsequent runs is gated. First-run baseline writes and CI offline mode are not affected — those paths return before the sampling gate.

Adjust the judge threshold:

// Require stricter semantic equivalence (default is 0.85)
await expectPromptFn(classifyFn).toMatchSnapshot(cases, { threshold: 0.95 });

threshold controls how similar a string field must be to the baseline for the LLM judge to call it equivalent. Lower values tolerate more variation; higher values are stricter.

How It Works

promptFn as a typed code artifact. Define prompts once as createPromptFn(). Zod validates input and output, .describe() annotations on schema fields auto-inject into the system message, and the function is a plain async function callable anywhere in your codebase.

Snapshots committed to git. .snap.json files live next to your test file in __snapshots__/, use deterministic key order, and are human-readable JSON. They show up in PR diffs like any other code change, so your team reviews LLM behaviour changes the same way they review code changes.

Drift detection with field attribution. Tidemark hashes the prompt text, the Zod schema _def tree, and the resolved model version from the API response. On a snapshot mismatch, the failure message tells you which of the three changed rather than giving you a blob diff of raw output. For string fields, an LLM-as-judge equivalence check distinguishes semantically equivalent output from an actual regression before reporting a failure.

API Reference (v0.1)

createPromptFn(config) defines a typed prompt function.

| Config field | Type | Description | |---|---|---| | name | string | Stable identifier used as the snapshot filename | | prompt | (input) => string | Function that builds the prompt string from validated input | | inputSchema | z.ZodType | Zod schema for validating the input object | | outputSchema | z.ZodType | Zod schema for validating and parsing LLM output | | adapter | ProviderAdapter | Provider adapter (AnthropicAdapter, OpenAIAdapter) | | temperature? | number | Sampling temperature (optional) | | maxRetries? | number | Retry attempts on Zod validation failure (default: 2) | | tools? | Record<string, z.ZodType> | Tool definitions for function calling (optional) |

fn(input, options?) calls the function and returns Promise<TidemarkResult<Output>> with shape { output, meta, messages }.

Options: { handlers?: Record<string, Handler>, messages?: ConversationMessage[] }

fn.stream(input, options?) returns a TidemarkStream with for await text chunks and .finalOutput().

expectPromptFn(fn).toMatchSnapshot(cases, opts?) is the snapshot matcher. Exported from '@mutineerjs/tidemark/vitest' (Vitest) or '@mutineerjs/tidemark/jest' (Jest).

| Option | Type | Default | Description | |---|---|---|---| | threshold | number | 0.85 | LLM judge equivalence threshold for string fields (0–1) | | sample | number | — | Probability of running the drift check on subsequent runs (0–1) | | every | number | — | Run drift check 1-in-N times; equivalent to sample: 1/N |

sample and every are mutually exclusive. If both are omitted, the drift check always runs.

mockPromptFn(returnValue) is a test double exported from '@mutineerjs/tidemark/testing'. It returns a PromptFn with zero-value cost and latency metadata.

new AnthropicAdapter(model, { apiKey? }) and new OpenAIAdapter(model, { apiKey? }) are the built-in provider adapters.

Sub-packages:

| Import path | Test runner | Registration | |---|---|---| | @mutineerjs/tidemark/vitest | Vitest | Add to setupFiles in vitest.config.ts | | @mutineerjs/tidemark/jest | Jest | Add to setupFilesAfterEnv in jest.config.ts | | @mutineerjs/tidemark/testing | Any | Import mockPromptFn in test files |

TidemarkCallMeta fields: inputTokens, outputTokens, estimatedCostUsd, responseTimeMs, rawRequest, rawResponse.

Mocking in Unit Tests

Use mockPromptFn from '@mutineerjs/tidemark/testing' to stub a promptFn in unit tests without making real API calls.

import { mockPromptFn } from '@mutineerjs/tidemark/testing';

const classify = mockPromptFn({ category: 'billing', confidence: 0.95 });
const result = await classify({ text: 'charged twice' });

// result.category === 'billing'
// result.confidence === 0.95
// result.inputTokens === 0, result.estimatedCostUsd === 0

The mock returns your returnValue with zero-value cost and latency metadata. The .stream() method is also stubbed, so code that calls fn.stream() will not throw.

Multi-Turn Conversations

Pass result.messages from one call to the next to maintain conversation context.

// First turn
const r1 = await fn({ text: 'hello' });

// Second turn — r1.messages is the full first-turn conversation
const r2 = await fn({ text: 'follow up' }, { messages: r1.messages });

// Third turn — r2.messages includes all three turns
const r3 = await fn({ text: 'one more thing' }, { messages: r2.messages });

messages is an in-memory array and is not persisted across sessions. Pass result.messages to the next call to maintain context within a session.

Benchmark

See BENCHMARK.md for captured output demonstrating prompt hash drift, schema hash drift, and model version drift detection, all running without a real API key via MockAdapter.

License

MIT