@mutineerjs/tidemark
v0.3.3
Published
Snapshot testing for LLM features — detect prompt/model/schema drift before production does
Downloads
288
Maintainers
Readme
Tidemark
Snapshot testing for LLM features. Detect prompt, model, and schema drift before production does.
Define your prompts as typed promptFn functions with Zod-validated output, capture behaviour across
test cases as committed JSON snapshots, and get automatic drift detection with field-level attribution
when the prompt, model, or schema changes underneath them. Ships as a Vitest matcher with no new test
runner, no SaaS, and no separate prompt store.
Install
# Vitest
npm install @mutineerjs/tidemark zod vitest
# Jest
npm install @mutineerjs/tidemark zod jestQuick Start
Define your promptFn in a regular TypeScript module:
// src/classify.ts
import { createPromptFn, AnthropicAdapter } from '@mutineerjs/tidemark';
import * as z from 'zod';
const adapter = new AnthropicAdapter('claude-sonnet-4-5-20250929', {
apiKey: process.env.ANTHROPIC_API_KEY, // never commit API keys
});
export const classifyFn = createPromptFn({
name: 'classify',
prompt: (i) => `Classify this support message: ${i.text}`,
inputSchema: z.object({ text: z.string() }),
outputSchema: z.object({
category: z.enum(['billing', 'tech', 'general']),
confidence: z.number(),
}),
adapter,
});Then import it in your snapshot test file:
// src/classify.snap.test.ts
import { expectPromptFn } from '@mutineerjs/tidemark/vitest';
import { classifyFn } from './classify';
it('classifyFn matches snapshot', async () => {
await expectPromptFn(classifyFn).toMatchSnapshot([
{ name: 'billing', input: { text: 'charged twice' } },
{ name: 'general', input: { text: 'update my address' } },
]);
});The first run writes __snapshots__/classify.snap.json next to your test file. Subsequent runs
fail if the prompt, schema, or model changes, with attribution telling you exactly which hash
changed.
Vitest Configuration
Register Tidemark's matchers by adding the entry point to setupFiles in your config — no separate
setup file needed:
// vitest.config.ts
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
setupFiles: ['@mutineerjs/tidemark/vitest'],
testTimeout: 30_000,
},
});Alternatively, if you already have a vitest.setup.ts for other setup, import it there:
// vitest.setup.ts
import '@mutineerjs/tidemark/vitest';Snapshot tests make real LLM API calls, so Vitest's default 5 s timeout is too short. The
testTimeout: 30_000 above sets it to 30 s — adjust to fit your provider's latency.
If you have a mixed suite (fast unit tests alongside AI snapshot tests), keep a separate config for the snapshot tests and explicitly exclude that directory from the main config:
// vitest.config.ts — unit tests only
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
include: ['src/**/*.test.ts'],
exclude: ['src/snapshots/**', 'node_modules/**'],
},
});// vitest.snapshot.config.ts — AI snapshot tests
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
include: ['src/snapshots/**/*.test.ts'],
testTimeout: 30_000,
},
});Run them independently:
vitest run # unit tests
vitest run --config vitest.snapshot.config.ts # snapshot testsJest Configuration
Register Tidemark's matchers by adding the entry point to setupFilesAfterEnv in your Jest config:
// jest.config.ts
export default {
setupFilesAfterEnv: ['@mutineerjs/tidemark/jest'],
testTimeout: 30_000,
};Then import from the Jest sub-package in your tests:
import { expectPromptFn } from '@mutineerjs/tidemark/jest';
it('classifyFn matches snapshot', async () => {
await expectPromptFn(classifyFn).toMatchSnapshot([
{ name: 'billing', input: { text: 'charged twice' } },
{ name: 'general', input: { text: 'update my address' } },
]);
});The Jest adapter works identically to the Vitest adapter: first run writes the snapshot, subsequent
runs detect drift. In CI (--ci flag), Jest sets its snapshot mode to none and Tidemark skips
all LLM calls, trusting the committed snapshot.
Controlling When Drift Checks Run
LLM snapshot tests make real API calls, which is slow and costs money. Tidemark lets you tune how
often the drift check actually fires on subsequent runs using the opts parameter.
Run on a fixed fraction of test executions:
// Run the LLM drift check ~20% of the time
await expectPromptFn(classifyFn).toMatchSnapshot(cases, { sample: 0.2 });sample takes a probability between 0 and 1. On each run, Tidemark draws a random number — if
it falls above sample, the test passes immediately without calling the LLM.
Run 1-in-N times:
// Run the LLM drift check roughly once every 10 test runs
await expectPromptFn(classifyFn).toMatchSnapshot(cases, { every: 10 });every: N is equivalent to sample: 1/N. Use it when you want to think in terms of frequency
rather than probability.
When sampling kicks in: Only the drift check on subsequent runs is gated. First-run baseline writes and CI offline mode are not affected — those paths return before the sampling gate.
Adjust the judge threshold:
// Require stricter semantic equivalence (default is 0.85)
await expectPromptFn(classifyFn).toMatchSnapshot(cases, { threshold: 0.95 });threshold controls how similar a string field must be to the baseline for the LLM judge to call
it equivalent. Lower values tolerate more variation; higher values are stricter.
How It Works
promptFn as a typed code artifact. Define prompts once as createPromptFn(). Zod validates
input and output, .describe() annotations on schema fields auto-inject into the system message,
and the function is a plain async function callable anywhere in your codebase.
Snapshots committed to git. .snap.json files live next to your test file in __snapshots__/,
use deterministic key order, and are human-readable JSON. They show up in PR diffs like any other
code change, so your team reviews LLM behaviour changes the same way they review code changes.
Drift detection with field attribution. Tidemark hashes the prompt text, the Zod schema
_def tree, and the resolved model version from the API response. On a snapshot mismatch, the
failure message tells you which of the three changed rather than giving you a blob diff of raw
output. For string fields, an LLM-as-judge equivalence check distinguishes semantically equivalent
output from an actual regression before reporting a failure.
API Reference (v0.1)
createPromptFn(config) defines a typed prompt function.
| Config field | Type | Description |
|---|---|---|
| name | string | Stable identifier used as the snapshot filename |
| prompt | (input) => string | Function that builds the prompt string from validated input |
| inputSchema | z.ZodType | Zod schema for validating the input object |
| outputSchema | z.ZodType | Zod schema for validating and parsing LLM output |
| adapter | ProviderAdapter | Provider adapter (AnthropicAdapter, OpenAIAdapter) |
| temperature? | number | Sampling temperature (optional) |
| maxRetries? | number | Retry attempts on Zod validation failure (default: 2) |
| tools? | Record<string, z.ZodType> | Tool definitions for function calling (optional) |
fn(input, options?) calls the function and returns Promise<TidemarkResult<Output>> with shape { output, meta, messages }.
Options: { handlers?: Record<string, Handler>, messages?: ConversationMessage[] }
fn.stream(input, options?) returns a TidemarkStream with for await text chunks and .finalOutput().
expectPromptFn(fn).toMatchSnapshot(cases, opts?) is the snapshot matcher. Exported from
'@mutineerjs/tidemark/vitest' (Vitest) or '@mutineerjs/tidemark/jest' (Jest).
| Option | Type | Default | Description |
|---|---|---|---|
| threshold | number | 0.85 | LLM judge equivalence threshold for string fields (0–1) |
| sample | number | — | Probability of running the drift check on subsequent runs (0–1) |
| every | number | — | Run drift check 1-in-N times; equivalent to sample: 1/N |
sample and every are mutually exclusive. If both are omitted, the drift check always runs.
mockPromptFn(returnValue) is a test double exported from '@mutineerjs/tidemark/testing'. It returns a PromptFn with zero-value cost and latency metadata.
new AnthropicAdapter(model, { apiKey? }) and new OpenAIAdapter(model, { apiKey? }) are the built-in provider adapters.
Sub-packages:
| Import path | Test runner | Registration |
|---|---|---|
| @mutineerjs/tidemark/vitest | Vitest | Add to setupFiles in vitest.config.ts |
| @mutineerjs/tidemark/jest | Jest | Add to setupFilesAfterEnv in jest.config.ts |
| @mutineerjs/tidemark/testing | Any | Import mockPromptFn in test files |
TidemarkCallMeta fields: inputTokens, outputTokens, estimatedCostUsd, responseTimeMs, rawRequest, rawResponse.
Mocking in Unit Tests
Use mockPromptFn from '@mutineerjs/tidemark/testing' to stub a promptFn in unit tests without making real API calls.
import { mockPromptFn } from '@mutineerjs/tidemark/testing';
const classify = mockPromptFn({ category: 'billing', confidence: 0.95 });
const result = await classify({ text: 'charged twice' });
// result.category === 'billing'
// result.confidence === 0.95
// result.inputTokens === 0, result.estimatedCostUsd === 0The mock returns your returnValue with zero-value cost and latency metadata. The .stream() method
is also stubbed, so code that calls fn.stream() will not throw.
Multi-Turn Conversations
Pass result.messages from one call to the next to maintain conversation context.
// First turn
const r1 = await fn({ text: 'hello' });
// Second turn — r1.messages is the full first-turn conversation
const r2 = await fn({ text: 'follow up' }, { messages: r1.messages });
// Third turn — r2.messages includes all three turns
const r3 = await fn({ text: 'one more thing' }, { messages: r2.messages });messages is an in-memory array and is not persisted across sessions. Pass result.messages to the
next call to maintain context within a session.
Benchmark
See BENCHMARK.md for captured output demonstrating prompt hash drift, schema hash
drift, and model version drift detection, all running without a real API key via MockAdapter.
License
MIT
