npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@mutineerjs/tidemark

v0.3.3

Published

Snapshot testing for LLM features — detect prompt/model/schema drift before production does

Downloads

288

Readme

Tidemark

npm CI License: MIT

Snapshot testing for LLM features. Detect prompt, model, and schema drift before production does.

Define your prompts as typed promptFn functions with Zod-validated output, capture behaviour across test cases as committed JSON snapshots, and get automatic drift detection with field-level attribution when the prompt, model, or schema changes underneath them. Ships as a Vitest matcher with no new test runner, no SaaS, and no separate prompt store.

Install

# Vitest
npm install @mutineerjs/tidemark zod vitest

# Jest
npm install @mutineerjs/tidemark zod jest

Quick Start

Define your promptFn in a regular TypeScript module:

// src/classify.ts
import { createPromptFn, AnthropicAdapter } from '@mutineerjs/tidemark';
import * as z from 'zod';

const adapter = new AnthropicAdapter('claude-sonnet-4-5-20250929', {
  apiKey: process.env.ANTHROPIC_API_KEY, // never commit API keys
});

export const classifyFn = createPromptFn({
  name: 'classify',
  prompt: (i) => `Classify this support message: ${i.text}`,
  inputSchema: z.object({ text: z.string() }),
  outputSchema: z.object({
    category: z.enum(['billing', 'tech', 'general']),
    confidence: z.number(),
  }),
  adapter,
});

Then import it in your snapshot test file:

// src/classify.snap.test.ts
import { expectPromptFn } from '@mutineerjs/tidemark/vitest';
import { classifyFn } from './classify';

it('classifyFn matches snapshot', async () => {
  await expectPromptFn(classifyFn).toMatchSnapshot([
    { name: 'billing', input: { text: 'charged twice' } },
    { name: 'general', input: { text: 'update my address' } },
  ]);
});

The first run writes __snapshots__/classify.snap.json next to your test file. Subsequent runs fail if the prompt, schema, or model changes, with attribution telling you exactly which hash changed.

Vitest Configuration

Register Tidemark's matchers by adding the entry point to setupFiles in your config — no separate setup file needed:

// vitest.config.ts
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    setupFiles: ['@mutineerjs/tidemark/vitest'],
    testTimeout: 30_000,
  },
});

Alternatively, if you already have a vitest.setup.ts for other setup, import it there:

// vitest.setup.ts
import '@mutineerjs/tidemark/vitest';

Snapshot tests make real LLM API calls, so Vitest's default 5 s timeout is too short. The testTimeout: 30_000 above sets it to 30 s — adjust to fit your provider's latency.

If you have a mixed suite (fast unit tests alongside AI snapshot tests), keep a separate config for the snapshot tests and explicitly exclude that directory from the main config:

// vitest.config.ts — unit tests only
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    include: ['src/**/*.test.ts'],
    exclude: ['src/snapshots/**', 'node_modules/**'],
  },
});
// vitest.snapshot.config.ts — AI snapshot tests
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    include: ['src/snapshots/**/*.test.ts'],
    testTimeout: 30_000,
  },
});

Run them independently:

vitest run                                       # unit tests
vitest run --config vitest.snapshot.config.ts   # snapshot tests

Jest Configuration

Register Tidemark's matchers by adding the entry point to setupFilesAfterEnv in your Jest config:

// jest.config.ts
export default {
  setupFilesAfterEnv: ['@mutineerjs/tidemark/jest'],
  testTimeout: 30_000,
};

Then import from the Jest sub-package in your tests:

import { expectPromptFn } from '@mutineerjs/tidemark/jest';

it('classifyFn matches snapshot', async () => {
  await expectPromptFn(classifyFn).toMatchSnapshot([
    { name: 'billing', input: { text: 'charged twice' } },
    { name: 'general', input: { text: 'update my address' } },
  ]);
});

The Jest adapter works identically to the Vitest adapter: first run writes the snapshot, subsequent runs detect drift. In CI (--ci flag), Jest sets its snapshot mode to none and Tidemark skips all LLM calls, trusting the committed snapshot.

Controlling When Drift Checks Run

LLM snapshot tests make real API calls, which is slow and costs money. Tidemark lets you tune how often the drift check actually fires on subsequent runs using the opts parameter.

Run on a fixed fraction of test executions:

// Run the LLM drift check ~20% of the time
await expectPromptFn(classifyFn).toMatchSnapshot(cases, { sample: 0.2 });

sample takes a probability between 0 and 1. On each run, Tidemark draws a random number — if it falls above sample, the test passes immediately without calling the LLM.

Run 1-in-N times:

// Run the LLM drift check roughly once every 10 test runs
await expectPromptFn(classifyFn).toMatchSnapshot(cases, { every: 10 });

every: N is equivalent to sample: 1/N. Use it when you want to think in terms of frequency rather than probability.

When sampling kicks in: Only the drift check on subsequent runs is gated. First-run baseline writes and CI offline mode are not affected — those paths return before the sampling gate.

Adjust the judge threshold:

// Require stricter semantic equivalence (default is 0.85)
await expectPromptFn(classifyFn).toMatchSnapshot(cases, { threshold: 0.95 });

threshold controls how similar a string field must be to the baseline for the LLM judge to call it equivalent. Lower values tolerate more variation; higher values are stricter.

How It Works

promptFn as a typed code artifact. Define prompts once as createPromptFn(). Zod validates input and output, .describe() annotations on schema fields auto-inject into the system message, and the function is a plain async function callable anywhere in your codebase.

Snapshots committed to git. .snap.json files live next to your test file in __snapshots__/, use deterministic key order, and are human-readable JSON. They show up in PR diffs like any other code change, so your team reviews LLM behaviour changes the same way they review code changes.

Drift detection with field attribution. Tidemark hashes the prompt text, the Zod schema _def tree, and the resolved model version from the API response. On a snapshot mismatch, the failure message tells you which of the three changed rather than giving you a blob diff of raw output. For string fields, an LLM-as-judge equivalence check distinguishes semantically equivalent output from an actual regression before reporting a failure.

API Reference (v0.1)

createPromptFn(config) defines a typed prompt function.

| Config field | Type | Description | |---|---|---| | name | string | Stable identifier used as the snapshot filename | | prompt | (input) => string | Function that builds the prompt string from validated input | | inputSchema | z.ZodType | Zod schema for validating the input object | | outputSchema | z.ZodType | Zod schema for validating and parsing LLM output | | adapter | ProviderAdapter | Provider adapter (AnthropicAdapter, OpenAIAdapter) | | temperature? | number | Sampling temperature (optional) | | maxRetries? | number | Retry attempts on Zod validation failure (default: 2) | | tools? | Record<string, z.ZodType> | Tool definitions for function calling (optional) |

fn(input, options?) calls the function and returns Promise<TidemarkResult<Output>> with shape { output, meta, messages }.

Options: { handlers?: Record<string, Handler>, messages?: ConversationMessage[] }

fn.stream(input, options?) returns a TidemarkStream with for await text chunks and .finalOutput().

expectPromptFn(fn).toMatchSnapshot(cases, opts?) is the snapshot matcher. Exported from '@mutineerjs/tidemark/vitest' (Vitest) or '@mutineerjs/tidemark/jest' (Jest).

| Option | Type | Default | Description | |---|---|---|---| | threshold | number | 0.85 | LLM judge equivalence threshold for string fields (0–1) | | sample | number | — | Probability of running the drift check on subsequent runs (0–1) | | every | number | — | Run drift check 1-in-N times; equivalent to sample: 1/N |

sample and every are mutually exclusive. If both are omitted, the drift check always runs.

mockPromptFn(returnValue) is a test double exported from '@mutineerjs/tidemark/testing'. It returns a PromptFn with zero-value cost and latency metadata.

new AnthropicAdapter(model, { apiKey? }) and new OpenAIAdapter(model, { apiKey? }) are the built-in provider adapters.

Sub-packages:

| Import path | Test runner | Registration | |---|---|---| | @mutineerjs/tidemark/vitest | Vitest | Add to setupFiles in vitest.config.ts | | @mutineerjs/tidemark/jest | Jest | Add to setupFilesAfterEnv in jest.config.ts | | @mutineerjs/tidemark/testing | Any | Import mockPromptFn in test files |

TidemarkCallMeta fields: inputTokens, outputTokens, estimatedCostUsd, responseTimeMs, rawRequest, rawResponse.

Mocking in Unit Tests

Use mockPromptFn from '@mutineerjs/tidemark/testing' to stub a promptFn in unit tests without making real API calls.

import { mockPromptFn } from '@mutineerjs/tidemark/testing';

const classify = mockPromptFn({ category: 'billing', confidence: 0.95 });
const result = await classify({ text: 'charged twice' });

// result.category === 'billing'
// result.confidence === 0.95
// result.inputTokens === 0, result.estimatedCostUsd === 0

The mock returns your returnValue with zero-value cost and latency metadata. The .stream() method is also stubbed, so code that calls fn.stream() will not throw.

Multi-Turn Conversations

Pass result.messages from one call to the next to maintain conversation context.

// First turn
const r1 = await fn({ text: 'hello' });

// Second turn — r1.messages is the full first-turn conversation
const r2 = await fn({ text: 'follow up' }, { messages: r1.messages });

// Third turn — r2.messages includes all three turns
const r3 = await fn({ text: 'one more thing' }, { messages: r2.messages });

messages is an in-memory array and is not persisted across sessions. Pass result.messages to the next call to maintain context within a session.

Benchmark

See BENCHMARK.md for captured output demonstrating prompt hash drift, schema hash drift, and model version drift detection, all running without a real API key via MockAdapter.

License

MIT