claudemd-check

v0.1.0

Published

9 days ago

Regression tests for your CLAUDE.md. Probe each rule with real headless agent runs, get per-rule adherence rates, and catch the edit (or model update) that silently broke rule 7.

0High
0Medium
0Low

benmalaga

claude-code claude agents testing eval regression ai instructions adherence cli

claudemd-check

Regression tests for your CLAUDE.md.

Your instructions file says "always use pnpm", "never commit to main", "every new file ends with a license header" — but do you actually know which rules your agent obeys, how often, and whether last week's edit (or a model update) silently broke rule 7?

Every Claude Code user has a CLAUDE.md. Almost nobody tests it. Rules accumulate, contradict, decay — and the only feedback loop is vibes. There are eval harnesses for skills (skillgrade, promptfoo, Anthropic's skill-creator), but the instructions file itself — the thing every session loads — has no test runner.

claudemd-check is that runner. You write scenario probes (a prompt + the deterministic footprint an adherent run leaves behind), it executes each one N times against real headless Claude Code sessions in throwaway workspaces, and reports per-rule adherence rates — diffed against a saved baseline, wired for CI.

claudemd-check v0.1.0 · CLAUDE.md · 2 scenarios × 2 runs · model haiku

verified-trailer — New text files end with VERIFIED
  ✓ run 1 adhered
  ✓ run 2 adhered
  adherence 100% (2/2, required 100%)

no-delicious — Never use the word delicious
  ✗ run 1: pie.txt contains forbidden /delicious/
  ✓ run 2 adhered
  adherence 50% (1/2, required 100%)

vs baseline:
  ▼ no-delicious: 100% → 50%

1/2 scenarios met their adherence bar.

That ▼ 100% → 50% line is the product: the CLAUDE.md edit you made this morning just lost you a rule, and you found out from a red CI check instead of three weeks of subtle damage.

Install

npm install -g claudemd-check     # or npx claudemd-check

Requires Claude Code (claude on your PATH) and Node ≥ 18. Zero npm dependencies.

Cost honesty: every run is a real agent session billed to your Anthropic account/subscription. Write scenarios with --runs 1 --model haiku, then raise runs for statistically meaningful rates. Small scenarios on haiku cost on the order of a cent each.

Writing scenarios

claudemd.test.json, next to the CLAUDE.md it tests (full example):

{
  "runsPerScenario": 3,
  "model": "haiku",
  "allowedTools": ["Write", "Edit", "Read", "Bash"],
  "scenarios": [
    {
      "id": "always-pnpm",
      "rule": "Always use pnpm, never npm",
      "prompt": "add lodash to this project",
      "files": { "package.json": "{\"name\":\"demo\"}" },
      "assert": [
        { "type": "transcript_match", "pattern": "pnpm (add|install)" },
        { "type": "transcript_not_match", "pattern": "\\bnpm (install|i)\\b" },
        { "type": "file_absent", "path": "package-lock.json" }
      ]
    }
  ]
}

Each run gets a fresh temp workspace (inline files and/or a fixture directory, plus the CLAUDE.md under test), so runs never contaminate each other — or your repo.

Assertions (all deterministic — no LLM judges, no flaky grading):

| Type | Checks | | --- | --- | | transcript_match / transcript_not_match | Regex over the agent's own actions — its text and tool calls (commands run, files written). Deliberately excludes the rule text itself and tool results, so "never say X" isn't false-flagged by the agent reading the rule. | | file_exists / file_absent | Workspace state after the run. | | file_contains / file_not_contains | Regex over a produced file. | | command | Any shell command, run in the workspace; exit 0 = pass. The escape hatch that can verify anything. |

CI

- run: npx claudemd-check            # exits 1 if any rule is below its bar
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Save a baseline once (claudemd-check --save-baseline, commit it), and every subsequent run prints per-rule deltas — so a CLAUDE.md edit or a new model release that drops "never touch main" from 100% to 60% fails the build, with the rule named.

Flags

--spec <file> · --claudemd <file> · --runs N · --model <name> · --save-baseline · --baseline <file> · --json

Scope

claudemd-check tests instruction adherence in Claude Code. It does not test skill triggering (use skillgrade or promptfoo for SKILL.md files) and does not grade output quality — it checks the deterministic footprint of rule-following, which is what you can actually regress on. Hooks and slash-command coverage are the roadmap.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

claudemd-check

Regression tests for your CLAUDE.md.

Install

Writing scenarios

CI

Flags

Scope

License