@safetnsr/model-drift

v0.1.1

Published

3 months ago

receipts for model nerfs — benchmark LLMs against saved prompt baselines, track quality drift

0High
0Medium
0Low

safetnsr

ai llm benchmark drift nerf model-quality cli

model-drift

receipts for model nerfs — benchmark any LLM against saved prompt baselines, track quality drift over time.

install

npx @safetnsr/model-drift init --model claude-sonnet-4-6

usage

save a baseline

model-drift init --model gpt-5.4

runs 10 coding prompts against the model. saves scores as your baseline.

check for drift

model-drift check --model gpt-5.4

re-runs the same prompts. diffs against your baseline. shows what regressed.

view report

model-drift report --model gpt-5.4

full markdown report with all historical runs.

list baselines

model-drift list

flags

| flag | description | |------|-------------| | --model | target model name (required for init/check) | | --since | compare against baseline from specific date | | --json | machine-readable JSON output | | --help | show help | | --version | show version |

supported models

any model accessible via:

OpenAI API (set OPENAI_API_KEY) — gpt-, o1-, o3-*
Anthropic API (set ANTHROPIC_API_KEY) — claude-*
Google AI (set GOOGLE_API_KEY) — gemini-*
any OpenAI-compatible endpoint (set OPENAI_BASE_URL)

agent interface (--json)

{
  "model": "claude-sonnet-4-6",
  "baseline": "2026-03-01",
  "current": "2026-03-08",
  "total": { "baseline": 8, "current": 6, "delta": -2 },
  "prompts": [
    { "id": "binary-search-edge", "baseline": 1, "current": 0, "delta": -1 }
  ]
}

prompt suite

model-drift ships with 10 hardcoded prompts covering:

algorithm implementation (binary search, fibonacci, CSV parser)
security (SQL injection fix)
validation (IPv4 regex)
async patterns (fetch with retry)
knowledge (event loop explanation)
refactoring (reduce complexity)
framework-specific (React useEffect deps)
type system (TypeScript generics)

scoring is objective: code execution tests, pattern matching, keyword checks. no LLM-as-judge.

pair with

@safetnsr/vet — audit your AI coding workflow
@safetnsr/pinch — track AI session costs