@safetnsr/model-drift
v0.1.1
Published
receipts for model nerfs — benchmark LLMs against saved prompt baselines, track quality drift
Maintainers
Readme
model-drift
receipts for model nerfs — benchmark any LLM against saved prompt baselines, track quality drift over time.
install
npx @safetnsr/model-drift init --model claude-sonnet-4-6usage
save a baseline
model-drift init --model gpt-5.4runs 10 coding prompts against the model. saves scores as your baseline.
check for drift
model-drift check --model gpt-5.4re-runs the same prompts. diffs against your baseline. shows what regressed.
view report
model-drift report --model gpt-5.4full markdown report with all historical runs.
list baselines
model-drift listflags
| flag | description | |------|-------------| | --model | target model name (required for init/check) | | --since | compare against baseline from specific date | | --json | machine-readable JSON output | | --help | show help | | --version | show version |
supported models
any model accessible via:
- OpenAI API (set OPENAI_API_KEY) — gpt-, o1-, o3-*
- Anthropic API (set ANTHROPIC_API_KEY) — claude-*
- Google AI (set GOOGLE_API_KEY) — gemini-*
- any OpenAI-compatible endpoint (set OPENAI_BASE_URL)
agent interface (--json)
{
"model": "claude-sonnet-4-6",
"baseline": "2026-03-01",
"current": "2026-03-08",
"total": { "baseline": 8, "current": 6, "delta": -2 },
"prompts": [
{ "id": "binary-search-edge", "baseline": 1, "current": 0, "delta": -1 }
]
}prompt suite
model-drift ships with 10 hardcoded prompts covering:
- algorithm implementation (binary search, fibonacci, CSV parser)
- security (SQL injection fix)
- validation (IPv4 regex)
- async patterns (fetch with retry)
- knowledge (event loop explanation)
- refactoring (reduce complexity)
- framework-specific (React useEffect deps)
- type system (TypeScript generics)
scoring is objective: code execution tests, pattern matching, keyword checks. no LLM-as-judge.
pair with
- @safetnsr/vet — audit your AI coding workflow
- @safetnsr/pinch — track AI session costs
