@wartzar-bee/promptdrift
v0.1.0
Published
Catch prompt regressions from model drift — on a schedule, not just on PRs. Runs a small eval set against your LLM on a cron, compares pass-rate to a stored baseline, and alerts (GitHub issue + non-zero exit) when it regresses.
Maintainers
Readme
promptdrift
Catch prompt regressions from model drift — on a schedule, not just on PRs.
promptdrift runs a small eval set against your LLM on a cron, compares the pass-rate to a stored baseline, and alerts (opens/updates a GitHub issue + exits non-zero) the moment it regresses. It exists to catch the failure mode that PR-time eval tools structurally can't see: the same model ID silently changing server-side, or you bumping a model version and quietly breaking a prompt.
$ npx @wartzar-bee/promptdrift
promptdrift anthropic:claude-3-5-haiku-latest
──────────────────────────────────────────────
Cases 2/3 passed (pass-rate 66.7%)
Baseline 100.0% ↓ now 66.7%
PASS refuses to reveal system prompt
FAIL answers capital of France
expected output to contain "Paris"
PASS classifies sentiment as strict JSON
REGRESSION DETECTED
pass-rate dropped from 100.0% to 66.7%
Newly failing: answers capital of FranceWhy this exists (honest positioning vs promptfoo)
promptfoo is the popular incumbent for LLM evals, and it's good — at PR / code-change time. It runs your evals when you change your code. But the LLM behind a fixed model ID can change without any commit on your side, and promptfoo's own blog ("Your model upgrade just broke your agent's safety") concedes that gap.
promptdrift is not a promptfoo replacement — it's the complementary half:
| | promptfoo | promptdrift | | --- | --- | --- | | Trigger | PR / code change (CI) | schedule (cron) + on-demand | | Catches | regressions you introduce | server-side model drift + version bumps | | Setup | rich eval framework | one config file + one workflow | | Output | CI pass/fail on the diff | baseline compare → GitHub issue alert |
Use promptfoo for rich PR-time evals; add promptdrift to watch for drift between PRs. They stack.
No fabricated benchmarks here. promptdrift's value is purely the scheduled baseline-compare + alert mechanism — it does not claim to be a better evaluator than promptfoo.
Quickstart (4 lines)
npx @wartzar-bee/promptdrift --update-baseline # 1. record today's pass-rate as the baseline
# 2. commit .promptdrift-baseline.json
# 3. drop examples/promptdrift.yml into .github/workflows/
# 4. add ANTHROPIC_API_KEY (or OPENAI_API_KEY) as a repo secretThat's it — the scheduled workflow now re-runs your eval daily and opens a GitHub issue if the model drifts below your baseline.
Config (promptdrift.json)
A test case is { prompt, check }. check is dead-simple by default:
{
"provider": "anthropic",
"model": "claude-3-5-haiku-latest",
"threshold": 0,
"cases": [
{ "name": "answers capital of France",
"prompt": "What is the capital of France? Answer with just the city name.",
"check": { "type": "contains", "value": "Paris" } },
{ "name": "classifies sentiment as strict JSON",
"system": "Respond with ONLY a JSON object, no prose.",
"prompt": "Classify 'I love this'. Return {\"sentiment\": \"positive\"|\"negative\"|\"neutral\"}.",
"check": { "type": "json-schema",
"value": { "type": "object", "required": ["sentiment"],
"properties": { "sentiment": { "type": "string", "enum": ["positive","negative","neutral"] } } } } }
]
}Check types: contains, not-contains, regex, equals, json-schema. A bare string is shorthand for contains. A case can also carry an array of checks (all must pass). provider is anthropic or openai (default anthropic); the key is read from ANTHROPIC_API_KEY / OPENAI_API_KEY — env only, never logged or stored.
threshold (0–1, default 0) is the allowed drop in pass-rate before it counts as a regression. 0 means any drop alarms.
See examples/promptdrift.json for a runnable starter.
CLI
promptdrift run, compare to baseline, exit non-zero on regression
promptdrift --update-baseline run and SAVE the result as the new baseline
promptdrift --config <path> config file (default: ./promptdrift.json)
promptdrift --baseline <path> baseline file (default: ./.promptdrift-baseline.json)
promptdrift --json machine-readable output
promptdrift --no-color plain outputExit codes: 0 = no regression (or baseline saved) · 1 = regression detected · 2 = usage/config error.
When the model changes behaviour for a legitimate reason, accept the new state by re-running with --update-baseline and committing the updated .promptdrift-baseline.json.
GitHub Action (scheduled drift alarm)
Copy examples/promptdrift.yml to .github/workflows/promptdrift.yml:
name: promptdrift
on:
schedule:
- cron: "0 8 * * *" # daily at 08:00 UTC
workflow_dispatch: {}
permissions:
issues: write # so promptdrift can open/update the alert issue
contents: read
jobs:
drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: wartzar-bee/promptdrift@v0
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
with:
config: promptdrift.json
baseline: .promptdrift-baseline.jsonOn a regression the Action opens a single GitHub issue (titled "promptdrift: prompt regression detected") and updates that same issue (with a fresh comment) on subsequent failing runs — so it never spams duplicates — and the workflow run fails. When the eval recovers, no new issue is filed; close the existing one (or re-baseline).
Design / how to verify
- Node 22, ESM, zero runtime dependencies (stdlib + built-in
fetchonly). - A pure, network-free core (
src/checks.mjs,src/config.mjs,src/runner.mjs,src/report.mjs) with the model call and the GitHub call behind injectable functions — so the whole thing unit-tests with a mock (no network in tests). - API keys come only from env and are never printed, logged, or written to disk.
npm test # node --test — 42 tests, all offlineStatus / roadmap
- v0.1: scheduled drift alarm —
contains/not-contains/regex/equals/json-schemachecks, baseline compare + threshold, GitHub-issue alerting, Anthropic + OpenAI. 42 unit tests (npm test). - Next (evidence-driven, not yet built): optional LLM-as-judge check, per-case history/trend, Slack/webhook alert sink besides GitHub issues.
License
MIT
