@wartzar-bee/promptdrift

v0.1.0

Published

20 days ago

Catch prompt regressions from model drift — on a schedule, not just on PRs. Runs a small eval set against your LLM on a cron, compares pass-rate to a stored baseline, and alerts (GitHub issue + non-zero exit) when it regresses.

0High
0Medium
0Low

wartzar-bee

llm prompt eval evals regression model-drift monitoring anthropic openai cron github-action cli promptfoo

promptdrift

Catch prompt regressions from model drift — on a schedule, not just on PRs.

promptdrift runs a small eval set against your LLM on a cron, compares the pass-rate to a stored baseline, and alerts (opens/updates a GitHub issue + exits non-zero) the moment it regresses. It exists to catch the failure mode that PR-time eval tools structurally can't see: the same model ID silently changing server-side, or you bumping a model version and quietly breaking a prompt.

$ npx @wartzar-bee/promptdrift

  promptdrift  anthropic:claude-3-5-haiku-latest
  ──────────────────────────────────────────────
  Cases   2/3 passed   (pass-rate 66.7%)
  Baseline 100.0%  ↓  now 66.7%

  PASS  refuses to reveal system prompt
  FAIL  answers capital of France
        expected output to contain "Paris"
  PASS  classifies sentiment as strict JSON

  REGRESSION DETECTED
  pass-rate dropped from 100.0% to 66.7%
  Newly failing: answers capital of France

Why this exists (honest positioning vs promptfoo)

promptfoo is the popular incumbent for LLM evals, and it's good — at PR / code-change time. It runs your evals when you change your code. But the LLM behind a fixed model ID can change without any commit on your side, and promptfoo's own blog ("Your model upgrade just broke your agent's safety") concedes that gap.

promptdrift is not a promptfoo replacement — it's the complementary half:

| | promptfoo | promptdrift | | --- | --- | --- | | Trigger | PR / code change (CI) | schedule (cron) + on-demand | | Catches | regressions you introduce | server-side model drift + version bumps | | Setup | rich eval framework | one config file + one workflow | | Output | CI pass/fail on the diff | baseline compare → GitHub issue alert |

Use promptfoo for rich PR-time evals; add promptdrift to watch for drift between PRs. They stack.

No fabricated benchmarks here. promptdrift's value is purely the scheduled baseline-compare + alert mechanism — it does not claim to be a better evaluator than promptfoo.

Quickstart (4 lines)

npx @wartzar-bee/promptdrift --update-baseline    # 1. record today's pass-rate as the baseline
                                     # 2. commit .promptdrift-baseline.json
                                     # 3. drop examples/promptdrift.yml into .github/workflows/
                                     # 4. add ANTHROPIC_API_KEY (or OPENAI_API_KEY) as a repo secret

That's it — the scheduled workflow now re-runs your eval daily and opens a GitHub issue if the model drifts below your baseline.

Config (`promptdrift.json`)

A test case is { prompt, check }. check is dead-simple by default:

{
  "provider": "anthropic",
  "model": "claude-3-5-haiku-latest",
  "threshold": 0,
  "cases": [
    { "name": "answers capital of France",
      "prompt": "What is the capital of France? Answer with just the city name.",
      "check": { "type": "contains", "value": "Paris" } },

    { "name": "classifies sentiment as strict JSON",
      "system": "Respond with ONLY a JSON object, no prose.",
      "prompt": "Classify 'I love this'. Return {\"sentiment\": \"positive\"|\"negative\"|\"neutral\"}.",
      "check": { "type": "json-schema",
        "value": { "type": "object", "required": ["sentiment"],
          "properties": { "sentiment": { "type": "string", "enum": ["positive","negative","neutral"] } } } } }
  ]
}

Check types: contains, not-contains, regex, equals, json-schema. A bare string is shorthand for contains. A case can also carry an array of checks (all must pass). provider is anthropic or openai (default anthropic); the key is read from ANTHROPIC_API_KEY / OPENAI_API_KEY — env only, never logged or stored.

threshold (0–1, default 0) is the allowed drop in pass-rate before it counts as a regression. 0 means any drop alarms.

See examples/promptdrift.json for a runnable starter.

CLI

promptdrift                      run, compare to baseline, exit non-zero on regression
promptdrift --update-baseline    run and SAVE the result as the new baseline
promptdrift --config <path>      config file (default: ./promptdrift.json)
promptdrift --baseline <path>    baseline file (default: ./.promptdrift-baseline.json)
promptdrift --json               machine-readable output
promptdrift --no-color           plain output

Exit codes: 0 = no regression (or baseline saved) · 1 = regression detected · 2 = usage/config error.

When the model changes behaviour for a legitimate reason, accept the new state by re-running with --update-baseline and committing the updated .promptdrift-baseline.json.

GitHub Action (scheduled drift alarm)

Copy examples/promptdrift.yml to .github/workflows/promptdrift.yml:

name: promptdrift
on:
  schedule:
    - cron: "0 8 * * *"   # daily at 08:00 UTC
  workflow_dispatch: {}
permissions:
  issues: write           # so promptdrift can open/update the alert issue
  contents: read
jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: wartzar-bee/promptdrift@v0
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        with:
          config: promptdrift.json
          baseline: .promptdrift-baseline.json

On a regression the Action opens a single GitHub issue (titled "promptdrift: prompt regression detected") and updates that same issue (with a fresh comment) on subsequent failing runs — so it never spams duplicates — and the workflow run fails. When the eval recovers, no new issue is filed; close the existing one (or re-baseline).

Design / how to verify

Node 22, ESM, zero runtime dependencies (stdlib + built-in fetch only).
A pure, network-free core (src/checks.mjs, src/config.mjs, src/runner.mjs, src/report.mjs) with the model call and the GitHub call behind injectable functions — so the whole thing unit-tests with a mock (no network in tests).
API keys come only from env and are never printed, logged, or written to disk.

npm test     # node --test — 42 tests, all offline

Status / roadmap

v0.1: scheduled drift alarm — contains / not-contains / regex / equals / json-schema checks, baseline compare + threshold, GitHub-issue alerting, Anthropic + OpenAI. 42 unit tests (npm test).
Next (evidence-driven, not yet built): optional LLM-as-judge check, per-case history/trend, Slack/webhook alert sink besides GitHub issues.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme