npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@wartzar-bee/promptdrift

v0.1.0

Published

Catch prompt regressions from model drift — on a schedule, not just on PRs. Runs a small eval set against your LLM on a cron, compares pass-rate to a stored baseline, and alerts (GitHub issue + non-zero exit) when it regresses.

Readme

promptdrift

Catch prompt regressions from model drift — on a schedule, not just on PRs.

promptdrift runs a small eval set against your LLM on a cron, compares the pass-rate to a stored baseline, and alerts (opens/updates a GitHub issue + exits non-zero) the moment it regresses. It exists to catch the failure mode that PR-time eval tools structurally can't see: the same model ID silently changing server-side, or you bumping a model version and quietly breaking a prompt.

$ npx @wartzar-bee/promptdrift

  promptdrift  anthropic:claude-3-5-haiku-latest
  ──────────────────────────────────────────────
  Cases   2/3 passed   (pass-rate 66.7%)
  Baseline 100.0%  ↓  now 66.7%

  PASS  refuses to reveal system prompt
  FAIL  answers capital of France
        expected output to contain "Paris"
  PASS  classifies sentiment as strict JSON

  REGRESSION DETECTED
  pass-rate dropped from 100.0% to 66.7%
  Newly failing: answers capital of France

Why this exists (honest positioning vs promptfoo)

promptfoo is the popular incumbent for LLM evals, and it's good — at PR / code-change time. It runs your evals when you change your code. But the LLM behind a fixed model ID can change without any commit on your side, and promptfoo's own blog ("Your model upgrade just broke your agent's safety") concedes that gap.

promptdrift is not a promptfoo replacement — it's the complementary half:

| | promptfoo | promptdrift | | --- | --- | --- | | Trigger | PR / code change (CI) | schedule (cron) + on-demand | | Catches | regressions you introduce | server-side model drift + version bumps | | Setup | rich eval framework | one config file + one workflow | | Output | CI pass/fail on the diff | baseline compare → GitHub issue alert |

Use promptfoo for rich PR-time evals; add promptdrift to watch for drift between PRs. They stack.

No fabricated benchmarks here. promptdrift's value is purely the scheduled baseline-compare + alert mechanism — it does not claim to be a better evaluator than promptfoo.

Quickstart (4 lines)

npx @wartzar-bee/promptdrift --update-baseline    # 1. record today's pass-rate as the baseline
                                     # 2. commit .promptdrift-baseline.json
                                     # 3. drop examples/promptdrift.yml into .github/workflows/
                                     # 4. add ANTHROPIC_API_KEY (or OPENAI_API_KEY) as a repo secret

That's it — the scheduled workflow now re-runs your eval daily and opens a GitHub issue if the model drifts below your baseline.

Config (promptdrift.json)

A test case is { prompt, check }. check is dead-simple by default:

{
  "provider": "anthropic",
  "model": "claude-3-5-haiku-latest",
  "threshold": 0,
  "cases": [
    { "name": "answers capital of France",
      "prompt": "What is the capital of France? Answer with just the city name.",
      "check": { "type": "contains", "value": "Paris" } },

    { "name": "classifies sentiment as strict JSON",
      "system": "Respond with ONLY a JSON object, no prose.",
      "prompt": "Classify 'I love this'. Return {\"sentiment\": \"positive\"|\"negative\"|\"neutral\"}.",
      "check": { "type": "json-schema",
        "value": { "type": "object", "required": ["sentiment"],
          "properties": { "sentiment": { "type": "string", "enum": ["positive","negative","neutral"] } } } } }
  ]
}

Check types: contains, not-contains, regex, equals, json-schema. A bare string is shorthand for contains. A case can also carry an array of checks (all must pass). provider is anthropic or openai (default anthropic); the key is read from ANTHROPIC_API_KEY / OPENAI_API_KEYenv only, never logged or stored.

threshold (0–1, default 0) is the allowed drop in pass-rate before it counts as a regression. 0 means any drop alarms.

See examples/promptdrift.json for a runnable starter.

CLI

promptdrift                      run, compare to baseline, exit non-zero on regression
promptdrift --update-baseline    run and SAVE the result as the new baseline
promptdrift --config <path>      config file (default: ./promptdrift.json)
promptdrift --baseline <path>    baseline file (default: ./.promptdrift-baseline.json)
promptdrift --json               machine-readable output
promptdrift --no-color           plain output

Exit codes: 0 = no regression (or baseline saved) · 1 = regression detected · 2 = usage/config error.

When the model changes behaviour for a legitimate reason, accept the new state by re-running with --update-baseline and committing the updated .promptdrift-baseline.json.

GitHub Action (scheduled drift alarm)

Copy examples/promptdrift.yml to .github/workflows/promptdrift.yml:

name: promptdrift
on:
  schedule:
    - cron: "0 8 * * *"   # daily at 08:00 UTC
  workflow_dispatch: {}
permissions:
  issues: write           # so promptdrift can open/update the alert issue
  contents: read
jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: wartzar-bee/promptdrift@v0
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        with:
          config: promptdrift.json
          baseline: .promptdrift-baseline.json

On a regression the Action opens a single GitHub issue (titled "promptdrift: prompt regression detected") and updates that same issue (with a fresh comment) on subsequent failing runs — so it never spams duplicates — and the workflow run fails. When the eval recovers, no new issue is filed; close the existing one (or re-baseline).

Design / how to verify

  • Node 22, ESM, zero runtime dependencies (stdlib + built-in fetch only).
  • A pure, network-free core (src/checks.mjs, src/config.mjs, src/runner.mjs, src/report.mjs) with the model call and the GitHub call behind injectable functions — so the whole thing unit-tests with a mock (no network in tests).
  • API keys come only from env and are never printed, logged, or written to disk.
npm test     # node --test — 42 tests, all offline

Status / roadmap

  • v0.1: scheduled drift alarm — contains / not-contains / regex / equals / json-schema checks, baseline compare + threshold, GitHub-issue alerting, Anthropic + OpenAI. 42 unit tests (npm test).
  • Next (evidence-driven, not yet built): optional LLM-as-judge check, per-case history/trend, Slack/webhook alert sink besides GitHub issues.

License

MIT