@phoenixaihub/mrd

v0.1.0

Published

2 months ago

Know when GPT-4 gets dumber before your users do — scheduled benchmarks + alerts

0High
0Medium
0Low

phoenixaihub

llm eval regression benchmark ai

Model Regression Detector (mrd)

Know when GPT-4 gets dumber before your users do — scheduled benchmarks + alerts.

Problem

The "Is Claude getting worse?" thread has 1,809+ upvotes. Thousands of developers share the same anxiety: model providers ship silent updates that change behavior, and you find out from user complaints — 3-7 days too late.

mrd is a CLI that runs scheduled eval suites against your models, detects statistically significant regressions, and alerts you before your users notice.

Quick Start

npm install -g @phoenixaihub/mrd
mrd init
mrd run

Configuration

mrd.config.yaml:

models:
  - name: gpt-4o
    provider: openai
    model: gpt-4o
  - name: claude-sonnet
    provider: anthropic
    model: claude-sonnet-4-20250514

suites:
  - name: coding-quality
    cases:
      - prompt: "Write a Python function to merge two sorted lists"
        expect:
          method: contains
          value: "def merge"
      - prompt: "Explain the difference between a mutex and a semaphore"
        expect:
          method: llm-judge
          value: "Accurate, mentions counting semaphore vs binary lock"

schedule: "0 */6 * * *"  # every 6 hours

alerts:
  slack:
    webhook: ${SLACK_WEBHOOK_URL}
  webhook:
    url: https://your-api.com/webhook

baseline:
  window: 7        # rolling baseline over last 7 runs
  threshold: 0.15  # alert on >15% regression

store:
  path: ./mrd.db   # SQLite

Commands

| Command | Description | |---------|-------------| | mrd init | Generate a starter mrd.config.yaml | | mrd run | Execute all eval suites once | | mrd run --suite <name> | Run a specific suite | | mrd status | Show latest results and trends | | mrd watch | Run on schedule (cron mode) |

Scoring Methods

| Method | Description | Example | |--------|-------------|---------| | exact | Exact string match | value: "42" | | contains | Substring match | value: "def merge" | | regex | Regular expression | value: "\\d{4}-\\d{2}-\\d{2}" | | llm-judge | LLM evaluates response quality | value: "Accurate and complete explanation" |

How It Works

Define eval suites in YAML — prompts + expected behaviors
Run against multiple models on a schedule
Score responses via heuristics + LLM-as-judge
Detect statistically significant regressions (rolling baseline, not noise)
Alert via Slack, webhook, or email: "Claude Opus coding dropped 18% since Tuesday"

Why mrd vs Alternatives

| Feature | mrd | Braintrust | Promptfoo | Langfuse | |---------|-----|-----------|-----------|---------| | Scheduled monitoring | ✅ Built-in | ❌ | ❌ | ❌ | | Regression detection | ✅ Statistical | ⚠️ Manual | ❌ | ❌ | | Alerts (Slack/webhook) | ✅ | ⚠️ Paid | ❌ | ⚠️ | | Setup time | 5 min | 30+ min | 10 min | 20+ min | | Price | Free (MIT) | $500+/mo | Free | Free tier | | Multi-provider | ✅ | ✅ | ✅ | ✅ | | CLI-first | ✅ | ❌ | ✅ | ❌ |

Architecture

Scheduler (cron) → Runner (API calls) → Scorer (judge + heuristic) → Alerts (Slack/webhook)
                          ↕                        ↕
                    SQLite (results + trends) ← Stats Engine (regression detection)

Roadmap

[ ] Web dashboard with trend charts
[ ] Community eval library (shared prompt suites)
[ ] Email alerts via Resend
[ ] Cost tracking per eval run
[ ] Public regression feed (opt-in shared results)
[ ] mrd compare <date1> <date2> command
[ ] CI integration (GitHub Actions, GitLab CI)
[ ] Additional scorers: cosine similarity, structured output validation

License

MIT