@phoenixaihub/mrd
v0.1.0
Published
Know when GPT-4 gets dumber before your users do — scheduled benchmarks + alerts
Maintainers
Readme
Model Regression Detector (mrd)
Know when GPT-4 gets dumber before your users do — scheduled benchmarks + alerts.
Problem
The "Is Claude getting worse?" thread has 1,809+ upvotes. Thousands of developers share the same anxiety: model providers ship silent updates that change behavior, and you find out from user complaints — 3-7 days too late.
mrd is a CLI that runs scheduled eval suites against your models, detects statistically significant regressions, and alerts you before your users notice.
Quick Start
npm install -g @phoenixaihub/mrd
mrd init
mrd runConfiguration
mrd.config.yaml:
models:
- name: gpt-4o
provider: openai
model: gpt-4o
- name: claude-sonnet
provider: anthropic
model: claude-sonnet-4-20250514
suites:
- name: coding-quality
cases:
- prompt: "Write a Python function to merge two sorted lists"
expect:
method: contains
value: "def merge"
- prompt: "Explain the difference between a mutex and a semaphore"
expect:
method: llm-judge
value: "Accurate, mentions counting semaphore vs binary lock"
schedule: "0 */6 * * *" # every 6 hours
alerts:
slack:
webhook: ${SLACK_WEBHOOK_URL}
webhook:
url: https://your-api.com/webhook
baseline:
window: 7 # rolling baseline over last 7 runs
threshold: 0.15 # alert on >15% regression
store:
path: ./mrd.db # SQLiteCommands
| Command | Description |
|---------|-------------|
| mrd init | Generate a starter mrd.config.yaml |
| mrd run | Execute all eval suites once |
| mrd run --suite <name> | Run a specific suite |
| mrd status | Show latest results and trends |
| mrd watch | Run on schedule (cron mode) |
Scoring Methods
| Method | Description | Example |
|--------|-------------|---------|
| exact | Exact string match | value: "42" |
| contains | Substring match | value: "def merge" |
| regex | Regular expression | value: "\\d{4}-\\d{2}-\\d{2}" |
| llm-judge | LLM evaluates response quality | value: "Accurate and complete explanation" |
How It Works
- Define eval suites in YAML — prompts + expected behaviors
- Run against multiple models on a schedule
- Score responses via heuristics + LLM-as-judge
- Detect statistically significant regressions (rolling baseline, not noise)
- Alert via Slack, webhook, or email: "Claude Opus coding dropped 18% since Tuesday"
Why mrd vs Alternatives
| Feature | mrd | Braintrust | Promptfoo | Langfuse | |---------|-----|-----------|-----------|---------| | Scheduled monitoring | ✅ Built-in | ❌ | ❌ | ❌ | | Regression detection | ✅ Statistical | ⚠️ Manual | ❌ | ❌ | | Alerts (Slack/webhook) | ✅ | ⚠️ Paid | ❌ | ⚠️ | | Setup time | 5 min | 30+ min | 10 min | 20+ min | | Price | Free (MIT) | $500+/mo | Free | Free tier | | Multi-provider | ✅ | ✅ | ✅ | ✅ | | CLI-first | ✅ | ❌ | ✅ | ❌ |
Architecture
Scheduler (cron) → Runner (API calls) → Scorer (judge + heuristic) → Alerts (Slack/webhook)
↕ ↕
SQLite (results + trends) ← Stats Engine (regression detection)Roadmap
- [ ] Web dashboard with trend charts
- [ ] Community eval library (shared prompt suites)
- [ ] Email alerts via Resend
- [ ] Cost tracking per eval run
- [ ] Public regression feed (opt-in shared results)
- [ ]
mrd compare <date1> <date2>command - [ ] CI integration (GitHub Actions, GitLab CI)
- [ ] Additional scorers: cosine similarity, structured output validation
License
MIT
