ml-playbook
v1.0.0
Published
The ML Playbook — 12 rules and 4 skills that make Claude your senior ML engineer
Downloads
167
Maintainers
Readme
The ML Playbook
12 rules and 4 skills that make Claude your senior ML engineer
Stop vibe-ML-ing. Start using The ML Playbook.
npx ml-playbookGetting Started · The 12 Rules · The 4 Skills · Supported Tools
🎯 The Problem
Most AI coding assistants default to:
| ❌ Bad Default | 💡 What Senior ML Engineers Do | |:-:|:-:| | Write complex models before baselines | Always build the simplest baseline first | | Ignore data quality issues | Audit data before touching models | | Leave everything in notebooks | Write production-ready code from day one | | Skip evaluation until it's too late | Define metrics before writing code | | No monitoring, no security | Deploy monitoring from day one |
The ML Playbook encodes these senior engineer habits into skills that automatically activate when you're doing ML work.
⚡ Quick Start
# Install with npx (recommended)
npx ml-playbook
# Or with curl
curl -fsSL https://raw.githubusercontent.com/charlotte-12s/ml-playbook/main/install.sh | bash# Install only LLM skills
npx ml-playbook --bundle llm-only
# Install only traditional ML skills
npx ml-playbook --bundle traditional-ml
# Install for a specific tool
npx ml-playbook --tool cursor --tool codex --tool gemini📜 The 12 Golden Rules
These rules override default AI behaviors when working on ML/AI code:
| | Rule | Anti-Pattern | |:-:|------|:------------| | 1 | Build Baseline First — Always implement the simplest model before optimizing | Tuning before baselining | | 2 | Data > Model > Hyperparams — Check data quality before changing architecture | Reaching for a bigger model | | 3 | Deploy from Day One — Write all ML code with production in mind | "I'll productionize later" | | 4 | Metrics Before Code — Define success metrics before writing model code | Coding without an eval | | 5 | Guard Against Data Leakage — Verify every feature is available at inference time | Using future data in training | | 6 | Simplicity First — Use rules over ML, linear over deep learning | Defaulting to Transformers | | 7 | Reproducible Experiments — Record config, seeds, and environment for every run | "Can't reproduce last week's results" | | 8 | Change One Variable — Make one change per experiment iteration | Changing data + model + params at once | | 9 | Monitor Before You Optimize — Set up monitoring before tuning | Discovering drift from user complaints | | 10 | Cost Consciousness — Calculate token count and latency for every LLM call | Using GPT-4 for simple tasks | | 11 | Eval-Driven LLM Dev — Build eval benchmarks before iterating on LLM apps | Judging by vibes | | 12 | Security Baseline — Always assume user input is malicious | Injecting user text into prompts |
🛠️ The 4 Skills
| Skill | Command | Stage | What It Does |
|-------|---------|:-----:|-------------|
| ML Bootstrap | /ml-bootstrap | 🚀 Launch | Problem definition → Data audit → Baseline → Eval framework → Project scaffold |
| ML Debug | /ml-debug | 🔍 Debug | Symptom classification → Root cause analysis → Prioritized fix recommendations |
| ML Ship | /ml-ship | 🚢 Deploy | Readiness check → Packaging → Serving → Testing → Monitoring → Rollback plan |
| LLM Craft | /llm-craft | 🧠 Build | Architecture decision → RAG engineering → Prompt design → Agent design → Eval system |
Skill Detail
5-step gated process that prevents skipping fundamentals:
- Problem Definition — Classify the problem, define the target, identify constraints
- Data Audit — Volume, distribution, missingness, quality, temporal aspects
- Baseline Strategy — Rule-based → Linear → Dummy baseline
- Evaluation Framework — Primary/secondary metrics, validation strategy, significance
- Project Scaffold — Standard ML directory structure with configs
Includes: Problem definition template · Data audit checklist · Project scaffold with Dockerfile
4-step diagnosis-to-fix workflow:
- Symptom Classification — Non-convergence / Overfitting / Evaluation bug / Instability / Serving bug / Drift
- Root Cause Investigation — Follow the decision tree to pinpoint the cause
- Data Investigation — Leakage scan, distribution check, label quality audit
- Fix Recommendations — Prioritized by impact/effort matrix
Includes: Full diagnosis decision tree (PyTorch + sklearn diagnostic commands) · 30+ common pitfalls catalog
6-step production readiness pipeline:
- Readiness Check — Performance, latency, reproducibility gates
- Packaging — Model signature, dependency locking, config separation, serialization
- Serving — REST API (FastAPI) / Batch / Triton code generation
- Testing — Unit + Integration + Regression + Load + A/B test design
- Monitoring — Data drift, performance degradation, latency SLA, error rate
- Rollback Plan — Versioning, canary deployment, automatic rollback
Includes: 25-item deployment checklist · Prometheus + Grafana monitoring template
5-step LLM engineering workflow:
- Architecture Decision — RAG vs Fine-tune vs Agent decision matrix
- RAG Engineering — Chunking strategy → Retrieval optimization → Generation quality → Eval loop
- Prompt Engineering — Template design, injection defense, cost optimization
- Agent Design — Tool definition, planning strategy, error recovery, human-in-the-loop
- Evaluation System — Golden dataset, automated metrics, LLM-as-judge, regression testing
Includes: 4 RAG architecture patterns · 5 prompt templates + injection defense · Full eval framework with cost tracking
🔌 Supported Tools
| Tool | Format | Auto-Detected |
|------|--------|:---:|
| Claude Code | .claude/skills/ + SKILL.md | ✅ |
| Cursor | .cursor/rules/ | ✅ |
| Codex CLI | AGENTS.md | ✅ |
| Gemini CLI | GEMINI.md | ✅ |
| GitHub Copilot | .github/copilot-instructions.md | ✅ |
| Windsurf | .windsurfrules | ✅ |
The installer auto-detects which tools you're using and generates the right format.
🧩 How It Works
┌─────────────────────────────────────────────────┐
│ CLAUDE.md │
│ 12 Golden Rules (always active) │
│ Override default AI behavior on ML/AI tasks │
└──────────────────────┬──────────────────────────┘
│ routes to
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ ml-bootstrap │ │ ml-debug │ │ ml-ship │
│ 🚀 Launch │ │ 🔍 Debug │ │ 🚢 Deploy │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
▼ ▼ ▼
references/ references/ references/
· templates · decision tree · checklists
· checklists · pitfalls · monitoring
· scaffold · rollback
│
▼
┌─────────────┐
│ llm-craft │
│ 🧠 Build │
└──────┬──────┘
▼
references/
· RAG patterns
· prompt templates
· eval framework- CLAUDE.md activates automatically when you're working on ML/AI code
- The 12 rules modify AI behavior without you asking
- Skill routing triggers the right skill based on your task
- Each skill follows a gated methodology — you can't skip steps
🤝 Contributing
Contributions are welcome! Areas of particular interest:
- More reference templates for specific ML frameworks
- Translations of the 12 rules into other languages
- Additional skill bundles (e.g., computer vision, NLP, time-series)
- Improvements to the install script for more tools
Please read the existing skill structure before submitting PRs.
⭐ Star History
The ML Playbook — Because senior ML engineers don't vibe-code models.
