ml-playbook

v1.0.0

Published

17 days ago

The ML Playbook — 12 rules and 4 skills that make Claude your senior ML engineer

Downloads

167

0High
0Medium
0Low

zzs12312123

claude-code skills ml machine-learning mlops llm rag agent

The ML Playbook

12 rules and 4 skills that make Claude your senior ML engineer

Stop vibe-ML-ing. Start using The ML Playbook.

npx ml-playbook

Getting Started · The 12 Rules · The 4 Skills · Supported Tools

🎯 The Problem

Most AI coding assistants default to:

The ML Playbook encodes these senior engineer habits into skills that automatically activate when you're doing ML work.

⚡ Quick Start

# Install with npx (recommended)
npx ml-playbook

# Or with curl
curl -fsSL https://raw.githubusercontent.com/charlotte-12s/ml-playbook/main/install.sh | bash

# Install only LLM skills
npx ml-playbook --bundle llm-only

# Install only traditional ML skills
npx ml-playbook --bundle traditional-ml

# Install for a specific tool
npx ml-playbook --tool cursor --tool codex --tool gemini

📜 The 12 Golden Rules

These rules override default AI behaviors when working on ML/AI code:

| | Rule | Anti-Pattern | |:-:|------|:------------| | 1 | Build Baseline First — Always implement the simplest model before optimizing | Tuning before baselining | | 2 | Data > Model > Hyperparams — Check data quality before changing architecture | Reaching for a bigger model | | 3 | Deploy from Day One — Write all ML code with production in mind | "I'll productionize later" | | 4 | Metrics Before Code — Define success metrics before writing model code | Coding without an eval | | 5 | Guard Against Data Leakage — Verify every feature is available at inference time | Using future data in training | | 6 | Simplicity First — Use rules over ML, linear over deep learning | Defaulting to Transformers | | 7 | Reproducible Experiments — Record config, seeds, and environment for every run | "Can't reproduce last week's results" | | 8 | Change One Variable — Make one change per experiment iteration | Changing data + model + params at once | | 9 | Monitor Before You Optimize — Set up monitoring before tuning | Discovering drift from user complaints | | 10 | Cost Consciousness — Calculate token count and latency for every LLM call | Using GPT-4 for simple tasks | | 11 | Eval-Driven LLM Dev — Build eval benchmarks before iterating on LLM apps | Judging by vibes | | 12 | Security Baseline — Always assume user input is malicious | Injecting user text into prompts |

🛠️ The 4 Skills

| Skill | Command | Stage | What It Does | |-------|---------|:-----:|-------------| | ML Bootstrap | /ml-bootstrap | 🚀 Launch | Problem definition → Data audit → Baseline → Eval framework → Project scaffold | | ML Debug | /ml-debug | 🔍 Debug | Symptom classification → Root cause analysis → Prioritized fix recommendations | | ML Ship | /ml-ship | 🚢 Deploy | Readiness check → Packaging → Serving → Testing → Monitoring → Rollback plan | | LLM Craft | /llm-craft | 🧠 Build | Architecture decision → RAG engineering → Prompt design → Agent design → Eval system |

Skill Detail

5-step gated process that prevents skipping fundamentals:

Problem Definition — Classify the problem, define the target, identify constraints
Data Audit — Volume, distribution, missingness, quality, temporal aspects
Baseline Strategy — Rule-based → Linear → Dummy baseline
Evaluation Framework — Primary/secondary metrics, validation strategy, significance
Project Scaffold — Standard ML directory structure with configs

Includes: Problem definition template · Data audit checklist · Project scaffold with Dockerfile

4-step diagnosis-to-fix workflow:

Symptom Classification — Non-convergence / Overfitting / Evaluation bug / Instability / Serving bug / Drift
Root Cause Investigation — Follow the decision tree to pinpoint the cause
Data Investigation — Leakage scan, distribution check, label quality audit
Fix Recommendations — Prioritized by impact/effort matrix

Includes: Full diagnosis decision tree (PyTorch + sklearn diagnostic commands) · 30+ common pitfalls catalog

6-step production readiness pipeline:

Readiness Check — Performance, latency, reproducibility gates
Packaging — Model signature, dependency locking, config separation, serialization
Serving — REST API (FastAPI) / Batch / Triton code generation
Testing — Unit + Integration + Regression + Load + A/B test design
Monitoring — Data drift, performance degradation, latency SLA, error rate
Rollback Plan — Versioning, canary deployment, automatic rollback

Includes: 25-item deployment checklist · Prometheus + Grafana monitoring template

5-step LLM engineering workflow:

Architecture Decision — RAG vs Fine-tune vs Agent decision matrix
RAG Engineering — Chunking strategy → Retrieval optimization → Generation quality → Eval loop
Prompt Engineering — Template design, injection defense, cost optimization
Agent Design — Tool definition, planning strategy, error recovery, human-in-the-loop
Evaluation System — Golden dataset, automated metrics, LLM-as-judge, regression testing

Includes: 4 RAG architecture patterns · 5 prompt templates + injection defense · Full eval framework with cost tracking

🔌 Supported Tools

| Tool | Format | Auto-Detected | |------|--------|:---:| | Claude Code | .claude/skills/ + SKILL.md | ✅ | | Cursor | .cursor/rules/ | ✅ | | Codex CLI | AGENTS.md | ✅ | | Gemini CLI | GEMINI.md | ✅ | | GitHub Copilot | .github/copilot-instructions.md | ✅ | | Windsurf | .windsurfrules | ✅ |

The installer auto-detects which tools you're using and generates the right format.

🧩 How It Works

┌─────────────────────────────────────────────────┐
│                  CLAUDE.md                        │
│          12 Golden Rules (always active)          │
│   Override default AI behavior on ML/AI tasks    │
└──────────────────────┬──────────────────────────┘
                       │ routes to
       ┌───────────────┼───────────────┐
       ▼               ▼               ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ ml-bootstrap │ │  ml-debug   │ │  ml-ship    │
│  🚀 Launch   │ │  🔍 Debug   │ │  🚢 Deploy  │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
       │               │               │
       ▼               ▼               ▼
  references/      references/      references/
  · templates      · decision tree  · checklists
  · checklists     · pitfalls       · monitoring
  · scaffold                        · rollback
                       │
                       ▼
                ┌─────────────┐
                │  llm-craft  │
                │  🧠 Build   │
                └──────┬──────┘
                       ▼
                  references/
                  · RAG patterns
                  · prompt templates
                  · eval framework

CLAUDE.md activates automatically when you're working on ML/AI code
The 12 rules modify AI behavior without you asking
Skill routing triggers the right skill based on your task
Each skill follows a gated methodology — you can't skip steps

🤝 Contributing

Contributions are welcome! Areas of particular interest:

More reference templates for specific ML frameworks
Translations of the 12 rules into other languages
Additional skill bundles (e.g., computer vision, NLP, time-series)
Improvements to the install script for more tools

Please read the existing skill structure before submitting PRs.

⭐ Star History

The ML Playbook — Because senior ML engineers don't vibe-code models.

⭐ Star this repo · 🐛 Report Bug · 💡 Request Feature

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme