@sloflash/ml-training-mcp

v0.1.1

Published

7 months ago

MCP server for ML training script development with progressive scaling and intelligent recovery

Downloads

0High
0Medium
0Low

sloflash

mcp modelcontextprotocol ml machine-learning training huggingface fsdp scaling claude ai

🚀 ML Training MCP Server

An intelligent MCP server for ML training script development with progressive scaling, automatic recovery, and comprehensive testing.

Key Features

🎯 Single File Focus

ALL training logic stays in one training.py file
No multi-file sprawl - everything in one place
Small, manageable changes (50-100 lines max per delta)

📈 Progressive Scaling

Start with minimal config (batch_size=1, seq_len=128)
Gradually scale up to target configuration
Automatic rollback on failure
Memory of successful configurations

🔍 Suspicious Pattern Detection

Detects loss increasing after 20 steps
Catches NaN/Inf in gradients
Identifies memory leaks
Spots training stalls

🔄 Smart Recovery

Analyzes failures intelligently
Suggests specific fixes (not blind retries)
Maximum 20 retries with 10-minute timeout
Different fix attempted each time

🧪 Comprehensive Testing

Mock scenarios for common failures
Test before running on real GPUs
Simulate long-running training
Validate scaling paths safely

Quick Install (Recommended)

# No installation needed - use npx directly in your project
cd your-ml-project

# Create Claude configuration for THIS project only
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "ml-training": {
      "command": "npx",
      "args": ["-y", "@sloflash/ml-training-mcp"],
      "env": {
        "PROJECT_ROOT": "${workspaceFolder}"
      }
    }
  }
}
EOF

# That's it! The MCP server is now available in THIS project only

Alternative Installation Methods

Method 1: Project-specific installation (for teams)

# Install as dev dependency in your ML project
cd your-ml-project
npm install --save-dev @sloflash/ml-training-mcp

# Configure Claude for this project
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "ml-training": {
      "command": "node",
      "args": ["./node_modules/@sloflash/ml-training-mcp/dist/index.js"],
      "env": {
        "PROJECT_ROOT": "${workspaceFolder}"
      }
    }
  }
}
EOF

Method 2: Global Installation (if you prefer)

# Install globally
npm install -g @sloflash/ml-training-mcp

# Then in your project, configure Claude
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "ml-training": {
      "command": "ml-training-mcp",
      "args": [],
      "env": {
        "PROJECT_ROOT": "${workspaceFolder}"
      }
    }
  }
}
EOF

One-Click Setup Script

Save this as setup-mcp.sh in your ML project:

#!/bin/bash

echo "🚀 Setting up ML Training MCP for this project..."

# Check Node.js
if ! command -v node &> /dev/null; then
    echo "❌ Node.js is required. Install from nodejs.org"
    exit 1
fi

# Create Claude config
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "ml-training": {
      "command": "npx",
      "args": ["-y", "@sloflash/ml-training-mcp"],
      "env": {
        "PROJECT_ROOT": "${workspaceFolder}",
        "SINGLE_FILE_ENFORCEMENT": "true",
        "MAX_RETRIES": "20",
        "TIMEOUT_MINUTES": "10"
      }
    }
  }
}
EOF

echo "✅ MCP configured for this project only"
echo "📁 Config: .claude/claude_desktop_config.json"
echo "🔄 Restart Claude Desktop to activate"

Then run:

chmod +x setup-mcp.sh
./setup-mcp.sh

Basic Usage

// 1. Initialize training script
await mcp.init_training_script({ model: "meta-llama/Llama-2-7b-hf" });

// 2. Test with minimal configuration
await mcp.test_minimal_config();

// 3. Scale up progressively
await mcp.scale_up_safely();

// 4. Run training with monitoring
await mcp.run_training({ 
  timeout: 600000,  // 10 minutes
  mock: false       // Use real training
});

Available Tools

Initialization

init_training_script - Create training.py from template
setup_environment - Initialize uv and .venv
configure_model - Set HuggingFace model/tokenizer

Execution

run_training - Execute with timeout and monitoring
run_mock_training - Test with simulated scenarios
test_minimal_config - Validate minimal setup

Monitoring

detect_suspicious_patterns - Check for training issues
get_training_status - Current status and retries
monitor_memory - Track GPU memory usage

Recovery

analyze_failure - Determine failure cause
suggest_fix - Get intelligent fix suggestions
rollback_to_working - Revert to last good version
apply_small_delta - Make controlled changes

Scaling

scale_up_safely - Increase parameters gradually
scale_down - Reduce after failure
find_max_viable - Determine hardware limits

Testing

Run All Tests

npm test

Specific Test Suites

npm run test:suspicious-patterns  # Pattern detection
npm run test:scaling              # Progressive scaling
npm run test:recovery             # Failure recovery

Integration Testing

npm run test:integration

Mock Scenarios

The server includes predefined scenarios for testing:

happy_path - Everything works perfectly
oom_after_warmup - OOM after 100 steps
gradient_explosion - NaN loss at step 250
memory_leak - Gradual memory growth
loss_increase - Loss going up instead of down
nccl_timeout - Multi-GPU communication failure

Example Workflow

// Start with mock testing
const mockResult = await mcp.run_mock_training({
  scenario: 'oom_after_warmup',
  config: { batch_size: 4, seq_len: 512 }
});

if (!mockResult.success) {
  // Analyze the failure
  const analysis = await mcp.analyze_failure({ 
    error: mockResult.error 
  });
  
  // Get suggested fix
  const fix = await mcp.suggest_fix();
  
  // Apply the fix
  await mcp.apply_small_delta({
    changes: fix.code,
    description: fix.description
  });
  
  // Retry with smaller config
  await mcp.scale_down({ factor: 0.5 });
}

Suspicious Pattern Handling

When the system detects issues:

| Pattern | Detection | Automatic Action | |---------|-----------|------------------| | Loss increasing | After 20 steps | Reduce LR by 10x | | Loss NaN | Immediate | Stop training | | Memory leak | Continuous growth | Clear cache | | No progress | 100 steps stalled | Check gradients | | Gradient explosion | norm > 100 | Add clipping |

Configuration

Default configuration in Config class:

batch_size = 1
seq_len = 128  
learning_rate = 1e-5
gradient_clip = 1.0
max_steps = 1000

Override via command line:

uv run python training.py --batch_size 4 --seq_len 512

File Structure

ml-training-mcp/
├── src/                      # TypeScript source
│   ├── index.ts             # MCP server
│   ├── script-manager.ts    # Version control
│   ├── suspicious-detector.ts # Pattern detection
│   ├── scaling-orchestrator.ts # Progressive scaling
│   ├── retry-manager.ts     # Retry logic
│   └── mock-executor.ts     # Test scenarios
├── storage/                  # Runtime storage
│   ├── training.py          # THE training script
│   ├── training_memory.json # Working memory
│   └── working_versions/    # Checkpoints
├── templates/               # Script templates
│   └── training_template.py
└── test/                    # Test suites

Important Notes

Single File Rule: Everything stays in training.py
Environment: Always uses uv with .venv
Testing First: Always test with mocks before real training
Small Changes: Maximum 50-100 lines per modification
Smart Recovery: Analyzes failures, doesn't blind retry

Troubleshooting

OOM Errors

Reduce batch_size by 50%
Reduce sequence_length by 50%
Enable gradient checkpointing
Use gradient accumulation

Loss Increasing

Check learning rate (usually too high)
Verify data shuffling
Check gradient flow
Add warmup schedule

NaN Loss

Add gradient clipping
Reduce learning rate significantly
Check model initialization
Verify data normalization

Contributing

See CLAUDE.md for development guidelines and testing requirements.

License

MIT