@sloflash/ml-training-mcp
v0.1.1
Published
MCP server for ML training script development with progressive scaling and intelligent recovery
Downloads
16
Maintainers
Readme
🚀 ML Training MCP Server
An intelligent MCP server for ML training script development with progressive scaling, automatic recovery, and comprehensive testing.
Key Features
🎯 Single File Focus
- ALL training logic stays in one
training.pyfile - No multi-file sprawl - everything in one place
- Small, manageable changes (50-100 lines max per delta)
📈 Progressive Scaling
- Start with minimal config (batch_size=1, seq_len=128)
- Gradually scale up to target configuration
- Automatic rollback on failure
- Memory of successful configurations
🔍 Suspicious Pattern Detection
- Detects loss increasing after 20 steps
- Catches NaN/Inf in gradients
- Identifies memory leaks
- Spots training stalls
🔄 Smart Recovery
- Analyzes failures intelligently
- Suggests specific fixes (not blind retries)
- Maximum 20 retries with 10-minute timeout
- Different fix attempted each time
🧪 Comprehensive Testing
- Mock scenarios for common failures
- Test before running on real GPUs
- Simulate long-running training
- Validate scaling paths safely
Quick Install (Recommended)
# No installation needed - use npx directly in your project
cd your-ml-project
# Create Claude configuration for THIS project only
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
"mcpServers": {
"ml-training": {
"command": "npx",
"args": ["-y", "@sloflash/ml-training-mcp"],
"env": {
"PROJECT_ROOT": "${workspaceFolder}"
}
}
}
}
EOF
# That's it! The MCP server is now available in THIS project onlyAlternative Installation Methods
Method 1: Project-specific installation (for teams)
# Install as dev dependency in your ML project
cd your-ml-project
npm install --save-dev @sloflash/ml-training-mcp
# Configure Claude for this project
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
"mcpServers": {
"ml-training": {
"command": "node",
"args": ["./node_modules/@sloflash/ml-training-mcp/dist/index.js"],
"env": {
"PROJECT_ROOT": "${workspaceFolder}"
}
}
}
}
EOFMethod 2: Global Installation (if you prefer)
# Install globally
npm install -g @sloflash/ml-training-mcp
# Then in your project, configure Claude
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
"mcpServers": {
"ml-training": {
"command": "ml-training-mcp",
"args": [],
"env": {
"PROJECT_ROOT": "${workspaceFolder}"
}
}
}
}
EOFOne-Click Setup Script
Save this as setup-mcp.sh in your ML project:
#!/bin/bash
echo "🚀 Setting up ML Training MCP for this project..."
# Check Node.js
if ! command -v node &> /dev/null; then
echo "❌ Node.js is required. Install from nodejs.org"
exit 1
fi
# Create Claude config
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
"mcpServers": {
"ml-training": {
"command": "npx",
"args": ["-y", "@sloflash/ml-training-mcp"],
"env": {
"PROJECT_ROOT": "${workspaceFolder}",
"SINGLE_FILE_ENFORCEMENT": "true",
"MAX_RETRIES": "20",
"TIMEOUT_MINUTES": "10"
}
}
}
}
EOF
echo "✅ MCP configured for this project only"
echo "📁 Config: .claude/claude_desktop_config.json"
echo "🔄 Restart Claude Desktop to activate"Then run:
chmod +x setup-mcp.sh
./setup-mcp.shBasic Usage
// 1. Initialize training script
await mcp.init_training_script({ model: "meta-llama/Llama-2-7b-hf" });
// 2. Test with minimal configuration
await mcp.test_minimal_config();
// 3. Scale up progressively
await mcp.scale_up_safely();
// 4. Run training with monitoring
await mcp.run_training({
timeout: 600000, // 10 minutes
mock: false // Use real training
});Available Tools
Initialization
init_training_script- Create training.py from templatesetup_environment- Initialize uv and .venvconfigure_model- Set HuggingFace model/tokenizer
Execution
run_training- Execute with timeout and monitoringrun_mock_training- Test with simulated scenariostest_minimal_config- Validate minimal setup
Monitoring
detect_suspicious_patterns- Check for training issuesget_training_status- Current status and retriesmonitor_memory- Track GPU memory usage
Recovery
analyze_failure- Determine failure causesuggest_fix- Get intelligent fix suggestionsrollback_to_working- Revert to last good versionapply_small_delta- Make controlled changes
Scaling
scale_up_safely- Increase parameters graduallyscale_down- Reduce after failurefind_max_viable- Determine hardware limits
Testing
Run All Tests
npm testSpecific Test Suites
npm run test:suspicious-patterns # Pattern detection
npm run test:scaling # Progressive scaling
npm run test:recovery # Failure recoveryIntegration Testing
npm run test:integrationMock Scenarios
The server includes predefined scenarios for testing:
- happy_path - Everything works perfectly
- oom_after_warmup - OOM after 100 steps
- gradient_explosion - NaN loss at step 250
- memory_leak - Gradual memory growth
- loss_increase - Loss going up instead of down
- nccl_timeout - Multi-GPU communication failure
Example Workflow
// Start with mock testing
const mockResult = await mcp.run_mock_training({
scenario: 'oom_after_warmup',
config: { batch_size: 4, seq_len: 512 }
});
if (!mockResult.success) {
// Analyze the failure
const analysis = await mcp.analyze_failure({
error: mockResult.error
});
// Get suggested fix
const fix = await mcp.suggest_fix();
// Apply the fix
await mcp.apply_small_delta({
changes: fix.code,
description: fix.description
});
// Retry with smaller config
await mcp.scale_down({ factor: 0.5 });
}Suspicious Pattern Handling
When the system detects issues:
| Pattern | Detection | Automatic Action | |---------|-----------|------------------| | Loss increasing | After 20 steps | Reduce LR by 10x | | Loss NaN | Immediate | Stop training | | Memory leak | Continuous growth | Clear cache | | No progress | 100 steps stalled | Check gradients | | Gradient explosion | norm > 100 | Add clipping |
Configuration
Default configuration in Config class:
batch_size = 1
seq_len = 128
learning_rate = 1e-5
gradient_clip = 1.0
max_steps = 1000Override via command line:
uv run python training.py --batch_size 4 --seq_len 512File Structure
ml-training-mcp/
├── src/ # TypeScript source
│ ├── index.ts # MCP server
│ ├── script-manager.ts # Version control
│ ├── suspicious-detector.ts # Pattern detection
│ ├── scaling-orchestrator.ts # Progressive scaling
│ ├── retry-manager.ts # Retry logic
│ └── mock-executor.ts # Test scenarios
├── storage/ # Runtime storage
│ ├── training.py # THE training script
│ ├── training_memory.json # Working memory
│ └── working_versions/ # Checkpoints
├── templates/ # Script templates
│ └── training_template.py
└── test/ # Test suitesImportant Notes
- Single File Rule: Everything stays in
training.py - Environment: Always uses
uvwith.venv - Testing First: Always test with mocks before real training
- Small Changes: Maximum 50-100 lines per modification
- Smart Recovery: Analyzes failures, doesn't blind retry
Troubleshooting
OOM Errors
- Reduce batch_size by 50%
- Reduce sequence_length by 50%
- Enable gradient checkpointing
- Use gradient accumulation
Loss Increasing
- Check learning rate (usually too high)
- Verify data shuffling
- Check gradient flow
- Add warmup schedule
NaN Loss
- Add gradient clipping
- Reduce learning rate significantly
- Check model initialization
- Verify data normalization
Contributing
See CLAUDE.md for development guidelines and testing requirements.
License
MIT
