npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@sloflash/ml-training-mcp

v0.1.1

Published

MCP server for ML training script development with progressive scaling and intelligent recovery

Downloads

16

Readme

🚀 ML Training MCP Server

An intelligent MCP server for ML training script development with progressive scaling, automatic recovery, and comprehensive testing.

Key Features

🎯 Single File Focus

  • ALL training logic stays in one training.py file
  • No multi-file sprawl - everything in one place
  • Small, manageable changes (50-100 lines max per delta)

📈 Progressive Scaling

  • Start with minimal config (batch_size=1, seq_len=128)
  • Gradually scale up to target configuration
  • Automatic rollback on failure
  • Memory of successful configurations

🔍 Suspicious Pattern Detection

  • Detects loss increasing after 20 steps
  • Catches NaN/Inf in gradients
  • Identifies memory leaks
  • Spots training stalls

🔄 Smart Recovery

  • Analyzes failures intelligently
  • Suggests specific fixes (not blind retries)
  • Maximum 20 retries with 10-minute timeout
  • Different fix attempted each time

🧪 Comprehensive Testing

  • Mock scenarios for common failures
  • Test before running on real GPUs
  • Simulate long-running training
  • Validate scaling paths safely

Quick Install (Recommended)

# No installation needed - use npx directly in your project
cd your-ml-project

# Create Claude configuration for THIS project only
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "ml-training": {
      "command": "npx",
      "args": ["-y", "@sloflash/ml-training-mcp"],
      "env": {
        "PROJECT_ROOT": "${workspaceFolder}"
      }
    }
  }
}
EOF

# That's it! The MCP server is now available in THIS project only

Alternative Installation Methods

Method 1: Project-specific installation (for teams)

# Install as dev dependency in your ML project
cd your-ml-project
npm install --save-dev @sloflash/ml-training-mcp

# Configure Claude for this project
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "ml-training": {
      "command": "node",
      "args": ["./node_modules/@sloflash/ml-training-mcp/dist/index.js"],
      "env": {
        "PROJECT_ROOT": "${workspaceFolder}"
      }
    }
  }
}
EOF

Method 2: Global Installation (if you prefer)

# Install globally
npm install -g @sloflash/ml-training-mcp

# Then in your project, configure Claude
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "ml-training": {
      "command": "ml-training-mcp",
      "args": [],
      "env": {
        "PROJECT_ROOT": "${workspaceFolder}"
      }
    }
  }
}
EOF

One-Click Setup Script

Save this as setup-mcp.sh in your ML project:

#!/bin/bash

echo "🚀 Setting up ML Training MCP for this project..."

# Check Node.js
if ! command -v node &> /dev/null; then
    echo "❌ Node.js is required. Install from nodejs.org"
    exit 1
fi

# Create Claude config
mkdir -p .claude
cat > .claude/claude_desktop_config.json << 'EOF'
{
  "mcpServers": {
    "ml-training": {
      "command": "npx",
      "args": ["-y", "@sloflash/ml-training-mcp"],
      "env": {
        "PROJECT_ROOT": "${workspaceFolder}",
        "SINGLE_FILE_ENFORCEMENT": "true",
        "MAX_RETRIES": "20",
        "TIMEOUT_MINUTES": "10"
      }
    }
  }
}
EOF

echo "✅ MCP configured for this project only"
echo "📁 Config: .claude/claude_desktop_config.json"
echo "🔄 Restart Claude Desktop to activate"

Then run:

chmod +x setup-mcp.sh
./setup-mcp.sh

Basic Usage

// 1. Initialize training script
await mcp.init_training_script({ model: "meta-llama/Llama-2-7b-hf" });

// 2. Test with minimal configuration
await mcp.test_minimal_config();

// 3. Scale up progressively
await mcp.scale_up_safely();

// 4. Run training with monitoring
await mcp.run_training({ 
  timeout: 600000,  // 10 minutes
  mock: false       // Use real training
});

Available Tools

Initialization

  • init_training_script - Create training.py from template
  • setup_environment - Initialize uv and .venv
  • configure_model - Set HuggingFace model/tokenizer

Execution

  • run_training - Execute with timeout and monitoring
  • run_mock_training - Test with simulated scenarios
  • test_minimal_config - Validate minimal setup

Monitoring

  • detect_suspicious_patterns - Check for training issues
  • get_training_status - Current status and retries
  • monitor_memory - Track GPU memory usage

Recovery

  • analyze_failure - Determine failure cause
  • suggest_fix - Get intelligent fix suggestions
  • rollback_to_working - Revert to last good version
  • apply_small_delta - Make controlled changes

Scaling

  • scale_up_safely - Increase parameters gradually
  • scale_down - Reduce after failure
  • find_max_viable - Determine hardware limits

Testing

Run All Tests

npm test

Specific Test Suites

npm run test:suspicious-patterns  # Pattern detection
npm run test:scaling              # Progressive scaling
npm run test:recovery             # Failure recovery

Integration Testing

npm run test:integration

Mock Scenarios

The server includes predefined scenarios for testing:

  1. happy_path - Everything works perfectly
  2. oom_after_warmup - OOM after 100 steps
  3. gradient_explosion - NaN loss at step 250
  4. memory_leak - Gradual memory growth
  5. loss_increase - Loss going up instead of down
  6. nccl_timeout - Multi-GPU communication failure

Example Workflow

// Start with mock testing
const mockResult = await mcp.run_mock_training({
  scenario: 'oom_after_warmup',
  config: { batch_size: 4, seq_len: 512 }
});

if (!mockResult.success) {
  // Analyze the failure
  const analysis = await mcp.analyze_failure({ 
    error: mockResult.error 
  });
  
  // Get suggested fix
  const fix = await mcp.suggest_fix();
  
  // Apply the fix
  await mcp.apply_small_delta({
    changes: fix.code,
    description: fix.description
  });
  
  // Retry with smaller config
  await mcp.scale_down({ factor: 0.5 });
}

Suspicious Pattern Handling

When the system detects issues:

| Pattern | Detection | Automatic Action | |---------|-----------|------------------| | Loss increasing | After 20 steps | Reduce LR by 10x | | Loss NaN | Immediate | Stop training | | Memory leak | Continuous growth | Clear cache | | No progress | 100 steps stalled | Check gradients | | Gradient explosion | norm > 100 | Add clipping |

Configuration

Default configuration in Config class:

batch_size = 1
seq_len = 128  
learning_rate = 1e-5
gradient_clip = 1.0
max_steps = 1000

Override via command line:

uv run python training.py --batch_size 4 --seq_len 512

File Structure

ml-training-mcp/
├── src/                      # TypeScript source
│   ├── index.ts             # MCP server
│   ├── script-manager.ts    # Version control
│   ├── suspicious-detector.ts # Pattern detection
│   ├── scaling-orchestrator.ts # Progressive scaling
│   ├── retry-manager.ts     # Retry logic
│   └── mock-executor.ts     # Test scenarios
├── storage/                  # Runtime storage
│   ├── training.py          # THE training script
│   ├── training_memory.json # Working memory
│   └── working_versions/    # Checkpoints
├── templates/               # Script templates
│   └── training_template.py
└── test/                    # Test suites

Important Notes

  1. Single File Rule: Everything stays in training.py
  2. Environment: Always uses uv with .venv
  3. Testing First: Always test with mocks before real training
  4. Small Changes: Maximum 50-100 lines per modification
  5. Smart Recovery: Analyzes failures, doesn't blind retry

Troubleshooting

OOM Errors

  • Reduce batch_size by 50%
  • Reduce sequence_length by 50%
  • Enable gradient checkpointing
  • Use gradient accumulation

Loss Increasing

  • Check learning rate (usually too high)
  • Verify data shuffling
  • Check gradient flow
  • Add warmup schedule

NaN Loss

  • Add gradient clipping
  • Reduce learning rate significantly
  • Check model initialization
  • Verify data normalization

Contributing

See CLAUDE.md for development guidelines and testing requirements.

License

MIT