mcp-slurm
v0.2.0
Published
SLURM MCP server for HPC cluster management
Readme
SLURM MCP Server
A Model Context Protocol (MCP) server for managing SLURM (Simple Linux Utility for Resource Management) clusters. This server allows AI assistants to interact with HPC clusters via SSH to submit jobs, check resources, manage queues, and monitor job status.
🚀 Recent Major Improvements
Version 0.2.0 includes significant architectural improvements based on comprehensive analysis:
✅ Fixed Critical Issues
- Persistent SSH Connections: Eliminated the performance bottleneck of creating new SSH connections for every tool call
- Proper Tool Registration: Fixed MCP framework integration - tools are now correctly auto-discovered and registered
- Structured Error Handling: Replaced string-based errors with structured JSON responses for better AI interaction
- Input Sanitization: Added comprehensive validation and escaping to prevent command injection vulnerabilities
- Improved Type Safety: Enhanced TypeScript usage throughout the codebase
✅ Enhanced Security
- Command injection protection with input sanitization
- Whitelist-based parameter validation
- Secure temporary file handling
- Validated file paths and job IDs
✅ Better Reliability
- Persistent SSH connection management
- Graceful error handling and recovery
- Structured logging with Winston
- Comprehensive input validation
✅ Improved Performance
- Single persistent SSH connection vs. new connection per operation
- Reduced latency from ~3-5 seconds to milliseconds
- Efficient connection pooling
- Optimized command execution
Features
- Cluster Information: Query node status, partitions, and resource availability
- Job Submission: Submit jobs with customizable parameters including resource requests
- Job Management: Cancel, hold, release, suspend, resume, and modify running jobs
- Script Upload: Upload and execute job scripts directly to the cluster
- File Operations: View job outputs, list directories, and manage files
- SSH Connectivity: Secure connection to login nodes with password or key authentication
- Structured Logging: Comprehensive logging for debugging and monitoring
- Input Validation: Robust security against command injection attacks
Quick Start
1. Installation
# Clone the repository
git clone <your-repo>
cd mcp-slurm
# Install dependencies
npm install
# Build the project
npm run build2. Configuration
Create a .env file in the project root with your cluster connection details:
# Required: Cluster connection details
SLURM_HOST=your-cluster-login-node.example.com
SLURM_USERNAME=your-username
# Authentication (choose one)
SLURM_PASSWORD=your-password
# OR
SLURM_SSH_KEY_PATH=/path/to/your/private/key
# Optional: Connection settings
SLURM_PORT=22
# Optional: Default SLURM parameters
SLURM_DEFAULT_PARTITION=compute
SLURM_DEFAULT_ACCOUNT=your-account
# Optional: Logging configuration
LOG_LEVEL=info
NODE_ENV=development3. Running the Server
# Start the server
npm start
# Or run in development mode
npm run watchThe server will start and maintain a persistent connection to your SLURM cluster.
Tools Available
1. slurm_info
Get cluster information including nodes, partitions, queues, and job accounting.
Parameters:
command_type: Type of command (sinfo,squeue,sacct,scontrol)detailed: Get detailed output (optional)partition: Query specific partition (optional)node: Query specific node (optional)
Example Response:
{
"success": true,
"command_type": "sinfo",
"command_executed": "sinfo -N",
"output": "NODELIST NODES PARTITION STATE\nnode001 1 compute idle\n...",
"detailed": false
}2. slurm_submit
Submit jobs to the SLURM scheduler with customizable parameters.
Parameters:
job_name: Name for the job (required)command: Command or script to execute (required)partition: Partition to submit to (optional)nodes: Number of nodes (optional)cpus_per_task: CPUs per task (optional)memory: Memory per node (optional)time_limit: Time limit (optional)account: Account to charge (optional)wait: Wait for job completion (optional)- And many more...
Example Response:
{
"success": true,
"message": "Job submitted successfully",
"job_id": "12345",
"sbatch_output": "Submitted batch job 12345",
"wait_for_completion": false
}3. slurm_job_control
Control SLURM jobs: cancel, hold, release, suspend, resume, requeue, or modify job parameters.
Parameters:
job_id: Job ID to control (required)action: Action to perform (required)reason: Reason for action (optional)modify_parameter: Parameter to modify (for modify action)modify_value: New value (for modify action)
Example Response:
{
"success": true,
"action": "cancel",
"job_id": "12345",
"message": "Successfully performed cancel on job 12345",
"command_executed": "scancel 12345"
}4. slurm_script
Upload a job script to the cluster and optionally submit it to SLURM.
Parameters:
script_name: Name for script file (required)script_content: Content of the script (required)remote_path: Directory to store script (optional)submit_immediately: Submit after upload (optional, default: true)additional_sbatch_args: Extra sbatch arguments (optional)wait: Wait for completion (optional)
Example Response:
{
"success": true,
"message": "Script uploaded successfully to /home/user/job.sh",
"script_path": "/home/user/job.sh",
"submitted": true,
"job_id": "12346"
}5. slurm_files
Manage files on the cluster: list directories, view job outputs, find job output files.
Parameters:
action: Action to perform (required)path: File or directory path (optional)job_id: Job ID for finding outputs (optional)lines: Number of lines for head/tail (optional)pattern: Search pattern (optional)
Example Response:
{
"success": true,
"action": "list",
"output": "total 156\ndrwxr-xr-x 2 user group 4096 Jan 15 10:30 scripts\n...",
"path": "/home/user"
}Security Features
Input Sanitization
All user inputs are validated and sanitized:
- File paths are escaped to prevent path traversal
- Job IDs are validated against allowed patterns
- Command parameters are whitelisted
- Shell arguments are properly escaped
Connection Security
- Persistent SSH connections with proper authentication
- Support for both password and key-based authentication
- Secure temporary file handling
- Connection cleanup on server shutdown
Error Handling
All tools return structured error responses:
{
"success": false,
"error": {
"code": "SLURM_COMMAND_FAILED",
"message": "Failed to submit job",
"details": "sbatch: error: invalid partition name"
}
}Common error codes:
EXECUTION_ERROR: General execution errorsSLURM_COMMAND_FAILED: SLURM-specific command failuresCOMMAND_INJECTION_DETECTED: Security validation failuresSSH_CONNECTION_FAILED: Connection issues
Logging
The server includes comprehensive logging:
# Set log level
export LOG_LEVEL=debug
# Enable file logging (production)
export NODE_ENV=productionLogs include:
- SSH connection events
- Command executions
- Tool invocations
- Error tracking
- Performance metrics
Configuration Options
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| SLURM_HOST | Cluster hostname | - | Yes |
| SLURM_USERNAME | SSH username | - | Yes |
| SLURM_PASSWORD | SSH password | - | * |
| SLURM_SSH_KEY_PATH | Private key path | - | * |
| SLURM_PORT | SSH port | 22 | No |
| SLURM_DEFAULT_PARTITION | Default partition | - | No |
| SLURM_DEFAULT_ACCOUNT | Default account | - | No |
| LOG_LEVEL | Logging level | info | No |
| NODE_ENV | Environment | development | No |
* Either password or SSH key is required
Development
Building
npm run buildTesting Configuration
npm run testDevelopment Mode
npm run watchTroubleshooting
Common Issues
Connection Failures
- Verify SLURM_HOST is accessible
- Check SSH credentials
- Ensure firewall allows connections
Permission Errors
- Verify SSH key permissions (600)
- Check SLURM account access
- Validate partition permissions
Command Failures
- Check SLURM configuration
- Verify resource availability
- Review job parameters
Debug Mode
Enable debug logging:
export LOG_LEVEL=debug
npm startConnection Testing
Test your configuration:
npm run testArchitecture
The server uses:
- MCP Framework: For tool registration and client communication
- NodeSSH: For secure SSH connections
- Winston: For structured logging
- Zod: For input validation
- TypeScript: For type safety
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
MIT License - see LICENSE file for details.
Changelog
v0.2.0 (Latest)
- ✅ Fixed persistent SSH connection management
- ✅ Implemented proper MCP tool registration
- ✅ Added structured error handling
- ✅ Enhanced input sanitization and security
- ✅ Improved TypeScript usage
- ✅ Added comprehensive logging
- ✅ Performance improvements (3-5x faster)
v0.1.0
- Initial release with basic SLURM functionality
