@timemacro/service-guardian

v1.0.15

Published

a month ago

Enterprise Linux service monitor with auto-restart, crash recovery, OOM detection, email alerts, health checks. Alternative to PM2, Supervisor, Monit for systemd services. Monitor MySQL, Nginx, Apache, PostgreSQL, Redis. Zero-downtime production monitorin

Downloads

0High
0Medium
0Low

derricksiawor

systemd monitoring linux service-monitor auto-restart service monitor restart alert email daemon watchdog health-check devops sysadmin uptime service-monitoring process-monitor server-monitoring automation systemctl process-manager service-manager linux-monitoring ubuntu-monitoring debian-monitoring centos-monitoring rhel-monitoring server-management automatic-restart service-recovery crash-recovery failure-detection oom-killer memory-monitoring cpu-monitoring disk-monitoring resource-monitoring email-alerts smtp-alerts notification alerting mysql-monitor nginx-monitor apache-monitor postgresql-monitor redis-monitor mongodb-monitor docker-monitor pm2-alternative supervisor-alternative monit-alternative nagios-alternative zabbix-alternative production-monitoring enterprise-monitoring 24x7-monitoring 247-monitoring always-on high-availability fault-tolerance self-healing auto-recovery incident-response downtime-prevention uptime-monitoring service-health health-monitoring tcp-check http-check port-monitoring endpoint-monitoring api-monitoring website-monitoring server-watchdog process-watchdog service-watchdog linux-service linux-daemon background-service cron-monitoring scheduled-checks maintenance-window dependency-management service-dependencies batch-alerts alert-aggregation smart-alerts intelligent-monitoring predictive-monitoring proactive-monitoring zero-downtime mission-critical business-continuity disaster-recovery service-resilience service-reliability infrastructure-monitoring ops operations site-reliability sre cli-tool command-line terminal bash shell systemd-service systemd-manager service-orchestration service-automation

Service Guardian

Enterprise-grade automatic service monitoring, recovery, and alerting system for Linux servers

📦 Installation

⚠️ IMPORTANT: This is a global CLI tool. Always install with the -g flag:

npm install -g @timemacro/service-guardian

Or with sudo if needed:

sudo npm install -g @timemacro/service-guardian

Service Guardian is a production-ready Node.js daemon that monitors your Linux services, automatically recovers from failures, and sends intelligent alerts. Built for system administrators and DevOps teams who need reliable service uptime without manual intervention.

The Problem It Solves

Ever had MySQL crash at 3 AM due to an OOM killer? Or Apache go down during peak traffic? Service Guardian ensures your critical services stay running by:

Detecting failures instantly - Not just checking if process exists, but verifying services actually work
Smart auto-recovery - Distinguishes between crashes and manual stops, only restarts genuine failures
Intelligent alerting - Batched, actionable alerts with system context, not spam

Key Features

🛡️ Core Monitoring

Systemd Integration - Deep integration with systemd for accurate service state detection
Intelligent Failure Analysis - Differentiates between:
- OOM (Out of Memory) kills
- Service crashes
- Manual stops (won't restart these)
- Dependency failures
Parallel Monitoring - Efficiently monitors multiple services simultaneously
Resource-Aware - Monitors CPU, memory, disk usage before taking actions

🔄 Advanced Recovery

Smart Auto-Restart - With exponential backoff to prevent restart loops
Dependency Management - Handles service dependencies and circular dependencies
Recovery Actions - Beyond just restart:
- Clear system cache
- Kill memory-intensive processes
- Reload configurations
- Clean zombie processes
- Repair databases
Maintenance Windows - Pause monitoring during planned maintenance

🏥 Health Checks

Beyond Process Monitoring - Tests if services actually work:
- TCP port checks (is MySQL accepting connections?)
- HTTP endpoint checks (is API returning 200?)
- Custom script checks (complex business logic)
- Command checks (simple shell commands)
Failure Thresholds - Only alerts after X consecutive failures (no false alarms)
User-Friendly Messages - Clear explanations of what's wrong and how to fix it

📧 Intelligent Alerting

Beautiful HTML Emails - Professional, readable alert emails with system context
Alert Aggregation - Batches multiple alerts to reduce email spam
Rate Limiting - Prevents alert storms during major incidents
Cooldown Periods - Won't repeatedly alert for the same issue
Contextual Information - Includes failure analysis, resource usage, recent logs

📊 Metrics & Reporting

Service Metrics - Track uptime, restart counts, failure patterns
Resource Metrics - Monitor CPU, memory, disk usage over time
Daily Aggregation - Historical data for trend analysis
Health Reports - Summary of all monitored services

🔒 Security

Command Injection Protection - All inputs sanitized and validated
Whitelisted Commands - Only approved system commands can be executed
Path Traversal Prevention - Secure file operations
No Hardcoded Credentials - Everything configurable via environment variables

Installation

Prerequisites

Node.js >= 16.0.0
Linux with systemd (Debian, Ubuntu, RHEL, etc.)
Root or sudo access (for systemctl commands)

Install via npm

# Install globally
npm install -g @timemacro/service-guardian

# Or with sudo if needed
sudo npm install -g @timemacro/service-guardian

Install from source

# Clone from your private repository
# https://github.com/derricksiawor/service-guardian
cd service-guardian
npm install
npm link

Quick Start

1. Install and Check Version

# Install globally
npm install -g @timemacro/service-guardian

# Verify installation
sg --version
sg --help                     # See all available commands

2. Configure Email Alerts (Optional but Recommended)

sg config email               # Interactive email setup

You'll be prompted for SMTP settings:

SMTP Host (e.g., smtp.gmail.com)
SMTP Port (e.g., 587)
Username
Password
From address
To address

3. Add Services to Monitor

# Add a service (auto-restart and alerts are enabled by default)
sg add mysql

# Add multiple services
sg add nginx
sg add postgresql
sg add redis

# Add with custom settings
sg add apache2 --max-restarts 10

# List all monitored services
sg list

4. Monitor Your Services

# The daemon auto-starts when you add services
sg status                     # Check daemon and all services status

# View logs
sg logs                       # Recent logs
sg logs --follow              # Live logs (like tail -f)
sg logs --tail 100            # Last 100 lines

# Manual operations
sg check mysql                # Check specific service
sg restart                    # Restart the daemon
sg test                       # Test all services

Usage

Command Reference

Service Guardian can be invoked using either service-guardian or sg (shorthand). We recommend using sg for convenience.

Quick Information Commands

# Get started quickly
sg                            # Show help and available commands
sg --help                     # Show detailed help
sg --version                  # Show version

# View current state
sg status                     # Show daemon status and all monitored services
sg list                       # List all monitored services
sg info                       # Show system information and configuration

Core Commands

# Daemon Control (auto-starts if not running)
sg start                      # Start monitoring daemon (auto-starts on first command)
sg stop                       # Stop monitoring daemon
sg restart                    # Restart daemon
sg status                     # Show daemon and services status

# Service Management
sg add <service> [options]    # Add service to monitoring
sg remove <service>           # Remove service from monitoring
sg list                       # List all monitored services
sg enable <service>           # Enable monitoring for service
sg disable <service>          # Disable monitoring for service

# Monitoring & Logs
sg logs                       # View recent daemon logs
sg logs --follow              # View logs in real-time (like tail -f)
sg logs --tail 50             # View last 50 log lines
sg check <service>            # Manually check service status
sg test                       # Test monitoring all services

Advanced Features

# Health Checks
sg health add <service> [options]     # Add health check
sg health list                         # List all health checks
sg health remove <service>             # Remove health check
sg health test <service>               # Test health check

# Dependencies
sg deps add <service> <deps...>       # Add service dependencies
sg deps remove <service> <deps...>    # Remove dependencies
sg deps list [service]                # List dependencies
sg deps check                          # Check for circular dependencies

# Maintenance Windows
sg maintenance add [options]          # Schedule maintenance
sg maintenance list                    # List maintenance windows
sg maintenance remove <name>           # Remove maintenance window

# Groups & Tags
sg group create <name>                 # Create service group
sg group add <group> <services...>    # Add services to group
sg group list                          # List all groups
sg tag add <service> <tags...>        # Add tags to service
sg tag list [service]                  # List tags

# Metrics & Reports
sg metrics [service] [options]         # View service metrics
sg report [options]                    # Generate health report

# Configuration
sg config email                        # Configure email settings
sg config show                         # Show configuration
sg config set <key> <value>           # Set config value
sg export [file]                       # Export configuration
sg import <file>                       # Import configuration

Configuration Options

Configuration is stored in /etc/service-guardian/config.json (or ~/.service-guardian/config.json for non-root users).

{
  // Monitoring
  "CHECK_INTERVAL": 30,              // Seconds between checks
  "HEALTH_CHECK_INTERVAL": 60,       // Seconds between health checks
  
  // Restart Settings
  "MAX_RESTARTS": 5,                 // Max restart attempts
  "RESTART_DELAY": 10,               // Initial delay (seconds)
  "RESTART_BACKOFF_MULTIPLIER": 2,   // Exponential backoff
  "MAX_RESTART_DELAY": 300,          // Max delay (seconds)
  
  // Alerts
  "ALERT_COOLDOWN": 600,             // Seconds between alerts
  "ALERT_BATCH_INTERVAL": 60,        // Batch window (seconds)
  "MAX_ALERTS_PER_HOUR": 10,         // Rate limiting
  
  // Email Settings (set via sg config email)
  "SMTP_HOST": "smtp.gmail.com",
  "SMTP_PORT": 587,
  "SMTP_USER": "[email protected]",
  "SMTP_PASS": "your-app-password",
  "EMAIL_FROM": "[email protected]",
  "EMAIL_TO": "[email protected]"
}

How It Works

1. Service Monitoring Flow

┌─────────────────┐
│ Cron Scheduler  │ Every 30 seconds
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Check Services  │ Parallel checks
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Analyze Status  │ Is service healthy?
└────────┬────────┘
         │
    ┌────┴────┐
    │ Healthy │ Not Healthy
    └────┬────┘
         │
         ▼
┌─────────────────┐
│ Failure Analysis│ Why did it fail?
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Recovery Actions│ Try to fix
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Auto-Restart?   │ If enabled
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Send Alert?     │ If enabled & not in cooldown
└─────────────────┘

2. Failure Detection

Service Guardian performs intelligent failure analysis:

// Not just "is process running?"
if (!service.isActive) {
  // Analyze WHY it's not running
  const analysis = await analyzeFailure(service);
  
  if (analysis.type === 'MANUAL_STOP') {
    // User stopped it, don't restart
    return;
  }
  
  if (analysis.type === 'OOM_KILL') {
    // Killed by OOM, check memory before restart
    if (memory.usage > 90%) {
      // Clean up memory first
      await clearSystemCache();
    }
  }
  
  // Smart restart with backoff
  await attemptRestart(service);
}

3. Health Checks

Beyond process monitoring, health checks verify services actually work:

// TCP Health Check Example
const mysql_health = {
  type: 'tcp',
  host: 'localhost',
  port: 3306,
  timeout: 10,
  interval: 60
};

// Results in user-friendly messages:
// ✅ "mysql is responding on localhost:3306"
// ❌ "mysql is not accepting connections on localhost:3306. 
//     The service may be down or not listening on this port.
//     Suggestion: Verify mysql is running with: systemctl status mysql"

4. Alert Aggregation

Intelligent batching reduces email spam:

// Instead of 10 emails in 1 minute:
// "nginx failed"
// "mysql failed"
// "redis failed"
// ...

// You get 1 comprehensive email:
// "3 services need attention:
//  - nginx: Connection refused on port 80
//  - mysql: OOM killed (memory: 95%)
//  - redis: Dependency postgres is down"

Real-World Examples

Example 1: MySQL OOM Protection

# Add MySQL with OOM recovery (auto-restart and alerts enabled by default)
sg add mysql --max-restarts 5

# Add health check to verify it's accepting connections
sg health add mysql --type tcp --port 3306

# Add recovery action to clear cache when memory is high
sg recovery add mysql --type clear-cache --threshold 90

When MySQL gets OOM-killed:

Service Guardian detects the OOM kill (not just "service down")
Checks system memory usage
If memory > 90%, clears system cache first
Restarts MySQL with exponential backoff
Verifies it's accepting connections
Sends detailed alert with memory stats and suggestions

Example 2: Dependent Services

# Setup WordPress stack with dependencies
sg add nginx
sg add php-fpm
sg add mysql

# Define dependencies
sg deps add nginx php-fpm
sg deps add php-fpm mysql

# If MySQL fails, Service Guardian will:
# 1. Restart MySQL first
# 2. Then restart php-fpm (depends on MySQL)
# 3. Then restart nginx (depends on php-fpm)

Example 3: Maintenance Windows

# Schedule maintenance window for updates
sg maintenance add "Weekly Updates" \
  --days sunday \
  --start 02:00 \
  --duration 2 \
  --services nginx,mysql,redis

# During maintenance:
# - No auto-restarts
# - No alerts
# - Services can be safely updated

Example 4: Custom Health Checks

# Create custom health check script
cat > /etc/service-guardian/health-checks/api-check.sh << 'EOF'
#!/bin/bash
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/api/health)
if [ "$RESPONSE" = "200" ]; then
  echo "API is healthy"
  exit 0
else
  echo "API returned status code: $RESPONSE"
  exit 1
fi
EOF

chmod +x /etc/service-guardian/health-checks/api-check.sh

# Add the health check
sg health add api --type script --script api-check.sh

Architecture

Security Features

Input Validation - All inputs validated with JSON schemas
Command Whitelisting - Only approved system commands
Shell Escape - Prevents command injection
Path Validation - Prevents directory traversal
Secure Execution - Isolated command execution

Performance

Parallel Monitoring - Check multiple services simultaneously
Efficient Resource Usage - Minimal CPU and memory footprint
Optimized Queries - Batch operations where possible
Caching - Reduces repeated system calls

Reliability

Crash Recovery - Daemon automatically recovers from crashes
Data Persistence - Configuration and metrics survive restarts
Atomic Operations - Prevents partial updates
Graceful Shutdown - Cleanly stops all operations

Troubleshooting

Service Guardian won't start

# Check if already running
sg status

# Check logs for errors
sg logs --tail 50

# Verify Node.js version
node --version  # Should be >= 16.0.0

# Check permissions
ls -la /etc/service-guardian/

Services not being monitored

# Verify service is added
sg list

# Check if service exists
systemctl status <service-name>

# Test monitoring manually
sg check <service-name>

# Check dependencies
sg deps check

Not receiving alerts

# Test email configuration
sg config email --test

# Check alert settings
sg config show | grep ALERT

# View recent alerts
sg logs | grep "Alert sent"

# Check cooldown status
sg status --verbose

High memory usage

# Check metrics history
sg metrics --days 7

# Clear old metrics
sg metrics --cleanup

# Reduce check frequency
sg config set CHECK_INTERVAL 60

Development

Running Tests

npm test                 # Run all tests
npm run test:watch      # Watch mode
npm run test:coverage   # Coverage report

Contributing

For contributions, please see CONTRIBUTING.md or open an issue at https://github.com/derricksiawor/service-guardian/issues

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Derrick S. K. Siawor Website: https://derricksiawor.com

Support

GitHub Issues: github.com/derricksiawor/service-guardian/issues
npm Package: npmjs.com/package/@timemacro/service-guardian

Acknowledgments

Built with enterprise-grade libraries:

Commander.js - CLI interface
Nodemailer - Email alerts
node-cron - Scheduling
Winston - Logging
Chalk - Terminal styling

Stop losing sleep over crashed services. Let Service Guardian keep watch.