gpu-monitor

v0.1.1

Published

5 days ago

CLI that flags idle GPUs and estimates wasted spend from local nvidia-smi data.

0High
0Medium
0Low

gpu nvidia nvidia-smi cuda monitor watcher idle utilization cost spend alert mlops cli billing underutilized orphan fleet auto-stop

gpu-monitor

Watch your GPU boxes for idle spend. Local checks are free. Email alerts use Badgr.

gpu-monitor is a CLI watcher that runs on the GPU machine — rented GPU pods, cloud GPU instances, local workstation, or any Linux box with NVIDIA drivers — and alerts when a GPU has been idle long enough to waste money.

This CLI is MIT-licensed open source. You can read every line it runs before you put it on a machine with SSH access, API keys, or customer data.

Hosted alerts, alert history, team features, and automated actions are commercial features provided by Badgr.

Quick Start

One-time local check, no account required:

npx gpu-monitor check --hourly-rate 2.50

Continuous local watcher, no account required:

npx gpu-monitor watch --hourly-rate 2.50

See which processes are holding VRAM:

npx gpu-monitor processes

Continuous watcher with Badgr email alerts:

npx gpu-monitor watch \
  --hourly-rate 2.50 \
  --idle-minutes 30 \
  --email [email protected] \
  --badgr-key $BADGR_API_KEY

Email alerts require a Badgr account and API key.

Commands

`check`

Runs one nvidia-smi sample and prints the current GPU state.

npx gpu-monitor check --hourly-rate 2.50

`watch`

Runs in the terminal and checks every interval.

npx gpu-monitor watch --hourly-rate 2.50 --interval 60

If a GPU stays idle for at least --idle-minutes, it becomes alert-ready. With --email and --badgr-key, the watcher sends an email alert through Badgr, then waits for the email cooldown before sending another alert for the same GPU.

`processes`

Shows which PIDs are holding VRAM on each GPU.

npx gpu-monitor processes
npx gpu-monitor processes --json

Useful when a GPU shows high memory but low utilization — a stale process is likely holding VRAM without doing any work. See Stale Memory below.

Keeping watch alive after SSH disconnect

watch only works while the process is alive. To keep it running after you disconnect:

tmux (recommended for interactive sessions):

tmux new -s gpu-monitor
npx gpu-monitor watch --hourly-rate 2.50 --email [email protected] --badgr-key $BADGR_API_KEY
# Ctrl-b d  to detach
# tmux attach -t gpu-monitor  to reattach

screen:

screen -S gpu-monitor
npx gpu-monitor watch --hourly-rate 2.50 --email [email protected] --badgr-key $BADGR_API_KEY
# Ctrl-a d  to detach
# screen -r gpu-monitor  to reattach

nohup (fire-and-forget, no reattach):

nohup npx gpu-monitor watch \
  --hourly-rate 2.50 \
  --email [email protected] \
  --badgr-key $BADGR_API_KEY \
  >> gpu-monitor.log 2>&1 &
echo $!   # save the PID to kill later
tail -f gpu-monitor.log

Defaults

interval: 60 seconds
idle-minutes: 30
idle-utilization: 10%
idle-memory-percent: 10%
email cooldown: 60 minutes per GPU
hourly-rate: 1.00

CLI Options

--hourly-rate <usd>             Hourly cost per GPU (default: 1)
--idle-minutes <minutes>        Alert after this much idle time (default: 30)
--interval <seconds>            Watch interval (default: 60)
--idle-utilization <pct>        GPU utilization at or below this is idle (default: 10)
--idle-memory-percent <pct>     Memory use at or below this is idle (default: 10)
--json                          Print JSON report
--email <address>               Send idle alerts through Badgr (watch only)
--badgr-key <key>               Badgr API key for email alerts
--email-cooldown-minutes <min>  Email cooldown per GPU (default: 60)
--help                          Show help

How It Works

Every interval, check and watch run:

nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,power.draw --format=csv,noheader,nounits

processes additionally runs per GPU:

nvidia-smi --id=<index> --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader,nounits

A GPU is idle when both are true:

utilization <= idle-utilization threshold
memory % <= idle-memory threshold

The watcher tracks how long each GPU has stayed idle. It alerts only when:

GPU has been idle for >= idle-minutes
AND email cooldown has passed

It does not email every interval. Multi-GPU machines are tracked per GPU index.

If nvidia-smi is missing, the CLI exits non-zero and prints the driver visibility issue instead of guessing.

The API key is never printed to stdout or stderr.

Stale Memory

A GPU is flagged as stale memory when:

utilization <= idle threshold   (GPU is not computing)
memory > memory threshold       (something is holding VRAM)

This is the classic "forgot to free after training finished" situation. The job exited but the process never released its memory allocation, leaving the GPU occupied without doing any work.

check and watch print a warning line for stale-memory GPUs:

✓ GPU 0 NVIDIA A100-SXM4-40GB: active
  utilization: 2%
  memory: 8192/40960 MB (20.0%)
  idle duration: 0s
  power draw: 50 W
  ⚠ stale memory: high VRAM, low utilization — a process may be holding memory

Run gpu-monitor processes to see the PID:

GPU processes

GPU 0 NVIDIA A100-SXM4-40GB  8192/40960 MB (20.0%)
  PID 12345     python3                   8000 MB
  held: 8000 MB / 40960 MB (19.5%)
  ⚠ stale memory: low utilization but high VRAM — run `gpu-monitor processes` to see PIDs

Waste Calculation

estimated waste = hourly rate × idle minutes / 60

Example:

$2.50/hr × 30/60 = $1.25 wasted after 30 idle minutes

Output

check / watch

GPU cost watcher

Checked at: 2026-06-09T06:00:00.000Z
GPUs checked: 2
Idle GPUs: 1
Alert-ready GPUs: 1
Hourly rate: $2.50/hr
Idle threshold: 30 min (utilization <= 10%, memory <= 10%)
Estimated waste: $1.25
Next check: 2026-06-09T06:01:00.000Z

⚠ GPU 0 NVIDIA A100-SXM4-40GB: idle
  utilization: 0%
  memory: 512/40960 MB (1.3%)
  idle duration: 30.0m
  power draw: 45 W
  estimated waste: $1.25 after 30 min idle

✓ GPU 1 NVIDIA L40S: active
  utilization: 85%
  memory: 16000/46080 MB (34.7%)
  idle duration: 0s
  power draw: 240 W

processes

GPU processes

GPU 0 NVIDIA A100-SXM4-40GB  8192/40960 MB (20.0%)
  PID 12345     python3                   6000 MB
  PID 67890     jupyter                   2000 MB
  held: 8000 MB / 40960 MB (19.5%)
  ⚠ stale memory: low utilization but high VRAM — run `gpu-monitor processes` to see PIDs

GPU 1 NVIDIA L40S  16000/46080 MB (34.7%)
  no compute processes

Install Globally

npm install -g gpu-monitor
gpu-monitor watch --hourly-rate 2.50

Requirements

Node.js 18+
NVIDIA drivers with nvidia-smi available
A GPU machine for live readings
A Badgr API key for email alerts

GPU cloud containers (RunPod, Vast.ai, Lambda Labs, Modal, CoreWeave)

Plain images like node:20 do not include nvidia-smi. Use a CUDA runtime image and install Node.js in it:

# Example base image
nvcr.io/nvidia/cuda:12.6-runtime-ubuntu22.04

# Then install Node.js
curl -fsSL https://deb.nodesource.com/setup_20.x | bash -
apt-get install -y nodejs

Or use a pre-built CUDA + Node image from the provider's image registry.

gpu-monitor tries these paths automatically before giving up:

nvidia-smi
/usr/bin/nvidia-smi
/usr/local/nvidia/bin/nvidia-smi   ← NVIDIA container toolkit default
/usr/local/bin/nvidia-smi

If all paths fail, the tool prints a clear message explaining which image type to use.

Open Source

gpu-monitor is MIT-licensed. Source is at github.com/michaelmanly/gpu-monitor.

What is open source:

check — one-time GPU status
watch — continuous monitoring with idle detection, waste estimate, power draw, email alerts, cooldown
processes — per-GPU PID list showing which processes hold VRAM
Stale memory detection
All nvidia-smi parsing logic
JSON output

What is commercial (Badgr):

Hosted email alerts and alert history
Team and fleet dashboards
Automated actions and auto-shutdown
Paid monitoring features

Bugs and pull requests welcome on GitHub.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

gpu-monitor

Quick Start

Commands

check

watch

processes

Keeping watch alive after SSH disconnect

Defaults

CLI Options

How It Works

Stale Memory

Waste Calculation

Output

check / watch

processes

Install Globally

Requirements

GPU cloud containers (RunPod, Vast.ai, Lambda Labs, Modal, CoreWeave)

Open Source

`check`

`watch`

`processes`