gpu-orchestrator
v0.5.0
Published
Cross-platform GPU/CPU tuning + monitoring for local AI runtimes
Maintainers
Readme
Cross-platform GPU/CPU tuning, VRAM management, and live monitoring dashboard for local AI runtimes.
Why this exists
Local AI runtimes (Ollama, llama.cpp, vLLM) ship with zero guidance on tuning for your hardware. Users guess at batch sizes, thread counts, VRAM limits, and backend selection — and get it wrong. This tool inspects your machine, detects your compute stack, and generates correct runtime configs automatically.
What it does
- Detects hardware — CPU cores, RAM, GPUs, VRAM per device, temperatures
- Probes compute backends — CUDA, ROCm, Vulkan, Metal, CPU
- Recommends backends — AMD → ROCm primary, Vulkan fallback; NVIDIA → CUDA; Apple → Metal
- VRAM soft caps — suggest memory limits per GPU (strict / balanced / lenient policies)
- Model recommendations — curated database of 35+ models, ranked by what fits your VRAM
- Quickstart wizard — zero to running in one command
- Model fit checker — tell it a model size, it tells you if it fits your VRAM
- GPU/CPU split planner — visualize how a model distributes across GPUs + CPU RAM
- Model management — list, load, unload, pull, remove, inspect Ollama models with GPU/CPU split bars
- Profile-based tuning — latency, throughput, balanced, low-power
- Generates runtime configs — env files for Ollama, llama.cpp flags, vLLM args
- Live ASCII dashboard — CPU trend, RAM gauge, GPU table, VRAM bars, temperatures, live inference
- Web GUI dashboard — browser-based live monitoring with Chart.js graphs, health score, model management
- Install guides — platform-aware setup instructions for ROCm, Vulkan, CUDA, Metal, Ollama
- Quick benchmarks — tok/s measurement with history tracking, comparison, trend sparklines
- Cost savings — compare local inference electricity cost vs cloud API pricing
- Health score — A-F system grade with category breakdown
- Temperature monitoring — GPU and CPU temps via nvidia-smi, rocm-smi, systeminformation
- Process viewer — find running AI runtime processes
- Persistent config — save preferences to
~/.gpu-orchestrator/config.json
Install
npm install -g gpu-orchestratorCommands
quickstart — Zero to running
gpu-orchestrator quickstartInteractive 9-step wizard: detect hardware → check Ollama → recommend models → pull → generate config → load → benchmark → done.
status — Quick overview
gpu-orchestrator status⚡ GPU Orchestrator — Status
CPU: [██░░░░░░░░░░░░░░░░░░] 4.9% (12 cores)
RAM: [████████░░░░░░░░░░░░] 41.9% (13.1/31.3 GB)
GPU: AMD Radeon RX 5500 (4.0 GB VRAM)
Cap: 3.38 GB usable / 3.98 GB total (balanced)
Temp: CPU 52°C | AMD Radeon RX 5500 61°C
API: VULKAN
Proc: 2 AI process(es) running
● Ollama — 3 models, 1 running
○ llama.cpp
○ vLLMrecommend — What model should I run?
gpu-orchestrator recommend
gpu-orchestrator recommend --category code
gpu-orchestrator recommend --jsonScans your hardware, shows a ranked table of models that fit your VRAM, with medal emojis and install status.
doctor — Full inspection
gpu-orchestrator doctor
gpu-orchestrator doctor --json
gpu-orchestrator doctor --vram-policy strict
gpu-orchestrator doctor --export report.jsonDetailed report: health score (A-F grade), system specs, temperatures, per-GPU VRAM caps, backend detection table, process list, runtime probing, profile recommendations.
models — Ollama model management
gpu-orchestrator models list # all installed + running status
gpu-orchestrator models running # loaded models with GPU/CPU split + last bench tok/s
gpu-orchestrator models info llama3:8b # detailed model specs
gpu-orchestrator models load llama3:8b # load into memory
gpu-orchestrator models unload llama3:8b # free memory
gpu-orchestrator models pull llama3:8b # pull from Ollama library (streaming progress)
gpu-orchestrator models rm llama3:8b # remove a modelmodel-fit — Will it run?
gpu-orchestrator model-fit 7b
gpu-orchestrator model-fit 13b --quant q8
gpu-orchestrator model-fit 3b --quant fp16Presets: 1b 3b 7b 8b 13b 14b 24b 30b 34b 70b 72b — or pass raw GB.
split — GPU/CPU layer split planner
gpu-orchestrator split 4.5
gpu-orchestrator split 14 --runtime llamacpp
gpu-orchestrator split 8 --policy strictoptimize — Generate runtime config
gpu-orchestrator optimize --runtime ollama --profile balanced
gpu-orchestrator optimize --runtime llamacpp --profile latency --backend vulkan
gpu-orchestrator optimize --runtime ollama --profile balanced --apply # show how to apply
gpu-orchestrator optimize --runtime ollama --profile balanced --dry-run| Option | Values |
|---|---|
| --runtime | ollama · llamacpp · vllm |
| --profile | latency · throughput · balanced · low-power |
| --vram-cap | Manual cap in GB |
| --vram-policy | strict (25%) · balanced (15%) · lenient (5%) |
| --backend | cuda · rocm · vulkan · metal · cpu · auto |
| --apply | Show platform-specific apply instructions |
bench — Runtime benchmark
gpu-orchestrator bench --model llama3:8b --runs 3
gpu-orchestrator bench history # past runs table
gpu-orchestrator bench compare llama3:8b mistral:7b # side-by-side
gpu-orchestrator bench trend llama3:8b # ASCII sparkline over timeResults auto-save to history. Use --no-save to skip.
savings — Cost calculator
gpu-orchestrator savings --tokens 100000
gpu-orchestrator savings --tokens 500000 --electricity 0.15
gpu-orchestrator savings --tokens 100000 --jsonCompares your local electricity cost vs GPT-4o, GPT-4o Mini, Claude Sonnet, Claude Haiku.
ps — AI process viewer
gpu-orchestrator ps
gpu-orchestrator ps --jsonShows running Ollama, llama.cpp, vLLM processes with PID, CPU%, memory.
config — Persistent configuration
gpu-orchestrator config show # print current config
gpu-orchestrator config set electricityRate 0.15 # update a value
gpu-orchestrator config set hosts.ollama http://192.168.1.50:11434
gpu-orchestrator config reset # restore defaults
gpu-orchestrator config path # show config file locationmonitor — Live ASCII dashboard (TUI)
gpu-orchestrator monitor
gpu-orchestrator monitor --runtime all --interval 2000
gpu-orchestrator monitor --new-windowTerminal UI panels: CPU load trend (60 points) · Memory donut · GPU table · Temperature panel · Live inference metrics · Runtime status · Event log. Press q to exit.
web — Web GUI dashboard
gpu-orchestrator web
gpu-orchestrator web --port 8080 --open
gpu-orchestrator web --host http://192.168.1.50:11434Browser-based live dashboard with:
- DARKSOL gold/dark branded theme
- Chart.js real-time CPU line chart
- Temperature gauges (color-coded)
- Health score badge (A-F)
- GPU/VRAM panels with cap indicators
- Ollama model list with load/unload buttons
- Model pull input field
- Cost savings calculator widget
- Runtime status (Ollama, llama.cpp, vLLM)
- 2-second auto-refresh
install-guide — Setup help
gpu-orchestrator install-guide
gpu-orchestrator install-guide rocm
gpu-orchestrator install-guide cuda
gpu-orchestrator install-guide ollamaPlatform-aware guides for Windows, Linux, and macOS.
VRAM Soft Caps
Non-aggressive by design — suggestions only, never kills processes.
| Policy | VRAM Reserve | Use Case |
|---|---|---|
| strict | 25% | Desktop multitasking, stability first |
| balanced | 15% | Default — good performance + headroom |
| lenient | 5% | Max performance, may cause stuttering |
Backend Detection
Auto-probes and recommends:
| Hardware | Primary | Fallback | |---|---|---| | NVIDIA + CUDA | CUDA | Vulkan | | AMD + ROCm | ROCm | Vulkan | | AMD (no ROCm) | Vulkan | CPU | | Apple Silicon | Metal | — | | Intel | Vulkan | CPU |
Health Score
System health grading with 6 categories:
| Category | Max Points | Criteria | |---|---|---| | GPU | 20 | Discrete GPU detected | | Backend | 20 | GPU compute backend available | | Runtime | 20 | AI runtime running | | VRAM | 15 | 8G+ = 15, 4G+ = 10, 2G+ = 5 | | RAM | 15 | 32G+ = 15, 16G+ = 10 | | Cores | 10 | 8+ = 10, 4+ = 5 |
Grades: A (90+), B (75+), C (60+), D (45+), F (<45)
Supported Runtimes
- Ollama — full integration: model list, load/unload, pull/remove, running status, GPU/CPU split, env generation, benchmarks
- llama.cpp — flag generation (
--threads,--n-gpu-layers,--tensor-split,--batch-size) - vLLM — arg generation (
--gpu-memory-utilization,--max-num-seqs)
Testing
npm test8 test files, 25+ tests using Node's built-in test runner. All pure unit tests with mock data.
Roadmap
- [ ] Runtime adapters: text-generation-webui, exllamav2
- [ ] Model-aware tuning (param size × quant → layer allocation)
- [ ] Remote node collector + multi-machine dashboard
- [ ] Thermal guardrails with auto profile switching
- [ ] Webhook alerts on VRAM cap breach
- [ ] Config presets per GPU family
- [ ] SSH tunnel support for remote GPU nodes
License
MIT
Built with teeth. 🌑
