gpu-orchestrator
v0.8.0
Published
Cross-platform GPU/CPU tuning + monitoring for local AI runtimes
Readme
gpu-orchestrator
Built by DARKSOL 🌑
Cross-platform GPU/CPU tuning, VRAM management, and live monitoring dashboard for local AI runtimes.
Why this exists
Running local AI is just throwing models at hardware until something works — until now. gpu-orchestrator gives you actual visibility into what your machine is doing: detect your backends, cap your VRAM intelligently, generate tuned runtime configs, and watch everything live in a terminal dashboard. Built for real local inference setups, not cloud demos.
What it does
- Auto-detects compute backends: CUDA, ROCm, Vulkan, Metal, CPU
- Inspects hardware and reports VRAM, RAM, CPU with ASCII bars
- Generates tuned runtime configs for Ollama, llama.cpp, and vLLM
- Live ASCII terminal dashboard with CPU trends, memory gauges, GPU tables, VRAM bars, per-core sparklines, and runtime status
- Quick benchmark runner — tokens/sec avg/min/max across runs
- Non-aggressive VRAM soft caps: suggestions only, never kills your processes
Install
npm install -g gpu-orchestratorCommands
doctor — Full hardware inspection
gpu-orchestrator doctor
gpu-orchestrator doctor --json
gpu-orchestrator doctor --vram-policy strictReports:
- CPU, RAM, GPU hardware
- VRAM soft caps with ASCII bars per GPU
- Compute backends (CUDA / ROCm / Vulkan / Metal / CPU)
- Backend recommendations (e.g. AMD → ROCm primary, Vulkan fallback)
- Runtime status (Ollama, llama.cpp, vLLM — online/offline)
- Profile recommendations
optimize — Generate tuned runtime config
# Ollama
gpu-orchestrator optimize --runtime ollama --profile balanced
gpu-orchestrator optimize --runtime ollama --profile balanced --vram-cap 3.0
# llama.cpp with Vulkan backend
gpu-orchestrator optimize --runtime llamacpp --profile latency --backend vulkan
# vLLM with custom VRAM cap
gpu-orchestrator optimize --runtime vllm --profile throughput --vram-cap 6.0
# Dry run (preview without writing)
gpu-orchestrator optimize --runtime ollama --profile balanced --dry-runOptions:
| Flag | Values | Description |
|---|---|---|
| --runtime | ollama | llamacpp | vllm | Target runtime |
| --profile | latency | throughput | balanced | low-power | Tuning profile |
| --vram-cap <GB> | number | Manual VRAM soft cap in GB |
| --vram-policy | strict | balanced | lenient | VRAM reserve policy |
| --backend | cuda | rocm | vulkan | metal | cpu | Force backend (default: auto) |
| --output <path> | string | Custom output path |
| --dry-run | — | Preview without writing |
monitor — Live ASCII terminal dashboard
gpu-orchestrator monitor
gpu-orchestrator monitor --runtime all --interval 2000Dashboard panels:
- CPU load trend line
- Memory donut gauge (color-coded)
- GPU table (model/vendor/VRAM/driver)
- VRAM bar chart
- Per-core CPU sparklines
- Runtime status panel (Ollama / llama.cpp / vLLM online status)
- Event log
Press q to exit.
bench — Quick runtime benchmark
gpu-orchestrator bench --runtime ollama --model lfm2:latest --runs 3Reports avg/min/max tokens per second.
VRAM Soft Caps
Non-aggressive by design — suggestions only, never kills processes.
| Policy | Reserve | Use case |
|---|---|---|
| strict | 25% | Safety margin for OS/desktop |
| balanced | 15% | Default — works for most setups |
| lenient | 5% | Max performance, minimal headroom |
Manual override: --vram-cap 3.0 sets a hard GB ceiling.
Generated env vars per runtime:
- Ollama:
OLLAMA_MAX_VRAM - llama.cpp:
--n-gpu-layershints +LLAMA_VRAM_CAP_MB - vLLM:
--gpu-memory-utilization
Backend Detection
Auto-probes in priority order:
| Backend | Probe method |
|---|---|
| CUDA | nvcc / nvidia-smi |
| ROCm | rocminfo / rocm-smi |
| Vulkan | vulkaninfo |
| Metal | macOS native |
| CPU | Always available |
AMD example: ROCm installed → ROCm primary. No ROCm → Vulkan fallback. Neither → CPU.
Architecture / flow
doctorprobes hardware and backend availabilityoptimizemaps hardware profile → runtime config → writes env vars or config filesmonitorpolls system metrics viasysteminformationand renders live in blessed-contribbenchruns timed inference loops and reports throughput stats- All commands respect VRAM policy before generating any output
Roadmap
- Runtime adapters: text-generation-webui, exllamav2
- Model-aware tuning (param size × quantization → layer allocation)
- Remote node collector + multi-machine dashboard
- Thermal guardrails with auto profile switching
- Config presets per GPU vendor family
- Webhook alerts when VRAM usage nears cap
Security notes
- No telemetry, no phone-home. Runs fully local.
- VRAM caps are advisory — the tool does not kill or modify running processes.
- Generated configs may contain system-specific values; review before deploying to shared machines.
License + links
- License: MIT
- GitHub: https://github.com/darks0l/gpu-orchestrator
- npm: https://www.npmjs.com/package/gpu-orchestrator
