gpu-orchestrator

v0.8.0

Published

24 days ago

Cross-platform GPU/CPU tuning + monitoring for local AI runtimes

0High
0Medium
0Low

darksol

gpu-orchestrator

Built by DARKSOL 🌑

Cross-platform GPU/CPU tuning, VRAM management, and live monitoring dashboard for local AI runtimes.

npm

Why this exists

Running local AI is just throwing models at hardware until something works — until now. gpu-orchestrator gives you actual visibility into what your machine is doing: detect your backends, cap your VRAM intelligently, generate tuned runtime configs, and watch everything live in a terminal dashboard. Built for real local inference setups, not cloud demos.

What it does

Auto-detects compute backends: CUDA, ROCm, Vulkan, Metal, CPU
Inspects hardware and reports VRAM, RAM, CPU with ASCII bars
Generates tuned runtime configs for Ollama, llama.cpp, and vLLM
Live ASCII terminal dashboard with CPU trends, memory gauges, GPU tables, VRAM bars, per-core sparklines, and runtime status
Quick benchmark runner — tokens/sec avg/min/max across runs
Non-aggressive VRAM soft caps: suggestions only, never kills your processes

Install

npm install -g gpu-orchestrator

Commands

`doctor` — Full hardware inspection

gpu-orchestrator doctor
gpu-orchestrator doctor --json
gpu-orchestrator doctor --vram-policy strict

Reports:

CPU, RAM, GPU hardware
VRAM soft caps with ASCII bars per GPU
Compute backends (CUDA / ROCm / Vulkan / Metal / CPU)
Backend recommendations (e.g. AMD → ROCm primary, Vulkan fallback)
Runtime status (Ollama, llama.cpp, vLLM — online/offline)
Profile recommendations

`optimize` — Generate tuned runtime config

# Ollama
gpu-orchestrator optimize --runtime ollama --profile balanced
gpu-orchestrator optimize --runtime ollama --profile balanced --vram-cap 3.0

# llama.cpp with Vulkan backend
gpu-orchestrator optimize --runtime llamacpp --profile latency --backend vulkan

# vLLM with custom VRAM cap
gpu-orchestrator optimize --runtime vllm --profile throughput --vram-cap 6.0

# Dry run (preview without writing)
gpu-orchestrator optimize --runtime ollama --profile balanced --dry-run

Options:

| Flag | Values | Description | |---|---|---| | --runtime | ollama | llamacpp | vllm | Target runtime | | --profile | latency | throughput | balanced | low-power | Tuning profile | | --vram-cap <GB> | number | Manual VRAM soft cap in GB | | --vram-policy | strict | balanced | lenient | VRAM reserve policy | | --backend | cuda | rocm | vulkan | metal | cpu | Force backend (default: auto) | | --output <path> | string | Custom output path | | --dry-run | — | Preview without writing |

`monitor` — Live ASCII terminal dashboard

gpu-orchestrator monitor
gpu-orchestrator monitor --runtime all --interval 2000

Dashboard panels:

CPU load trend line
Memory donut gauge (color-coded)
GPU table (model/vendor/VRAM/driver)
VRAM bar chart
Per-core CPU sparklines
Runtime status panel (Ollama / llama.cpp / vLLM online status)
Event log

Press q to exit.

`bench` — Quick runtime benchmark

gpu-orchestrator bench --runtime ollama --model lfm2:latest --runs 3

Reports avg/min/max tokens per second.

VRAM Soft Caps

Non-aggressive by design — suggestions only, never kills processes.

| Policy | Reserve | Use case | |---|---|---| | strict | 25% | Safety margin for OS/desktop | | balanced | 15% | Default — works for most setups | | lenient | 5% | Max performance, minimal headroom |

Manual override: --vram-cap 3.0 sets a hard GB ceiling.

Generated env vars per runtime:

Ollama: OLLAMA_MAX_VRAM
llama.cpp: --n-gpu-layers hints + LLAMA_VRAM_CAP_MB
vLLM: --gpu-memory-utilization

Backend Detection

Auto-probes in priority order:

| Backend | Probe method | |---|---| | CUDA | nvcc / nvidia-smi | | ROCm | rocminfo / rocm-smi | | Vulkan | vulkaninfo | | Metal | macOS native | | CPU | Always available |

AMD example: ROCm installed → ROCm primary. No ROCm → Vulkan fallback. Neither → CPU.

Architecture / flow

doctor probes hardware and backend availability
optimize maps hardware profile → runtime config → writes env vars or config files
monitor polls system metrics via systeminformation and renders live in blessed-contrib
bench runs timed inference loops and reports throughput stats
All commands respect VRAM policy before generating any output

Roadmap

Runtime adapters: text-generation-webui, exllamav2
Model-aware tuning (param size × quantization → layer allocation)
Remote node collector + multi-machine dashboard
Thermal guardrails with auto profile switching
Config presets per GPU vendor family
Webhook alerts when VRAM usage nears cap

Security notes

No telemetry, no phone-home. Runs fully local.
VRAM caps are advisory — the tool does not kill or modify running processes.
Generated configs may contain system-specific values; review before deploying to shared machines.

License + links

License: MIT
GitHub: https://github.com/darks0l/gpu-orchestrator
npm: https://www.npmjs.com/package/gpu-orchestrator

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

gpu-orchestrator

Why this exists

What it does

Install

Commands

doctor — Full hardware inspection

optimize — Generate tuned runtime config

monitor — Live ASCII terminal dashboard

bench — Quick runtime benchmark

VRAM Soft Caps

Backend Detection

Architecture / flow

Roadmap

Security notes

License + links

`doctor` — Full hardware inspection

`optimize` — Generate tuned runtime config

`monitor` — Live ASCII terminal dashboard

`bench` — Quick runtime benchmark