@auxot/worker-cli

v0.3.5

Published

11 days ago

Auxot GPU worker CLI - connects local GPU resources to Auxot platform

0High
0Medium
0Low

keith301

auxot gpu worker llama llm ai

Auxot GPU Worker CLI

Connect your local GPU resources to the Auxot platform. The worker CLI automatically downloads and manages llama.cpp binaries and models based on your GPU key policy.

Requirements

Node.js 20+
Valid Auxot GPU key with configured policy
GPU hardware (NVIDIA CUDA, AMD Vulkan, or Apple Metal) - CPU fallback available with limitations

Installation

# Run directly with npx (recommended)
npx @auxot/worker-cli --gpu-key YOUR_GPU_KEY

# Or install globally
npm install -g @auxot/worker-cli
worker-cli --gpu-key YOUR_GPU_KEY

Quick Start

1. Get Your GPU Key

Log in to Auxot
Navigate to Organization Settings → GPU Keys
Create a new GPU key
Configure the policy (model, quantization, context size, capabilities)
Copy the key (format: gpu.xxxxx.yyyyy)

2. Run the Worker

npx @auxot/worker-cli --gpu-key gpu.xxxxx.yyyyy

The worker CLI will:

✅ Automatically download the llama.cpp binary (first run only)
✅ Automatically download the model specified in your GPU key policy
✅ Spawn and manage the llama.cpp process
✅ Connect to the Auxot platform
✅ Start processing jobs

CLI Options

--gpu-key <key>        GPU authentication key (required)
--auxot-url <url>      Auxot platform URL (default: https://auxot.com)
--debug [level]        Enable debug logging (level 1 or 2, default: 1)
--help, -h             Show help message

Note: The --llama-url option is no longer needed. The worker CLI manages its own llama.cpp instance on http://127.0.0.1:9002.

How It Works

Policy Reception: Worker CLI connects to Auxot and receives the GPU key policy (model, quantization, context size, etc.)
Model Download: Automatically downloads the required GGUF model from Hugging Face if not already cached
Binary Download: Downloads the appropriate llama.cpp binary for your platform (first run only)
Process Management: Spawns llama.cpp with policy parameters and manages its lifecycle
Capability Discovery: Queries llama.cpp to discover actual model capabilities
Validation: Validates discovered capabilities against the policy (both client and server-side)
Job Processing: Listens for agent execution jobs and forwards them to llama.cpp
Streaming: Streams response tokens back to the platform in real-time
Crash Recovery: Automatically restarts llama.cpp if it crashes

GPU Key Policy

The GPU key policy defines what model and configuration your worker must use:

Model Name: Which model to load (e.g., "Qwen3-VL-30B-A3B")
Quantization: Model quantization level (e.g., "Q4_K_S", "F16")
Context Size: Maximum context window (e.g., 128000)
Max Parallelism: Maximum parallel jobs (e.g., 2)
Capabilities: Required capabilities (e.g., ["chat", "vision"])

The worker CLI validates that its discovered capabilities match the policy before accepting jobs.

Model Storage

Models are cached in ~/.auxot/models/ (platform-specific). You can override this with the AUXOT_MODELS_DIR environment variable.

Binary Storage

The llama.cpp binary is cached in ~/.auxot/llama-server/{platform}-{arch}/. The binary is downloaded once and reused on subsequent runs.

GPU Detection

The worker CLI automatically detects your GPU hardware:

macOS: Metal GPU acceleration (built into binaries)
Linux: Vulkan GPU acceleration (AMD/NVIDIA) or CPU fallback
Windows: CUDA 12.4 GPU acceleration (NVIDIA) or CPU fallback

If no GPU is detected, the worker will:

Download a CPU-only binary with a warning
Limit model size to 7B or less (if policy specifies larger model, validation will fail)

GPU ID

The worker CLI generates a stable UUID on first run and stores it in ~/.auxot/gpu-id. This allows Auxot to track individual GPUs across restarts.

Troubleshooting

Connection Failed

Verify --auxot-url is correct
Check network connectivity
Ensure GPU key is valid and has a configured policy

Policy Validation Failed

Ensure your GPU key policy is configured in the web UI
Check that the model specified in the policy exists in the model registry
Verify your GPU hardware meets the policy requirements (context size, capabilities)

Model Download Failed

Check internet connectivity (models are downloaded from Hugging Face)
Verify sufficient disk space (models can be 10GB+)
Check Hugging Face API rate limits (downloads are resumable)

llama.cpp Crashes

Check GPU memory (VRAM) - models may not fit in available memory
Review llama.cpp logs in worker CLI output
Worker CLI will attempt to auto-restart crashed processes

No Jobs Received

Verify GPU key belongs to the correct organization
Check that agents exist in your organization
Ensure GPU meets minimum context size requirements
Verify worker is online (check dashboard in web UI)

Support

For issues, questions, or feature requests, please visit: