@ruvector/ruvllm-cli

v0.1.0

Published

a month ago

CLI for LLM inference, benchmarking, and model management - run local LLMs with Metal/CUDA acceleration

0High
0Medium
0Low

ruvnet

llm cli inference benchmarking gguf metal cuda local-llm ai machine-learning ruvector ruvllm model-serving

@ruvector/ruvllm-cli

Command-line interface for local LLM inference and benchmarking - run AI models on your machine with Metal, CUDA, and CPU acceleration.

Features

Hardware Acceleration - Metal (macOS), CUDA (NVIDIA), Vulkan, Apple Neural Engine
GGUF Support - Load quantized models (Q4, Q5, Q6, Q8) for efficient inference
Interactive Chat - Terminal-based chat sessions with conversation history
Benchmarking - Measure tokens/second, memory usage, time-to-first-token
HTTP Server - OpenAI-compatible API server for integration
Model Management - Download, list, and manage models from HuggingFace
Streaming Output - Real-time token streaming for responsive UX

Installation

# Install globally
npm install -g @ruvector/ruvllm-cli

# Or run directly with npx
npx @ruvector/ruvllm-cli --help

For full native performance, install the Rust binary:

cargo install ruvllm-cli

Quick Start

Run Inference

# Basic inference
ruvllm run --model ./llama-7b-q4.gguf --prompt "Explain quantum computing"

# With options
ruvllm run \
  --model ./model.gguf \
  --prompt "Write a haiku about Rust" \
  --temperature 0.8 \
  --max-tokens 100 \
  --backend metal

Interactive Chat

# Start chat session
ruvllm chat --model ./model.gguf

# With system prompt
ruvllm chat --model ./model.gguf --system "You are a helpful coding assistant"

Benchmark Performance

# Run benchmark
ruvllm bench --model ./model.gguf --iterations 20

# Compare backends
ruvllm bench --model ./model.gguf --backend metal
ruvllm bench --model ./model.gguf --backend cpu

Start Server

# OpenAI-compatible API server
ruvllm serve --model ./model.gguf --port 8080

# Then use with any OpenAI client
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}'

Model Management

# List available models
ruvllm list

# Download from HuggingFace
ruvllm download TheBloke/Llama-2-7B-GGUF

# Download specific quantization
ruvllm download TheBloke/Llama-2-7B-GGUF --quant q4_k_m

CLI Reference

| Command | Description | |---------|-------------| | run | Run inference on a prompt | | chat | Interactive chat session | | bench | Benchmark model performance | | serve | Start HTTP server | | list | List downloaded models | | download | Download model from HuggingFace |

Global Options

| Option | Description | Default | |--------|-------------|---------| | --model, -m | Path to GGUF model file | - | | --backend, -b | Acceleration backend (metal, cuda, cpu) | auto | | --threads, -t | Number of CPU threads | auto | | --gpu-layers | Layers to offload to GPU | all | | --context-size | Context window size | 2048 | | --verbose, -v | Enable verbose logging | false |

Generation Options

| Option | Description | Default | |--------|-------------|---------| | --temperature | Sampling temperature (0-2) | 0.7 | | --top-p | Nucleus sampling threshold | 0.9 | | --top-k | Top-k sampling | 40 | | --max-tokens | Maximum tokens to generate | 256 | | --repeat-penalty | Repetition penalty | 1.1 |

Programmatic Usage

import {
  parseArgs,
  formatBenchmarkTable,
  getAvailableBackends,
  ModelConfig,
  BenchmarkResult,
} from '@ruvector/ruvllm-cli';

// Parse CLI arguments
const args = parseArgs(['--model', './model.gguf', '--temperature', '0.8']);
console.log(args); // { model: './model.gguf', temperature: '0.8' }

// Check available backends
const backends = getAvailableBackends();
console.log('Available:', backends); // ['cpu', 'metal'] on macOS

// Format benchmark results
const results: BenchmarkResult[] = [
  {
    model: 'llama-7b',
    backend: 'metal',
    promptTokens: 50,
    generatedTokens: 100,
    promptTime: 120,
    generationTime: 2500,
    promptTPS: 416.7,
    generationTPS: 40.0,
    memoryUsage: 4200,
    peakMemory: 4800,
  },
];

console.log(formatBenchmarkTable(results));

Performance

Benchmarks on Apple M2 Pro with Q4_K_M quantization:

| Model | Prompt TPS | Gen TPS | Memory | |-------|------------|---------|--------| | Llama-2-7B | 450 | 42 | 4.2 GB | | Mistral-7B | 480 | 45 | 4.1 GB | | Phi-2 | 820 | 85 | 1.8 GB | | TinyLlama-1.1B | 1200 | 120 | 0.8 GB |

Configuration

Create ~/.ruvllm/config.json:

{
  "defaultBackend": "metal",
  "modelsDir": "~/.ruvllm/models",
  "cacheDir": "~/.ruvllm/cache",
  "streaming": true,
  "logLevel": "info"
}

Environment Variables

| Variable | Description | |----------|-------------| | RUVLLM_MODELS_DIR | Models directory | | RUVLLM_CACHE_DIR | Cache directory | | RUVLLM_BACKEND | Default backend | | RUVLLM_THREADS | CPU threads | | HF_TOKEN | HuggingFace token for gated models |

Related Packages

@ruvector/ruvllm - LLM orchestration library
@ruvector/ruvllm-wasm - Browser LLM inference
ruvector - All-in-one vector database

Documentation

License

MIT OR Apache-2.0

Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@ruvector/ruvllm-cli

Features

Installation

Quick Start

Run Inference

Interactive Chat

Benchmark Performance

Start Server

Model Management

CLI Reference

Global Options

Generation Options

Programmatic Usage

Performance

Configuration

Environment Variables

Related Packages

Documentation

License