npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@ruvector/ruvllm-cli

v0.1.0

Published

CLI for LLM inference, benchmarking, and model management - run local LLMs with Metal/CUDA acceleration

Readme

@ruvector/ruvllm-cli

npm version npm downloads npm downloads/month License TypeScript

Command-line interface for local LLM inference and benchmarking - run AI models on your machine with Metal, CUDA, and CPU acceleration.

Features

  • Hardware Acceleration - Metal (macOS), CUDA (NVIDIA), Vulkan, Apple Neural Engine
  • GGUF Support - Load quantized models (Q4, Q5, Q6, Q8) for efficient inference
  • Interactive Chat - Terminal-based chat sessions with conversation history
  • Benchmarking - Measure tokens/second, memory usage, time-to-first-token
  • HTTP Server - OpenAI-compatible API server for integration
  • Model Management - Download, list, and manage models from HuggingFace
  • Streaming Output - Real-time token streaming for responsive UX

Installation

# Install globally
npm install -g @ruvector/ruvllm-cli

# Or run directly with npx
npx @ruvector/ruvllm-cli --help

For full native performance, install the Rust binary:

cargo install ruvllm-cli

Quick Start

Run Inference

# Basic inference
ruvllm run --model ./llama-7b-q4.gguf --prompt "Explain quantum computing"

# With options
ruvllm run \
  --model ./model.gguf \
  --prompt "Write a haiku about Rust" \
  --temperature 0.8 \
  --max-tokens 100 \
  --backend metal

Interactive Chat

# Start chat session
ruvllm chat --model ./model.gguf

# With system prompt
ruvllm chat --model ./model.gguf --system "You are a helpful coding assistant"

Benchmark Performance

# Run benchmark
ruvllm bench --model ./model.gguf --iterations 20

# Compare backends
ruvllm bench --model ./model.gguf --backend metal
ruvllm bench --model ./model.gguf --backend cpu

Start Server

# OpenAI-compatible API server
ruvllm serve --model ./model.gguf --port 8080

# Then use with any OpenAI client
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}'

Model Management

# List available models
ruvllm list

# Download from HuggingFace
ruvllm download TheBloke/Llama-2-7B-GGUF

# Download specific quantization
ruvllm download TheBloke/Llama-2-7B-GGUF --quant q4_k_m

CLI Reference

| Command | Description | |---------|-------------| | run | Run inference on a prompt | | chat | Interactive chat session | | bench | Benchmark model performance | | serve | Start HTTP server | | list | List downloaded models | | download | Download model from HuggingFace |

Global Options

| Option | Description | Default | |--------|-------------|---------| | --model, -m | Path to GGUF model file | - | | --backend, -b | Acceleration backend (metal, cuda, cpu) | auto | | --threads, -t | Number of CPU threads | auto | | --gpu-layers | Layers to offload to GPU | all | | --context-size | Context window size | 2048 | | --verbose, -v | Enable verbose logging | false |

Generation Options

| Option | Description | Default | |--------|-------------|---------| | --temperature | Sampling temperature (0-2) | 0.7 | | --top-p | Nucleus sampling threshold | 0.9 | | --top-k | Top-k sampling | 40 | | --max-tokens | Maximum tokens to generate | 256 | | --repeat-penalty | Repetition penalty | 1.1 |

Programmatic Usage

import {
  parseArgs,
  formatBenchmarkTable,
  getAvailableBackends,
  ModelConfig,
  BenchmarkResult,
} from '@ruvector/ruvllm-cli';

// Parse CLI arguments
const args = parseArgs(['--model', './model.gguf', '--temperature', '0.8']);
console.log(args); // { model: './model.gguf', temperature: '0.8' }

// Check available backends
const backends = getAvailableBackends();
console.log('Available:', backends); // ['cpu', 'metal'] on macOS

// Format benchmark results
const results: BenchmarkResult[] = [
  {
    model: 'llama-7b',
    backend: 'metal',
    promptTokens: 50,
    generatedTokens: 100,
    promptTime: 120,
    generationTime: 2500,
    promptTPS: 416.7,
    generationTPS: 40.0,
    memoryUsage: 4200,
    peakMemory: 4800,
  },
];

console.log(formatBenchmarkTable(results));

Performance

Benchmarks on Apple M2 Pro with Q4_K_M quantization:

| Model | Prompt TPS | Gen TPS | Memory | |-------|------------|---------|--------| | Llama-2-7B | 450 | 42 | 4.2 GB | | Mistral-7B | 480 | 45 | 4.1 GB | | Phi-2 | 820 | 85 | 1.8 GB | | TinyLlama-1.1B | 1200 | 120 | 0.8 GB |

Configuration

Create ~/.ruvllm/config.json:

{
  "defaultBackend": "metal",
  "modelsDir": "~/.ruvllm/models",
  "cacheDir": "~/.ruvllm/cache",
  "streaming": true,
  "logLevel": "info"
}

Environment Variables

| Variable | Description | |----------|-------------| | RUVLLM_MODELS_DIR | Models directory | | RUVLLM_CACHE_DIR | Cache directory | | RUVLLM_BACKEND | Default backend | | RUVLLM_THREADS | CPU threads | | HF_TOKEN | HuggingFace token for gated models |

Related Packages

Documentation

License

MIT OR Apache-2.0


Part of the RuVector ecosystem - High-performance vector database with self-learning capabilities.