llm-pulse

v1.0.0

Published

21 days ago

Zero-config CLI for monitoring your local LLM hardware, runtimes, and model compatibility

Downloads

492

0High
0Medium
0Low

sumeetjaindelhi

llm gpu vram monitoring hardware ollama llama inference ai cli terminal tui mcp

llm-pulse

Zero-config CLI that tells you what LLMs your PC can run. Scans hardware, finds runtimes, recommends models.

npx llm-pulse

Install

# Run directly (no install)
npx llm-pulse

# Or install globally
npm install -g llm-pulse

Requires Node.js 18+.

Commands

`llm-pulse` / `llm-pulse scan`

Hardware scan + model recommendations.

llm-pulse                            # Full scan (default)
llm-pulse --format json              # JSON output
llm-pulse --category coding --top 3  # Top 3 coding models

| Flag | Description | Default | |---|---|---| | -f, --format | table, json, or csv | table | | -c, --category | general, coding, reasoning, creative, multilingual | all | | -t, --top <n> | Number of recommendations | 5 | | -H, --host <url> | Ollama API host URL | http://127.0.0.1:11434 |

`llm-pulse check <model>`

"Can I run this model?" verdict with GPU layer-offload guidance when it doesn't fully fit.

llm-pulse check llama3.1:8b          # Check a specific model
llm-pulse check llama3.1:70b         # Overflow case — shows partial-offload tip
llm-pulse check qwen2.5-coder:14b --quant Q4_K_M
llm-pulse check llama3.1:70b --format json

When a model overflows your VRAM, the GPU Layer Offload section tells you how many transformer blocks to put on the GPU (maps to Ollama num_gpu / llama.cpp --n-gpu-layers) with the rest on CPU — e.g. "Put 44 of 80 layers on GPU (~22 GB), rest on CPU". Hidden on Apple Silicon (unified memory) and CPU-only systems.

`llm-pulse compare [models...]`

Compare models side-by-side against your hardware — fit level, VRAM needed, and speed estimate per model.

llm-pulse compare llama3.1:8b phi3 qwen2.5-coder:14b
llm-pulse compare --category coding --top 3    # Auto-pick top 3 coding models
llm-pulse compare llama3.1:8b phi3 --quant Q4_K_M

`llm-pulse quant-advice <model>`

Which quantization should you actually pick? Shows a quality-vs-VRAM tradeoff table with the sweet-spot recommendation for your hardware — the largest quant that still fits comfortably.

llm-pulse quant-advice llama3.1:8b       # Sweet-spot pick + full tradeoff table
llm-pulse quant-advice llama3.1:70b      # "Nothing fits" → redirects to check for offload tips
llm-pulse quant-advice qwen2.5-coder:14b --format json

Each row gets a note: "Sweet spot — best quality you can fit", "Smaller — faster, slight quality drop", "Overkill — negligible quality gain", "Too big — overflows VRAM", etc. The recommendation follows the llama.cpp community heuristic: buy the most quality you can afford in VRAM, since gains at the high end are real but diminishing.

`llm-pulse optimize <model>`

Recommends tuned Ollama runtime parameters — num_ctx, num_gpu, num_thread, num_batch — for the sweet-spot quantization on your hardware, as a paste-ready Modelfile plus interactive /set parameter lines.

llm-pulse optimize llama3.1:8b                  # Balanced tuned profile + Modelfile
llm-pulse optimize llama3.1:8b --quant Q4_K_M   # Pin a specific quantization
llm-pulse optimize llama3.1:8b --format json

num_thread uses your physical performance cores (skipping efficiency cores and SMT); num_ctx is the largest context whose KV cache fits alongside the weights — conservative, so it won't suggest a size that risks OOM; num_gpu reuses the layer-offload math (omitted on Apple Silicon, where Ollama offloads all layers); num_batch drops to 256 on tight fits to ease the prompt-eval VRAM spike.

`llm-pulse context-fit <model>`

"Will this prompt fit in the context window?" — answers using the smaller of the model's native context window and the KV-cache ceiling your hardware can sustain. Returns a yes / tight / no verdict, which ceiling is binding (model vs hardware), and a remedy when it doesn't fit (smaller quant that does, or trim/offload suggestion).

llm-pulse context-fit llama3.1:8b --prompt-tokens 50000                          # will this prompt fit in context?
llm-pulse context-fit llama3.1:8b --prompt-tokens 50000 --response-tokens 1024   # reserve room to generate
llm-pulse context-fit llama3.1:8b --prompt-tokens 50000 --format json

`llm-pulse doctor`

System health check — scores your setup and gives suggestions.

llm-pulse doctor
llm-pulse doctor --format json
llm-pulse doctor --fix --dry-run    # Preview the exact commands --fix would run
llm-pulse doctor --fix              # Auto-fix detected issues

--dry-run prints each planned fix with the exact command (e.g. $ brew install ollama) and changes nothing — review first, then run --fix to apply.

`llm-pulse models`

Browse the model database filtered for your hardware. Pulls in the live ollama.com/library catalog (cached 24 h) on top of the curated database.

llm-pulse models                      # Curated set (48 models)
llm-pulse models --library            # Full Ollama library (245+ models)
llm-pulse models --refresh            # Force refresh library cache
llm-pulse models --search llama       # Search by name
llm-pulse models --category coding    # Filter by category
llm-pulse models --fits               # Only models that fit your VRAM

`llm-pulse monitor`

Live TUI dashboard — like htop for LLMs. Press Tab to switch views, q to quit.

Overview — CPU/GPU/RAM/VRAM bars with sparklines + smart alerts
Inference — Throughput chart + session stats
GPU — Per-GPU utilization, temperature, VRAM, and power sparklines with peak stats + temperature alerts
VRAM Map — Visual VRAM breakdown (model weights / KV cache / overhead / free)
Models — Browse installed Ollama models; pull new ones or delete, from inside the TUI

llm-pulse monitor

`llm-pulse benchmark`

Quick inference benchmark via Ollama.

llm-pulse benchmark                  # Auto-picks smallest model
llm-pulse benchmark --model phi3     # Specific model
llm-pulse benchmark --rounds 5       # 5 rounds (default: 3)

`llm-pulse profile`

Run inference with hardware profiling — latency breakdown (TTFT, generation), plus a VRAM and GPU-utilization timeline sampled during the run.

llm-pulse profile                          # Short/medium/long prompt set
llm-pulse profile --model phi3             # Specific model
llm-pulse profile --prompt "Explain DNS"   # Custom prompt
llm-pulse profile --context-size 4096

Programmatic API

import { detectHardware, getRecommendations } from "llm-pulse";

const hardware = await detectHardware();
const recs = getRecommendations(hardware, { category: "coding", top: 3 });

console.log(recs[0].score.model.name);  // "Qwen 2.5 Coder 14B"
console.log(recs[0].score.fitLevel);     // "comfortable"
console.log(recs[0].pullCommand);        // "ollama pull qwen2.5-coder:14b"

MCP Server

Use llm-pulse as an MCP tool from Claude Code, Cursor, or any MCP-compatible AI assistant. The assistant can scan your hardware, check model compatibility, and snapshot live GPU/VRAM state — all without leaving the chat.

Add to your Claude Code config (~/.claude.json or your project's .mcp.json):

{
  "mcpServers": {
    "llm-pulse": {
      "command": "npx",
      "args": ["-y", "-p", "llm-pulse", "llm-pulse-mcp"]
    }
  }
}

(llm-pulse-mcp is a binary inside the llm-pulse package, so npx needs -p llm-pulse. If you've installed globally with npm install -g llm-pulse, you can use "command": "llm-pulse-mcp" with no args instead.)

Exposed tools:

| Tool | What it does | |---|---| | scan | Full hardware scan + ranked model recommendations | | check | "Can I run this model?" verdict (yes/maybe/no) with best quantization + speed estimate | | context-fit-check | "Will a prompt of N tokens fit?" — verdict (yes/tight/no), which ceiling is binding (model vs hardware), and a remedy | | recommend | Ranked model list for your hardware, filterable by category | | doctor | System health score with actionable suggestions | | models | Browse / search the model database, optionally filtered to models that fit | | monitor | One-shot live snapshot — CPU/GPU%, VRAM, temp, power, active Ollama model + tok/s |

Supported

Hardware: NVIDIA GPU (full CUDA/VRAM), AMD, Intel, Apple Silicon, any CPU (AVX2/NEON), DDR4/DDR5, NVMe/SSD/HDD

Runtimes: Ollama, llama.cpp, LM Studio

Models: 48 curated + 245+ via live Ollama library catalog (cached 24 h) — across general, coding, reasoning, creative, multilingual — each with Q4/Q5/Q8/F16 quantization variants

Stability

llm-pulse follows semantic versioning. As of 1.0.0, the CLI commands and flags, the table/json/csv output shapes, the programmatic API (detectHardware, getRecommendations), and the 7 MCP tools are considered stable — any breaking change to them bumps the major version.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

llm-pulse

Install

Commands

llm-pulse / llm-pulse scan

llm-pulse check <model>

llm-pulse compare [models...]

llm-pulse quant-advice <model>

llm-pulse optimize <model>

llm-pulse context-fit <model>

llm-pulse doctor

llm-pulse models

llm-pulse monitor

llm-pulse benchmark

llm-pulse profile

Programmatic API

MCP Server

Supported

Stability

License

`llm-pulse` / `llm-pulse scan`

`llm-pulse check <model>`

`llm-pulse compare [models...]`

`llm-pulse quant-advice <model>`

`llm-pulse optimize <model>`

`llm-pulse context-fit <model>`

`llm-pulse doctor`

`llm-pulse models`

`llm-pulse monitor`

`llm-pulse benchmark`

`llm-pulse profile`