npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@appkit/llamacpp-cli

v1.2.0

Published

CLI tool to manage local llama.cpp servers on macOS

Readme

llamacpp-cli

Manage llama.cpp servers like Ollama—but faster. Full control over llama-server with macOS launchctl integration.

CLI tool to manage local llama.cpp servers on macOS. Provides an Ollama-like experience for managing GGUF models and llama-server instances, with significantly faster response times than Ollama.

npm version License: MIT

Status: Beta - Stable for personal use, actively maintained

Features

  • 🚀 Easy server management - Start, stop, and monitor llama.cpp servers
  • 🤖 Model downloads - Pull GGUF models from Hugging Face
  • ⚙️ Smart defaults - Auto-configure threads, context size, and GPU layers based on model size
  • 🔌 Auto port assignment - Automatically find available ports (9000-9999)
  • 📊 Status monitoring - Real-time server status with memory usage and uptime tracking
  • 🪵 Smart logging - Compact one-line request format with optional full JSON details

Why llamacpp-cli?

TL;DR: Much faster response times than Ollama by using llama.cpp's native server directly.

Ollama is great, but it adds a wrapper layer that introduces latency. llamacpp-cli gives you:

  • ⚡️ Faster inference - Direct llama-server means lower overhead and quicker responses
  • 🎛️ Full control - Access all llama-server flags and configuration options
  • 🔧 Transparency - Standard launchctl services, visible in Activity Monitor
  • 📦 Any GGUF model - Not limited to Ollama's model library
  • 🪶 Lightweight - No daemon overhead, just native macOS services

Comparison

| Feature | llamacpp-cli | Ollama | |---------|-------------|--------| | Response Time | ⚡️ Faster (native) | Slower (wrapper layer) | | Model Format | Any GGUF from HF | Ollama's library | | Server Binary | llama.cpp native | Custom wrapper | | Configuration | Full llama-server flags | Limited options | | Service Management | macOS launchctl | Custom daemon | | Resource Usage | Lower overhead | Higher overhead | | Transparency | Standard Unix tools | Black box |

If you need raw speed and full control, llamacpp-cli is the better choice.

Installation

npm install -g @appkit/llamacpp-cli

Prerequisites

  • macOS (uses launchctl for service management)
  • llama.cpp installed via Homebrew:
    brew install llama.cpp

Quick Start

# Search for models on Hugging Face
llamacpp search "llama 3b"

# Download a model
llamacpp pull bartowski/Llama-3.2-3B-Instruct-GGUF/llama-3.2-3b-instruct-q4_k_m.gguf

# List local models
llamacpp ls

# Create and start a server (auto-assigns port, uses smart defaults)
llamacpp server create llama-3.2-3b-instruct-q4_k_m.gguf

# View running servers
llamacpp ps

# Chat with your model interactively
llamacpp server run llama-3.2-3b

# Or send a single message (non-interactive)
llamacpp server run llama-3.2-3b -m "What is the capital of France?"

# Stop a server
llamacpp server stop llama-3.2-3b

# Start a stopped server
llamacpp server start llama-3.2-3b

# View logs
llamacpp server logs llama-3.2-3b -f

Using Your Server

Once a server is running, it exposes an OpenAI-compatible API:

# Chat completion
curl http://localhost:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Text completion
curl http://localhost:9000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'

# Get embeddings
curl http://localhost:9000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello world"
  }'

# Health check
curl http://localhost:9000/health

The server is fully compatible with OpenAI's API format, so you can use it with any OpenAI-compatible client library.

Example Output

Creating a server:

$ llamacpp server create llama-3.2-3b-instruct-q4_k_m.gguf

✓ Server created and started successfully!

  Model:  llama-3.2-3b-instruct-q4_k_m.gguf
  Port:   9000
  Status: Running (PID 12345)

  API endpoint: http://localhost:9000

Viewing running servers:

$ llamacpp ps

┌─────────────────────────┬──────┬────────────┬──────┬──────────┬────────┐
│ SERVER ID               │ PORT │ STATUS     │ PID  │ MEMORY   │ UPTIME │
├─────────────────────────┼──────┼────────────┼──────┼──────────┼────────┤
│ llama-3-2-3b-instruct   │ 9000 │ ✅ RUNNING │ 1234 │ 594.0 MB │ 15m    │
│ qwen2-7b-instruct-q4-k  │ 9001 │ ✅ RUNNING │ 5678 │ 1.2 GB   │ 2h     │
└─────────────────────────┴──────┴────────────┴──────┴──────────┴────────┘

Total: 2 servers (2 running, 0 stopped)

Running interactive chat:

$ llamacpp server run llama-3.2-3b

Connected to llama-3.2-3b-instruct on port 9000

You: What is the capital of France?
Assistant: The capital of France is Paris...

You: exit

One-shot message:

$ llamacpp server run llama-3.2-3b -m "What is the capital of France?"
Assistant: The capital of France is Paris...

Commands

llamacpp ls

List all GGUF models in ~/models directory.

llamacpp ls

llamacpp search <query> [options]

Search Hugging Face for GGUF models.

# Search for models
llamacpp search "llama 3.2"

# Limit results
llamacpp search "qwen" --limit 10

# Show files for a specific result (by index number)
llamacpp search "llama 3b" --files 1

Options:

  • -l, --limit <number> - Max results to show (default: 20)
  • --files [number] - Show available GGUF files for result # (e.g., --files 1)

Tip: Results are numbered. Use the number with --files to see available quantizations for that model!

llamacpp show <identifier> [options]

Show details about a model or file without downloading.

# Show model info and all GGUF files
llamacpp show bartowski/Llama-3.2-3B-Instruct-GGUF

# Show info for a specific file
llamacpp show bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Or use --file flag
llamacpp show bartowski/Llama-3.2-3B-Instruct-GGUF --file Llama-3.2-3B-Instruct-Q4_K_M.gguf

Options:

  • -f, --file <filename> - Show details for a specific file

Displays: Downloads, likes, license, tags, and available GGUF files

llamacpp pull <identifier> [options]

Download a GGUF model from Hugging Face.

# Option 1: Full path (recommended)
llamacpp pull bartowski/Llama-3.2-3B-Instruct-GGUF/llama-3.2-3b-instruct-q4_k_m.gguf

# Option 2: Repo + --file flag
llamacpp pull bartowski/Llama-3.2-3B-Instruct-GGUF --file llama-3.2-3b-instruct-q4_k_m.gguf

Options:

  • -f, --file <filename> - Specific GGUF file (alternative to path)

llamacpp rm <model>

Delete a model file from ~/models (and stop any associated servers).

llamacpp rm llama-3.2-3b-instruct-q4_k_m.gguf
llamacpp rm llama-3.2  # Partial name matching

llamacpp ps

List all servers with status, memory usage, and uptime.

llamacpp ps

Shows:

  • Server ID and model name
  • Port number
  • Status (running/stopped/crashed)
  • Process ID (PID)
  • Memory usage (RAM consumption)
  • Uptime (how long server has been running)

Server Management

llamacpp server create <model> [options]

Create and start a new llama-server instance.

llamacpp server create llama-3.2-3b-instruct-q4_k_m.gguf
llamacpp server create llama-3.2-3b-instruct-q4_k_m.gguf --port 8080 --ctx-size 16384 --verbose

Options:

  • -p, --port <number> - Port number (default: auto-assign from 9000)
  • -t, --threads <number> - Thread count (default: half of CPU cores)
  • -c, --ctx-size <number> - Context size (default: based on model size)
  • -g, --gpu-layers <number> - GPU layers (default: 60)
  • -v, --verbose - Enable verbose HTTP logging (detailed request/response info)

llamacpp server show <identifier>

Show detailed configuration and status information for a server.

llamacpp server show llama-3.2-3b       # By partial name
llamacpp server show 9000               # By port
llamacpp server show llama-3-2-3b       # By server ID

Displays:

  • Server ID, model name, and path
  • Current status (running/stopped/crashed)
  • Port and PID
  • Runtime info (uptime, memory usage)
  • Configuration (threads, context size, GPU layers, verbose logging)
  • Timestamps (created, last started/stopped)
  • System paths (plist file, log files)
  • Quick commands for common next actions

Identifiers: Port number, server ID, partial model name

llamacpp server start <identifier>

Start an existing stopped server.

llamacpp server start llama-3.2-3b       # By partial name
llamacpp server start 9000               # By port
llamacpp server start llama-3-2-3b       # By server ID

Identifiers: Port number, server ID, partial model name, or model filename

llamacpp server run <identifier> [options]

Run an interactive chat session with a model, or send a single message.

# Interactive mode (REPL)
llamacpp server run llama-3.2-3b       # By partial name
llamacpp server run 9000               # By port
llamacpp server run llama-3-2-3b       # By server ID

# One-shot mode (single message and exit)
llamacpp server run llama-3.2-3b -m "What is the capital of France?"
llamacpp server run 9000 --message "Explain quantum computing in simple terms"

Options:

  • -m, --message <text> - Send a single message and exit (non-interactive mode)

Identifiers: Port number, server ID, partial model name, or model filename

In interactive mode, type exit or press Ctrl+C to end the session.

llamacpp server stop <identifier>

Stop a running server by model name, port, or ID.

llamacpp server stop llama-3.2-3b
llamacpp server stop 9000

llamacpp server rm <identifier>

Remove a server configuration and launchctl service (preserves model file).

llamacpp server rm llama-3.2-3b
llamacpp server rm 9000

llamacpp server logs <identifier> [options]

View server logs with smart filtering.

Without --verbose (default):

llamacpp server logs llama-3.2-3b
# Output: 2025-12-09 18:02:23 POST /v1/chat/completions 127.0.0.1 200

With --verbose enabled on server:

llamacpp server logs llama-3.2-3b
# Output: 2025-12-09 18:02:23 POST /v1/chat/completions 127.0.0.1 200 "What is..." 305 22 1036

More examples:

Full HTTP JSON request/response

llamacpp server logs llama-3.2-3b --http

Follow logs in real-time

llamacpp server logs llama-3.2-3b --follow

Last 100 requests

llamacpp server logs llama-3.2-3b --lines 100

Show only errors

llamacpp server logs llama-3.2-3b --errors

Show all messages (including debug internals)

llamacpp server logs llama-3.2-3b --verbose

Custom filter pattern

llamacpp server logs llama-3.2-3b --filter "error|warning"


**Options:**
- `-f, --follow` - Follow log output in real-time
- `-n, --lines <number>` - Number of lines to show (default: 50)
- `--http` - Show full HTTP JSON request/response logs
- `--errors` - Show only error messages
- `--verbose` - Show all messages including debug internals
- `--filter <pattern>` - Custom grep pattern for filtering
- `--stdout` - Show stdout instead of stderr (rarely needed)

**Output Formats:**

Non-verbose servers (default):

TIMESTAMP METHOD ENDPOINT IP STATUS


Verbose servers (`--verbose` flag on create):

TIMESTAMP METHOD ENDPOINT IP STATUS "MESSAGE..." TOKENS_IN TOKENS_OUT TIME_MS


The compact format shows one line per HTTP request. Verbose servers include:
- User's message (first 50 characters)
- Token counts (prompt tokens in, completion tokens out)
- Total response time in milliseconds

**Note:** To get detailed logs, create your server with the `--verbose` flag:
```bash
llamacpp server create model.gguf --verbose

Use --http to see full request/response JSON, or --verbose option to see all internal server logs.

Configuration

llamacpp-cli stores its configuration in ~/.llamacpp/:

~/.llamacpp/
├── config.json           # Global settings
├── servers/              # Server configurations
│   └── <server-id>.json
└── logs/                 # Server logs
    ├── <server-id>.stdout
    └── <server-id>.stderr

Smart Defaults

llamacpp-cli automatically configures optimal settings based on model size:

| Model Size | Context Size | Threads | GPU Layers | |------------|--------------|---------|------------| | < 1GB | 2048 | Half cores | 60 | | 1-3GB | 4096 | Half cores | 60 | | 3-6GB | 8192 | Half cores | 60 | | > 6GB | 16384 | Half cores | 60 |

All servers include --embeddings and --jinja flags by default.

How It Works

llamacpp-cli uses macOS launchctl to manage llama-server processes:

  1. Creates a launchd plist file in ~/Library/LaunchAgents/
  2. Registers the service with launchctl load
  3. Starts the server with launchctl start
  4. Monitors status via launchctl list and lsof

Services are named com.llama.<model-id> and persist across reboots.

Known Limitations

  • macOS only - Relies on launchctl for service management (Linux/Windows support planned)
  • Homebrew dependency - Requires llama.cpp installed via brew install llama.cpp
  • ~/models convention - Expects GGUF models in ~/models directory
  • Single binary - Assumes llama-server at /opt/homebrew/bin/llama-server
  • Port range - Auto-assignment limited to 9000-9999 (configurable with --port)

Troubleshooting

Command not found

Make sure npm global bin directory is in your PATH:

npm config get prefix  # Should be in PATH

llama-server not found

Install llama.cpp via Homebrew:

brew install llama.cpp

Port already in use

llamacpp-cli will automatically find the next available port. Or specify a custom port:

llamacpp server create model.gguf --port 8080

Server won't start

Check the logs for errors:

llamacpp server logs <identifier> --errors

Development

# Install dependencies
npm install

# Run in development mode
npm run dev -- ps

# Build for production
npm run build

# Clean build artifacts
npm run clean

Releasing

This project uses commit-and-tag-version for automated releases based on conventional commits.

Commit Message Format:

# Features (bumps minor version)
git commit -m "feat: add interactive chat command"
git commit -m "feat(search): add limit option for search results"

# Bug fixes (bumps patch version)
git commit -m "fix: handle port conflicts correctly"
git commit -m "fix(logs): stream logs without buffering"

# Breaking changes (bumps major version)
git commit -m "feat!: change server command structure"
git commit -m "feat: major refactor

BREAKING CHANGE: server commands now require 'server' prefix"

# Other types (no version bump, hidden in changelog)
git commit -m "chore: update dependencies"
git commit -m "docs: fix typo in README"
git commit -m "test: add unit tests for port manager"

Release Commands:

# Automatic version bump based on commits
npm run release

# Force specific version bump
npm run release:patch  # 1.0.0 → 1.0.1
npm run release:minor  # 1.0.0 → 1.1.0
npm run release:major  # 1.0.0 → 2.0.0

# First release (doesn't bump version, just tags)
npm run release:first

What happens during release:

  1. Analyzes commits since last release
  2. Determines version bump (feat = minor, fix = patch, BREAKING CHANGE = major)
  3. Updates package.json version
  4. Generates/updates CHANGELOG.md
  5. Creates git commit: chore(release): v1.2.3
  6. Creates git tag: v1.2.3
  7. Pushes tags to GitHub
  8. Publishes to npm with --access public

Contributing

Contributions are welcome! If you'd like to contribute:

  1. Open an issue first for major changes to discuss the approach
  2. Fork the repository
  3. Create a feature branch (git checkout -b feature/amazing-feature)
  4. Make your changes and test with npm run dev
  5. Commit using conventional commits (see Releasing section)
    • feat: for new features
    • fix: for bug fixes
    • docs: for documentation
    • chore: for maintenance
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Development Tips

  • Use npm run dev -- <command> to test commands without building
  • Check logs with llamacpp server logs <server> --errors when debugging
  • Test launchctl integration with launchctl list | grep com.llama
  • All server configs are in ~/.llamacpp/servers/
  • Test interactive chat with npm run dev -- server run <model>

Acknowledgments

Built on top of the excellent llama.cpp project by Georgi Gerganov and contributors.

License

MIT