@zeiq/locca

v0.8.1

Published

8 days ago

TUI for managing local LLM inference with llama.cpp

0High
0Medium
0Low

perminder-klair

llama.cpp llm tui cli gguf huggingface pi

locca

A TUI around llama.cpp for running, managing, and benchmarking local GGUF models, and for launching the pi coding agent against your local server.

https://github.com/user-attachments/assets/8b451763-bc8a-4707-96f9-9bc78cf6de25

Works on Linux and macOS, against any GPU llama.cpp can target (Vulkan, Metal, CUDA, ROCm) or CPU-only. Defaults are tuned for iGPU-class hardware (q8_0 KV cache, single slot, batch size 1024) so a 7B–9B model with 128k context fits on a 16 GB shared-VRAM iGPU.

Quickstart

npm install -g @zeiq/locca
locca                  # first run launches the setup wizard

The setup wizard:

Asks for your models directory (default ~/.locca/models).
Confirms llama-server is on $PATH. If not, prints the exact install line for your distro (apt, dnf, pacman/AUR, zypper, apk, brew), including the shader compiler packages (glslc, spirv-headers) recent Vulkan builds need.
Sets server defaults (port, ctx, threads, VRAM budget tier).
If the models dir is empty, offers a catalog-aware first model picker. Each curated size shows a fits — 5.6 GB dl, 14.3 GB RAM, 256k ctx (or needs 32 GB+ RAM) hint based on detected hardware, so you can't accidentally pick a 30 GB download that won't run.
Offers to install pi: tries mise → npm → manual hint.

Then it writes ~/.locca/config.json and drops you at the menu.

Run from source

git clone https://github.com/perminder-klair/locca.git
cd locca
npm install
npm run build
npm link              # symlinks `locca` into your PATH

Commands

locca                          # interactive menu (Pi is default)
locca pi [model-pattern]       # launch pi against a local server
locca serve                    # start llama-server with a picked model (detached)
locca switch                   # picker: installed models + curated catalog
locca bench                    # run llama-bench against a model
locca doctor                   # health check: hardware, llama.cpp, server, log, config
locca optimise                 # ask pi to review the deployment and suggest tweaks
locca api                      # print OpenAI-compatible connection info
locca logs                     # tail server log (locca-started servers only)
locca download [user/repo]     # pull a GGUF from HuggingFace
locca search   [query]         # search HuggingFace for GGUF models
locca delete                   # remove a model directory
locca stop                     # stop the running server
locca config                   # view / edit ~/.locca/config.json
locca setup                    # re-run the setup wizard
locca install-llama            # download a prebuilt llama.cpp binary into ~/.locca/bin
locca help                     # full command listing

locca pi qwen fuzzy-matches the first *qwen*.gguf in your models dir.

`locca api`

When a server is running, locca api prints the OpenAI-compatible connection block: base URL, loaded model name, every endpoint (/chat/completions, /completions, /embeddings, /models, plus native /health, /props, /slots, /metrics), and a copy-pasteable curl quick-test.

If the server bound 0.0.0.0 (the default for locca serve), it also lists every reachable LAN and Tailscale URL, probed live so only working ones show up. Handy for pointing a phone or another machine at the same server.

The same output prints automatically after locca serve succeeds.

Server defaults

| Flag | Purpose | |---|---| | --host 0.0.0.0 (serve) / 127.0.0.1 (pi) | LAN access vs loopback only | | --n-gpu-layers 999 | All layers on GPU | | --flash-attn on | Flash attention | | --cache-type-k q8_0 --cache-type-v q8_0 | Quantized KV cache (4× smaller than f16) | | --parallel 1 | Full context to a single slot | | --cache-reuse 256 | KV reuse across multi-turn requests | | --batch-size 1024 | Larger prompt-processing batches (faster on iGPUs) | | --jinja | Proper chat template handling | | --mmproj <file> | Auto-added when an mmproj*.gguf sibling is detected |

Per-model context auto-tuning (ctxForModel() in src/models.ts) picks the largest tier that actually fits:

Catalog hit. When the filename matches a curated entry in src/catalog.ts, locca uses each size's measured KV-cache slope plus detected RAM/VRAM to pick the largest tier from [4k, 8k, 16k, 32k, 64k, 128k, 256k] that fits.
Sideloaded GGUF. Falls back to a name-based regex (MoE/*A3B* → 128k; 3–9B → 128k; 12–14B → 64k; 22–27B → 32k; 30–35B → 64k; other → 32k).
VRAM budget cap. vramBudgetMB in your config caps the result so a small GPU doesn't OOM on the 128k default.

Sampling parameters (temperature, top_p, etc.) are read from GGUF metadata when --jinja is on. Verify what your server is using with curl -s http://localhost:<port>/props | jq '.default_generation_settings.params'.

`locca bench`

Wraps llama-bench -o json and renders a friendlier summary:

  Generation     18.3 tok/s   ≈   14 words/sec    drives perceived speed
  Prompt eval   231.4 tok/s   ≈  178 words/sec    parallel, batched

  Translates to:
    • 200-token reply         10.9 s
    • 2000-token reply        1m 49s
    • 1000-token prompt eval   4.3 s  (time-to-first-token)

Generation rate is what you feel watching output stream; prompt-eval rate sets time-to-first-token on long prompts.

While the bench runs, the spinner shows live stats — elapsed time, CPU load, RAM, and (if nvidia-smi or rocm-smi is on PATH) GPU utilisation and VRAM. The same line shows during the "pi is thinking" wait in locca optimise.

File layout

| Purpose | Path | |---|---| | Binary | wherever npm puts globals (npm prefix -g/bin) | | Config | ~/.locca/config.json | | Server PID | ${XDG_RUNTIME_DIR:-/tmp}/locca-server.pid | | Server log | ${XDG_RUNTIME_DIR:-/tmp}/locca-server.log | | pi provider config | ${PI_CODING_AGENT_DIR:-~/.pi/agent}/models.json | | Models dir | ~/.locca/models (default, configurable) | | Downloaded GGUFs | $modelsDir/<repo>/ |

On Linux, runtime files live in /run/user/$UID/ and are wiped on reboot. That's intentional.

Configuration

locca setup writes ~/.locca/config.json. Edit it via locca config, by hand, or re-run the wizard:

{
  "modelsDir": "/home/you/.locca/models",
  "defaultPort": 8080,
  "defaultCtx": 32768,
  "defaultThreads": 10,
  "llamaServer": "llama-server",
  "llamaCli": "llama-cli",
  "llamaBench": "llama-bench",
  "piSkills": "lazy",
  "piExtensions": true,
  "piContextFiles": false,
  "vramBudgetMB": 16384
}

The interactive editor shows preset pickers for defaultCtx, defaultThreads, and vramBudgetMB, with a Custom… fallback.

If your binaries aren't on $PATH, point them at absolute paths:

{
  "llamaServer": "/home/you/llama.cpp/build/bin/llama-server",
  "llamaCli":    "/home/you/llama.cpp/build/bin/llama-cli",
  "llamaBench":  "/home/you/llama.cpp/build/bin/llama-bench"
}

piSkills is tri-state (default "lazy"):

"lazy" — /skill:<name> slash commands still work, but skill descriptions are stripped from the system prompt to save context on small local models. Implemented via a tiny bundled pi extension.
"on" — pi's default; descriptions are loaded and the model can auto-invoke skills.
"off" — passes --no-skills.

piExtensions (default true) toggles pi's extension discovery, needed for lazy skills mode. piContextFiles (default false) toggles pi's AGENTS.md / CLAUDE.md discovery; off by default so small models aren't blown out by large project instruction files.

vramBudgetMB is optional. It caps the auto-picked context window:

| vramBudgetMB | Auto-ctx ceiling | |---|---| | ≤ 6 GB | 8 192 | | ≤ 8 GB | 16 384 | | ≤ 12 GB | 32 768 | | ≤ 16 GB | 65 536 | | > 16 GB | 131 072 |

It does not override an explicit defaultCtx or a ctx you type into locca serve. locca doctor will detect your GPU's reported VRAM and suggest a value if it's unset.

locca probes defaultPort at startup. If something already responds to /health (a llama-server you started by hand or via a supervisor), locca marks it as attached and uses it instead of spawning a duplicate. serve, stop, and logs short-circuit on attached servers; manage them via whatever started them.

`locca config`

locca config              # interactive picker
locca config list         # print every key + current value
locca config get  <key>
locca config set  <key> <value>
locca config reset <key>  # remove the key, fall back to defaults
locca config path

Empty values clear optional keys (e.g. locca config set vramBudgetMB "" removes the cap).

Dependencies

Required:

node ≥ 20
llama.cpp:
- Arch: sudo pacman -S llama.cpp · yay -S llama.cpp-vulkan-git · yay -S llama.cpp-hip-git
- macOS: brew install llama.cpp
- Debian / Ubuntu / Fedora / openSUSE / Alpine: build from source. locca setup prints the exact apt/dnf/zypper/apk line; full deps reference at .claude/skills/llama-cpp-manage/references/install.md.

Optional:

pi (pi.dev) for the locca pi subcommand. The setup wizard offers to install it, or:

npm install -g @mariozechner/pi-coding-agent
# or
mise use -g npm:@mariozechner/pi-coding-agent

vulkan-tools — vulkaninfo for GPU diagnostics.
rocm-smi-lib — VRAM monitoring on AMD discrete GPUs.
jq — used by diagnose.sh for prettier output.

Updating

npm update -g @zeiq/locca

Or, if installed from source:

cd path/to/locca
git pull
npm install
npm run build

Uninstall

npm uninstall -g @zeiq/locca
rm -rf "$HOME/.locca"                                   # config + models (optional)
rm -rf "$HOME/.pi/agent"                                # pi provider config (optional)
rm -f "${XDG_RUNTIME_DIR:-/tmp}/locca-server."{pid,log}

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme