@zeiq/locca
v0.8.1
Published
TUI for managing local LLM inference with llama.cpp
Maintainers
Readme
locca
A TUI around llama.cpp for running,
managing, and benchmarking local GGUF models, and for launching the
pi coding agent against your local server.
https://github.com/user-attachments/assets/8b451763-bc8a-4707-96f9-9bc78cf6de25
Works on Linux and macOS, against any GPU llama.cpp can target (Vulkan, Metal, CUDA, ROCm) or CPU-only. Defaults are tuned for iGPU-class hardware (q8_0 KV cache, single slot, batch size 1024) so a 7B–9B model with 128k context fits on a 16 GB shared-VRAM iGPU.
Quickstart
npm install -g @zeiq/locca
locca # first run launches the setup wizardThe setup wizard:
- Asks for your models directory (default
~/.locca/models). - Confirms
llama-serveris on$PATH. If not, prints the exact install line for your distro (apt, dnf, pacman/AUR, zypper, apk, brew), including the shader compiler packages (glslc,spirv-headers) recent Vulkan builds need. - Sets server defaults (port, ctx, threads, VRAM budget tier).
- If the models dir is empty, offers a catalog-aware first model picker.
Each curated size shows a
fits — 5.6 GB dl, 14.3 GB RAM, 256k ctx(orneeds 32 GB+ RAM) hint based on detected hardware, so you can't accidentally pick a 30 GB download that won't run. - Offers to install
pi: triesmise→npm→ manual hint.
Then it writes ~/.locca/config.json and drops you at the menu.
Run from source
git clone https://github.com/perminder-klair/locca.git
cd locca
npm install
npm run build
npm link # symlinks `locca` into your PATHCommands
locca # interactive menu (Pi is default)
locca pi [model-pattern] # launch pi against a local server
locca serve # start llama-server with a picked model (detached)
locca switch # picker: installed models + curated catalog
locca bench # run llama-bench against a model
locca doctor # health check: hardware, llama.cpp, server, log, config
locca optimise # ask pi to review the deployment and suggest tweaks
locca api # print OpenAI-compatible connection info
locca logs # tail server log (locca-started servers only)
locca download [user/repo] # pull a GGUF from HuggingFace
locca search [query] # search HuggingFace for GGUF models
locca delete # remove a model directory
locca stop # stop the running server
locca config # view / edit ~/.locca/config.json
locca setup # re-run the setup wizard
locca install-llama # download a prebuilt llama.cpp binary into ~/.locca/bin
locca help # full command listinglocca pi qwen fuzzy-matches the first *qwen*.gguf in your models dir.
locca api
When a server is running, locca api prints the OpenAI-compatible
connection block: base URL, loaded model name, every endpoint
(/chat/completions, /completions, /embeddings, /models, plus
native /health, /props, /slots, /metrics), and a copy-pasteable
curl quick-test.
If the server bound 0.0.0.0 (the default for locca serve), it also
lists every reachable LAN and Tailscale URL, probed live so only
working ones show up. Handy for pointing a phone or another machine at
the same server.
The same output prints automatically after locca serve succeeds.
Server defaults
| Flag | Purpose |
|---|---|
| --host 0.0.0.0 (serve) / 127.0.0.1 (pi) | LAN access vs loopback only |
| --n-gpu-layers 999 | All layers on GPU |
| --flash-attn on | Flash attention |
| --cache-type-k q8_0 --cache-type-v q8_0 | Quantized KV cache (4× smaller than f16) |
| --parallel 1 | Full context to a single slot |
| --cache-reuse 256 | KV reuse across multi-turn requests |
| --batch-size 1024 | Larger prompt-processing batches (faster on iGPUs) |
| --jinja | Proper chat template handling |
| --mmproj <file> | Auto-added when an mmproj*.gguf sibling is detected |
Per-model context auto-tuning (ctxForModel() in src/models.ts) picks
the largest tier that actually fits:
- Catalog hit. When the filename matches a curated entry in
src/catalog.ts, locca uses each size's measured KV-cache slope plus detected RAM/VRAM to pick the largest tier from[4k, 8k, 16k, 32k, 64k, 128k, 256k]that fits. - Sideloaded GGUF. Falls back to a name-based regex (MoE/
*A3B*→ 128k; 3–9B → 128k; 12–14B → 64k; 22–27B → 32k; 30–35B → 64k; other → 32k). - VRAM budget cap.
vramBudgetMBin your config caps the result so a small GPU doesn't OOM on the 128k default.
Sampling parameters (temperature, top_p, etc.) are read from GGUF
metadata when --jinja is on. Verify what your server is using with
curl -s http://localhost:<port>/props | jq '.default_generation_settings.params'.
locca bench
Wraps llama-bench -o json and renders a friendlier summary:
Generation 18.3 tok/s ≈ 14 words/sec drives perceived speed
Prompt eval 231.4 tok/s ≈ 178 words/sec parallel, batched
Translates to:
• 200-token reply 10.9 s
• 2000-token reply 1m 49s
• 1000-token prompt eval 4.3 s (time-to-first-token)Generation rate is what you feel watching output stream; prompt-eval rate sets time-to-first-token on long prompts.
While the bench runs, the spinner shows live stats — elapsed time, CPU
load, RAM, and (if nvidia-smi or rocm-smi is on PATH) GPU
utilisation and VRAM. The same line shows during the "pi is thinking"
wait in locca optimise.
File layout
| Purpose | Path |
|---|---|
| Binary | wherever npm puts globals (npm prefix -g/bin) |
| Config | ~/.locca/config.json |
| Server PID | ${XDG_RUNTIME_DIR:-/tmp}/locca-server.pid |
| Server log | ${XDG_RUNTIME_DIR:-/tmp}/locca-server.log |
| pi provider config | ${PI_CODING_AGENT_DIR:-~/.pi/agent}/models.json |
| Models dir | ~/.locca/models (default, configurable) |
| Downloaded GGUFs | $modelsDir/<repo>/ |
On Linux, runtime files live in /run/user/$UID/ and are wiped on
reboot. That's intentional.
Configuration
locca setup writes ~/.locca/config.json. Edit it via locca config,
by hand, or re-run the wizard:
{
"modelsDir": "/home/you/.locca/models",
"defaultPort": 8080,
"defaultCtx": 32768,
"defaultThreads": 10,
"llamaServer": "llama-server",
"llamaCli": "llama-cli",
"llamaBench": "llama-bench",
"piSkills": "lazy",
"piExtensions": true,
"piContextFiles": false,
"vramBudgetMB": 16384
}The interactive editor shows preset pickers for defaultCtx,
defaultThreads, and vramBudgetMB, with a Custom… fallback.
If your binaries aren't on $PATH, point them at absolute paths:
{
"llamaServer": "/home/you/llama.cpp/build/bin/llama-server",
"llamaCli": "/home/you/llama.cpp/build/bin/llama-cli",
"llamaBench": "/home/you/llama.cpp/build/bin/llama-bench"
}piSkills is tri-state (default "lazy"):
"lazy"—/skill:<name>slash commands still work, but skill descriptions are stripped from the system prompt to save context on small local models. Implemented via a tiny bundled pi extension."on"— pi's default; descriptions are loaded and the model can auto-invoke skills."off"— passes--no-skills.
piExtensions (default true) toggles pi's extension discovery, needed
for lazy skills mode. piContextFiles (default false) toggles pi's
AGENTS.md / CLAUDE.md discovery; off by default so small models
aren't blown out by large project instruction files.
vramBudgetMB is optional. It caps the auto-picked context window:
| vramBudgetMB | Auto-ctx ceiling |
|---|---|
| ≤ 6 GB | 8 192 |
| ≤ 8 GB | 16 384 |
| ≤ 12 GB | 32 768 |
| ≤ 16 GB | 65 536 |
| > 16 GB | 131 072 |
It does not override an explicit defaultCtx or a ctx you type
into locca serve. locca doctor will detect your GPU's reported VRAM
and suggest a value if it's unset.
locca probes defaultPort at startup. If something already responds to
/health (a llama-server you started by hand or via a supervisor),
locca marks it as attached and uses it instead of spawning a
duplicate. serve, stop, and logs short-circuit on attached
servers; manage them via whatever started them.
locca config
locca config # interactive picker
locca config list # print every key + current value
locca config get <key>
locca config set <key> <value>
locca config reset <key> # remove the key, fall back to defaults
locca config pathEmpty values clear optional keys (e.g. locca config set vramBudgetMB ""
removes the cap).
Dependencies
Required:
node≥ 20llama.cpp:- Arch:
sudo pacman -S llama.cpp·yay -S llama.cpp-vulkan-git·yay -S llama.cpp-hip-git - macOS:
brew install llama.cpp - Debian / Ubuntu / Fedora / openSUSE / Alpine: build from source.
locca setupprints the exactapt/dnf/zypper/apkline; full deps reference at.claude/skills/llama-cpp-manage/references/install.md.
- Arch:
Optional:
pi(pi.dev) for thelocca pisubcommand. The setup wizard offers to install it, or:npm install -g @mariozechner/pi-coding-agent # or mise use -g npm:@mariozechner/pi-coding-agentvulkan-tools—vulkaninfofor GPU diagnostics.rocm-smi-lib— VRAM monitoring on AMD discrete GPUs.jq— used bydiagnose.shfor prettier output.
Updating
npm update -g @zeiq/loccaOr, if installed from source:
cd path/to/locca
git pull
npm install
npm run buildUninstall
npm uninstall -g @zeiq/locca
rm -rf "$HOME/.locca" # config + models (optional)
rm -rf "$HOME/.pi/agent" # pi provider config (optional)
rm -f "${XDG_RUNTIME_DIR:-/tmp}/locca-server."{pid,log}License
MIT
