npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@zeiq/locca

v0.8.1

Published

TUI for managing local LLM inference with llama.cpp

Readme

locca

A TUI around llama.cpp for running, managing, and benchmarking local GGUF models, and for launching the pi coding agent against your local server.

https://github.com/user-attachments/assets/8b451763-bc8a-4707-96f9-9bc78cf6de25

Works on Linux and macOS, against any GPU llama.cpp can target (Vulkan, Metal, CUDA, ROCm) or CPU-only. Defaults are tuned for iGPU-class hardware (q8_0 KV cache, single slot, batch size 1024) so a 7B–9B model with 128k context fits on a 16 GB shared-VRAM iGPU.

Quickstart

npm install -g @zeiq/locca
locca                  # first run launches the setup wizard

The setup wizard:

  1. Asks for your models directory (default ~/.locca/models).
  2. Confirms llama-server is on $PATH. If not, prints the exact install line for your distro (apt, dnf, pacman/AUR, zypper, apk, brew), including the shader compiler packages (glslc, spirv-headers) recent Vulkan builds need.
  3. Sets server defaults (port, ctx, threads, VRAM budget tier).
  4. If the models dir is empty, offers a catalog-aware first model picker. Each curated size shows a fits — 5.6 GB dl, 14.3 GB RAM, 256k ctx (or needs 32 GB+ RAM) hint based on detected hardware, so you can't accidentally pick a 30 GB download that won't run.
  5. Offers to install pi: tries misenpm → manual hint.

Then it writes ~/.locca/config.json and drops you at the menu.

Run from source

git clone https://github.com/perminder-klair/locca.git
cd locca
npm install
npm run build
npm link              # symlinks `locca` into your PATH

Commands

locca                          # interactive menu (Pi is default)
locca pi [model-pattern]       # launch pi against a local server
locca serve                    # start llama-server with a picked model (detached)
locca switch                   # picker: installed models + curated catalog
locca bench                    # run llama-bench against a model
locca doctor                   # health check: hardware, llama.cpp, server, log, config
locca optimise                 # ask pi to review the deployment and suggest tweaks
locca api                      # print OpenAI-compatible connection info
locca logs                     # tail server log (locca-started servers only)
locca download [user/repo]     # pull a GGUF from HuggingFace
locca search   [query]         # search HuggingFace for GGUF models
locca delete                   # remove a model directory
locca stop                     # stop the running server
locca config                   # view / edit ~/.locca/config.json
locca setup                    # re-run the setup wizard
locca install-llama            # download a prebuilt llama.cpp binary into ~/.locca/bin
locca help                     # full command listing

locca pi qwen fuzzy-matches the first *qwen*.gguf in your models dir.

locca api

When a server is running, locca api prints the OpenAI-compatible connection block: base URL, loaded model name, every endpoint (/chat/completions, /completions, /embeddings, /models, plus native /health, /props, /slots, /metrics), and a copy-pasteable curl quick-test.

If the server bound 0.0.0.0 (the default for locca serve), it also lists every reachable LAN and Tailscale URL, probed live so only working ones show up. Handy for pointing a phone or another machine at the same server.

The same output prints automatically after locca serve succeeds.

Server defaults

| Flag | Purpose | |---|---| | --host 0.0.0.0 (serve) / 127.0.0.1 (pi) | LAN access vs loopback only | | --n-gpu-layers 999 | All layers on GPU | | --flash-attn on | Flash attention | | --cache-type-k q8_0 --cache-type-v q8_0 | Quantized KV cache (4× smaller than f16) | | --parallel 1 | Full context to a single slot | | --cache-reuse 256 | KV reuse across multi-turn requests | | --batch-size 1024 | Larger prompt-processing batches (faster on iGPUs) | | --jinja | Proper chat template handling | | --mmproj <file> | Auto-added when an mmproj*.gguf sibling is detected |

Per-model context auto-tuning (ctxForModel() in src/models.ts) picks the largest tier that actually fits:

  • Catalog hit. When the filename matches a curated entry in src/catalog.ts, locca uses each size's measured KV-cache slope plus detected RAM/VRAM to pick the largest tier from [4k, 8k, 16k, 32k, 64k, 128k, 256k] that fits.
  • Sideloaded GGUF. Falls back to a name-based regex (MoE/*A3B* → 128k; 3–9B → 128k; 12–14B → 64k; 22–27B → 32k; 30–35B → 64k; other → 32k).
  • VRAM budget cap. vramBudgetMB in your config caps the result so a small GPU doesn't OOM on the 128k default.

Sampling parameters (temperature, top_p, etc.) are read from GGUF metadata when --jinja is on. Verify what your server is using with curl -s http://localhost:<port>/props | jq '.default_generation_settings.params'.

locca bench

Wraps llama-bench -o json and renders a friendlier summary:

  Generation     18.3 tok/s   ≈   14 words/sec    drives perceived speed
  Prompt eval   231.4 tok/s   ≈  178 words/sec    parallel, batched

  Translates to:
    • 200-token reply         10.9 s
    • 2000-token reply        1m 49s
    • 1000-token prompt eval   4.3 s  (time-to-first-token)

Generation rate is what you feel watching output stream; prompt-eval rate sets time-to-first-token on long prompts.

While the bench runs, the spinner shows live stats — elapsed time, CPU load, RAM, and (if nvidia-smi or rocm-smi is on PATH) GPU utilisation and VRAM. The same line shows during the "pi is thinking" wait in locca optimise.

File layout

| Purpose | Path | |---|---| | Binary | wherever npm puts globals (npm prefix -g/bin) | | Config | ~/.locca/config.json | | Server PID | ${XDG_RUNTIME_DIR:-/tmp}/locca-server.pid | | Server log | ${XDG_RUNTIME_DIR:-/tmp}/locca-server.log | | pi provider config | ${PI_CODING_AGENT_DIR:-~/.pi/agent}/models.json | | Models dir | ~/.locca/models (default, configurable) | | Downloaded GGUFs | $modelsDir/<repo>/ |

On Linux, runtime files live in /run/user/$UID/ and are wiped on reboot. That's intentional.

Configuration

locca setup writes ~/.locca/config.json. Edit it via locca config, by hand, or re-run the wizard:

{
  "modelsDir": "/home/you/.locca/models",
  "defaultPort": 8080,
  "defaultCtx": 32768,
  "defaultThreads": 10,
  "llamaServer": "llama-server",
  "llamaCli": "llama-cli",
  "llamaBench": "llama-bench",
  "piSkills": "lazy",
  "piExtensions": true,
  "piContextFiles": false,
  "vramBudgetMB": 16384
}

The interactive editor shows preset pickers for defaultCtx, defaultThreads, and vramBudgetMB, with a Custom… fallback.

If your binaries aren't on $PATH, point them at absolute paths:

{
  "llamaServer": "/home/you/llama.cpp/build/bin/llama-server",
  "llamaCli":    "/home/you/llama.cpp/build/bin/llama-cli",
  "llamaBench":  "/home/you/llama.cpp/build/bin/llama-bench"
}

piSkills is tri-state (default "lazy"):

  • "lazy"/skill:<name> slash commands still work, but skill descriptions are stripped from the system prompt to save context on small local models. Implemented via a tiny bundled pi extension.
  • "on" — pi's default; descriptions are loaded and the model can auto-invoke skills.
  • "off" — passes --no-skills.

piExtensions (default true) toggles pi's extension discovery, needed for lazy skills mode. piContextFiles (default false) toggles pi's AGENTS.md / CLAUDE.md discovery; off by default so small models aren't blown out by large project instruction files.

vramBudgetMB is optional. It caps the auto-picked context window:

| vramBudgetMB | Auto-ctx ceiling | |---|---| | ≤ 6 GB | 8 192 | | ≤ 8 GB | 16 384 | | ≤ 12 GB | 32 768 | | ≤ 16 GB | 65 536 | | > 16 GB | 131 072 |

It does not override an explicit defaultCtx or a ctx you type into locca serve. locca doctor will detect your GPU's reported VRAM and suggest a value if it's unset.

locca probes defaultPort at startup. If something already responds to /health (a llama-server you started by hand or via a supervisor), locca marks it as attached and uses it instead of spawning a duplicate. serve, stop, and logs short-circuit on attached servers; manage them via whatever started them.

locca config

locca config              # interactive picker
locca config list         # print every key + current value
locca config get  <key>
locca config set  <key> <value>
locca config reset <key>  # remove the key, fall back to defaults
locca config path

Empty values clear optional keys (e.g. locca config set vramBudgetMB "" removes the cap).

Dependencies

Required:

  • node ≥ 20
  • llama.cpp:
    • Arch: sudo pacman -S llama.cpp · yay -S llama.cpp-vulkan-git · yay -S llama.cpp-hip-git
    • macOS: brew install llama.cpp
    • Debian / Ubuntu / Fedora / openSUSE / Alpine: build from source. locca setup prints the exact apt/dnf/zypper/apk line; full deps reference at .claude/skills/llama-cpp-manage/references/install.md.

Optional:

  • pi (pi.dev) for the locca pi subcommand. The setup wizard offers to install it, or:
    npm install -g @mariozechner/pi-coding-agent
    # or
    mise use -g npm:@mariozechner/pi-coding-agent
  • vulkan-toolsvulkaninfo for GPU diagnostics.
  • rocm-smi-lib — VRAM monitoring on AMD discrete GPUs.
  • jq — used by diagnose.sh for prettier output.

Updating

npm update -g @zeiq/locca

Or, if installed from source:

cd path/to/locca
git pull
npm install
npm run build

Uninstall

npm uninstall -g @zeiq/locca
rm -rf "$HOME/.locca"                                   # config + models (optional)
rm -rf "$HOME/.pi/agent"                                # pi provider config (optional)
rm -f "${XDG_RUNTIME_DIR:-/tmp}/locca-server."{pid,log}

License

MIT