npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

termollama

v0.0.17

Published

A linux command line utility for Ollama

Readme

Termollama

pub package

A Linux command line utility for Ollama, a user friendly Llamacpp wrapper. It displays info about gpu vram usage and models and has these additional features:

  • Memory management: load and unload models with different parameters
  • Serve command: with flag options
  • Gguf utilities: extract gguf files or links from the Ollama models blob

Install

The nvidia-smi command should be available on the system in order to display gpu info.

Install:

npm i -g termollama
# to update:
npm i -g termollama@latest

Or just run it with npx:

npx termollama

The olm command is now available.

Memory occupation stats

Run the olm command without any argument to display memory stats. Output:

Note the action bar at the bottom with quick actions shortcuts: it will stay on the screen for 5 seconds and disapear. It allows quick actions:

  • m → Show a memory chart
  • l → Load models
  • u → Unload models

Watch mode

To monitor the activity in real time:

olm -w

Options

  • -m, --max-model-bars <number>: Set the maximum number of model bars to display. Defaults to OLLAMA_MAX_LOADED_MODELS if set, otherwise 3 × number of GPUs.

Environment Variables

  • TERMOLLAMA_TEMPS: Set temperature thresholds as comma-separated values (low, mid, high) for color-coding.
    Example:
    export TERMOLLAMA_TEMPS="30,55,75"
  • TERMOLLAMA_POWER: Set power usage threshold percentage for color-coding.
    Example:
    export TERMOLLAMA_POWER="20"

Models

To list all the available models:

olm models
# or
olm m

To search for a model with filters:

olm m stral
+--------------------------------------------------------+--------+---------+----------+
|                         Model                          | Params |  Quant  |   Size   |
+--------------------------------------------------------+--------+---------+----------+
|              devstral:24b-small-2505-q8_0              | 23.6B  |   Q8_0  | 23.3 GiB |
+--------------------------------------------------------+--------+---------+----------+
|                   devstral32k:latest                   | 23.6B  |   Q8_0  | 23.3 GiB |
+--------------------------------------------------------+--------+---------+----------+
|     hf.co/unsloth/Devstral-Small-2507-GGUF:Q8_K_XL     | 23.6B  | unknown | 27.8 GiB |
+--------------------------------------------------------+--------+---------+----------+
|                  mistral-nemo:latest                   | 12.2B  |   Q4_0  | 6.6 GiB  |
+--------------------------------------------------------+--------+---------+----------+
|                  mistral-small:latest                  | 23.6B  |  Q4_K_M | 13.3 GiB |
+--------------------------------------------------------+--------+---------+----------+
|                  mistral-small3.1:24b                  | 24.0B  |  Q4_K_M | 14.4 GiB |
+--------------------------------------------------------+--------+---------+----------+
|                mistral-small3.2:latest                 | 24.0B  |  Q4_K_M | 14.1 GiB |
+--------------------------------------------------------+--------+---------+----------+

Load models

List all the models and select some to load:

olm load
# or
olm l

You can specify optional parameters when loading:

  • --ctx or -c: Set the context window (e.g., 2k, 4k, 8192).
  • --keep-alive or -k: Set the keep alive timeout (e.g., 5m, 2h).
  • --ngl or -n: Number of GPU layers to load.

Examples:

  1. Basic load with search:

    olm l qw

    This searches for models containing "qw" and lets you select from the filtered list. Example output:

  2. Load with context and keep alive:

    olm load --ctx 8k --keep-alive 1h mistral

    Search for "mistral" models and load with an 8k context window and a 1 hour keep alive time.

  3. Specify GPU layers:

    olm l --ngl 40 qwen3:30b

    Loads qwen3:30b model with 40 GPU layers, the rest will go to ram.

Filters can be combined (e.g., olm l qwen3 4b finds models with both terms). The selected models are loaded into memory with interactive prompts for parameters if not specified via flags.

Unload models

To unload models:

olm unload
# or
olm u

Pick the models to unload from the list.

Serve command

A serve command is available, equivalent to ollama serve but with flag options.

olm serve
# or
olm s

Serve command options directly map to environment variables (they are changed within the process only):

| Option Flag | Environment Variable | |----------------------|---------------------------| | --flash-attention | OLLAMA_FLASH_ATTENTION | | --kv-4 | OLLAMA_KV_CACHE_TYPE=q4_0| | --kv-8 | OLLAMA_KV_CACHE_TYPE=q8_0| | --keep-alive | OLLAMA_KEEP_ALIVE | | --ctx | OLLAMA_CONTEXT_LENGTH | | --max-loaded-models| OLLAMA_MAX_LOADED_MODELS|

Usage

Options of olm serve:

  • Flash attention: use the --flash-attention or -f flag to enable
  • Q4 kv cache:use --kv-4 or -4 (note: this flag will turn flash attention on)
  • Q8 kv cache:use --kv-8 or -8 (note: this flag will turn flash attention on)
  • Cpu: use the --cpu flag to run only on cpu
  • Gpu: provide a list of gpu ids to use: --gpu 0 1 or -g 0 1
  • Keep alive: to set the default keep alive time: --keep-alive 1h or -k 1h
  • Context length: to set the default context length: -ctx 8192 or -c 8192
  • Max loaded models: max number of models in memory: --max-loaded-models 4 or -m 4
  • Max queue: set the max queue value: --max-queue 50 or -q 50
  • Num parallel: number of parallel requests: --num-parallel 2 or -n 2
  • Port: set the port: --port 11485 or -p 11485
  • Host: set the hostname: --host 192.168.1.8
  • Models registry: set the directory for models registry: --registry ~/some/path/ollama_models or -r ~/some/path/ollama_models

Key Options:

  • Flash Attention: -f
  • KV Cache:
    • -4q4_0 quantization (low memory)
    • -8q8_0 quantization (balanced)
  • GPU/CPU:
    • --cpu → Run on CPU only
    • -g 0 1 → Use specific GPUs (e.g., GPUs 0 and 1)
  • Memory Management:
    • -k 15m → Keep alive timeout
    • -c 8192 → Default context length
  • Server Settings:
    • -p 11434 → Port (default 11434)
    • -h 0.0.0.0 → Host address

Examples

olm s -fg 0

Run with flash attention on GPU 0 only

olm s -c 8192 --cpu

Run with a default context window of 8192 using only the cpu

olm s -8k 10m -m 4

Use fp8 kv cache (flash attention will be used), models will stay loaded for ten minutes and a max of 4 models can be loaded at the same times

olm s -p 11385 -r ~/some/path/ollama_models

Run on localhost:11385 with a custom models registry directory: use an empty directory to create a new registry

Environment variables info

To show the environment variables used by Ollama:

olm env
# or
olm e

| Variable | Description | |-------------------------|-----------------------------------------------------------------------------| | OLLAMA_FLASH_ATTENTION| Enable flash attention (1 to enable) | | OLLAMA_KV_CACHE_TYPE | Set KV cache quantization (e.g. q4_0, q8_0) | | OLLAMA_KEEP_ALIVE | Default keep alive timeout (e.g. 5m, 2h) | | OLLAMA_CONTEXT_LENGTH | Default context window length (e.g. 4096) | | OLLAMA_MAX_LOADED_MODELS| Maximum number of models to load simultaneously | | OLLAMA_MAX_QUEUE | Maximum request queue size | | OLLAMA_NUM_PARALLEL | Number of parallel requests allowed | | OLLAMA_HOST | Server host address (default localhost) | | OLLAMA_MODELS | Custom models registry directory | | CUDA_VISIBLE_DEVICES | GPU selection (use -1 to force CPU mode)

Instance options

To use a different instance than the default localhost:11434:

  • -u, --use-instance <hostdomain>: Use a specific Ollama instance as the source. Example:

    olm models -u 192.168.1.8:11434

    This command will list the models from the Ollama instance running at 192.168.1.8 on port 11434.

  • -s, --use-https: Use HTTPS protocol to reach the Ollama instance.

Information about gguf files

Show registries info

To show information about gguf models located in the Ollama internal registries:

olm gguf
# or
olm g

This will display information about models from the Ollama model storage registries. Ouptut:

---------  Registry hf.co/bartowski ---------
hf.co/bartowski
   NousResearch_DeepHermes-3-Llama-3-8B-Preview-GGUF (1 model)
    - Q6_K_L

---------  Registry ollama.com ---------
ollama.com
   deepseek-coder-v2 (1 model)
    - 16b-lite-instruct-q8_0

---------  Registry registry.ollama.ai ---------
registry.ollama.ai
  gemma3 (3 models)
    - 12b
    - 27b
    - 4b-it-q8_0
  ...

Show model info

To show information about a specific model:

olm gguf -m qwen3:0.6b

Output:

Model qwen3:0.6b found in registry registry.ollama.ai
  size: 498.4 MiB
  quant: Q4_K_M
  blob: /home/me/.ollama/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fx
  link: ln -s /home/me/.ollama/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fx qwen3_0.6b_Q4_K_M.gguf

The link can be used to create a regular gguf file name symlink from the blob, and use it with Llamacpp and friends.

Show template info

To show a model's template:

olm gguf -t qwen3:0.6b

Exfiltrate Model Blob

To exfiltrate a model blob to a gguf file:

olm gguf -x qwen3:0.6b /path/to/destination

This command will copy the model data from its original location to the specified destination, rename it to a .gguf file, and replace the original blob with a symlink pointing to the new file. Use case: to move the model to another storage location. Use at your own risks.

Copy Model Blob

To only copy a model blob without replacing the original:

olm gguf -c qwen3:0.6b /path/to/destination

This command will perform the same steps as the exfiltrate command but will not replace the original blob with a symlink.