termollama
v0.0.17
Published
A linux command line utility for Ollama
Readme
Termollama
A Linux command line utility for Ollama, a user friendly Llamacpp wrapper. It displays info about gpu vram usage and models and has these additional features:
- Memory management: load and unload models with different parameters
- Serve command: with flag options
- Gguf utilities: extract gguf files or links from the Ollama models blob
Install
The nvidia-smi command should be available on the system in order to display gpu info.
Install:
npm i -g termollama
# to update:
npm i -g termollama@latestOr just run it with npx:
npx termollamaThe olm command is now available.
Memory occupation stats
Run the olm command without any argument to display memory stats. Output:

Note the action bar at the bottom with quick actions shortcuts: it will stay on the screen for 5 seconds and disapear. It allows quick actions:
m→ Show a memory chartl→ Load modelsu→ Unload models
Watch mode
To monitor the activity in real time:
olm -wOptions
-m, --max-model-bars <number>: Set the maximum number of model bars to display. Defaults toOLLAMA_MAX_LOADED_MODELSif set, otherwise 3 × number of GPUs.
Environment Variables
TERMOLLAMA_TEMPS: Set temperature thresholds as comma-separated values (low, mid, high) for color-coding.
Example:export TERMOLLAMA_TEMPS="30,55,75"TERMOLLAMA_POWER: Set power usage threshold percentage for color-coding.
Example:export TERMOLLAMA_POWER="20"
Models
To list all the available models:
olm models
# or
olm mTo search for a model with filters:
olm m stral+--------------------------------------------------------+--------+---------+----------+
| Model | Params | Quant | Size |
+--------------------------------------------------------+--------+---------+----------+
| devstral:24b-small-2505-q8_0 | 23.6B | Q8_0 | 23.3 GiB |
+--------------------------------------------------------+--------+---------+----------+
| devstral32k:latest | 23.6B | Q8_0 | 23.3 GiB |
+--------------------------------------------------------+--------+---------+----------+
| hf.co/unsloth/Devstral-Small-2507-GGUF:Q8_K_XL | 23.6B | unknown | 27.8 GiB |
+--------------------------------------------------------+--------+---------+----------+
| mistral-nemo:latest | 12.2B | Q4_0 | 6.6 GiB |
+--------------------------------------------------------+--------+---------+----------+
| mistral-small:latest | 23.6B | Q4_K_M | 13.3 GiB |
+--------------------------------------------------------+--------+---------+----------+
| mistral-small3.1:24b | 24.0B | Q4_K_M | 14.4 GiB |
+--------------------------------------------------------+--------+---------+----------+
| mistral-small3.2:latest | 24.0B | Q4_K_M | 14.1 GiB |
+--------------------------------------------------------+--------+---------+----------+Load models
List all the models and select some to load:
olm load
# or
olm lYou can specify optional parameters when loading:
--ctxor-c: Set the context window (e.g.,2k,4k,8192).--keep-aliveor-k: Set the keep alive timeout (e.g.,5m,2h).--nglor-n: Number of GPU layers to load.
Examples:
Basic load with search:
olm l qwThis searches for models containing "qw" and lets you select from the filtered list. Example output:

Load with context and keep alive:
olm load --ctx 8k --keep-alive 1h mistralSearch for "mistral" models and load with an 8k context window and a 1 hour keep alive time.
Specify GPU layers:
olm l --ngl 40 qwen3:30bLoads qwen3:30b model with 40 GPU layers, the rest will go to ram.
Filters can be combined (e.g., olm l qwen3 4b finds models with both terms). The selected models are loaded into memory with interactive prompts for parameters if not specified via flags.
Unload models
To unload models:
olm unload
# or
olm uPick the models to unload from the list.
Serve command
A serve command is available, equivalent to ollama serve but with flag options.
olm serve
# or
olm sServe command options directly map to environment variables (they are changed within the process only):
| Option Flag | Environment Variable |
|----------------------|---------------------------|
| --flash-attention | OLLAMA_FLASH_ATTENTION |
| --kv-4 | OLLAMA_KV_CACHE_TYPE=q4_0|
| --kv-8 | OLLAMA_KV_CACHE_TYPE=q8_0|
| --keep-alive | OLLAMA_KEEP_ALIVE |
| --ctx | OLLAMA_CONTEXT_LENGTH |
| --max-loaded-models| OLLAMA_MAX_LOADED_MODELS|
Usage
Options of olm serve:
- Flash attention: use the
--flash-attentionor-fflag to enable - Q4 kv cache:use
--kv-4or-4(note: this flag will turn flash attention on) - Q8 kv cache:use
--kv-8or-8(note: this flag will turn flash attention on) - Cpu: use the
--cpuflag to run only on cpu - Gpu: provide a list of gpu ids to use:
--gpu 0 1or-g 0 1 - Keep alive: to set the default keep alive time:
--keep-alive 1hor-k 1h - Context length: to set the default context length:
-ctx 8192or-c 8192 - Max loaded models: max number of models in memory:
--max-loaded-models 4or-m 4 - Max queue: set the max queue value:
--max-queue 50or-q 50 - Num parallel: number of parallel requests:
--num-parallel 2or-n 2 - Port: set the port:
--port 11485or-p 11485 - Host: set the hostname:
--host 192.168.1.8 - Models registry: set the directory for models registry:
--registry ~/some/path/ollama_modelsor-r ~/some/path/ollama_models
Key Options:
- Flash Attention:
-f - KV Cache:
-4→q4_0quantization (low memory)-8→q8_0quantization (balanced)
- GPU/CPU:
--cpu→ Run on CPU only-g 0 1→ Use specific GPUs (e.g., GPUs 0 and 1)
- Memory Management:
-k 15m→ Keep alive timeout-c 8192→ Default context length
- Server Settings:
-p 11434→ Port (default 11434)-h 0.0.0.0→ Host address
Examples
olm s -fg 0Run with flash attention on GPU 0 only
olm s -c 8192 --cpuRun with a default context window of 8192 using only the cpu
olm s -8k 10m -m 4Use fp8 kv cache (flash attention will be used), models will stay loaded for ten minutes and a max of 4 models can be loaded at the same times
olm s -p 11385 -r ~/some/path/ollama_modelsRun on localhost:11385 with a custom models registry directory: use an empty directory to create a new registry
Environment variables info
To show the environment variables used by Ollama:
olm env
# or
olm e| Variable | Description |
|-------------------------|-----------------------------------------------------------------------------|
| OLLAMA_FLASH_ATTENTION| Enable flash attention (1 to enable) |
| OLLAMA_KV_CACHE_TYPE | Set KV cache quantization (e.g. q4_0, q8_0) |
| OLLAMA_KEEP_ALIVE | Default keep alive timeout (e.g. 5m, 2h) |
| OLLAMA_CONTEXT_LENGTH | Default context window length (e.g. 4096) |
| OLLAMA_MAX_LOADED_MODELS| Maximum number of models to load simultaneously |
| OLLAMA_MAX_QUEUE | Maximum request queue size |
| OLLAMA_NUM_PARALLEL | Number of parallel requests allowed |
| OLLAMA_HOST | Server host address (default localhost) |
| OLLAMA_MODELS | Custom models registry directory |
| CUDA_VISIBLE_DEVICES | GPU selection (use -1 to force CPU mode)
Instance options
To use a different instance than the default localhost:11434:
-u, --use-instance <hostdomain>: Use a specific Ollama instance as the source. Example:olm models -u 192.168.1.8:11434This command will list the models from the Ollama instance running at
192.168.1.8on port11434.-s, --use-https: Use HTTPS protocol to reach the Ollama instance.
Information about gguf files
Show registries info
To show information about gguf models located in the Ollama internal registries:
olm gguf
# or
olm gThis will display information about models from the Ollama model storage registries. Ouptut:
--------- Registry hf.co/bartowski ---------
hf.co/bartowski
NousResearch_DeepHermes-3-Llama-3-8B-Preview-GGUF (1 model)
- Q6_K_L
--------- Registry ollama.com ---------
ollama.com
deepseek-coder-v2 (1 model)
- 16b-lite-instruct-q8_0
--------- Registry registry.ollama.ai ---------
registry.ollama.ai
gemma3 (3 models)
- 12b
- 27b
- 4b-it-q8_0
...Show model info
To show information about a specific model:
olm gguf -m qwen3:0.6bOutput:
Model qwen3:0.6b found in registry registry.ollama.ai
size: 498.4 MiB
quant: Q4_K_M
blob: /home/me/.ollama/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fx
link: ln -s /home/me/.ollama/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fx qwen3_0.6b_Q4_K_M.ggufThe link can be used to create a regular gguf file name symlink from the blob, and use it with Llamacpp and friends.
Show template info
To show a model's template:
olm gguf -t qwen3:0.6bExfiltrate Model Blob
To exfiltrate a model blob to a gguf file:
olm gguf -x qwen3:0.6b /path/to/destinationThis command will copy the model data from its original location to the specified destination, rename it to a .gguf file, and replace the original blob with a symlink pointing to the new file. Use case: to move the model to another storage location. Use at your own risks.
Copy Model Blob
To only copy a model blob without replacing the original:
olm gguf -c qwen3:0.6b /path/to/destinationThis command will perform the same steps as the exfiltrate command but will not replace the original blob with a symlink.
