@troed/oc-ls-stats
v1.1.0
Published
A TUI plugin for OpenCode that displays live prefill rate (PP) and generation rate (TG) from a llama.cpp-based llama-server.
Readme
oc-ls-stats
A TUI plugin for OpenCode that displays live prefill rate (PP) and generation rate (TG) from a llama.cpp-based llama-server.
There are many plugins to show tokens per second, but the reason for this one is that when running a local model I often found myself not knowing what the server was currently doing, which meant constantly switching over to a console where I could see its output. Especially during prefill/prompt processing which can take more than a minute with no feedback in the UI from other plugins I tried.
Another motivation was to display the data of interest, but in a non-intrusive way with no UI elements jumping around. This plugin thus only shows a single numeric value, tokens per second, with an indicator as to whether the model is currently doing prompt processing or inference (token generation).
I'm using the llama-server /slots endpoint to get the needed data, which means if you connect opencode to another provider the plugin will just display "-" since it's not getting any data to display.
Note: As explained in further detail below the data that's displayed has to be deduced from llama-server's output. Sometimes the plugin might display PP for prompt processing while in reality the model is doing TG. If additional developments are made to the llama-server output data the plugin might be able to discern between them in a better way, but I think is as good as it gets for now.
I made this for my own usage. If you find it useful as well I'm just happy.
/Troed
thanks to Tarquinen for their oc-tps, which I used as a base although I guess most of the code has now been replaced
Installation
opencode plugin @troed/oc-ls-stats@latest --globalRequires opencode 1.3.14 or newer.
TUI plugins are loaded from ~/.config/opencode/tui.json, which after installation should look like this:
{
"plugin": ["@troed/oc-ls-stats@latest"]
}Display Format
The plugin renders a single line in the session prompt right slot:
1247 tps (PP) -- during prefill
25 tps (TG) -- during generation
- tps (TG) -- idle
n/a -- unable to reach llama-serverDetection and Calculation
Server Discovery
The plugin discovers the llama-server URL by reading the OpenCode configuration via the TUI API and extracting baseURL/base_url fields from provider options whose name contains "llama" (but excludes providers whose name contains "ollama"). Falls back to http://localhost:8080 if no matching provider is found.
Slot Polling
Every 500ms, the plugin polls GET /slots?model=<model> on each discovered server. The model parameter is required by the /slots endpoint. If the model cannot be discovered from the current route's session, the plugin skips polling.
State Classification
Each slot is classified as prefill or generation based on the n_decoded counter in next_token[0]. The plugin tracks a per-slot baseline value:
- When a slot first appears as processing, the current
n_decodedis recorded as the baseline withhasIncreased = false. - If
n_decoded <= baselineandhasIncreasedis false, the slot is classified as prefilling. - If
n_decoded > baseline,hasIncreasedis set to true and the slot is classified as generating.
This approach handles the case where n_decoded drops when a new request starts on a reused slot, and prevents generation stalls (where n_decoded plateaus) from being misclassified as prefill.
When no slots are processing, all tracked state for those slots is cleared.
Prefill Rate (PP)
During prefill, the plugin calculates the instantaneous prompt processing rate:
- On first detection of a prefill slot, the current
n_prompt_tokensis captured as the baseline. - On subsequent polls, the delta in
n_prompt_tokensis divided by the elapsed time in seconds. - The rate is updated only when both
dt > 0anddelta > 0.
The per-slot n_prompt_tokens field is used instead of the global llamacpp:prompt_tokens_total from /metrics because the global counter includes tokens from all slots, producing inflated values when multiple slots are active simultaneously.
Generation Rate (TG)
During generation, the plugin calculates the instantaneous token generation rate:
- On first detection of a generation slot, the current
n_decodedis captured as the baseline. - On subsequent polls (same slot ID), the delta in
n_decodedis divided by the elapsed time in seconds. - The rate is updated only when both
dt > 0anddelta > 0.
Slot reuse is tracked via generateSlotId to detect when a new generation starts on a different slot.
Limitations
Progress Percentage
The plugin cannot display prefill progress percentage. The /slots endpoint returns n_prompt_tokens (current prompt size) and n_prompt_tokens_processed (tokens processed), but not the final prompt size (task->n_tokens() from llama.cpp). Progress requires the ratio n_prompt_tokens_processed / task->n_tokens().
What Would Improve Compatibility
The following changes to the /slots endpoint would improve the plugin's functionality:
Expose final prompt size: Add
n_prompt_tokens_total(orn_tokens) to the/slotsoutput, representingtask->n_tokens()from llama.cpp. This would enable prefill progress percentage calculation as(n_prompt_tokens_processed / n_prompt_tokens_total) * 100.Per-slot metrics endpoints: Currently, the
/metricsendpoint provides only global counters (llamacpp:prompt_tokens_total,llamacpp:prompt_tokens_seconds). Per-slot metrics would allow independent rate tracking without relying on slot state classification.Slot transition notifications: The plugin polls every 500ms to detect state transitions. A WebSocket or SSE-based notification system for slot state changes would reduce polling overhead and improve detection latency.
Stall detection: When generation stalls (e.g., due to context window limits),
n_decodedremains constant whilen_remainstops decreasing. The plugin detects this via zero delta but has no way to distinguish a stall from normal generation. An explicitstalledflag in the slot output would help.Model-agnostic slot data: The
/slotsendpoint requires a model parameter. Returning all slots without model filtering, or supporting*as a wildcard, would simplify discovery when multiple models are loaded.
Source code repo
For known issues, posting new ones, forking or contributing:
https://codeberg.org/troed/oc-ls-stats
Debug Logging
Debug logging is controlled by the DEBUG_ENABLED constant in tui.tsx. When enabled, full slot state data is written to /tmp/oc-ls-stats-debug.log on every poll.
License
Creative Commons Zero (CC0 1.0 Universal)
