@troed/oc-ls-stats

v1.1.0

Published

3 hours ago

A TUI plugin for OpenCode that displays live prefill rate (PP) and generation rate (TG) from a llama.cpp-based llama-server.

0High
0Medium
0Low

troed

oc-ls-stats

A TUI plugin for OpenCode that displays live prefill rate (PP) and generation rate (TG) from a llama.cpp-based llama-server.

There are many plugins to show tokens per second, but the reason for this one is that when running a local model I often found myself not knowing what the server was currently doing, which meant constantly switching over to a console where I could see its output. Especially during prefill/prompt processing which can take more than a minute with no feedback in the UI from other plugins I tried.

Another motivation was to display the data of interest, but in a non-intrusive way with no UI elements jumping around. This plugin thus only shows a single numeric value, tokens per second, with an indicator as to whether the model is currently doing prompt processing or inference (token generation).

I'm using the llama-server /slots endpoint to get the needed data, which means if you connect opencode to another provider the plugin will just display "-" since it's not getting any data to display.

Note: As explained in further detail below the data that's displayed has to be deduced from llama-server's output. Sometimes the plugin might display PP for prompt processing while in reality the model is doing TG. If additional developments are made to the llama-server output data the plugin might be able to discern between them in a better way, but I think is as good as it gets for now.

I made this for my own usage. If you find it useful as well I'm just happy.

/Troed

thanks to Tarquinen for their oc-tps, which I used as a base although I guess most of the code has now been replaced

Installation

opencode plugin @troed/oc-ls-stats@latest --global

Requires opencode 1.3.14 or newer.

TUI plugins are loaded from ~/.config/opencode/tui.json, which after installation should look like this:

{
  "plugin": ["@troed/oc-ls-stats@latest"]
}

Display Format

The plugin renders a single line in the session prompt right slot:

1247 tps (PP)    -- during prefill
  25 tps (TG)    -- during generation
   - tps (TG)    -- idle
          n/a    -- unable to reach llama-server

Detection and Calculation

Server Discovery

The plugin discovers the llama-server URL by reading the OpenCode configuration via the TUI API and extracting baseURL/base_url fields from provider options whose name contains "llama" (but excludes providers whose name contains "ollama"). Falls back to http://localhost:8080 if no matching provider is found.

Slot Polling

Every 500ms, the plugin polls GET /slots?model=<model> on each discovered server. The model parameter is required by the /slots endpoint. If the model cannot be discovered from the current route's session, the plugin skips polling.

State Classification

Each slot is classified as prefill or generation based on the n_decoded counter in next_token[0]. The plugin tracks a per-slot baseline value:

When a slot first appears as processing, the current n_decoded is recorded as the baseline with hasIncreased = false.
If n_decoded <= baseline and hasIncreased is false, the slot is classified as prefilling.
If n_decoded > baseline, hasIncreased is set to true and the slot is classified as generating.

This approach handles the case where n_decoded drops when a new request starts on a reused slot, and prevents generation stalls (where n_decoded plateaus) from being misclassified as prefill.

When no slots are processing, all tracked state for those slots is cleared.

Prefill Rate (PP)

During prefill, the plugin calculates the instantaneous prompt processing rate:

On first detection of a prefill slot, the current n_prompt_tokens is captured as the baseline.
On subsequent polls, the delta in n_prompt_tokens is divided by the elapsed time in seconds.
The rate is updated only when both dt > 0 and delta > 0.

The per-slot n_prompt_tokens field is used instead of the global llamacpp:prompt_tokens_total from /metrics because the global counter includes tokens from all slots, producing inflated values when multiple slots are active simultaneously.

Generation Rate (TG)

During generation, the plugin calculates the instantaneous token generation rate:

On first detection of a generation slot, the current n_decoded is captured as the baseline.
On subsequent polls (same slot ID), the delta in n_decoded is divided by the elapsed time in seconds.
The rate is updated only when both dt > 0 and delta > 0.

Slot reuse is tracked via generateSlotId to detect when a new generation starts on a different slot.

Limitations

Progress Percentage

The plugin cannot display prefill progress percentage. The /slots endpoint returns n_prompt_tokens (current prompt size) and n_prompt_tokens_processed (tokens processed), but not the final prompt size (task->n_tokens() from llama.cpp). Progress requires the ratio n_prompt_tokens_processed / task->n_tokens().

What Would Improve Compatibility

The following changes to the /slots endpoint would improve the plugin's functionality:

Expose final prompt size: Add n_prompt_tokens_total (or n_tokens) to the /slots output, representing task->n_tokens() from llama.cpp. This would enable prefill progress percentage calculation as (n_prompt_tokens_processed / n_prompt_tokens_total) * 100.
Per-slot metrics endpoints: Currently, the /metrics endpoint provides only global counters (llamacpp:prompt_tokens_total, llamacpp:prompt_tokens_seconds). Per-slot metrics would allow independent rate tracking without relying on slot state classification.
Slot transition notifications: The plugin polls every 500ms to detect state transitions. A WebSocket or SSE-based notification system for slot state changes would reduce polling overhead and improve detection latency.
Stall detection: When generation stalls (e.g., due to context window limits), n_decoded remains constant while n_remain stops decreasing. The plugin detects this via zero delta but has no way to distinguish a stall from normal generation. An explicit stalled flag in the slot output would help.
Model-agnostic slot data: The /slots endpoint requires a model parameter. Returning all slots without model filtering, or supporting * as a wildcard, would simplify discovery when multiple models are loaded.

Source code repo

For known issues, posting new ones, forking or contributing:

https://codeberg.org/troed/oc-ls-stats

Debug Logging

Debug logging is controlled by the DEBUG_ENABLED constant in tui.tsx. When enabled, full slot state data is written to /tmp/oc-ls-stats-debug.log on every poll.

License

Creative Commons Zero (CC0 1.0 Universal)