pi-ollama

v0.1.6

Published

4 days ago

Native Ollama provider for pi coding agent — fixes tool calling under streaming

0High
0Medium
0Low

captcanadaman

pi ollama llm coding-agent extension

pi-ollama

Native Ollama provider extension for the pi coding agent.

Talks directly to Ollama's /api/chat endpoint, bypassing the OpenAI-compat shim at /v1/chat/completions that silently drops tool_calls from streamed responses.

Why this exists

I was simply trying to use the pi agent locally with ollama and this can of worms opened up. Seemed like a good learning opportunity and a great way to start trying to get more involved in the community. New to contributing to open source and build-in-public etiquette, so feedback genuinely welcome. I hope you find this useful!

Pi ships with an openai-completions adapter that routes Ollama traffic through Ollama's OpenAI-compat shim. The shim has a known streaming bug: tool_calls are dropped from the streamed deltas. Without those tool calls, pi's agent loop stalls on the first tool use — the model produces a tool call, the wire eats it, pi never sees it.

Ollama's native /api/chat endpoint doesn't have this problem. This extension talks to /api/chat directly — routing around the shim, not patching it — so tool calls survive streaming and the agent loop completes through tool use, multi-turn workflows, and reasoning-heavy prompts.

Other Ollama extensions for pi exist (linked in Related projects below) and they're solid for chat-style use. The architectural difference here is the API path: through the shim vs. around it. If you're using local Ollama specifically for the agentic tool-call workflows pi is designed around, that distinction is the whole point of this extension.

Install

pi install npm:pi-ollama

Or for local development:

git clone https://github.com/CaptCanadaMan/pi-ollama
cd pi-ollama
npm install
pi install /absolute/path/to/pi-ollama

Requires Ollama running locally (default http://localhost:11434) and at least one tool-capable model pulled.

Uninstall

pi uninstall npm:pi-ollama

This removes the on-disk package and the entry from ~/.pi/agent/settings.json. Pi won't auto-restore it on the next launch.

The bare form pi uninstall pi-ollama doesn't work — pi parses bare names as relative local paths rather than npm packages, so the npm: prefix is required for any npm-installed extension.

If you've already manually deleted the package directory (find it with npm root -g), pi will silently reinstall it on the next launch because npm:pi-ollama is still in ~/.pi/agent/settings.json. Run the uninstall command above to clear the settings entry — the disk side is already clean.

Optional cleanup of the model discovery cache:

rm -f ~/.pi/agent/cache/pi-ollama-models.json

Quick start

After installation, launch pi and run:

/ollama-status

You should see something like:

Ollama base URL: http://localhost:11434
✓ Ollama reachable — 3 model(s) registered
  qwen2.5-coder:7b               ctx:131,072  [tools]
  gemma4:26b                     ctx:262,144  [tools, vision, reasoning]
  llama3.1:8b                    ctx:131,072  [tools]

Switch to one of the discovered models and use pi normally — tool calls work end-to-end.

Slash commands

| Command | Description | |---|---| | /ollama-status | Show the Ollama base URL, registered models with capability flags, and currently loaded models. | | /ollama-refresh | Re-discover models from /api/tags + /api/show and re-register the provider. Useful after ollama pull <model>. | | /ollama-info [model-id] | Show capability details for a model. Omit the argument to pick from a list of currently registered models. | | /ollama-context | Set the context length (num_ctx) pi-ollama sends to /api/chat. Picker with common presets + custom input. Persists across pi launches. |

Environment variables

| Variable | Default | Purpose | |---|---|---| | OLLAMA_HOST | localhost:11434 | Ollama server host[:port]. May include or omit protocol. | | OLLAMA_CONTEXT_LENGTH | unset | Override the num_ctx pi-ollama sends to /api/chat. Matches the env var Ollama itself respects, so a single setting works across tools. Superseded by /ollama-context if used. | | OLLAMA_NATIVE_DEBUG | unset | Set to 1 to enable per-chunk debug logging. Writes to a file (see below) — not stderr, since stderr writes corrupt pi's TUI rendering. | | OLLAMA_NATIVE_DEBUG_LOG | ~/.pi/agent/cache/pi-ollama-debug.log | Override the default debug log path. | | OLLAMA_NATIVE_DUMP_DIR | unset | If set, writes paired req-*.json / res-*.ndjson files per request — exact replay artifacts for diagnostics. | | OLLAMA_NATIVE_GHOST_RETRIES | 2 | Max retries when Ollama returns ghost-token responses (see Reliability below). |

Context length and memory. By default pi-ollama caps num_ctx at 32,768 tokens, even when the model's discovered context window is much larger (some models report 262,144 or more). Without the cap, Ollama would try to allocate enough memory for the full trained context, which exceeds typical hardware budgets. Users on machines with headroom for more can raise the cap via the OLLAMA_CONTEXT_LENGTH env var or /ollama-context slash command. The slash command persists across restarts; the env var is read at startup.

Live-tail the debug log from another terminal:

tail -f ~/.pi/agent/cache/pi-ollama-debug.log

How model discovery works

On extension load, the provider:

Reads cached models from ~/.pi/agent/cache/pi-ollama-models.json (instant startup, no network).
Calls GET /api/tags to list pulled models.
For each model, calls POST /api/show to extract:
- Context window from model_info.*.context_length.
- Tool support from capabilities array, falling back to family-name heuristics for older Ollama versions.
- Vision support from capabilities or details.families containing clip.
- Reasoning/thinking support from capabilities or model-name patterns (r1, deepseek, gemma4, etc.).
Caches the result for next startup.

If Ollama is unreachable at startup, the cached list is used as a fallback. Run /ollama-refresh once it's available to re-discover.

Thinking control

For thinking-capable models, the provider forwards pi's thinking level to Ollama's think request field: any level set in pi sends think: true; thinking off sends an explicit think: false. The explicit false is load-bearing — Ollama defaults thinking-capable models (the gemma4 family included) to thinking on when the field is omitted, so before this mapping existed, turning thinking off in pi had no effect on the wire and every turn paid the hidden reasoning-token cost (measured ~8× the generated tokens on a short gemma4:12b answer). Models without thinking support never get the field.

Reliability features

Ollama's streaming has a few known edge cases. The provider handles them explicitly rather than letting them surface as silent stalls:

Ghost-token retry. Ollama occasionally generates output tokens but streams nothing visible (done:true, eval_count > 0, empty message). The provider reads the first NDJSON line of each attempt, detects this pattern, cancels the connection, and retries. Up to OLLAMA_NATIVE_GHOST_RETRIES times (default 2 → ≈99% success at typical failure rates).

Truncation detection. If the connection closes before any chunk with done:true arrives, the provider surfaces a clear error rather than silently treating the partial response as complete. The error explains this is an Ollama-side reliability issue and prompts a retry.

Empty-response detection. If the connection closes without sending any chunks at all, the provider raises a distinct error pointing at the most likely causes (model failed to load, Ollama crashed, network issue).

Post-stream ghost check. Belt-and-suspenders: if eval_count > 0 but no content, thinking, or tool calls landed in the parsed stream, the provider raises an error rather than reporting a successful empty turn.

Swallowed-tool-call detection. Ollama can buffer a tool call server-side, fail to parse it, and end the turn with no tool_calls on the wire — the model announces an action and then nothing happens (issue #3). The guard detects the generated≫streamed token gap and raises a retryable error instead of completing silently. It stands down on batched streams (Ollama Cloud emits ~30 tokens per NDJSON chunk vs ~1 locally), which previously false-positived the ratio heuristic on healthy cloud turns (issue #4).

Vision

For vision-capable models, images pass through from both user messages and tool results as base64 images arrays on the wire — a tool that returns a camera frame or screenshot reaches the model directly on its tool message (verified against gemma4). Models without vision never receive image data.

Compatibility

pi: Tested against @earendil-works/pi-coding-agent v0.75.x. Should work with any version exposing the standard ExtensionAPI (registerProvider with streamSimple, registerCommand with ctx.ui.notify).
Ollama: Requires Ollama with /api/chat support (most versions). /api/ps is used opportunistically and tolerates older versions that don't expose it.
Node: Requires Node 22.19+ (matches pi-coding-agent 0.75.0's minimum).

Architecture (one paragraph)

The extension registers an ollama provider with a custom streamSimple handler. Pi calls streamSimple(model, context, options) for every turn; the handler converts pi's internal message format to Ollama's /api/chat wire format, opens an NDJSON stream, parses chunks into pi's AssistantMessageEventStream events (text deltas, thinking deltas, tool-call bursts, done), and surfaces errors with explanatory messages. No core pi changes required — streamSimple fully replaces the built-in handler for the registered API string.

See src/ for the implementation. Each file has a header comment explaining its role.

Limitations / not yet implemented

Ollama Cloud (https://ollama.com). This extension targets local Ollama. Cloud requires different auth (OLLAMA_API_KEY) and a different base URL — see fgrehm/pi-ollama-cloud if you want cloud-only.
Per-model temperature / top_p defaults. Sampling parameters are passed through from pi's options when set, but there's no extension-level config for default values per model. Open an issue if you need this.
Auto-pull. If you select a model that isn't pulled, you'll get an error from Ollama. The extension doesn't offer to ollama pull it for you.

Related projects

pi-mono — the pi coding agent itself
ollama#12557 — the upstream tool-calling streaming bug this extension routes around
pi-mono#3357 — the open issue requesting an official local-LLM extension
@0xkobold/pi-ollama — alternative extension covering local + cloud via the OpenAI-compat shim
fgrehm/pi-ollama-cloud — cloud-only Ollama extension

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

pi-ollama

Why this exists

Install

Uninstall

Quick start

Slash commands

Environment variables

How model discovery works

Thinking control

Reliability features

Vision

Compatibility

Architecture (one paragraph)

Limitations / not yet implemented

Related projects

License