pi-minimax-m3-caching-fix

v0.2.0

Published

10 days ago

MiniMax-M3 on the OpenAI-compatible endpoint with passive caching. Wraps the built-in openai-completions streamSimple driver to clean duplicated and inline <think>…</think> thinking in flight, mirroring the upstream skipThinkingBlock compat flag for pi-ai

0High
0Medium
0Low

frugally3683

pi-package minimax minimax-m3 caching

pi-minimax-m3

A standalone pi extension that fixes two issues with the built-in MiniMax-M3 integration:

Silent over-billing on the Anthropic-compatible endpoint. M3's /anthropic/v1/messages endpoint ignores cache_control markers, so every turn was billed at the full input price ($0.60/Mtok) instead of the cache-read price ($0.12/Mtok). M3 does support passive/automatic prompt caching on its OpenAI-compatible endpoint (/v1/chat/completions).
Duplicated thinking in the response. M3 emits thinking content twice: once in reasoning_content (consumed by pi as a thinking block) and once in content wrapped in <think>…</think> markers (which would otherwise appear inside the visible text).

This extension registers two new providers — minimax-m3-clean and minimax-cn-m3-clean — that route MiniMax-M3 to the OpenAI-compatible endpoint so passive caching works. The thinking cleanup is performed during the stream by wrapping the built-in openai-completions streamSimple driver and rewriting the event stream in flight: duplicated thinking from M3's reasoning_content / reasoning field alternation is suppressed, <think>…</think> spans are filtered out of text deltas (and their inner content is routed to a real thinking block when no reasoning fields were streamed), and text_start is deferred until the first non-whitespace character.

It mirrors the upstream fix in pi-mono@b85b91c9 ("route MiniMax-M3 to openai-completions for passive caching") so users can get the fix on any pi version without waiting for an upstream release.

Install

From npm:

pi install npm:pi-minimax-m3-caching-fix

From a git checkout (latest, or pinned):

pi install git:github.com/rwese/pi-minimax-m3-caching-fix
pi install git:github.com/rwese/[email protected]

For local development from a clone:

git clone https://github.com/rwese/pi-minimax-m3-caching-fix
pi install ./pi-minimax-m3-caching-fix

The extension reuses the env vars you already have for the built-in minimax provider — no new credentials required:

| Provider | Env var | Endpoint | | --------------------- | ---------------------- | -------------------------------- | | minimax-m3-clean | MINIMAX_API_KEY | https://api.minimax.io/v1 | | minimax-cn-m3-clean | MINIMAX_CN_API_KEY | https://api.minimaxi.com/v1 |

Quickstart (for the impatient)

# 1. Make sure your MiniMax API key is exported
export MINIMAX_API_KEY="sk-..."

# 2. Install the extension
pi install npm:pi-minimax-m3-caching-fix

# 3. Restart any running pi session, then start one
pi

# 4. Inside pi, switch the model
/model
#   pick:  minimax-m3-clean / MiniMax-M3 (clean)

# 5. Verify caching — look at the footer or session log
#    Turn 1: ~99% cache miss (system prompt being written to cache)
#    Turn 2+: ~99% cache read (system prompt being reused)

That's it. No new credentials, no config file, no restart of the upstream minimax provider. Just pick the right model in /model and the rest happens automatically.

Use

Run pi.
Open the model picker with /model.
Pick minimax-m3-clean / MiniMax-M3 (clean) for the global endpoint or minimax-cn-m3-clean / MiniMax-M3 (clean — CN) for the China endpoint.
Send a prompt. The first turn is a cache miss; subsequent turns of the same session show a CH (cache hit rate) in the footer as the system prompt gets reused.

In the session log, the usage object on each assistant message shows the cache reads. For example, a 3-turn session looks like:

| Turn | input | cacheRead | Hit rate | | ---- | ----- | --------- | -------- | | 1 | 8932 | 114 | 1% | | 2 | 128 | 8946 | 99% | | 3 | 128 | 8946 | 99% |

Why a separate provider (not overriding the built-in)

pi.registerProvider(name, { models }) replaces every model registered for that provider. There are two ways that breaks the built-in integration:

Override minimax with baseUrl only — this lumps M2.x onto the OpenAI-compatible endpoint too, breaking M2.x.
Override minimax with new models — this wipes M2.x from the registry.

So this extension registers new provider names (minimax-m3-clean, minimax-cn-m3-clean) that don't collide with minimax or minimax-cn. Users opt in by switching the model in /model. The built-in minimax / MiniMax-M3 model is still listed — pick the one with "(clean)" in the name.

Limitations

Two MiniMax-M3 entries in /model. The built-in (broken, billing at full input price) and the extension's (clean) both appear. Pick the one with (clean) in the name.
Requires both env vars for both providers to show. pi only lists providers that have auth configured. If you only have MINIMAX_API_KEY, only minimax-m3-clean shows up; set MINIMAX_CN_API_KEY (even to a dummy value) to also see minimax-cn-m3-clean.

How the fix works

The extension does two things:

Routes M3 to /v1/chat/completions by registering the two new providers under a custom api id (the provider name) so the wrapper below only intercepts these models. The model metadata mirrors packages/ai/src/models.generated.ts from the upstream fix: input: ["text", "image"], reasoning: true, cost $0.6 / $2.4 / $0.12 per million tokens, 1M-token context window, 512K max output.
Cleans M3's thinking in the stream wrapper. The wrapper sits in front of the built-in openai-completions streamSimple driver and rewrites events as they arrive:
- All driver thinking blocks are merged into ONE thinking block. M3 re-streams the same reasoning when it switches between reasoning_content and reasoning fields, which would otherwise start a new (truncated) thinking block on every field switch. The wrapper dedupes by prefix and emits only the new portion of reasoning.
- A ThinkScanner filters <think>…</think> spans from text deltas in real time and holds back bytes that look like the start of a tag so markers split across deltas are classified correctly. If the model never streamed reasoning fields, the captured inner content is routed to a real thinking block instead of being dropped; otherwise it's a duplicate of the reasoning fields and is discarded.
- text_start is deferred until the first non-whitespace character so empty / whitespace-only text blocks are not rendered.
This is the same effect as the upstream compat.skipThinkingBlock flag, but applied in the stream wrapper because the user's installed @earendil-works/pi-ai (0.79.1) predates that compat field. When a future pi-ai release includes skipThinkingBlock, the wrapper becomes a thin pass-through and can be deleted.

Removing the extension (when upstream ships the fix)

When pi-mono ships a release that includes b85b91c9 (or any release whose models.generated.ts lists MiniMax-M3 with api: "openai-completions" and skipThinkingBlock: true), retire the extension:

pi remove npm:pi-minimax-m3-caching-fix

The built-in minimax / MiniMax-M3 model will then route correctly out of the box.

License

MIT — see LICENSE.

Credits

The in-flight thinking-cleanup wrapper introduced in v0.2.0 (the ThinkScanner, the merged-thinking block, and the deferred text_start) was contributed by Thunder Guardian (Discord: @Thunder Guardian).

Development

npm run check    # tsc --noEmit using the bundled tsconfig.json

The tsconfig.json configures --skipLibCheck and --moduleResolution bundler so the type check is reproducible without depending on transitive type packages of the user's installed pi.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme