pi-makora-provider
v1.0.0
Published
Makora provider extension for pi - Access DeepSeek V4, GLM 5.1, Kimi K2.6, Llama 3.3, Qwen 3.6, and more through the Makora inference API
Maintainers
Readme
🔁 pi-makora-provider
Open-weight models through Makora
DeepSeek V4, Kimi K2.6, GLM 5.1, Qwen 3.6 — with client-side tool call repair for pi.
Models
| Model | ID | Reasoning | Notes |
|-------|----|-----------|-------|
| DeepSeek V4 Flash | deepseek-ai/DeepSeek-V4-Flash | Yes | include_reasoning + chat_template_kwargs.thinking via before_provider_request payload rewrite; returns reasoning field |
| DeepSeek V4 Pro | deepseek-ai/DeepSeek-V4-Pro | Yes | chat_template_kwargs.thinking via before_provider_request payload rewrite; returns reasoning_content field |
| GLM 5.1 FP8 | zai-org/GLM-5.1-FP8 | Yes | enable_thinking via qwen-chat-template; returns reasoning_content field; client-side tool call parsing (vLLM streaming parser bypass) |
| GPT-OSS 120B | openai/gpt-oss-120b | Yes | Reasoning always on |
| Kimi K2.6 NVFP4 | nvidia/Kimi-K2.6-NVFP4 | Yes | Reasoning on by default; client-side tool call parsing (vLLM streaming parser bypass) |
| Kimi K2.7 Code | moonshotai/Kimi-K2.7-Code | Yes | Reasoning on by default; client-side tool call parsing (vLLM streaming parser bypass) |
| Llama 3.3 70B FP8 | amd/Llama-3.3-70B-Instruct-FP8-KV | No | |
| Llama 3.3 70B Instruct | meta-llama/Llama-3.3-70B-Instruct | No | |
| MiniMax M3 MXFP8 | MiniMaxAI/MiniMax-M3-MXFP8 | Yes | Reasoning via chat_template_kwargs.enable_thinking; returns reasoning_content field |
| Qwen 3.6 27B NVFP4 | unsloth/Qwen3.6-27B-NVFP4 | Yes | enable_thinking via qwen-chat-template; client-side tool call parsing (vLLM streaming parser bypass) |
| Qwen 3.6 35B A3B NVFP4 | unsloth/Qwen3.6-35B-A3B-NVFP4 | Yes | enable_thinking via qwen-chat-template; client-side tool call parsing (vLLM streaming parser bypass) |
Installation
Option 1: Using pi install (Recommended)
Install directly from GitHub:
pi install https://github.com/monotykamary/pi-makora-providerThen set your API key and run pi:
# Recommended: add to auth.json
# See Authentication section below
# Or set as environment variable
export MAKORA_OPTIMIZE_TOKEN=your-api-key-here
piOption 2: Manual Clone
Clone this repository:
git clone https://github.com/monotykamary/pi-makora-provider.git cd pi-makora-providerSet your Makora API key:
# Recommended: add to auth.json # See Authentication section below # Or set as environment variable export MAKORA_OPTIMIZE_TOKEN=your-api-key-hereRun pi with the extension:
pi -e /path/to/pi-makora-provider
Setup
API Key
Add your Makora API key to ~/.pi/agent/auth.json (recommended):
{
"makora": { "type": "api_key", "key": "your-api-key" }
}Or set it as an environment variable:
export MAKORA_OPTIMIZE_TOKEN=your-api-keyUsage
pi -e /path/to/pi-makora-providerThen use /model to select from available Makora models.
Model Resolution
Models are discovered from the Makora /v1/models API and stored in models.json. Custom definitions and overrides are layered via patch.json and custom-models.json.
| File | Purpose |
|---|---|
| models.json | Auto-generated from Makora API (model discovery). Regenerated by node scripts/update-models.js — do not edit manually |
| patch.json | Manual overrides (reasoning, compat, notes, limits, etc.) applied on top of models.json |
| custom-models.json | Models not available via the API (e.g. per-slug endpoint models) |
Models are loaded by merging models.json → apply patch.json → merge custom-models.json.
Adding Custom Models
Do not edit models.json directly — it is auto-generated from the API. To customize:
- Override an existing model: Add entries to
patch.json(reasoning, compat, notes, maxTokens, etc.) - Add new models not in the API: Add entries to
custom-models.json:
[
{
"id": "my-org/my-model",
"name": "My Custom Model",
"reasoning": false,
"input": ["text"],
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 131072,
"maxTokens": 16384,
"baseUrl": "https://inference.makora.com/my-model-slug/v1"
}
]API Notes
- Each model is accessible at
https://inference.makora.com/v1/chat/completions(unified endpoint) - Models with a
baseUrloverride use their per-slug endpoint instead - The API is OpenAI-compatible (chat completions format)
- All models are hosted on vLLM
- The
developerrole is not supported (prompts are silently dropped);supportsDeveloperRoleis set tofalsefor all models
vLLM Caveats
These issues are common to all vLLM-hosted providers and affect Makora models:
GLM 5.1 tool calling: vLLM's streaming tool call handling is broken for GLM — the model outputs Zhipu's native
<tool_call>XML format as raw text. Themessage_endhook parses this intotoolCallblocks so pi can execute the tools. Acontexthook then stripstool_callsfrom assistant messages before follow-up requests, converting them back to<tool_call>text to avoid a ZAI/vLLM server crash (500:'str object' has no attribute 'items') that occurs when any assistant message contains atool_callsfield. If upstream fixes both the streaming parser and the 500 crash, themessage_endhook gracefully skips (existing validtoolCallblocks are preserved), and thecontexthook's text-stripping is harmless (GLM natively understands<tool_call>format in conversation history).Kimi K2.6 + Qwen 3.6 tool calling: vLLM's streaming tool call handling is broken or missing for these models. The
before_provider_requesthook setstool_choice: "none"andskip_special_tokens: falseso the model's tool call tokens pass through as plain text. Themessage_endhook then re-parses intotoolCallblocks:- Kimi K2.6: Uses
<|tool_call_begin|>...<|tool_call_end|>tokens. Makora's vLLM is missing both--enable-auto-tool-choiceand--tool-call-parserfor this model. - Qwen 3.6: Uses hermes-style
<function=...>XML, sometimes with█delimiters. Same vLLM flag limitation as Kimi.
- Kimi K2.6: Uses
GLM 5.1 CoT leak: On some vLLM builds, disabling reasoning may still leak chain-of-thought into
contentterminated by a ``` marker. See vllm-project/vllm#31319.DeepSeek V4 reasoning: The official DeepSeek API uses
thinking: { type: "enabled" }which Makora's vLLM silently ignores. Thebefore_provider_requesthook rewrites the payload to use vLLM-native params instead:- DS V4 Pro:
chat_template_kwargs: { thinking: true }. Returnsreasoning_content. - DS V4 Flash:
include_reasoning: true+chat_template_kwargs: { thinking: true }.include_reasoningalone returnsreasoning: nullon this vLLM build — both params are required. Returnsreasoning.
- DS V4 Pro:
GLM 5.1 reasoning: Returns
reasoning_content(notreasoning). pi's OpenAI completions handler checksreasoning_contentfirst, so this is handled correctly.MiniMax M3 reasoning: Uses
chat_template_kwargs.enable_thinkingto toggle thinking (notchat_template_kwargs.thinkinglike DeepSeek). Thebefore_provider_requesthook rewrites the DeepSeek API-stylethinkingparam into vLLM-nativechat_template_kwargs: { enable_thinking: true }. Returnsreasoning_contentfield.
