pi-makora-provider

v1.0.0

Published

6 days ago

Makora provider extension for pi - Access DeepSeek V4, GLM 5.1, Kimi K2.6, Llama 3.3, Qwen 3.6, and more through the Makora inference API

0High
0Medium
0Low

monotykamary

pi extension provider makora ai llm deepseek glm kimi llama qwen

🔁 pi-makora-provider

Open-weight models through Makora

DeepSeek V4, Kimi K2.6, GLM 5.1, Qwen 3.6 — with client-side tool call repair for pi.

Models

| Model | ID | Reasoning | Notes | |-------|----|-----------|-------| | DeepSeek V4 Flash | deepseek-ai/DeepSeek-V4-Flash | Yes | include_reasoning + chat_template_kwargs.thinking via before_provider_request payload rewrite; returns reasoning field | | DeepSeek V4 Pro | deepseek-ai/DeepSeek-V4-Pro | Yes | chat_template_kwargs.thinking via before_provider_request payload rewrite; returns reasoning_content field | | GLM 5.1 FP8 | zai-org/GLM-5.1-FP8 | Yes | enable_thinking via qwen-chat-template; returns reasoning_content field; client-side tool call parsing (vLLM streaming parser bypass) | | GPT-OSS 120B | openai/gpt-oss-120b | Yes | Reasoning always on | | Kimi K2.6 NVFP4 | nvidia/Kimi-K2.6-NVFP4 | Yes | Reasoning on by default; client-side tool call parsing (vLLM streaming parser bypass) | | Kimi K2.7 Code | moonshotai/Kimi-K2.7-Code | Yes | Reasoning on by default; client-side tool call parsing (vLLM streaming parser bypass) | | Llama 3.3 70B FP8 | amd/Llama-3.3-70B-Instruct-FP8-KV | No | | | Llama 3.3 70B Instruct | meta-llama/Llama-3.3-70B-Instruct | No | | | MiniMax M3 MXFP8 | MiniMaxAI/MiniMax-M3-MXFP8 | Yes | Reasoning via chat_template_kwargs.enable_thinking; returns reasoning_content field | | Qwen 3.6 27B NVFP4 | unsloth/Qwen3.6-27B-NVFP4 | Yes | enable_thinking via qwen-chat-template; client-side tool call parsing (vLLM streaming parser bypass) | | Qwen 3.6 35B A3B NVFP4 | unsloth/Qwen3.6-35B-A3B-NVFP4 | Yes | enable_thinking via qwen-chat-template; client-side tool call parsing (vLLM streaming parser bypass) |

Installation

Option 1: Using `pi install` (Recommended)

Install directly from GitHub:

pi install https://github.com/monotykamary/pi-makora-provider

Then set your API key and run pi:

# Recommended: add to auth.json
# See Authentication section below

# Or set as environment variable
export MAKORA_OPTIMIZE_TOKEN=your-api-key-here

pi

Option 2: Manual Clone

Clone this repository:

git clone https://github.com/monotykamary/pi-makora-provider.git
cd pi-makora-provider

Set your Makora API key:

# Recommended: add to auth.json
# See Authentication section below

# Or set as environment variable
export MAKORA_OPTIMIZE_TOKEN=your-api-key-here

Run pi with the extension:
```
pi -e /path/to/pi-makora-provider
```

Setup

API Key

Add your Makora API key to ~/.pi/agent/auth.json (recommended):

{
  "makora": { "type": "api_key", "key": "your-api-key" }
}

Or set it as an environment variable:

export MAKORA_OPTIMIZE_TOKEN=your-api-key

Usage

pi -e /path/to/pi-makora-provider

Then use /model to select from available Makora models.

Model Resolution

Models are discovered from the Makora /v1/models API and stored in models.json. Custom definitions and overrides are layered via patch.json and custom-models.json.

| File | Purpose | |---|---| | models.json | Auto-generated from Makora API (model discovery). Regenerated by node scripts/update-models.js — do not edit manually | | patch.json | Manual overrides (reasoning, compat, notes, limits, etc.) applied on top of models.json | | custom-models.json | Models not available via the API (e.g. per-slug endpoint models) |

Models are loaded by merging models.json → apply patch.json → merge custom-models.json.

Adding Custom Models

Do not edit models.json directly — it is auto-generated from the API. To customize:

Override an existing model: Add entries to patch.json (reasoning, compat, notes, maxTokens, etc.)
Add new models not in the API: Add entries to custom-models.json:

[
  {
    "id": "my-org/my-model",
    "name": "My Custom Model",
    "reasoning": false,
    "input": ["text"],
    "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
    "contextWindow": 131072,
    "maxTokens": 16384,
    "baseUrl": "https://inference.makora.com/my-model-slug/v1"
  }
]

API Notes

Each model is accessible at https://inference.makora.com/v1/chat/completions (unified endpoint)
Models with a baseUrl override use their per-slug endpoint instead
The API is OpenAI-compatible (chat completions format)
All models are hosted on vLLM
The developer role is not supported (prompts are silently dropped); supportsDeveloperRole is set to false for all models

vLLM Caveats

These issues are common to all vLLM-hosted providers and affect Makora models:

GLM 5.1 tool calling: vLLM's streaming tool call handling is broken for GLM — the model outputs Zhipu's native <tool_call> XML format as raw text. The message_end hook parses this into toolCall blocks so pi can execute the tools. A context hook then strips tool_calls from assistant messages before follow-up requests, converting them back to <tool_call> text to avoid a ZAI/vLLM server crash (500: 'str object' has no attribute 'items') that occurs when any assistant message contains a tool_calls field. If upstream fixes both the streaming parser and the 500 crash, the message_end hook gracefully skips (existing valid toolCall blocks are preserved), and the context hook's text-stripping is harmless (GLM natively understands <tool_call> format in conversation history).
- Kimi K2.6 + Qwen 3.6 tool calling: vLLM's streaming tool call handling is broken or missing for these models. The before_provider_request hook sets tool_choice: "none" and skip_special_tokens: false so the model's tool call tokens pass through as plain text. The message_end hook then re-parses into toolCall blocks:
  - Kimi K2.6: Uses <|tool_call_begin|>...<|tool_call_end|> tokens. Makora's vLLM is missing both --enable-auto-tool-choice and --tool-call-parser for this model.
  - Qwen 3.6: Uses hermes-style <function=...> XML, sometimes with █ delimiters. Same vLLM flag limitation as Kimi.
GLM 5.1 CoT leak: On some vLLM builds, disabling reasoning may still leak chain-of-thought into content terminated by a ``` marker. See vllm-project/vllm#31319.
DeepSeek V4 reasoning: The official DeepSeek API uses thinking: { type: "enabled" } which Makora's vLLM silently ignores. The before_provider_request hook rewrites the payload to use vLLM-native params instead:
- DS V4 Pro: chat_template_kwargs: { thinking: true }. Returns reasoning_content.
- DS V4 Flash: include_reasoning: true + chat_template_kwargs: { thinking: true }. include_reasoning alone returns reasoning: null on this vLLM build — both params are required. Returns reasoning.
GLM 5.1 reasoning: Returns reasoning_content (not reasoning). pi's OpenAI completions handler checks reasoning_content first, so this is handled correctly.
MiniMax M3 reasoning: Uses chat_template_kwargs.enable_thinking to toggle thinking (not chat_template_kwargs.thinking like DeepSeek). The before_provider_request hook rewrites the DeepSeek API-style thinking param into vLLM-native chat_template_kwargs: { enable_thinking: true }. Returns reasoning_content field.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme