omp-makora-provider
v1.0.1
Published
Makora provider plugin for OMP — Access DeepSeek V4, GLM 5.1, Kimi K2.6, Llama 3.3, Qwen 3.6, and more through the Makora inference API
Maintainers
Readme
🔁 omp-makora-provider
Open-weight models through Makora
DeepSeek V4, Kimi K2.6, GLM 5.1 / 5.2, Qwen 3.6 — with client-side tool call repair for OMP / pi.
Models
| Model | ID | Reasoning | Notes |
|-------|----|-----------|-------|
| DeepSeek V4 Flash | deepseek-ai/DeepSeek-V4-Flash | Yes | maxTokens 32768; include_reasoning + chat_template_kwargs.thinking via before_provider_request payload rewrite; returns reasoning field |
| DeepSeek V4 Pro | deepseek-ai/DeepSeek-V4-Pro | Yes | maxTokens 32768; chat_template_kwargs.thinking via before_provider_request payload rewrite; returns reasoning_content field |
| GLM 5.1 FP8 | zai-org/GLM-5.1-FP8 | Yes | maxTokens 16384; enable_thinking via qwen-chat-template; returns reasoning_content field; client-side tool call parsing (vLLM streaming parser bypass) |
| GLM 5.2 FP8 | zai-org/GLM-5.2-FP8 | Yes | maxTokens 16384; enable_thinking via qwen-chat-template; returns reasoning field; native tool calls work in both stream and non-stream (no client-side repair needed) |
| GPT-OSS 120B | openai/gpt-oss-120b | Yes | maxTokens 16384; reasoning always on |
| Kimi K2.6 NVFP4 | nvidia/Kimi-K2.6-NVFP4 | Yes | maxTokens 16384; vision maxImagesPerRequest 5; reasoning on by default; client-side tool call parsing (vLLM streaming parser bypass) |
| Kimi K2.7 Code | moonshotai/Kimi-K2.7-Code | Yes | maxTokens 16384; vision maxImagesPerRequest 5; reasoning on by default; client-side tool call parsing (vLLM streaming parser bypass) |
| Llama 3.3 70B FP8 | amd/Llama-3.3-70B-Instruct-FP8-KV | No | maxTokens 16384; custom per-slug endpoint |
| Llama 3.3 70B Instruct | meta-llama/Llama-3.3-70B-Instruct | No | maxTokens 8192; non-reasoning text-only model |
| MiniMax M3 MXFP8 | MiniMaxAI/MiniMax-M3-MXFP8 | Yes | maxTokens 16384; vision maxImagesPerRequest 5; reasoning via chat_template_kwargs.enable_thinking; returns reasoning_content field |
| Qwen 3.6 27B NVFP4 | unsloth/Qwen3.6-27B-NVFP4 | Yes | maxTokens 16384; enable_thinking via qwen-chat-template; client-side tool call parsing (vLLM streaming parser bypass) |
| Qwen 3.6 35B A3B NVFP4 | unsloth/Qwen3.6-35B-A3B-NVFP4 | Yes | maxTokens 16384; enable_thinking via qwen-chat-template; client-side tool call parsing (vLLM streaming parser bypass) |
Quickstart
Install from npm, then log in once:
# 1. Install the plugin from npm
omp plugin install omp-makora-provider
# 2. Open OMP / pi
omp
# 3. Add your Makora API key
/login makora
# 4. Pick a Makora model
/model makoraThat's it. Makora models now appear in /model. No -e flag, no manual clone, no config files.
API key
/login makora prompts for your Makora API key, validates it, and stores it.
If you prefer environment variables:
export MAKORA_OPTIMIZE_TOKEN=your-api-keyGet a key at inference.makora.com.
Install sources
# npm registry (recommended)
omp plugin install omp-makora-provider
# GitHub
omp plugin install https://github.com/ryan-brosas/omp-makora-provider
# Local development
git clone https://github.com/ryan-brosas/omp-makora-provider.git
omp plugin link ./omp-makora-providerModel Resolution
Models are discovered from the Makora /v1/models API and stored in models.json. Custom definitions and overrides are layered via patch.json and custom-models.json.
| File | Purpose |
|---|---|
| models.json | Auto-generated from Makora API (model discovery). Regenerated by node scripts/update-models.js — do not edit manually |
| patch.json | Manual overrides (reasoning, compat, notes, limits, etc.) applied on top of models.json |
| custom-models.json | Models not available via the API (e.g. per-slug endpoint models) |
Models are loaded by merging models.json → apply patch.json → merge custom-models.json.
Patch metadata fields
patch.json supports the same model metadata fields consumed by the provider, including reasoning, input, contextWindow, maxTokens, vision, notes, thinkingLevelMap, and compat. Use maxTokens for safe output caps because Makora model discovery does not report max output tokens. Use vision.maxImagesPerRequest for multimodal request limits when a model declares input: ["text", "image"].
Adding Custom Models
Do not edit models.json directly — it is auto-generated from the API. To customize:
- Override an existing model: Add entries to
patch.json(reasoning, compat, notes, maxTokens, etc.) - Add new models not in the API: Add entries to
custom-models.json:
[
{
"id": "my-org/my-model",
"name": "My Custom Model",
"reasoning": false,
"input": ["text"],
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 131072,
"maxTokens": 16384,
"baseUrl": "https://inference.makora.com/my-model-slug/v1"
}
]API Notes
- Each model is accessible at
https://inference.makora.com/v1/chat/completions(unified endpoint) - Models with a
baseUrloverride use their per-slug endpoint instead - The API is OpenAI-compatible (chat completions format)
- All models are hosted on vLLM
- The
developerrole is not supported (prompts are silently dropped);supportsDeveloperRoleis set tofalsefor all models
vLLM Caveats
These issues are common to all vLLM-hosted providers and affect Makora models:
GLM 5.1 tool calling: vLLM's streaming tool call handling is broken for GLM — the model outputs Zhipu's native
<tool_call>XML format as raw text. Themessage_endhook parses this intotoolCallblocks so OMP / pi can execute the tools. Acontexthook then stripstool_callsfrom assistant messages before follow-up requests, converting them back to<tool_call>text to avoid a ZAI/vLLM server crash (500:'str object' has no attribute 'items') that occurs when any assistant message contains atool_callsfield. If upstream fixes both the streaming parser and the 500 crash, themessage_endhook gracefully skips (existing validtoolCallblocks are preserved), and thecontexthook's text-stripping is harmless (GLM natively understands<tool_call>text).Kimi K2.6 + Qwen 3.6 tool calling: vLLM's streaming tool call handling is broken or missing for these models. The
before_provider_requesthook setstool_choice: "none"andskip_special_tokens: falseso the model's tool call tokens pass through as plain text. Themessage_endhook then re-parses intotoolCallblocks:- Kimi K2.6: Uses
<|tool_call_begin|>...<|tool_call_end|>tokens. Makora's vLLM is missing both--enable-auto-tool-choiceand--tool-call-parserfor this model. - Qwen 3.6: Uses hermes-style
<function=...>XML, sometimes with█delimiters. Same vLLM flag limitation as Kimi.
- Kimi K2.6: Uses
GLM 5.1 CoT leak: On some vLLM builds, disabling reasoning may still leak chain-of-thought into
contentterminated by a ``` marker. See vllm-project/vllm#31319.DeepSeek V4 reasoning: The official DeepSeek API uses
thinking: { type: "enabled" }which Makora's vLLM silently ignores. Thebefore_provider_requesthook rewrites the payload to use vLLM-native params instead:- DS V4 Pro:
chat_template_kwargs: { thinking: true }. Returnsreasoning_content. - DS V4 Flash:
include_reasoning: true+chat_template_kwargs: { thinking: true }.include_reasoningalone returnsreasoning: nullon this vLLM build — both params are required. Returnsreasoning.
- DS V4 Pro:
GLM 5.1 reasoning: Returns
reasoning_content(notreasoning). OMP / pi's OpenAI completions handler checksreasoning_contentfirst, so this is handled correctly.MiniMax M3 reasoning: Uses
chat_template_kwargs.enable_thinkingto toggle thinking (notchat_template_kwargs.thinkinglike DeepSeek). Thebefore_provider_requesthook rewrites the DeepSeek API-stylethinkingparam into vLLM-nativechat_template_kwargs: { enable_thinking: true }. Returnsreasoning_contentfield.
