omp-makora-provider

v1.0.1

Published

2 days ago

Makora provider plugin for OMP — Access DeepSeek V4, GLM 5.1, Kimi K2.6, Llama 3.3, Qwen 3.6, and more through the Makora inference API

0High
0Medium
0Low

ryanjoserbrosas

omp plugin provider makora ai llm deepseek glm kimi llama qwen

🔁 omp-makora-provider

Open-weight models through Makora

DeepSeek V4, Kimi K2.6, GLM 5.1 / 5.2, Qwen 3.6 — with client-side tool call repair for OMP / pi.

Models

| Model | ID | Reasoning | Notes | |-------|----|-----------|-------| | DeepSeek V4 Flash | deepseek-ai/DeepSeek-V4-Flash | Yes | maxTokens 32768; include_reasoning + chat_template_kwargs.thinking via before_provider_request payload rewrite; returns reasoning field | | DeepSeek V4 Pro | deepseek-ai/DeepSeek-V4-Pro | Yes | maxTokens 32768; chat_template_kwargs.thinking via before_provider_request payload rewrite; returns reasoning_content field | | GLM 5.1 FP8 | zai-org/GLM-5.1-FP8 | Yes | maxTokens 16384; enable_thinking via qwen-chat-template; returns reasoning_content field; client-side tool call parsing (vLLM streaming parser bypass) | | GLM 5.2 FP8 | zai-org/GLM-5.2-FP8 | Yes | maxTokens 16384; enable_thinking via qwen-chat-template; returns reasoning field; native tool calls work in both stream and non-stream (no client-side repair needed) | | GPT-OSS 120B | openai/gpt-oss-120b | Yes | maxTokens 16384; reasoning always on | | Kimi K2.6 NVFP4 | nvidia/Kimi-K2.6-NVFP4 | Yes | maxTokens 16384; vision maxImagesPerRequest 5; reasoning on by default; client-side tool call parsing (vLLM streaming parser bypass) | | Kimi K2.7 Code | moonshotai/Kimi-K2.7-Code | Yes | maxTokens 16384; vision maxImagesPerRequest 5; reasoning on by default; client-side tool call parsing (vLLM streaming parser bypass) | | Llama 3.3 70B FP8 | amd/Llama-3.3-70B-Instruct-FP8-KV | No | maxTokens 16384; custom per-slug endpoint | | Llama 3.3 70B Instruct | meta-llama/Llama-3.3-70B-Instruct | No | maxTokens 8192; non-reasoning text-only model | | MiniMax M3 MXFP8 | MiniMaxAI/MiniMax-M3-MXFP8 | Yes | maxTokens 16384; vision maxImagesPerRequest 5; reasoning via chat_template_kwargs.enable_thinking; returns reasoning_content field | | Qwen 3.6 27B NVFP4 | unsloth/Qwen3.6-27B-NVFP4 | Yes | maxTokens 16384; enable_thinking via qwen-chat-template; client-side tool call parsing (vLLM streaming parser bypass) | | Qwen 3.6 35B A3B NVFP4 | unsloth/Qwen3.6-35B-A3B-NVFP4 | Yes | maxTokens 16384; enable_thinking via qwen-chat-template; client-side tool call parsing (vLLM streaming parser bypass) |

Quickstart

Install from npm, then log in once:

# 1. Install the plugin from npm
omp plugin install omp-makora-provider

# 2. Open OMP / pi
omp

# 3. Add your Makora API key
/login makora

# 4. Pick a Makora model
/model makora

That's it. Makora models now appear in /model. No -e flag, no manual clone, no config files.

API key

/login makora prompts for your Makora API key, validates it, and stores it.

If you prefer environment variables:

export MAKORA_OPTIMIZE_TOKEN=your-api-key

Get a key at inference.makora.com.

Install sources

# npm registry (recommended)
omp plugin install omp-makora-provider

# GitHub
omp plugin install https://github.com/ryan-brosas/omp-makora-provider

# Local development
git clone https://github.com/ryan-brosas/omp-makora-provider.git
omp plugin link ./omp-makora-provider

Model Resolution

Models are discovered from the Makora /v1/models API and stored in models.json. Custom definitions and overrides are layered via patch.json and custom-models.json.

| File | Purpose | |---|---| | models.json | Auto-generated from Makora API (model discovery). Regenerated by node scripts/update-models.js — do not edit manually | | patch.json | Manual overrides (reasoning, compat, notes, limits, etc.) applied on top of models.json | | custom-models.json | Models not available via the API (e.g. per-slug endpoint models) |

Models are loaded by merging models.json → apply patch.json → merge custom-models.json.

Patch metadata fields

patch.json supports the same model metadata fields consumed by the provider, including reasoning, input, contextWindow, maxTokens, vision, notes, thinkingLevelMap, and compat. Use maxTokens for safe output caps because Makora model discovery does not report max output tokens. Use vision.maxImagesPerRequest for multimodal request limits when a model declares input: ["text", "image"].

Adding Custom Models

Do not edit models.json directly — it is auto-generated from the API. To customize:

Override an existing model: Add entries to patch.json (reasoning, compat, notes, maxTokens, etc.)
Add new models not in the API: Add entries to custom-models.json:

[
  {
    "id": "my-org/my-model",
    "name": "My Custom Model",
    "reasoning": false,
    "input": ["text"],
    "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
    "contextWindow": 131072,
    "maxTokens": 16384,
    "baseUrl": "https://inference.makora.com/my-model-slug/v1"
  }
]

API Notes

Each model is accessible at https://inference.makora.com/v1/chat/completions (unified endpoint)
Models with a baseUrl override use their per-slug endpoint instead
The API is OpenAI-compatible (chat completions format)
All models are hosted on vLLM
The developer role is not supported (prompts are silently dropped); supportsDeveloperRole is set to false for all models

vLLM Caveats

These issues are common to all vLLM-hosted providers and affect Makora models:

GLM 5.1 tool calling: vLLM's streaming tool call handling is broken for GLM — the model outputs Zhipu's native <tool_call> XML format as raw text. The message_end hook parses this into toolCall blocks so OMP / pi can execute the tools. A context hook then strips tool_calls from assistant messages before follow-up requests, converting them back to <tool_call> text to avoid a ZAI/vLLM server crash (500: 'str object' has no attribute 'items') that occurs when any assistant message contains a tool_calls field. If upstream fixes both the streaming parser and the 500 crash, the message_end hook gracefully skips (existing valid toolCall blocks are preserved), and the context hook's text-stripping is harmless (GLM natively understands <tool_call> text).
- Kimi K2.6 + Qwen 3.6 tool calling: vLLM's streaming tool call handling is broken or missing for these models. The before_provider_request hook sets tool_choice: "none" and skip_special_tokens: false so the model's tool call tokens pass through as plain text. The message_end hook then re-parses into toolCall blocks:
  - Kimi K2.6: Uses <|tool_call_begin|>...<|tool_call_end|> tokens. Makora's vLLM is missing both --enable-auto-tool-choice and --tool-call-parser for this model.
  - Qwen 3.6: Uses hermes-style <function=...> XML, sometimes with █ delimiters. Same vLLM flag limitation as Kimi.
GLM 5.1 CoT leak: On some vLLM builds, disabling reasoning may still leak chain-of-thought into content terminated by a ``` marker. See vllm-project/vllm#31319.
DeepSeek V4 reasoning: The official DeepSeek API uses thinking: { type: "enabled" } which Makora's vLLM silently ignores. The before_provider_request hook rewrites the payload to use vLLM-native params instead:
- DS V4 Pro: chat_template_kwargs: { thinking: true }. Returns reasoning_content.
- DS V4 Flash: include_reasoning: true + chat_template_kwargs: { thinking: true }. include_reasoning alone returns reasoning: null on this vLLM build — both params are required. Returns reasoning.
GLM 5.1 reasoning: Returns reasoning_content (not reasoning). OMP / pi's OpenAI completions handler checks reasoning_content first, so this is handled correctly.
MiniMax M3 reasoning: Uses chat_template_kwargs.enable_thinking to toggle thinking (not chat_template_kwargs.thinking like DeepSeek). The before_provider_request hook rewrites the DeepSeek API-style thinking param into vLLM-native chat_template_kwargs: { enable_thinking: true }. Returns reasoning_content field.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme