omp-lilac-provider

v1.0.0

Published

3 days ago

Lilac provider plugin for OMP — Access Kimi K2.6, GLM 5.1, Gemma 4, and MiniMax M2.7 models through Lilac's OpenAI-compatible API on idle GPUs

Downloads

130

0High
0Medium
0Low

ryanjoserbrosas

omp plugin provider lilac ai llm kimi glm gemma idle-gpu

💜 omp-lilac-provider

Kimi K2.6, GLM 5.1, Gemma 4 & more on idle GPUs via Lilac

A OMP provider plugin for cost-efficient GPU inference.

Access Kimi K2.6, GLM 5.1, MiniMax M2.7, and Gemma 4 models through Lilac's OpenAI-compatible API on idle GPUs.

Features

4 AI Models — Kimi K2.6, GLM 5.1, Gemma 4, and MiniMax M2.7
OpenAI-Compatible API — Just change the base URL and API key
Cost Tracking — Per-model pricing with cache read discounts
Reasoning Models — Chain-of-thought via chat_template_kwargs (all models)
Vision Support — Image input on Kimi K2.6 and Gemma 4
Context Caching — Cache read pricing on Kimi K2.6 and GLM 5.1
Idle GPU Scheduling — Lilac leverages idle GPU capacity for cost-efficient inference
Live Model Sync — Stale-while-revalidate: serve cached models instantly, hot-swap from the API in the background
Discount Tracking — Fetches subscription discounts from the /status endpoint and applies them to model costs

Quickstart

# 1. Install
omp plugin install omp-lilac-provider

# 2. Add your API key
omp
/login lilac

# 3. Pick a model and go
/model lilac

That's it. Lilac models now appear in /model. No -e flag, no manual clone, no config files.

Models

| Model | Context | Vision | Reasoning | Input $/M | Cache Read $/M | Output $/M | |-------|---------|--------|-----------|-----------|-----------------|------------| | Gemma 4 | 262K | ✅ | ✅ | $0.11 | — | $0.35 | | GLM 5.1 | 203K | ❌ | ✅ | $0.90 | $0.27 | $3.00 | | Kimi K2.6 | 262K | ✅ | ✅ | $0.70 | $0.20 | $3.50 | | MiniMax M2.7 | 205K | ❌ | ✅ | $0.30 | $0.06 | $1.20 |

Costs are per million tokens. Prices subject to change — check getlilac.com for current pricing.

Notes:

Gemma 4 has reasoning off by default — OMP enables it when you set a thinking level (Shift+Tab)
Kimi K2.6 and GLM 5.1 have reasoning on by default
Cache read pricing applies to repeated input tokens served from cache on supported models
Gemma 4 does not support cache read pricing

API key

/login lilac prompts for your Lilac API key, validates it against Lilac's authenticated chat-completions endpoint, and stores it. Or set it explicitly:

export LILAC_API_KEY=your-api-key

Get a key at getlilac.com.

Other install paths

# From GitHub
omp plugin install https://github.com/ryan-brosas/omp-lilac-provider

# Local development
git clone https://github.com/ryan-brosas/omp-lilac-provider.git
omp plugin link ./omp-lilac-provider

Usage

After loading the extension, use the /model command in OMP to select your preferred model:

/model lilac moonshotai/kimi-k2.6

Or start OMP directly with a Lilac model:

omp --provider lilac --model moonshotai/kimi-k2.6

Thinking Mode

All Lilac models support chain-of-thought reasoning via chat_template_kwargs. OMP uses the qwen-chat-template thinking format to send both thinking and enable_thinking keys, which works across all model families:

Kimi K2.6: Honors thinking key (Moonshot template)
GLM 5.1: Honors enable_thinking key (Z.ai template)
Gemma 4: Honors enable_thinking key (Google template)

In OMP, reasoning models automatically use the appropriate thinking format. Use Shift+Tab to control thinking level.

Vision

Kimi K2.6 and Gemma 4 support image inputs. Pass images in messages and OMP will handle the formatting automatically.

Gemma 4 also supports video by accepting a sequence of frames as images.

Model Resolution

Models are discovered from the Lilac /v1/models API and stored in models.json. Custom definitions and overrides are layered via patch.json and custom-models.json.

The extension uses a stale-while-revalidate strategy for zero-latency startup:

Serve stale immediately: disk cache → embedded models.json (zero-latency)
Revalidate in background: live API /models → merge with embedded → cache → hot-swap
patch.json + custom-models.json applied on top of whichever source won

| File | Purpose | |---|---| | models.json | Auto-generated from Lilac API (model discovery). Regenerated by node scripts/update-models.js — do not edit manually | | patch.json | Manual overrides (reasoning, compat, notes, limits, etc.) applied on top of models.json | | custom-models.json | Models not available via the API (e.g. per-slug endpoint models) |

Models are loaded by merging models.json → apply patch.json → merge custom-models.json.

Adding Custom Models

To customize:

Override an existing model: Add entries to patch.json (reasoning, compat, notes, maxTokens, etc.)
Add new models not in the API: Add entries to custom-models.json:

[
  {
    "id": "my-org/my-model",
    "name": "My Custom Model",
    "reasoning": false,
    "input": ["text"],
    "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
    "contextWindow": 131072,
    "maxTokens": 16384,
    "baseUrl": "https://api.getlilac.com/my-model-slug/v1"
  }
]

API Notes

Each model is accessible at https://api.getlilac.com/v1/chat/completions (unified endpoint)
The API is OpenAI-compatible (chat completions format)
All models are hosted on vLLM
Lilac serves models via a customized fork of vLLM tuned for idle-GPU scheduling and shared warm endpoints

vLLM Caveats

These issues are common to all vLLM-hosted providers and affect Lilac models:

GLM 5.1 intermittent tool call loss: vLLM's streaming parser intermittently emits finish_reason: "tool_calls" without any delta.tool_calls chunks — even with tool_stream: true (set via zaiToolStream in compat). OMP maps this to stopReason: "toolUse" with zero toolCall blocks, causing an "abrupt stop". The extension's message_end handler converts this to a retryable error that triggers OMP's built-in auto-retry mechanism, so the agent automatically re-prompts and typically succeeds on the next attempt.
GLM 5.1 chain-of-thought leakage: On the current vLLM build, disabling reasoning on GLM 5.1 may still leak chain-of-thought into content terminated by a marker. Post-process the response to discard text up to and including the first when reasoning is disabled. See vllm-project/vllm#31319.
Gemma 4 reasoning parser: vLLM's reasoning parser can fail to populate the reasoning field when special tokens are stripped before the parser runs. Clients that require a clean split should post-process <|channel|>thought ... <|channel|> markers. See vllm-project/vllm#38855.
Gemma 4 structured output: Combining enable_thinking: false with response_format: json_schema can silently disable xgrammar-backed structured output. If you rely on structured output with Gemma 4, leave thinking enabled or validate output client-side. See vllm-project/vllm#39130.

Compat Settings

Lilac's API is OpenAI-compatible with these specifics:

thinkingFormat: "qwen-chat-template" — All reasoning models. Lilac uses chat_template_kwargs (with thinking and enable_thinking keys) to toggle reasoning. OMP sends both keys for forward compatibility.
maxTokensField: "max_completion_tokens" — All models. Lilac supports max_completion_tokens (preferred for reasoning models as it includes reasoning tokens).
supportsDeveloperRole: true — All models. Lilac's vLLM backend maps the developer role to system.
supportsStore: false — All models. Lilac doesn't support the store parameter.

Updating Models

Run the update script to fetch the latest models from Lilac's API:

export LILAC_API_KEY=your-api-key
node scripts/update-models.js

This will:

Fetch models from https://api.getlilac.com/v1/models
Convert per-token pricing to per-million-tokens
Preserve existing curated data (pricing, compat) for known models
Apply overrides from patch.json
Update models.json and the README model table

A GitHub Actions workflow runs this daily and creates a PR if models have changed.

Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | LILAC_API_KEY | No | Your Lilac API key (fallback if not stored via /login) |

License

MIT