omp-lilac-provider
v1.0.0
Published
Lilac provider plugin for OMP — Access Kimi K2.6, GLM 5.1, Gemma 4, and MiniMax M2.7 models through Lilac's OpenAI-compatible API on idle GPUs
Downloads
130
Maintainers
Readme
💜 omp-lilac-provider
Kimi K2.6, GLM 5.1, Gemma 4 & more on idle GPUs via Lilac
A OMP provider plugin for cost-efficient GPU inference.
Access Kimi K2.6, GLM 5.1, MiniMax M2.7, and Gemma 4 models through Lilac's OpenAI-compatible API on idle GPUs.
Features
- 4 AI Models — Kimi K2.6, GLM 5.1, Gemma 4, and MiniMax M2.7
- OpenAI-Compatible API — Just change the base URL and API key
- Cost Tracking — Per-model pricing with cache read discounts
- Reasoning Models — Chain-of-thought via
chat_template_kwargs(all models) - Vision Support — Image input on Kimi K2.6 and Gemma 4
- Context Caching — Cache read pricing on Kimi K2.6 and GLM 5.1
- Idle GPU Scheduling — Lilac leverages idle GPU capacity for cost-efficient inference
- Live Model Sync — Stale-while-revalidate: serve cached models instantly, hot-swap from the API in the background
- Discount Tracking — Fetches subscription discounts from the
/statusendpoint and applies them to model costs
Quickstart
# 1. Install
omp plugin install omp-lilac-provider
# 2. Add your API key
omp
/login lilac
# 3. Pick a model and go
/model lilacThat's it. Lilac models now appear in /model. No -e flag, no manual clone, no config files.
Models
| Model | Context | Vision | Reasoning | Input $/M | Cache Read $/M | Output $/M | |-------|---------|--------|-----------|-----------|-----------------|------------| | Gemma 4 | 262K | ✅ | ✅ | $0.11 | — | $0.35 | | GLM 5.1 | 203K | ❌ | ✅ | $0.90 | $0.27 | $3.00 | | Kimi K2.6 | 262K | ✅ | ✅ | $0.70 | $0.20 | $3.50 | | MiniMax M2.7 | 205K | ❌ | ✅ | $0.30 | $0.06 | $1.20 |
Costs are per million tokens. Prices subject to change — check getlilac.com for current pricing.
Notes:
- Gemma 4 has reasoning off by default — OMP enables it when you set a thinking level (Shift+Tab)
- Kimi K2.6 and GLM 5.1 have reasoning on by default
- Cache read pricing applies to repeated input tokens served from cache on supported models
- Gemma 4 does not support cache read pricing
API key
/login lilac prompts for your Lilac API key, validates it against Lilac's authenticated chat-completions endpoint, and stores it. Or set it explicitly:
export LILAC_API_KEY=your-api-keyGet a key at getlilac.com.
Other install paths
# From GitHub
omp plugin install https://github.com/ryan-brosas/omp-lilac-provider
# Local development
git clone https://github.com/ryan-brosas/omp-lilac-provider.git
omp plugin link ./omp-lilac-providerUsage
After loading the extension, use the /model command in OMP to select your preferred model:
/model lilac moonshotai/kimi-k2.6Or start OMP directly with a Lilac model:
omp --provider lilac --model moonshotai/kimi-k2.6Thinking Mode
All Lilac models support chain-of-thought reasoning via chat_template_kwargs. OMP uses the qwen-chat-template thinking format to send both thinking and enable_thinking keys, which works across all model families:
- Kimi K2.6: Honors
thinkingkey (Moonshot template) - GLM 5.1: Honors
enable_thinkingkey (Z.ai template) - Gemma 4: Honors
enable_thinkingkey (Google template)
In OMP, reasoning models automatically use the appropriate thinking format. Use Shift+Tab to control thinking level.
Vision
Kimi K2.6 and Gemma 4 support image inputs. Pass images in messages and OMP will handle the formatting automatically.
Gemma 4 also supports video by accepting a sequence of frames as images.
Model Resolution
Models are discovered from the Lilac /v1/models API and stored in models.json. Custom definitions and overrides are layered via patch.json and custom-models.json.
The extension uses a stale-while-revalidate strategy for zero-latency startup:
- Serve stale immediately: disk cache → embedded
models.json(zero-latency) - Revalidate in background: live API
/models→ merge with embedded → cache → hot-swap patch.json+custom-models.jsonapplied on top of whichever source won
| File | Purpose |
|---|---|
| models.json | Auto-generated from Lilac API (model discovery). Regenerated by node scripts/update-models.js — do not edit manually |
| patch.json | Manual overrides (reasoning, compat, notes, limits, etc.) applied on top of models.json |
| custom-models.json | Models not available via the API (e.g. per-slug endpoint models) |
Models are loaded by merging models.json → apply patch.json → merge custom-models.json.
Adding Custom Models
To customize:
- Override an existing model: Add entries to
patch.json(reasoning, compat, notes, maxTokens, etc.) - Add new models not in the API: Add entries to
custom-models.json:
[
{
"id": "my-org/my-model",
"name": "My Custom Model",
"reasoning": false,
"input": ["text"],
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 131072,
"maxTokens": 16384,
"baseUrl": "https://api.getlilac.com/my-model-slug/v1"
}
]API Notes
- Each model is accessible at
https://api.getlilac.com/v1/chat/completions(unified endpoint) - The API is OpenAI-compatible (chat completions format)
- All models are hosted on vLLM
- Lilac serves models via a customized fork of vLLM tuned for idle-GPU scheduling and shared warm endpoints
vLLM Caveats
These issues are common to all vLLM-hosted providers and affect Lilac models:
- GLM 5.1 intermittent tool call loss: vLLM's streaming parser intermittently emits
finish_reason: "tool_calls"without anydelta.tool_callschunks — even withtool_stream: true(set viazaiToolStreamin compat). OMP maps this tostopReason: "toolUse"with zero toolCall blocks, causing an "abrupt stop". The extension'smessage_endhandler converts this to a retryable error that triggers OMP's built-in auto-retry mechanism, so the agent automatically re-prompts and typically succeeds on the next attempt. - GLM 5.1 chain-of-thought leakage: On the current vLLM build, disabling reasoning on GLM 5.1 may still leak chain-of-thought into
contentterminated by amarker. Post-process the response to discard text up to and including the firstwhen reasoning is disabled. See vllm-project/vllm#31319. - Gemma 4 reasoning parser: vLLM's reasoning parser can fail to populate the
reasoningfield when special tokens are stripped before the parser runs. Clients that require a clean split should post-process<|channel|>thought ... <|channel|>markers. See vllm-project/vllm#38855. - Gemma 4 structured output: Combining
enable_thinking: falsewithresponse_format: json_schemacan silently disable xgrammar-backed structured output. If you rely on structured output with Gemma 4, leave thinking enabled or validate output client-side. See vllm-project/vllm#39130.
Compat Settings
Lilac's API is OpenAI-compatible with these specifics:
thinkingFormat: "qwen-chat-template"— All reasoning models. Lilac useschat_template_kwargs(withthinkingandenable_thinkingkeys) to toggle reasoning. OMP sends both keys for forward compatibility.maxTokensField: "max_completion_tokens"— All models. Lilac supportsmax_completion_tokens(preferred for reasoning models as it includes reasoning tokens).supportsDeveloperRole: true— All models. Lilac's vLLM backend maps the developer role to system.supportsStore: false— All models. Lilac doesn't support thestoreparameter.
Updating Models
Run the update script to fetch the latest models from Lilac's API:
export LILAC_API_KEY=your-api-key
node scripts/update-models.jsThis will:
- Fetch models from
https://api.getlilac.com/v1/models - Convert per-token pricing to per-million-tokens
- Preserve existing curated data (pricing, compat) for known models
- Apply overrides from
patch.json - Update
models.jsonand the README model table
A GitHub Actions workflow runs this daily and creates a PR if models have changed.
Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| LILAC_API_KEY | No | Your Lilac API key (fallback if not stored via /login) |
License
MIT
