@theaiinc/leyline

v1.3.3

Published

21 days ago

The ultimate cost-optimizing LLM load balancer & gateway.

0High
0Medium
0Low

stevetran

stevetrantheaiinc

Leyline 🔮

The ultimate cost-optimizing LLM load balancer, semantic router & gateway.

Leyline unifies multiple LLM providers — cloud (Gemini, HuggingFace, OpenAI, OpenRouter, Azure OpenAI), local (Ollama, LM Studio), and custom — into a single API. It handles failover, rate-limit management, and now includes a semantic router that classifies requests by complexity/domain and selects the optimal model tier.

graph TB
    Client["Your App / Agent Pipeline"]
    Leyline["Leyline Router"]
    Cloud["Cloud LLMs - Gemini · HuggingFace · OpenAI · OpenRouter · Azure OpenAI"]
    Local["Local LLMs - Ollama · LM Studio · vLLM"]

    Client -->|/v1/chat/completions| Leyline
    Client -->|/v1/route| Leyline

    Leyline -->|provider failover| Cloud
    Leyline -->|provider failover| Local

    subgraph Leyline["Leyline Router"]
        Router["Router - route / routeStream"]
        Classifier["Classifier - complexity/domain/reasoning"]
        Registry["ModelRegistry - capabilities / billing"]
        Policy["Code Policy - selectModelByRouter"]
        QM["QuotaManager - RPM / RPD limits"]
    end

    Router --> QM
    Router -->|resolveRoute| Classifier
    Router --> Policy
    Policy --> Registry

✨ Key Features

🛡️ Resilient Routing: Automatically falls back to the next provider if one fails or hits rate limits.
🌊 Seamless Streaming: Recovers from mid-stream failures by stitching context transparently.
🧠 Semantic Router: Classifies requests by complexity (simple/medium/complex), domain (chat/coding/planning/workflow/memory/extraction), and reasoning requirement.
🎯 Tiered Model Selection: Applies a deterministic code policy to map classification → model tier (2B/4B/12B), so routing policy can evolve without retraining the classifier.
📦 Configurable Model Registry: Bring your own model variants via JSON — define capabilities, billing class, resource class, provider, and context length externally.
🏠 Local Model Support: Built-in provider for LM Studio / any OpenAI-compatible local endpoint.
📊 React Dashboard: Monitor network status, rate limits, request logs, and API key persistence at /dashboard.
📈 Agent Analytics: Insights into "Most Popular", "Fastest", and "Highest Quality" (Elo-rated) models.
🔍 Model Discovery: Search and filter through thousands of available models from connected providers.
🔁 LiteLLM Azure Adapter: Route Cursor Agent / OpenAI-compatible calls through LiteLLM for Azure Responses API compatibility.
🔐 Tunnel-Safe Auth: Generate a fresh client API key per Cloudflare tunnel session and keep the dashboard localhost-only.
🔌 OpenAI Compatible: Drop-in replacement for OpenAI SDKs (/v1/chat/completions).

📦 Installation

npm install @theaiinc/leyline

🚀 Quick Start

1. Standalone Server

Create a .env file:

# ── Cloud Providers ────────────────────────────────────────
GEMINI_API_KEY=your_key
HF_API_KEY=your_key
OPENAI_API_KEY=your_key
OPENAI_DEFAULT_MODEL=gpt-5.5
OPENROUTER_API_KEY=your_key
AZURE_OPENAI_API_KEY=your_key
AZURE_OPENAI_BASE_URL=https://your-resource.services.ai.azure.com/openai/v1
AZURE_OPENAI_DEFAULT_MODEL=gpt-5.5

# Optional dashboard persistence for runtime API keys
LEYLINE_KEYCHAIN_ENABLED=true
LEYLINE_KEYCHAIN_SERVICE=@theaiinc/leyline

# ── Router / Classifier Model (optional) ───────────────────
# A lightweight model like arch-router-1.5b.gguf
LEYLINE_ROUTER_MODEL=
LEYLINE_OPENAI_BASE_URL=http://localhost:1234/v1

# ── Cursor + Azure via LiteLLM (recommended) ───────────────
LITELLM_ENABLED=true
LITELLM_BASE_URL=http://127.0.0.1:4000/v1
LITELLM_MODEL=gpt-5.5
LITELLM_API_KEY=not-needed
LEYLINE_ROUTER_ENABLED=false
LEYLINE_FIXED_PROVIDER=LiteLLM
LEYLINE_FIXED_MODEL=gpt-5.5

# ── Model Tier Resolution (optional) ───────────────────────
# Maps tier labels to actual model names
LEYLINE_MODEL_2B=google/gemma-4-e2b
LEYLINE_MODEL_4B=qwen3:8b
LEYLINE_MODEL_12B=google/gemma-4-12b

# ── Custom Variant Registry (optional) ─────────────────────
# Full JSON array of ModelVariant objects (overrides defaults)
LEYLINE_CUSTOM_VARIANTS=

Start the LiteLLM Azure adapter in one terminal:

npm run litellm:azure

Run the router in another terminal:

npx @theaiinc/leyline

The API will be available at http://localhost:3000. On startup Leyline also launches a Cloudflare quick tunnel (via cloudflared) and prints a public trycloudflare.com URL — use that when cloud clients block private networks (e.g. "Access to private networks is forbidden").

Set LEYLINE_TUNNEL_ENABLED=false if you do not want the tunnel, or install cloudflared if it is missing.

Client API keys (calling Leyline)

Leyline validates incoming Authorization headers on /v1/chat/completions and /v1/route. When the Cloudflare tunnel is enabled and LEYLINE_CLIENT_API_KEY is unset, Leyline generates a fresh random ll-... client key for that server process and prints it next to the public base URL. This key is separate from provider credentials (Azure OpenAI, OpenAI, Gemini, etc.), which are configured in Leyline itself via .env or the local /dashboard API key panel.

When using the OpenAI SDK against Leyline at http://localhost:3000/v1:

const client = new OpenAI({
  baseURL: 'http://localhost:3000/v1',
  apiKey: 'leyline',
});

Or pass the header directly:

curl -H "Authorization: Bearer leyline" ...

Set LEYLINE_CLIENT_API_KEY to pin a stable expected key, or LEYLINE_CLIENT_AUTH_ENABLED=false to disable client auth (legacy behavior). The dashboard shows the current public base URL and generated client key masked by default with Show/Copy controls.

Do not pass your Azure or OpenAI provider key to Leyline clients. Configure AZURE_OPENAI_API_KEY (or save it in /dashboard under AzureOpenAI) on the Leyline server instead.

2. Usage as a Library

import {
  Router, ModelRegistry, Classifier,
  GeminiProvider, OpenAIProvider, AzureOpenAIProvider, LMStudioProvider, QuotaManager,
} from '@theaiinc/leyline';

// ── Tiered routing with classifier ────────────────────────

const classifyFn = async (system: string, userMessage: string) => {
  const response = await fetch('http://localhost:1234/v1/chat/completions', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'arch-router-1.5b.gguf',
      messages: [
        { role: 'system', content: system },
        { role: 'user', content: userMessage },
      ],
      max_tokens: 64,
      temperature: 0,
    }),
  });
  const data = await response.json();
  return data.choices[0]?.message?.content || '';
};

const router = new Router({
  classifier: new Classifier(classifyFn),
  tierConfig: {
    '2b': 'google/gemma-4-e2b',
    '4b': 'qwen3:8b',
    '12b': 'google/gemma-4-12b',
  },
});

// Get a routing decision before making the call
const route = await router.resolveRoute({
  userMessage: 'build a todo app with react',
  chatHistory: [],
});
console.log(route);
// → { classification: { complexity: 'complex', domain: 'coding', reasoning: true },
//     selectedTier: '12b',
//     selectedModel: 'google/gemma-4-12b',
//     selectedProvider: 'openai' }

// ── Provider failover with quota management ────────────────

const qm = new QuotaManager();
qm.setQuota('Gemini', { requestsPerMinute: 10, requestsPerDay: 1000 });

const failoverRouter = new Router({ quotaManager: qm });
failoverRouter.addProvider(new GeminiProvider(process.env.GEMINI_API_KEY));
failoverRouter.addProvider(new OpenAIProvider(process.env.OPENAI_API_KEY));
failoverRouter.addProvider(new AzureOpenAIProvider());
failoverRouter.addProvider(new LMStudioProvider('http://localhost:1234/v1'));

const response = await failoverRouter.route({
  model: 'auto',
  messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(response.choices[0].message.content);

// Streaming with mid-stream failover stitching
for await (const chunk of failoverRouter.routeStream({
  model: 'mistralai/mistral-7b-instruct',
  messages: [{ role: 'user', content: 'Tell me a story.' }],
})) {
  process.stdout.write(chunk.choices[0].delta.content || '');
}

🧠 Architecture

Routing Flow

sequenceDiagram
    participant App as Your App
    participant Leyline
    participant Classifier as "Classifier (tiny LLM)"
    participant Policy as "Code Policy"
    participant Registry as ModelRegistry
    participant Provider as "LLM Provider"

    App->>Leyline: POST /v1/route { userMessage }
    Leyline->>Classifier: classifyRequest()
    Classifier-->>Leyline: { complexity, domain, reasoning }
    Leyline->>Policy: selectModelByRouter(classification)
    Policy-->>Leyline: "12b"
    Leyline->>Registry: lookupVariant(null, "big-model")
    Registry-->>Leyline: { provider: "openai", capabilities }
    Leyline-->>App: { selectedTier, selectedModel, selectedProvider }

    App->>Leyline: POST /v1/chat/completions { model }
    Leyline->>Provider: complete / completeStream
    Provider-->>Leyline: response / stream chunks
    Leyline-->>App: response

Classifier Prompt

The classifier uses a lightweight LLM (e.g. arch-router-1.5b.gguf) with a 3-line structured output format:

COMPLEXITY: simple | medium | complex
DOMAIN: chat | coding | planning | workflow | memory | extraction
REASONING: true | false

The output is parsed and fed into the code policy — a deterministic function that maps classification → model tier. This keeps routing policy flexible without retraining the model.

Code Policy (Default)

graph LR
    C[Classification] --> Policy{selectModelByRouter}
    Policy -->|memory / extraction| 2b
    Policy -->|workflow| 12b
    Policy -->|planning| 12b
    Policy -->|coding + medium/complex| 12b
    Policy -->|reasoning=true| 12b
    Policy -->|simple| 2b
    Policy -->|medium| 4b
    Policy -->|complex| 12b
    Policy -->|null / error| 4b[4b fallback]

Package Structure

graph TD
    subgraph src
        index["index.ts (exports + bootstrap)"]
        config["config.ts (env + LeylineConfig type)"]
        server["server.ts (Express: /v1/chat + /v1/route)"]

        subgraph core
            types["types.ts (all interfaces)"]
            router["router.ts (route/routeStream/resolveRoute)"]
            modelRegistry["model-registry.ts (ModelRegistry class)"]
            classifier["classifier.ts (Classifier + ROUTER_PROMPT)"]
            quotaManager["quota-manager.ts (rate-limit tracking)"]
            logger["logger.ts (structured logging)"]
            leaderboard["leaderboard-data.ts (Elo scores)"]
        end

        subgraph providers
            gemini["gemini.ts"]
            huggingface["huggingface.ts"]
            openai["openai.ts"]
            openrouter["openrouter.ts"]
            azure["azure-openai.ts"]
            ollama["ollama.ts"]
            lmstudio["lmstudio.ts"]
        end
    end

    index --> config
    index --> server
    index --> core
    index --> providers
    server --> router
    router --> classifier
    router --> modelRegistry
    router --> quotaManager

🖥️ Dashboard

Access the dashboard at http://localhost:3000/dashboard to view:

Network Status: Real-time quota usage and provider health.
Runtime API Keys: Set or clear cloud provider API keys without editing .env.
Key Persistence: Choose Apple Keychain, browser localStorage, or server memory per provider.
Azure Runtime Settings: Edit Azure OpenAI base URL and model/deployment for the current server process.
Model Explorer: Searchable list of all available models with descriptions and specs.
Leaderboards:
- 🏆 Usage: Your most frequent models.
- ⚡ Latency: Fastest response times.
- 🌟 Quality: Models ranked by LMSYS Elo ratings (GPT-4o, Claude 3.5, etc.).

The dashboard and dashboard APIs are intentionally localhost-only. Cloudflare/proxy requests to /dashboard/* are blocked so the public tunnel exposes only the OpenAI-compatible /v1/* API surface. Use the local dashboard to copy the current tunnel base URL and generated client key.

Dashboard key behavior:

.env keys are treated as explicit startup configuration and take precedence over Keychain during startup.
If a provider has no .env key, Leyline attempts to load a saved dashboard key from Apple Keychain when Keychain is enabled and available.
A missing Keychain item is normal for providers that have not been saved yet; it should show as Missing key, not as a Keychain failure.
Saving a key from the dashboard updates the running provider immediately. Keychain saves persist across server restarts; memory saves last only for the current server process.
Browser localStorage is optional and browser-local. The dashboard stores the key in that browser and re-sends it to the server when the dashboard loads, but the server never returns raw key values.
If the dashboard says Apple Keychain lookup/save/delete failed, check macOS Keychain permissions or set LEYLINE_KEYCHAIN_ENABLED=false to run in memory-only mode. Memory and localStorage remain optional fallbacks.
Blank key fields never clear a key. Use the explicit Clear Key action, which keeps runtime URL/model settings intact.

🛠️ Configuration

Environment Variables

Cloud provider API keys can also be set from /dashboard. Raw keys are accepted only in save/rehydration requests and are never returned by dashboard APIs.

For Cursor Agent + Azure, use LiteLLM as the Azure adapter:

AZURE_OPENAI_BASE_URL=https://otlrs-dev-agents-resource.services.ai.azure.com/openai/v1
AZURE_OPENAI_DEFAULT_MODEL=gpt-5.5
LITELLM_ENABLED=true
LITELLM_BASE_URL=http://127.0.0.1:4000/v1
LITELLM_MODEL=gpt-5.5
LITELLM_API_KEY=not-needed
LEYLINE_ROUTER_ENABLED=false
LEYLINE_FIXED_PROVIDER=LiteLLM
LEYLINE_FIXED_MODEL=gpt-5.5

Then paste the Azure API key into the local /dashboard under AzureOpenAI, or set AZURE_OPENAI_API_KEY in .env, start npm run litellm:azure, and run npm start.

Router / Classifier

To force one model, set:

LEYLINE_ROUTER_ENABLED=false
LEYLINE_FIXED_PROVIDER=OpenAI
LEYLINE_FIXED_MODEL=gpt-5.5

When fixed model mode is enabled, /v1/chat/completions ignores the request model and sends every request to the configured provider/model. /v1/route returns a fixed routing decision instead of calling the classifier.

Tier → Model Resolution

Custom Variants

Local Models

Prompt Compression (optional)

Leyline can optionally compress prompts and context before sending to LLM providers using @theaiinc/headroom-ai — a library-only fork of chopratejas/headroom (Apache-2.0, original author credited).

Compression is off by default and requires the Python package:

pip install headroom-ai

When enabled, Router.route() and Router.routeStream() automatically compress messages before sending to providers. The compressor spawns headroom-compress (the Python CLI), passing messages as JSON on stdin and receiving compressed messages as JSON on stdout.

Exported Types

import type {
  ModelVariant, BillingClass, ResourceClass, ApiKeyConfigurableProvider,
  RouterClassification, ClassifyRequest, RouteResult, TierConfig,
  LeylineConfig, RouterModelConfig, QuotaConfig, SingleModelConfig,
  RouterOptions, SingleModelRouterConfig, ClassifyFn,
  Provider, CompletionRequest, CompletionResponse, StreamChunk,
} from '@theaiinc/leyline';

🤝 Contributing

We welcome contributions! Please feel free to submit a Pull Request.