switchboard-llm

v0.1.1

Published

21 days ago

Intelligent multi-model LLM router. Routes prompts to the optimal provider (Claude, GPT-4o, Gemini, DeepSeek, Groq, Mistral, Together AI, and more) based on task type, cost, and quality.

0High
0Medium
0Low

33labs

llm ai routing multi-model openai claude gemini deepseek groq mistral agent swarm orchestration cost-optimization

switchboard-llm

Intelligent multi-model LLM router. Sends each prompt to the right model — automatically.

Most apps call one LLM for everything. That's like using a bulldozer to plant seeds.

Switchboard classifies each incoming prompt by task type and routes it to the model that wins on that category — not just in marketing, but in benchmarks and real cost. A bug fix goes to Codestral (beats GPT-4o on HumanEval, costs 1/10 as much). A 200-page document goes to Gemini (1M token context). A latency-critical autocomplete goes to Groq (300–800 tok/s on custom LPU silicon). A high-stakes legal summary runs on 4 frontier models simultaneously and takes the majority vote.

This is not a translation layer. This is not a marketplace. It's a router.

Quick Start

npm install switchboard-llm

import { route } from 'switchboard-llm';

// Auto-classify from prompt content
const result = await route({ prompt: 'Fix the off-by-one error in this sort function' });
// → routes to Codestral automatically

console.log(result.content);       // the answer
console.log(result.cost);          // USD, e.g. 0.000041
console.log(result.latencyMs);     // wall-clock ms

Routing Table

| Task Type | Primary | Fallback | Rationale | |-----------|---------|----------|-----------| | code | Codestral | DeepSeek | Top HumanEval scores at 1/10 GPT-4o cost | | reasoning | Claude | GPT-4o | Best multi-step analysis and instruction following | | creative | GPT-4o | Claude | Best copy, voice, and storytelling | | multimodal | Gemini 1.5 Pro | GPT-4o | 1M token context, native multimodal | | fast | Groq (Llama 3.3 70B) | GPT-4o mini | 300–800 tok/s on custom LPU hardware | | research | Gemini 1.5 Pro | Claude | Long-context synthesis and document analysis | | security | Claude | GPT-4o | Best safety-aware reasoning and threat modeling | | rag | Cohere Command R+ | Claude | Purpose-built for retrieval-augmented generation | | search | Perplexity Sonar | GPT-4o | Live web retrieval on every call | | consensus | Claude + GPT-4o + Gemini + DeepSeek | — | Parallel run, majority vote |

Features

Task-aware routing. The classifier reads your prompt and picks the right model before the LLM call. No config required. You can also pass type explicitly if you know what you need.

Groq speed routing. Groq runs Llama 3.3 70B on custom LPU hardware at 300–800 tokens per second — 10–25x faster than standard GPU inference. When latency matters (autocomplete, voice, streaming), type: 'fast' routes there by default.

Self-healing fallback. Switchboard tracks rolling success rates per provider. When a provider's rate drops below 70%, it automatically promotes the fallback for that task type. No manual intervention, no 500s leaking to users.

Swarm mode. Dispatch N independent tasks to N providers in parallel and collect all results simultaneously. Useful for pipelines where multiple LLM calls would otherwise run serially.

Consensus mode. Run 4 frontier models (Claude, GPT-4o, Gemini 1.5 Pro, DeepSeek) in parallel and take the majority vote. Built for high-stakes decisions where a single model hallucinating is unacceptable.

OpenAI drop-in proxy. Start the proxy server and change one line in your existing code. Switchboard intercepts the request, routes it intelligently, and returns an OpenAI-compatible response. Zero other code changes.

MCP-native. Runs as a Model Context Protocol tool server, so it works inside Claude Code and any other MCP-compatible agent.

Fully typed. Ships with TypeScript definitions for every public interface. Zero any.

Installation

npm install switchboard-llm

Environment Variables

Set the API keys for the providers you want active. Unused providers are simply skipped.

# Required for defaults — set at least these two:
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...

# Optional — unlock additional routing targets:
GEMINI_API_KEY=AIza...
DEEPSEEK_API_KEY=sk-...
GROQ_API_KEY=gsk_...
MISTRAL_API_KEY=...           # enables Codestral (code routing) + Mistral Large
TOGETHER_API_KEY=...
PERPLEXITY_API_KEY=pplx-...
XAI_API_KEY=xai-...
COHERE_API_KEY=...

Usage

Auto-detect task type from prompt

import { route } from 'switchboard-llm';

const result = await route({
  prompt: 'Refactor this Python function to use a generator instead of a list',
});

console.log(result.provider);    // "codestral-latest"
console.log(result.content);
console.log(`Cost: $${result.cost.toFixed(6)}`);
console.log(`Latency: ${result.latencyMs}ms`);

Explicit task type

const result = await route({
  prompt: 'Write a cold email for a SaaS product targeting HR teams',
  type: 'creative',
});
// → routes to GPT-4o

Force a specific provider

const result = await route({
  prompt: 'Summarize these meeting notes',
  provider: 'gemini',   // bypass routing, always use this provider
});

With a system prompt

const result = await route({
  prompt: userMessage,
  type: 'reasoning',
  system: 'You are a senior software architect. Be concise and opinionated.',
  maxTokens: 2048,
});

Swarm mode — parallel dispatch

Dispatch multiple independent tasks simultaneously. Each is independently classified and routed.

import { swarm } from 'switchboard-llm';

const results = await swarm([
  { id: 'summary',  prompt: 'Summarize this 80-page contract',    type: 'research'  },
  { id: 'rewrite',  prompt: 'Rewrite this headline for LinkedIn', type: 'creative'  },
  { id: 'fix',      prompt: 'Fix the SQL injection in line 42',    type: 'code'      },
  { id: 'check',    prompt: 'Flag any GDPR compliance issues',     type: 'security'  },
]);

for (const { task, result, error } of results) {
  if (error) console.error(`${task.id} failed:`, error);
  else console.log(`${task.id} → ${result.provider} ($${result.cost.toFixed(6)})`);
}

Consensus mode — majority vote

Run 4 frontier models in parallel. The response that appears most often across models wins.

import { route } from 'switchboard-llm';
import type { ConsensusResult } from 'switchboard-llm';

const result = await route({
  prompt: 'Is this contract clause enforceable under Illinois law?',
  type: 'consensus',
}) as ConsensusResult;

console.log(result.winner.content);     // majority-vote answer
console.log(result.all.length);         // number of models that responded
console.log(result.failed);             // any that errored

OpenAI drop-in proxy

Start the proxy server:

npx switchboard-llm proxy --port 4141

Change one line in your existing code:

// Before:
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// After (zero other changes):
const client = new OpenAI({
  apiKey: 'switchboard',
  baseURL: 'http://localhost:4141/v1',
});

// Switchboard intercepts the request, routes intelligently, returns OpenAI-format response.
// Works with LangChain, LlamaIndex, Vercel AI SDK, anything that speaks OpenAI.
const response = await client.chat.completions.create({
  model: 'auto',   // Switchboard picks the model; or pass a task type here
  messages: [{ role: 'user', content: 'Fix the off-by-one error in this sort function' }],
});

MCP server

Add to your MCP config (e.g., mcp.json for Claude Code):

{
  "mcpServers": {
    "switchboard": {
      "command": "npx",
      "args": ["switchboard-llm", "mcp"]
    }
  }
}

Then use the route_prompt tool inside Claude Code or any MCP-compatible client:

route_prompt({ prompt: "...", type: "code" })

Create a custom client

For multi-tenant apps or when you need to override provider settings at runtime:

import { createClient } from 'switchboard-llm';

const client = createClient({
  providers: {
    groq: { maxTokens: 4096, costPer1kInput: 0.00059, costPer1kOutput: 0.00079 },
  },
});

const result = await client.route({ prompt: 'Translate this to Spanish', type: 'fast' });

Cost Comparison

The numbers below use real published API prices and a synthetic mixed workload of 1,000 requests distributed across task types.

Workload assumption: 30% code, 20% reasoning, 15% creative, 10% fast, 10% research, 5% security, 5% rag, 5% search. Average 500 input tokens / 400 output tokens per request.

| Routing Strategy | Avg cost / request | 1K requests | vs. GPT-4o baseline | |------------------|--------------------|-------------|----------------------| | Always GPT-4o | ~$0.00525 | ~$5.25 | baseline | | Switchboard auto-route | ~$0.00112 | ~$1.12 | 79% cheaper | | Always GPT-4o mini | ~$0.00027 | ~$0.27 | cheaper, but quality drops significantly on complex tasks |

Cost breakdown by task type with smart routing:

| Task | Model Used | Cost / 1K tokens (avg) | vs. GPT-4o | |------|------------|------------------------|------------| | code | Codestral | $0.00040 | 92% cheaper | | fast | Groq Llama 3.3 | $0.00069 | 87% cheaper | | research | Gemini 1.5 Pro | $0.00313 | 40% cheaper | | reasoning | Claude Sonnet | $0.00900 | similar | | consensus | 4 models parallel | $0.02100 | 4x (4 calls, worth it for critical decisions) |

Cost is tracked per-call and available on every result:

const result = await route({ prompt: '...', type: 'code' });
console.log(`$${result.cost.toFixed(6)}`);   // e.g. $0.000041

Providers

| ID | Model | Provider | Input / 1K tokens | Strength | |----|-------|----------|--------------------|----------| | claude | claude-sonnet-4-6 | Anthropic | $0.003 | Reasoning, safety, instruction following | | gpt4o | gpt-4o | OpenAI | $0.0025 | Creative, broad competence | | gpt4o-mini | gpt-4o-mini | OpenAI | $0.00015 | Low-cost fallback | | gemini | gemini-1.5-pro | Google | $0.00125 | 1M context, multimodal, research | | gemini-flash | gemini-1.5-flash | Google | $0.000075 | Fast + cheap multimodal | | deepseek | deepseek-chat | DeepSeek | $0.00027 | Strong code/reasoning at low cost | | groq | llama-3.3-70b-versatile | Groq | $0.00059 | 300–800 tok/s LPU hardware speed | | codestral | codestral-latest | Mistral | $0.00020 | Best-in-class code model | | together | Llama 3.3 70B Turbo | Together AI | $0.00088 | 50+ open models, flexible | | perplexity | sonar-pro | Perplexity | $0.003 | Live web search on every call | | xai | grok-2-latest | xAI | $0.002 | Real-time X/Twitter data | | cohere | command-r-plus | Cohere | $0.0025 | RAG-optimized, grounded generation |

Configuration

Routing rules live in src/config/routing.yaml and are fully overridable. Copy the file and set SWITCHBOARD_CONFIG to point to your version.

# routing.yaml — customize routing logic per task type
routing:
  code:
    primary: codestral
    fallback: deepseek

  # Override creative to use xAI Grok for real-time trend awareness:
  creative:
    primary: xai
    fallback: gpt4o

  # Add your own task type:
  internal-docs:
    primary: claude
    fallback: gpt4o
    description: "Internal documentation — prefers verbose, structured output"

# Classifier thresholds (token count heuristics for auto-routing):
classifier:
  simpleThreshold: 50     # prompts under 50 tokens → 'fast' route
  complexThreshold: 500   # prompts over 500 tokens → 'reasoning' route
  defaultRoute: fast

SWITCHBOARD_CONFIG=/path/to/my-routing.yaml node app.js

How Auto-Classification Works

The classifier uses a keyword-priority system with token-count heuristics as a secondary signal:

Keyword scan — checks for strong signals (def , function, ````python, SQL, bug, etc.) to detect code; imagine, story, email` for creative; etc.
Length heuristic — very short prompts (< 50 tokens) route to fast; very long prompts (> 500 tokens) route to reasoning when no stronger signal is found.
Default — unclassified prompts fall back to classifier.defaultRoute (default: fast).

You can always override with type: 'reasoning' or any other task type.

Self-Healing Fallback

Switchboard tracks a rolling window of outcomes per (providerId, taskType) pair. When a provider's success rate for a task type falls below 70%, subsequent requests for that task type automatically route to the fallback provider.

The tracker is in-memory per process. Stats are available at runtime:

import { tracker } from 'switchboard-llm';

const stats = tracker.getStats();
// [{ providerId, taskType, calls, successRate, avgLatencyMs, avgCost, totalCost }, ...]

for (const s of stats) {
  console.log(`${s.providerId}/${s.taskType}: ${(s.successRate * 100).toFixed(1)}% success, $${s.totalCost.toFixed(4)} total`);
}

CLI

# Start the OpenAI-compatible proxy
npx switchboard-llm proxy --port 4141

# Start the MCP tool server
npx switchboard-llm mcp

# Route a single prompt (stdout)
npx switchboard-llm route --prompt "Fix the null check on line 42" --type code

Contributing

Pull requests are welcome. For large changes, open an issue first.

git clone https://github.com/your-org/switchboard-llm
cd switchboard-llm
npm install
npm run build
npm test

The project follows standard TypeScript conventions. All public APIs require typed interfaces. Tests use Vitest.

Adding a provider:

Add a config entry to src/config/routing.yaml
If the provider is OpenAI-compatible, set adapter: openai-compat — no code needed
If it needs a custom adapter, add a file in src/providers/ implementing the BaseProvider interface
Add it to src/providers/registry.ts
Write a test

License

MIT — see LICENSE.

If this is useful, a star on GitHub helps other developers find it.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

switchboard-llm

Quick Start

Routing Table

Features

Installation

Environment Variables

Usage

Auto-detect task type from prompt

Explicit task type

Force a specific provider

With a system prompt

Swarm mode — parallel dispatch

Consensus mode — majority vote

OpenAI drop-in proxy

MCP server

Create a custom client

Cost Comparison

Providers

Configuration

How Auto-Classification Works

Self-Healing Fallback

CLI

Contributing

License