frugalroute

v1.1.0

Published

2 months ago

Capability-centric, local-first LLM routing layer — route requests to the cheapest capable model across Ollama, OpenAI, Anthropic, Google, Groq, Mistral, Kimi, and DeepSeek

The Problem

You're burning money on AI and you know it.

Every request goes to the same expensive cloud model — whether it's a trivial FAQ or a complex reasoning task. Your "summarize this email" costs the same as your "analyze this legal contract." Your team hardcodes model: "gpt-4o" because switching models means rewriting code. And when the bill lands, nobody can tell you which requests actually needed that firepower.

The real cost isn't the model. It's the lack of decision-making between your app and the model.

Meanwhile, that M4 MacBook Pro sitting on your desk? It can run an 8B parameter model at 50+ tokens/sec. For free. Right now. For 80% of your prompts, that's more than enough.

But nobody's using it, because wiring up local models, fallback logic, cost tracking, and caching is a month of engineering you'll never get approved.

The Fix

FrugalRoute is one line of config between your app and your models.

# Before: hardcoded, expensive, blind
client = OpenAI(api_key="sk-...")

# After: routed, cached, tracked, learning
client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")

That's it. Same OpenAI SDK. Same code. FrugalRoute intercepts every request and makes a decision:

Can a local model handle this? Run it on Ollama. Cost: $0.
Seen this before? Return it from the semantic cache. Cost: $0. Latency: ~1ms.
Needs more muscle? Escalate to the cloud — but only the cheapest cloud model that's capable enough.
Learn from it. Every cloud call becomes training data. Next time, the local model handles it.

The more you use it, the less you spend.

curl http://localhost:3100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is dependency injection?"}]
  }'

That returned a standard OpenAI response. Your app doesn't know or care which model answered. FrugalRoute picked a local Llama model, skipped the cloud entirely, and logged the cost as $0.00.

How It Works

    Your app                                                    Models
   ┌────────┐         ┌──────────────────────────────┐
   │ OpenAI │──HTTP──▶│         FrugalRoute           │──▶  Ollama     (local, free)
   │  SDK   │         │                               │──▶  OpenAI     (cloud, metered)
   │        │◀──JSON──│  :3100/v1/chat/completions    │──▶  Anthropic  (cloud, metered)
   └────────┘         └──────────────────────────────┘

The Cascade

Every request flows through a priority chain — cheapest first, escalate only when necessary.

  Semantic Cache ──hit?──▶ return instantly ($0)
       │ miss
  Keyword Classifier ──obvious?──▶ route directly (<1ms)
       │ uncertain
  Embedding Classifier ──▶ classify intent (~4ms)
       │
  Local Model (Ollama) ──confident?──▶ return ($0)
       │ low confidence
  Bigger Local Model ──confident?──▶ return ($0)
       │ still low
  Cloud Model (cheapest capable) ──▶ return ($$)
       │
  Collect training pair ──▶ distill into local models

The confidence threshold isn't static — it adapts per capability based on real performance data. Summarization might need 0.7 confidence locally. Code generation might need 0.95. FrugalRoute figures this out from your traffic.

The Flywheel

This is what no other router does.

Every time FrugalRoute escalates to the cloud, it captures the prompt and response as a training pair. Over time, you run the distillation pipeline, and your local models absorb the capabilities they used to delegate. Cloud spend decreases. Automatically.

  Traffic ──▶ Local model fails ──▶ Cloud handles it
                                         │
              Training pair collected ◀───┘
                     │
              Local model fine-tuned
                     │
              Next time: local model handles it ──▶ $0

The integrity layer (based on TruthKeeper research) ensures you never train on stale, contradicted, or low-quality data. Every training pair is dependency-tracked and integrity-verified before it touches your models.

Who It's For

Startups & Small Teams

You're shipping fast and watching costs. FrugalRoute gives you GPT-4-level output on a ramen budget. Local models handle the bulk — cloud kicks in only when it matters. No infra team required.

You'll love: Zero-config start, auto-learning, cost tracking per feature.

Enterprise & Platform Teams

You need governance, auditability, and vendor independence. FrugalRoute gives you per-key budgets, A/B testing across providers, full request provenance, and Prometheus metrics — without touching a single line of application code.

You'll love: Virtual API keys, guardrails pipeline, budget enforcement, self-hosted deployment.

AI/ML Engineers

You're tired of manually benchmarking models. FrugalRoute profiles your hardware, learns which models excel at what, and auto-adjusts routing weights from real traffic. The distillation pipeline means your local models get smarter over time — automatically.

You'll love: Judge agent, multi-sampling, TruthKeeper integrity, hardware auto-profiling.

Quickstart

bunx frugalroute

Or clone and run:

git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
cp .env.example .env
bun run dev

Pull at least one local model and the embedding model:

ollama pull llama3.2
ollama pull nomic-embed-text

Point any OpenAI client at http://localhost:3100/v1 and set model to "auto".

from openai import OpenAI

client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")
r = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain monads like I'm five"}]
)
print(r.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:3100/v1", apiKey: "unused" });
const r = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", content: "Explain monads like I'm five" }],
});
console.log(r.choices[0].message.content);

curl http://localhost:3100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"Explain monads like I'm five"}]}'

require "openai"

client = OpenAI::Client.new(uri_base: "http://localhost:3100/v1", access_token: "unused")
r = client.chat(parameters: {
  model: "auto",
  messages: [{ role: "user", content: "Explain monads like I'm five" }]
})
puts r.dig("choices", 0, "message", "content")

cfg := openai.DefaultConfig("unused")
cfg.BaseURL = "http://localhost:3100/v1"
client := openai.NewClientWithConfig(cfg)

resp, _ := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
    Model:    "auto",
    Messages: []openai.ChatCompletionMessage{{Role: "user", Content: "Explain monads like I'm five"}},
})
fmt.Println(resp.Choices[0].Message.Content)

All clients hit the same endpoint. FrugalRoute picks the model, runs inference, returns OpenAI-shaped JSON.

What's Under the Hood

Routing & Classification

Semantic intent classification via embeddings (nomic-embed-text)
Sub-1ms keyword pre-classifier for obvious cases
Composite scoring with cascade confidence
Capability matching — models declare strengths, requests state needs
Multi-model sampling with judge or majority voting
A/B testing with weighted traffic splits
Sticky sessions for multi-turn conversation consistency
Agent-specific routing strategies

Performance & Reliability

Two-tier cache: exact-match LRU + vector similarity
PeakEWMA latency tracking — routes around degraded providers
Error-type aware circuit breaker (429 vs 500 vs timeout)
Full SSE streaming with heartbeat keepalive
Graceful shutdown with in-flight request draining
Hardware auto-profiling (Apple Silicon, CUDA, ROCm)

Cost & Governance

Real-time cost tracking per request, key, session, and tag
Pre-flight budget enforcement — stops before it spends
Cache-aware pricing in routing decisions
Virtual API keys with independent limits per team
Token bucket rate limiting per key
Windowed budgets with configurable time windows

Learning & Distillation

Routing weights adapt from real success/failure signals
Judge agent for structural quality evaluation
Distillation pipeline: cloud responses train local models
TruthKeeper integrity layer prevents stale training data
Epistemic state tracking (Supported / Hypothesis / Contested)
Conversation compaction for long context management

Operations

Model aliases: fast, smart, cheap — decouple code from models
Prometheus metrics (frugalroute_*)
YAML model config (config/models.yaml)
OpenAPI spec at /openapi.json
One-command calibration tooling

Extensibility

MCP tool registry (MCP + OpenAI + Anthropic tools, unified)
Guardrails pipeline for pre/post content filtering
Provider adapters: Ollama, OpenAI, Anthropic
Plug in new providers by implementing one interface
Bidding/auction system for ambiguous routing decisions

The Competition

Every LLM gateway proxies requests. None of them think about them.

| | liteLLM | OpenRouter | Portkey | Bifrost | FrugalRoute | |:---|:---:|:---:|:---:|:---:|:---:| | OpenAI-compatible drop-in | Yes | Yes | Yes | Yes | Yes | | Routes by capability, not model name | | | | | Yes | | Local-first (Ollama, Apple Silicon) | | | | | Yes | | Semantic intent classification | | | | | Yes | | Confidence-based escalation cascade | | | | | Yes | | Two-tier semantic cache | | | Simple | | Yes | | Learns from traffic, self-improves | | | | | Yes | | Distills cloud into local models | | | | | Yes | | Hardware auto-profiling | | | | | Yes | | Budget enforcement per key/session | Partial | | | Partial | Yes | | A/B testing across models | | | | | Yes | | MCP tool interoperability | Partial | | | | Yes | | Self-hosted, no vendor lock-in | Yes | | Yes | Yes | Yes |

liteLLM is a great proxy. It connects 100+ providers behind one API. But it doesn't know what your prompt needs — you still pick the model. No local tier, no caching, no learning.
OpenRouter is a managed marketplace. Not self-hosted. Your data leaves your network.
Portkey has solid reliability features — retries, fallbacks, circuit breaking. But it routes by provider weight, not by prompt intent. No local models. No distillation.
Bifrost is fast (11us overhead). But it's a load balancer, not a router. It doesn't understand what your request needs.

They move traffic. FrugalRoute makes decisions.

Configuration

# .env
PORT=3100
OLLAMA_BASE_URL=http://localhost:11434
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
EMBEDDING_MODEL=nomic-embed-text
DEFAULT_MAX_COST_PER_REQUEST=0.01

# config/models.yaml
aliases:
  fast: gemma3-4b
  smart: claude-sonnet-4-20250514
  cheap: llama3.2

Full configuration reference: docs/user/configuration.mdx

Documentation

| Guide | What it covers | |:---|:---| | Getting Started | Install, first request, connect existing clients | | Architecture | Module map, request flow, design principles | | Routing | Classification, escalation, bidding, weight adjustment | | Caching | Two-tier semantic cache, adaptive thresholds | | Cost Management | Estimation, tracking, budget enforcement | | Configuration | Env vars, routes, models, budgets, thresholds | | Deployment | Docker, production hardening, hardware profiling | | Tools & MCP | Tool registry, MCP integration, format conversion | | Distillation | Training flywheel, TruthKeeper integrity | | API Reference | Complete HTTP endpoint reference |

FAQ

For local inference, yes. FrugalRoute uses Ollama as its local model backend. Without it, requests route straight to cloud providers — which still gives you caching, cost tracking, and budget enforcement, but you miss the free local tier.

Any model Ollama can run (Llama, Mistral, Gemma, Phi, Qwen, DeepSeek, etc.), plus OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5) and Anthropic (Claude Opus, Sonnet, Haiku). Adding a new provider is one adapter interface.

Yes. Full SSE streaming with heartbeat keepalive, compatible with the OpenAI streaming format. Set "stream": true in your request — same as you would with OpenAI directly.

Keyword classification adds <1ms. Embedding-based classification adds ~4ms. Cache hits return in ~1ms. The routing decision itself is negligible compared to model inference time.

Yes. Set model to any registered model name (e.g., "gpt-4o", "llama3.2") instead of "auto". FrugalRoute will route directly to that model while still tracking cost and logging the request. You can also use aliases like "fast", "smart", or "cheap".

FrugalRoute is fully self-hosted. Local model requests never leave your machine. Cloud requests go directly to OpenAI/Anthropic — FrugalRoute never proxies through a third-party service. Training pairs for distillation are stored locally in SQLite.

When a request escalates to a cloud model, the prompt-response pair is captured, quality-scored by a judge agent, and stored locally. Running bun run distill feeds verified pairs into a fine-tuning pipeline for your local models. The TruthKeeper integrity layer ensures only high-quality, non-contradicted data is used. See Distillation docs.

Supported. FrugalRoute's MCP tool registry unifies tools across MCP, OpenAI, and Anthropic formats. Tool calls are routed to the correct backend automatically.

Built With

Bun + Hono + Ollama + TypeScript

445 tests. 1,196 assertions. Two production dependencies.

Contributing

git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
bun test           # run all 445 tests
bun run dev        # start dev server with hot reload
bun run lint       # lint with Biome
bun run benchmark  # run hardware benchmarks
bun run calibrate  # calibrate keyword classifier thresholds

PRs welcome. Please run bun run check (lint + tests) before submitting.

License

PolyForm Small Business License 1.0.0 — free for individuals, small businesses (<100 people, <1M EUR revenue), nonprofits, and open source projects.

Commercial license for larger organizations: [email protected]