npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

frugalroute

v1.1.0

Published

Capability-centric, local-first LLM routing layer — route requests to the cheapest capable model across Ollama, OpenAI, Anthropic, Google, Groq, Mistral, Kimi, and DeepSeek

Readme


The Problem

You're burning money on AI and you know it.

Every request goes to the same expensive cloud model — whether it's a trivial FAQ or a complex reasoning task. Your "summarize this email" costs the same as your "analyze this legal contract." Your team hardcodes model: "gpt-4o" because switching models means rewriting code. And when the bill lands, nobody can tell you which requests actually needed that firepower.

The real cost isn't the model. It's the lack of decision-making between your app and the model.

Meanwhile, that M4 MacBook Pro sitting on your desk? It can run an 8B parameter model at 50+ tokens/sec. For free. Right now. For 80% of your prompts, that's more than enough.

But nobody's using it, because wiring up local models, fallback logic, cost tracking, and caching is a month of engineering you'll never get approved.

The Fix

FrugalRoute is one line of config between your app and your models.

# Before: hardcoded, expensive, blind
client = OpenAI(api_key="sk-...")

# After: routed, cached, tracked, learning
client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")

That's it. Same OpenAI SDK. Same code. FrugalRoute intercepts every request and makes a decision:

  1. Can a local model handle this? Run it on Ollama. Cost: $0.
  2. Seen this before? Return it from the semantic cache. Cost: $0. Latency: ~1ms.
  3. Needs more muscle? Escalate to the cloud — but only the cheapest cloud model that's capable enough.
  4. Learn from it. Every cloud call becomes training data. Next time, the local model handles it.

The more you use it, the less you spend.

curl http://localhost:3100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is dependency injection?"}]
  }'

That returned a standard OpenAI response. Your app doesn't know or care which model answered. FrugalRoute picked a local Llama model, skipped the cloud entirely, and logged the cost as $0.00.

How It Works

    Your app                                                    Models
   ┌────────┐         ┌──────────────────────────────┐
   │ OpenAI │──HTTP──▶│         FrugalRoute           │──▶  Ollama     (local, free)
   │  SDK   │         │                               │──▶  OpenAI     (cloud, metered)
   │        │◀──JSON──│  :3100/v1/chat/completions    │──▶  Anthropic  (cloud, metered)
   └────────┘         └──────────────────────────────┘

The Cascade

Every request flows through a priority chain — cheapest first, escalate only when necessary.

  Semantic Cache ──hit?──▶ return instantly ($0)
       │ miss
  Keyword Classifier ──obvious?──▶ route directly (<1ms)
       │ uncertain
  Embedding Classifier ──▶ classify intent (~4ms)
       │
  Local Model (Ollama) ──confident?──▶ return ($0)
       │ low confidence
  Bigger Local Model ──confident?──▶ return ($0)
       │ still low
  Cloud Model (cheapest capable) ──▶ return ($$)
       │
  Collect training pair ──▶ distill into local models

The confidence threshold isn't static — it adapts per capability based on real performance data. Summarization might need 0.7 confidence locally. Code generation might need 0.95. FrugalRoute figures this out from your traffic.

The Flywheel

This is what no other router does.

Every time FrugalRoute escalates to the cloud, it captures the prompt and response as a training pair. Over time, you run the distillation pipeline, and your local models absorb the capabilities they used to delegate. Cloud spend decreases. Automatically.

  Traffic ──▶ Local model fails ──▶ Cloud handles it
                                         │
              Training pair collected ◀───┘
                     │
              Local model fine-tuned
                     │
              Next time: local model handles it ──▶ $0

The integrity layer (based on TruthKeeper research) ensures you never train on stale, contradicted, or low-quality data. Every training pair is dependency-tracked and integrity-verified before it touches your models.


Who It's For

Startups & Small Teams

You're shipping fast and watching costs. FrugalRoute gives you GPT-4-level output on a ramen budget. Local models handle the bulk — cloud kicks in only when it matters. No infra team required.

You'll love: Zero-config start, auto-learning, cost tracking per feature.

Enterprise & Platform Teams

You need governance, auditability, and vendor independence. FrugalRoute gives you per-key budgets, A/B testing across providers, full request provenance, and Prometheus metrics — without touching a single line of application code.

You'll love: Virtual API keys, guardrails pipeline, budget enforcement, self-hosted deployment.

AI/ML Engineers

You're tired of manually benchmarking models. FrugalRoute profiles your hardware, learns which models excel at what, and auto-adjusts routing weights from real traffic. The distillation pipeline means your local models get smarter over time — automatically.

You'll love: Judge agent, multi-sampling, TruthKeeper integrity, hardware auto-profiling.


Quickstart

bunx frugalroute

Or clone and run:

git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
cp .env.example .env
bun run dev

Pull at least one local model and the embedding model:

ollama pull llama3.2
ollama pull nomic-embed-text

Point any OpenAI client at http://localhost:3100/v1 and set model to "auto".

from openai import OpenAI

client = OpenAI(base_url="http://localhost:3100/v1", api_key="unused")
r = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain monads like I'm five"}]
)
print(r.choices[0].message.content)
import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:3100/v1", apiKey: "unused" });
const r = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", content: "Explain monads like I'm five" }],
});
console.log(r.choices[0].message.content);
curl http://localhost:3100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"Explain monads like I'm five"}]}'
require "openai"

client = OpenAI::Client.new(uri_base: "http://localhost:3100/v1", access_token: "unused")
r = client.chat(parameters: {
  model: "auto",
  messages: [{ role: "user", content: "Explain monads like I'm five" }]
})
puts r.dig("choices", 0, "message", "content")
cfg := openai.DefaultConfig("unused")
cfg.BaseURL = "http://localhost:3100/v1"
client := openai.NewClientWithConfig(cfg)

resp, _ := client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
    Model:    "auto",
    Messages: []openai.ChatCompletionMessage{{Role: "user", Content: "Explain monads like I'm five"}},
})
fmt.Println(resp.Choices[0].Message.Content)

All clients hit the same endpoint. FrugalRoute picks the model, runs inference, returns OpenAI-shaped JSON.


What's Under the Hood

Routing & Classification

  • Semantic intent classification via embeddings (nomic-embed-text)
  • Sub-1ms keyword pre-classifier for obvious cases
  • Composite scoring with cascade confidence
  • Capability matching — models declare strengths, requests state needs
  • Multi-model sampling with judge or majority voting
  • A/B testing with weighted traffic splits
  • Sticky sessions for multi-turn conversation consistency
  • Agent-specific routing strategies

Performance & Reliability

  • Two-tier cache: exact-match LRU + vector similarity
  • PeakEWMA latency tracking — routes around degraded providers
  • Error-type aware circuit breaker (429 vs 500 vs timeout)
  • Full SSE streaming with heartbeat keepalive
  • Graceful shutdown with in-flight request draining
  • Hardware auto-profiling (Apple Silicon, CUDA, ROCm)

Cost & Governance

  • Real-time cost tracking per request, key, session, and tag
  • Pre-flight budget enforcement — stops before it spends
  • Cache-aware pricing in routing decisions
  • Virtual API keys with independent limits per team
  • Token bucket rate limiting per key
  • Windowed budgets with configurable time windows

Learning & Distillation

  • Routing weights adapt from real success/failure signals
  • Judge agent for structural quality evaluation
  • Distillation pipeline: cloud responses train local models
  • TruthKeeper integrity layer prevents stale training data
  • Epistemic state tracking (Supported / Hypothesis / Contested)
  • Conversation compaction for long context management

Operations

  • Model aliases: fast, smart, cheap — decouple code from models
  • Prometheus metrics (frugalroute_*)
  • YAML model config (config/models.yaml)
  • OpenAPI spec at /openapi.json
  • One-command calibration tooling

Extensibility

  • MCP tool registry (MCP + OpenAI + Anthropic tools, unified)
  • Guardrails pipeline for pre/post content filtering
  • Provider adapters: Ollama, OpenAI, Anthropic
  • Plug in new providers by implementing one interface
  • Bidding/auction system for ambiguous routing decisions

The Competition

Every LLM gateway proxies requests. None of them think about them.

| | liteLLM | OpenRouter | Portkey | Bifrost | FrugalRoute | |:---|:---:|:---:|:---:|:---:|:---:| | OpenAI-compatible drop-in | Yes | Yes | Yes | Yes | Yes | | Routes by capability, not model name | | | | | Yes | | Local-first (Ollama, Apple Silicon) | | | | | Yes | | Semantic intent classification | | | | | Yes | | Confidence-based escalation cascade | | | | | Yes | | Two-tier semantic cache | | | Simple | | Yes | | Learns from traffic, self-improves | | | | | Yes | | Distills cloud into local models | | | | | Yes | | Hardware auto-profiling | | | | | Yes | | Budget enforcement per key/session | Partial | | | Partial | Yes | | A/B testing across models | | | | | Yes | | MCP tool interoperability | Partial | | | | Yes | | Self-hosted, no vendor lock-in | Yes | | Yes | Yes | Yes |

liteLLM is a great proxy. It connects 100+ providers behind one API. But it doesn't know what your prompt needs — you still pick the model. No local tier, no caching, no learning.

OpenRouter is a managed marketplace. Not self-hosted. Your data leaves your network.

Portkey has solid reliability features — retries, fallbacks, circuit breaking. But it routes by provider weight, not by prompt intent. No local models. No distillation.

Bifrost is fast (11us overhead). But it's a load balancer, not a router. It doesn't understand what your request needs.

They move traffic. FrugalRoute makes decisions.


Configuration

# .env
PORT=3100
OLLAMA_BASE_URL=http://localhost:11434
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
EMBEDDING_MODEL=nomic-embed-text
DEFAULT_MAX_COST_PER_REQUEST=0.01
# config/models.yaml
aliases:
  fast: gemma3-4b
  smart: claude-sonnet-4-20250514
  cheap: llama3.2

Full configuration reference: docs/user/configuration.mdx


Documentation

| Guide | What it covers | |:---|:---| | Getting Started | Install, first request, connect existing clients | | Architecture | Module map, request flow, design principles | | Routing | Classification, escalation, bidding, weight adjustment | | Caching | Two-tier semantic cache, adaptive thresholds | | Cost Management | Estimation, tracking, budget enforcement | | Configuration | Env vars, routes, models, budgets, thresholds | | Deployment | Docker, production hardening, hardware profiling | | Tools & MCP | Tool registry, MCP integration, format conversion | | Distillation | Training flywheel, TruthKeeper integrity | | API Reference | Complete HTTP endpoint reference |


FAQ

For local inference, yes. FrugalRoute uses Ollama as its local model backend. Without it, requests route straight to cloud providers — which still gives you caching, cost tracking, and budget enforcement, but you miss the free local tier.

Any model Ollama can run (Llama, Mistral, Gemma, Phi, Qwen, DeepSeek, etc.), plus OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5) and Anthropic (Claude Opus, Sonnet, Haiku). Adding a new provider is one adapter interface.

Yes. Full SSE streaming with heartbeat keepalive, compatible with the OpenAI streaming format. Set "stream": true in your request — same as you would with OpenAI directly.

Keyword classification adds <1ms. Embedding-based classification adds ~4ms. Cache hits return in ~1ms. The routing decision itself is negligible compared to model inference time.

Yes. Set model to any registered model name (e.g., "gpt-4o", "llama3.2") instead of "auto". FrugalRoute will route directly to that model while still tracking cost and logging the request. You can also use aliases like "fast", "smart", or "cheap".

FrugalRoute is fully self-hosted. Local model requests never leave your machine. Cloud requests go directly to OpenAI/Anthropic — FrugalRoute never proxies through a third-party service. Training pairs for distillation are stored locally in SQLite.

When a request escalates to a cloud model, the prompt-response pair is captured, quality-scored by a judge agent, and stored locally. Running bun run distill feeds verified pairs into a fine-tuning pipeline for your local models. The TruthKeeper integrity layer ensures only high-quality, non-contradicted data is used. See Distillation docs.

Supported. FrugalRoute's MCP tool registry unifies tools across MCP, OpenAI, and Anthropic formats. Tool calls are routed to the correct backend automatically.


Built With

Bun + Hono + Ollama + TypeScript

445 tests. 1,196 assertions. Two production dependencies.


Contributing

git clone https://github.com/SimplyLiz/FrugalRoute && cd FrugalRoute
bun install
bun test           # run all 445 tests
bun run dev        # start dev server with hot reload
bun run lint       # lint with Biome
bun run benchmark  # run hardware benchmarks
bun run calibrate  # calibrate keyword classifier thresholds

PRs welcome. Please run bun run check (lint + tests) before submitting.


License

PolyForm Small Business License 1.0.0free for individuals, small businesses (<100 people, <1M EUR revenue), nonprofits, and open source projects.

Commercial license for larger organizations: [email protected]