krusch-cascade-router
v1.0.0
Published
Latency-aware LLM router that dynamically cascades between edge and cloud models via logprob inspection.
Maintainers
Readme
⚡ Why Krusch Cascade Router?
"LLM routing an LLM is a trap."
Using a massive third LLM to decide which LLM to route a query to adds severe TTFT (Time To First Token) latency and API costs. krusch-cascade-router solves this by combining a fast predictive heuristic classifier (<50ms latency) with a reactive logprob-based speculative cascade. Designed specifically for agentic developers building with local AI, it allows you to optimize for cost, performance, and reliability without sacrificing capability.
Key Features
- 🚀 Sub-50ms Heuristic Classifier: Evaluates prompt complexity instantly.
- 🧠 Logprob Speculative Execution: Reactively cascades to heavy cloud models if the edge model's confidence drops.
- 🔌 Framework Agnostic: Can be plugged into any Node.js AI architecture.
- 🛡️ Custom Heuristics: Support for
customRulesto inject your own prompt complexity detection logic. - 🛑 Native AbortSignal Support: Manage request timeouts natively via
ChatOptions. - 📦 Dual CJS/ESM Support: Works in modern ECMAScript and legacy environments.
🧠 Architecture: How It Works
- Predictive Classifier: Instantly evaluates the prompt's complexity via string heuristics (length, code blocks, complex cognitive verbs, or your
customRules). If classified as complex, it routes directly to the heavy cloud model. - Speculative Cascade: If classified as simple, it streams the fast local edge model. It buffers and inspects the logprobs of the first N tokens. If the confidence (probability) dips below your configured threshold, it silently aborts the stream and falls back to the heavy cloud model.
graph TD;
A[Incoming Prompt] --> B{Heuristic Classifier};
B -- Complex --> C[Heavy Cloud Model];
B -- Simple --> D[Local Edge Model];
D --> E{Evaluate Logprobs first N tokens};
E -- Confidence >= Threshold --> F[Stream Edge Response];
E -- Confidence < Threshold --> G[Abort Edge];
G --> C;📦 Installation
npm install krusch-cascade-routerNote: Requires Node.js 18+ for native fetch and
AbortSignalsupport.
🚀 Quick Start Guide
import { CascadeRouter } from 'krusch-cascade-router';
// 1. Initialize the router with your edge and cloud models
const router = new CascadeRouter({
fastModel: {
url: 'http://localhost:11434/v1/chat/completions',
model: 'qwen2.5:3b' // Edge node tag resolution
},
heavyModel: {
apiKey: process.env.GEMINI_API_KEY,
model: 'gemini-2.5-pro',
provider: 'gemini'
},
cascadeThreshold: 0.85, // Abort if average probability of first 5 tokens is < 85%
tokensToEvaluate: 5
});
// 2. Send a chat request
const response = await router.chat("Write a complex architectural plan...");
// 3. Check where it was routed
console.log(`Routed to: ${response.routedTo}`);
console.log(response.text);🛠️ Advanced Usage
Custom Heuristic Rules (customRules)
You can inject your own detection logic to fine-tune what goes directly to the cloud model:
const router = new CascadeRouter({
// ...models config
customRules: [
(prompt) => prompt.includes('PostgreSQL'), // Always route DB questions to cloud
(prompt) => prompt.length > 2000 // Override default length heuristics
]
});Timeouts and AbortSignals
Native integration with AbortSignal for graceful timeout handling:
const controller = new AbortController();
setTimeout(() => controller.abort(), 10000); // 10s timeout
try {
const response = await router.chat("Analyze this dataset", {
signal: controller.signal
});
} catch (err) {
if (err.name === 'AbortError') {
console.log('Request was timed out or aborted manually.');
}
}📚 API Reference
new CascadeRouter(config)
| Property | Type | Description |
|---|---|---|
| fastModel | ModelConfig | Configuration for your fast, local edge model (e.g. Ollama). |
| heavyModel | ModelConfig | Configuration for your heavy cloud fallback (e.g. Gemini, OpenAI). |
| cascadeThreshold | number | Confidence probability (0.0 to 1.0). If logprobs dip below this, it cascades. |
| tokensToEvaluate | number | How many tokens to buffer before making the speculative decision. |
| customRules | Array<(prompt: string) => boolean> | (Optional) Array of heuristic functions to override complex prompt detection. |
router.chat(prompt, options?)
| Parameter | Type | Description |
|---|---|---|
| prompt | string | The user's input prompt. |
| options | ChatOptions | (Optional) Options like { signal: AbortSignal }. |
Returns: Promise<{ text: string, routedTo: 'fast' | 'heavy' }>
🤝 Contributing
We welcome contributions! Please follow the established homelab conventions:
- Library code must NEVER use
console.warnorconsole.logdirectly. Route diagnostics through callback options (onEventpattern). - Ensure your
AbortSignallisteners use{ once: true }to prevent leaks. - Run tests via
npm testbefore submitting PRs.
📄 License
MIT License © 2026 kruschdev
