llm-runner-router
v1.0.0
Published
Universal LLM model loader and inference router - agnostic, fast, and intelligent
Maintainers
Readme
🧠 LLM-Runner-Router: The Universal Model Orchestration System
Where models transcend their formats, engines dance across dimensions, and inference becomes art
🌌 What Is This Sorcery?
LLM-Runner-Router is not just another model loader - it's a full-stack agnostic neural orchestration system that adapts to ANY model format, ANY runtime environment, and ANY deployment scenario. Think of it as the Swiss Army knife of AI inference, but cooler and with more quantum entanglement.
✨ Core Superpowers
- 🔮 Universal Format Support: GGUF, ONNX, Safetensors, HuggingFace, and whatever format the future invents
- ⚡ Multi-Engine Madness: WebGPU for speed demons, WASM for universalists, Node for servers, Edge for the distributed
- 🧭 Intelligent Routing: Automatically selects the perfect model based on quality, cost, speed, or pure chaos
- 🚀 Streaming Everything: Real-time token generation that flows like digital poetry
- 💰 Cost Optimization: Because your wallet deserves love too
- 🎯 Zero-Config Magic: Works out of the box, customizable to the quantum level
🎮 Quick Start (For The Impatient)
# Clone the quantum repository
git clone https://github.com/echoaisystems/llm-runner-router
cd llm-runner-router
# Install interdimensional dependencies
npm install
# Launch the neural matrix
npm start🎭 Usage Examples (Where Magic Happens)
Simple Mode - For Mortals
import { quick } from 'llm-runner-router';
// Just ask, and ye shall receive
const response = await quick("Explain quantum computing to a goldfish");
console.log(response.text);Advanced Mode - For Wizards
import LLMRouter from 'llm-runner-router';
const router = new LLMRouter({
strategy: 'quality-first',
enableQuantumMode: true // (Not actually quantum, but sounds cool)
});
// Load multiple models
await router.load('huggingface:meta-llama/Llama-2-7b');
await router.load('local:./models/mistral-7b.gguf');
// Let the router choose the best model
const response = await router.advanced({
prompt: "Write a haiku about JavaScript",
temperature: 0.8,
maxTokens: 50,
fallbacks: ['gpt-3.5', 'local-llama']
});Streaming Mode - For The Real-Time Addicts
const stream = router.stream("Tell me a story about a debugging dragon");
for await (const token of stream) {
process.stdout.write(token);
}Ensemble Mode - For The Overachievers
const result = await router.ensemble([
{ model: 'gpt-4', weight: 0.5 },
{ model: 'claude', weight: 0.3 },
{ model: 'llama', weight: 0.2 }
], "What is the meaning of life?");
// Get wisdom from multiple AI perspectives!🏗️ Architecture (For The Curious)
┌─────────────────────────────────────────────┐
│ Your Application │
├─────────────────────────────────────────────┤
│ LLM-Runner-Router │
├─────────────┬──────────┬───────────────────┤
│ Router │ Pipeline │ Registry │
├─────────────┴──────────┴───────────────────┤
│ Engines (WebGPU, WASM, Node) │
├─────────────────────────────────────────────┤
│ Loaders (GGUF, ONNX, Safetensors) │
└─────────────────────────────────────────────┘🎯 Routing Strategies
Choose your destiny:
- 🏆 Quality First: Only the finest neural outputs shall pass
- 💵 Cost Optimized: Your accountant will love you
- ⚡ Speed Priority: Gotta go fast!
- ⚖️ Balanced: The zen master approach
- 🎲 Random: Embrace chaos, trust the universe
- 🔄 Round Robin: Everyone gets a turn
- 📊 Least Loaded: Fair distribution of neural labor
🛠️ Configuration
{
"routingStrategy": "balanced",
"maxModels": 100,
"enableCaching": true,
"quantization": "dynamic",
"preferredEngine": "webgpu",
"maxTokens": 4096,
"cosmicAlignment": true // Optional but recommended
}📊 Performance Metrics
- Model Load Time: < 500ms ⚡
- First Token: < 100ms 🚀
- Throughput: > 100 tokens/sec 💨
- Memory Usage: < 50% of model size 🧠
- Quantum Entanglement: Yes ✨
🔧 Advanced Features
Custom Model Loaders
router.registerLoader('my-format', MyCustomLoader);Cost Optimization
const budget = 0.10; // $0.10 per request
const models = router.optimizeForBudget(availableModels, budget);Quality Scoring
const scores = await router.rankModelsByQuality(models, prompt);🌐 Deployment Options
- Browser: Full client-side inference with WebGPU
- Node.js: Server-side with native bindings
- Edge: Cloudflare Workers, Deno Deploy
- Docker: Container-ready out of the box
- Kubernetes: Scale to infinity and beyond
🤝 Contributing
We welcome contributions from all dimensions! Whether you're fixing bugs, adding features, or improving documentation, your quantum entanglement with this project is appreciated.
- Fork the repository (in this dimension)
- Create your feature branch (
git checkout -b feature/quantum-enhancement) - Commit with meaningful messages (
git commit -m 'Add quantum tunneling support') - Push to your branch (
git push origin feature/quantum-enhancement) - Open a Pull Request (and hope it doesn't collapse the wave function)
📜 License
MIT License - Because sharing is caring, and AI should be for everyone.
🙏 Acknowledgments
- The Quantum Field for probabilistic inspiration
- Coffee for keeping us in a superposition of awake and asleep
- You, for reading this far and joining our neural revolution
🚀 What's Next?
- [ ] Actual quantum computing support (when available)
- [ ] Time-travel debugging (work in progress)
- [ ] Telepathic model loading (pending FDA approval)
- [ ] Integration with alien AI systems (awaiting first contact)
Built with 💙 and ☕ by Echo AI Systems
"Because every business deserves an AI brain, and every AI brain deserves a proper orchestration system"
📞 Support
- Documentation: docs/ARCHITECTURE.md
- Issues: GitHub Issues
- Email: [email protected]
- Telepathy: Focus really hard on your question
Remember: With great model power comes great computational responsibility. Use wisely! 🧙♂️
