claude-hybrid
v1.1.1
Published
Hybrid Claude/Ollama solution for continuous coding with automatic failover and context preservation
Maintainers
Readme
claude-hybrid
Seamless Claude/Ollama hybrid solution with intelligent failover, semantic context retrieval, and real-time streaming.
Never stop coding when Claude hits rate limits. Automatically fails over to local Ollama models while preserving your conversation context.
✨ Features
🔄 Intelligent Failover
- Automatic Claude → Ollama switching on rate limits
- Context preservation across providers (95%+ accuracy)
- Zero-downtime continuation of work
- Automatic failback when Claude available
🧠 Semantic Context Retrieval
- Embeddings-based similarity search (sentence-transformers)
- Composite scoring: similarity + recency + importance
- 95%+ context accuracy (up from 67% with SQL alone)
- Automatic embedding generation
⚡ Performance Optimized
- Connection pooling (10x faster requests)
- Context caching (500x faster on cache hits)
- Dynamic timeouts (30-300s adaptive)
- Real-time token streaming
💎 Professional UX
- Streaming responses (see tokens as they're generated)
- Shell completions (Bash/Zsh/Fish)
- Progress indicators with elapsed time
- Health monitoring and diagnostics
- Retry logic with exponential backoff
🎯 Multi-Model Support
5 optimized Ollama models included:
- qwen2.5-coder:32b (19GB) - Code specialist
- qwen3:32b (20GB) - General purpose
- codestral:22b (12GB) - Fast code generation
- qwen3:latest (5.2GB) - Quick queries
- llama3.2:latest (2GB) - Fastest responses
🚀 Quick Start
Prerequisites
- Node.js 16+
- Python 3.12+
- Claude Code CLI
- Ollama running locally
Installation
# Global installation (recommended)
npm install -g claude-hybrid
# Or use npx (no global install needed)
npx claude-hybrid setupSetup
# 1. Initialize database and config
claude-hybrid setup
# 2. Install Python dependencies (required for semantic search)
pip3 install -r scripts/requirements.txt
# Or use a virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r scripts/requirements.txt
# 3. Pull recommended Ollama models (choose based on your VRAM)
ollama pull qwen2.5-coder:32b # 19GB - Best for code
ollama pull llama3.2:latest # 2GB - Fastest
ollama pull qwen3:32b # 20GB - General purposeNote: Python dependencies are attempted during npm install but may fail due to permissions. If you see warnings about missing dependencies (numpy, faiss-cpu, etc.), run the pip install commands above.
Usage
# Default: Hybrid mode (Claude with Ollama failover)
claude-hybrid "implement user authentication with JWT"
# Force Ollama only (test without Claude)
claude-hybrid --ollama-only "explain quicksort algorithm"
# Specific model
claude-hybrid --model qwen3:32b "write a blog post about AI"
# Disable streaming (wait for complete response)
claude-hybrid --no-stream "generate large code file"📖 Commands
Core Commands
claude-hybrid "your prompt" # Send prompt (hybrid mode)
claude-hybrid setup # Initialize system
claude-hybrid health [--verbose] [--fix] # System diagnostics
claude-hybrid status # Recent activity
claude-hybrid models # List available Ollama models
claude-hybrid test # Test provider connectivityConfiguration
claude-hybrid mode [hybrid|ollama-only|claude-only]
claude-hybrid config # View all config
claude-hybrid config ollama.model qwen2.5-coder:32bDatabase & Maintenance
claude-hybrid query requests -l 10 # Recent requests
claude-hybrid query context -l 5 # Context chunks
claude-hybrid query sessions # Active sessions
claude-hybrid query decisions # Technical decisions
claude-hybrid cleanup --days 30 # Clean old data
claude-hybrid migrate-embeddings # Regenerate semantic indexShell Completions
# Generate completions for your shell
claude-hybrid completion bash | sudo tee /etc/bash_completion.d/claude-hybrid
claude-hybrid completion zsh > ~/.zsh/completions/_claude-hybrid
claude-hybrid completion fish > ~/.config/fish/completions/claude-hybrid.fish⚙️ Configuration
Configuration stored in ~/.claude-hybrid/config.json:
{
"ollama": {
"host": "http://localhost:11434",
"model": "qwen2.5-coder:32b",
"fallback_model": "qwen3:32b",
"streaming": true
},
"context": {
"semantic_search_enabled": true,
"max_chunks": 15,
"cache_enabled": true,
"cache_ttl_seconds": 60
},
"retry": {
"enabled": true,
"max_attempts": 3,
"base_delay": 1.0
},
"mode": "hybrid"
}🎯 Use Cases
1. Continue Coding During Rate Limits
# Working on authentication feature
claude-hybrid "add OAuth2 support to API"
# Claude hits rate limit
# → Automatically switches to Ollama
# → Preserves conversation context
# → Continues implementation seamlessly2. Cost-Effective Development
# Use free local models for routine tasks
claude-hybrid mode ollama-only
# Quick iterations
claude-hybrid "refactor this function"
claude-hybrid "add error handling"
claude-hybrid "write unit tests"
# Switch back to Claude for complex tasks
claude-hybrid mode hybrid3. Offline Development
# No internet? No problem!
claude-hybrid mode ollama-only
# Full coding capability with local models
claude-hybrid "implement binary search tree"
claude-hybrid "explain the algorithm"
claude-hybrid "optimize for performance"📊 Performance
| Metric | Before claude-hybrid | After | |--------|---------------------|-------| | Rate limit handling | ❌ Stop working | ✅ Seamless failover | | Context preservation | ❌ Lost | ✅ 95%+ accuracy | | Request overhead | 50ms | 5ms (10x faster) | | Context retrieval | 50ms | 0.1ms (500x faster) | | Failure detection | 600s | 30s (20x faster) | | User feedback | None | Real-time streaming |
🧠 How It Works
Intelligent Context Management
- Semantic Search: Uses embeddings to find relevant context
- Recency Boost: Prioritizes recent work (3x for <5min, 2x for <15min)
- Importance Scoring: User-defined priority (1-10 scale)
- Composite Ranking:
score = (1.0 × similarity) + (0.3 × recency) + (0.2 × importance)
Smart Failover
┌─────────────┐
│ User Prompt │
└──────┬──────┘
│
▼
┌──────────────────┐
│ Try Claude CLI │ ◄─── First choice (best quality)
└────┬────────┬────┘
│ │
Success Rate Limit
│ │
▼ ▼
┌─────────┐ ┌──────────────────┐
│ Return │ │ Failover to │
│ Result │ │ Ollama + Context │
└─────────┘ └────┬─────────────┘
│
▼
┌──────────────┐
│ Return Result│
└──────────────┘Dynamic Timeout Calculation
timeout = base(30s) × length_factor × context_factor × complexity_factor × model_factor
timeout = clamp(timeout, 30s, 300s)
# Examples:
"2+2" → 30s (simple, small model)
"implement OAuth2" → 164s (complex, large model)
"comprehensive analysis" → 300s (max)🔧 Advanced Usage
Command Flags
--ollama-only # Skip Claude entirely (same as mode ollama-only)
--claude-only # No Ollama failover (same as mode claude-only)
--model <name> # Specify Ollama model (e.g., qwen2.5-coder:32b)
--no-stream # Disable streaming, wait for complete responseEnvironment Variables
CLAUDE_HYBRID_DIR # Custom config directory (default: ~/.claude-hybrid)
OLLAMA_HOST # Ollama server URL (default: http://localhost:11434)Programmatic Usage
from smart_claude import HybridClaude
hybrid = HybridClaude()
response = hybrid.run("implement feature X")
print(response)🐛 Troubleshooting
Ollama Connection Error
# Check if Ollama is running
ollama ps
# Start Ollama if not running
ollama serve
# Test connectivity
claude-hybrid healthSemantic Search Not Working
# Check Python dependencies
source .venv/bin/activate
python -c "import sentence_transformers; import faiss; import onnxruntime; print('✅ OK')"
# Reinstall if needed
pip install -r scripts/requirements.txt
# Regenerate embeddings and Faiss index
claude-hybrid migrate-embeddings --force --verifyRate Limit Still Occurring
# Check current mode
claude-hybrid mode
# Ensure failover is enabled
claude-hybrid mode hybrid
# Comprehensive health check
claude-hybrid health --verbose
# Test failover manually
claude-hybrid --ollama-only "test prompt"Database Issues
# Check database integrity
sqlite3 ~/.claude-hybrid/hybrid-memory.db "PRAGMA integrity_check;"
# View recent activity
claude-hybrid query sessions -l 5
# Clean old data if corrupted
claude-hybrid cleanup --days 7 --force📚 Documentation
Project Documentation
- CLAUDE.md - Architecture and development guide
- Complete implementation docs in
/docs/hybrid-llm-solution/:
🤝 Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Read CLAUDE.md for architecture details
- Add tests for new features (
npm test) - Update documentation (README.md, CLAUDE.md)
- Submit a pull request
Development Setup
git clone https://github.com/dgtise25/claude-hybrid.git
cd claude-hybrid
npm install # Installs deps + creates Python venv
npm link # Symlink for local testing
npm test # Run test suite📄 License
MIT License - see LICENSE file
🙏 Acknowledgments
- Built with Claude Code
- Powered by Ollama
- Semantic search via sentence-transformers
- Vector similarity with Faiss
🔗 Links
- Repository: https://github.com/dgtise25/claude-hybrid
- Issues: https://github.com/dgtise25/claude-hybrid/issues
- Documentation: CLAUDE.md and
/docs - NPM Package: https://www.npmjs.com/package/claude-hybrid
💡 Pro Tips
Optimize Model Selection
# Code tasks → Use code specialist
claude-hybrid --model qwen2.5-coder:32b "implement API"
# Quick queries → Use fast model
claude-hybrid --model llama3.2:latest "what is X?"
# General purpose → Use qwen3
claude-hybrid --model qwen3:32b "explain concept"Maximize Context Accuracy
# Store important decisions with high importance
# Context chunks with importance ≥7 are prioritized
# Semantic search automatically finds relevant context
# No manual tagging needed!Performance Tips
# First query to a model loads it into VRAM (~30-60s)
# Subsequent queries are fast
# Warmup large models:
claude-hybrid --model qwen3:32b "warmup"
# Use caching (60s TTL by default)
# Repeated similar queries use cached context⚠️ Important Performance Note
Initial Ollama prompts with large models (32b parameters) can take 60+ seconds on the first request as the model loads into VRAM. This is expected behavior:
- First request: 60-90 seconds (model loading + inference)
- Subsequent requests: 2-10 seconds (inference only)
- Not a bug: This is how Ollama manages memory-efficient model serving
Tip: Keep Ollama running and use the same model repeatedly to avoid reloading delays.
⭐ Star History
If you find this useful, please star the repository!
Made with ❤️ for developers who refuse to let rate limits slow them down.
