npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

claude-hybrid

v1.1.1

Published

Hybrid Claude/Ollama solution for continuous coding with automatic failover and context preservation

Readme

claude-hybrid

Seamless Claude/Ollama hybrid solution with intelligent failover, semantic context retrieval, and real-time streaming.

Never stop coding when Claude hits rate limits. Automatically fails over to local Ollama models while preserving your conversation context.

NPM Version License: MIT


✨ Features

🔄 Intelligent Failover

  • Automatic Claude → Ollama switching on rate limits
  • Context preservation across providers (95%+ accuracy)
  • Zero-downtime continuation of work
  • Automatic failback when Claude available

🧠 Semantic Context Retrieval

  • Embeddings-based similarity search (sentence-transformers)
  • Composite scoring: similarity + recency + importance
  • 95%+ context accuracy (up from 67% with SQL alone)
  • Automatic embedding generation

⚡ Performance Optimized

  • Connection pooling (10x faster requests)
  • Context caching (500x faster on cache hits)
  • Dynamic timeouts (30-300s adaptive)
  • Real-time token streaming

💎 Professional UX

  • Streaming responses (see tokens as they're generated)
  • Shell completions (Bash/Zsh/Fish)
  • Progress indicators with elapsed time
  • Health monitoring and diagnostics
  • Retry logic with exponential backoff

🎯 Multi-Model Support

5 optimized Ollama models included:

  • qwen2.5-coder:32b (19GB) - Code specialist
  • qwen3:32b (20GB) - General purpose
  • codestral:22b (12GB) - Fast code generation
  • qwen3:latest (5.2GB) - Quick queries
  • llama3.2:latest (2GB) - Fastest responses

🚀 Quick Start

Prerequisites

Installation

# Global installation (recommended)
npm install -g claude-hybrid

# Or use npx (no global install needed)
npx claude-hybrid setup

Setup

# 1. Initialize database and config
claude-hybrid setup

# 2. Install Python dependencies (required for semantic search)
pip3 install -r scripts/requirements.txt

# Or use a virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r scripts/requirements.txt

# 3. Pull recommended Ollama models (choose based on your VRAM)
ollama pull qwen2.5-coder:32b  # 19GB - Best for code
ollama pull llama3.2:latest    # 2GB - Fastest
ollama pull qwen3:32b          # 20GB - General purpose

Note: Python dependencies are attempted during npm install but may fail due to permissions. If you see warnings about missing dependencies (numpy, faiss-cpu, etc.), run the pip install commands above.

Usage

# Default: Hybrid mode (Claude with Ollama failover)
claude-hybrid "implement user authentication with JWT"

# Force Ollama only (test without Claude)
claude-hybrid --ollama-only "explain quicksort algorithm"

# Specific model
claude-hybrid --model qwen3:32b "write a blog post about AI"

# Disable streaming (wait for complete response)
claude-hybrid --no-stream "generate large code file"

📖 Commands

Core Commands

claude-hybrid "your prompt"              # Send prompt (hybrid mode)
claude-hybrid setup                      # Initialize system
claude-hybrid health [--verbose] [--fix] # System diagnostics
claude-hybrid status                     # Recent activity
claude-hybrid models                     # List available Ollama models
claude-hybrid test                       # Test provider connectivity

Configuration

claude-hybrid mode [hybrid|ollama-only|claude-only]
claude-hybrid config                     # View all config
claude-hybrid config ollama.model qwen2.5-coder:32b

Database & Maintenance

claude-hybrid query requests -l 10       # Recent requests
claude-hybrid query context -l 5         # Context chunks
claude-hybrid query sessions             # Active sessions
claude-hybrid query decisions            # Technical decisions
claude-hybrid cleanup --days 30          # Clean old data
claude-hybrid migrate-embeddings         # Regenerate semantic index

Shell Completions

# Generate completions for your shell
claude-hybrid completion bash | sudo tee /etc/bash_completion.d/claude-hybrid
claude-hybrid completion zsh > ~/.zsh/completions/_claude-hybrid
claude-hybrid completion fish > ~/.config/fish/completions/claude-hybrid.fish

⚙️ Configuration

Configuration stored in ~/.claude-hybrid/config.json:

{
  "ollama": {
    "host": "http://localhost:11434",
    "model": "qwen2.5-coder:32b",
    "fallback_model": "qwen3:32b",
    "streaming": true
  },
  "context": {
    "semantic_search_enabled": true,
    "max_chunks": 15,
    "cache_enabled": true,
    "cache_ttl_seconds": 60
  },
  "retry": {
    "enabled": true,
    "max_attempts": 3,
    "base_delay": 1.0
  },
  "mode": "hybrid"
}

🎯 Use Cases

1. Continue Coding During Rate Limits

# Working on authentication feature
claude-hybrid "add OAuth2 support to API"

# Claude hits rate limit
# → Automatically switches to Ollama
# → Preserves conversation context
# → Continues implementation seamlessly

2. Cost-Effective Development

# Use free local models for routine tasks
claude-hybrid mode ollama-only

# Quick iterations
claude-hybrid "refactor this function"
claude-hybrid "add error handling"
claude-hybrid "write unit tests"

# Switch back to Claude for complex tasks
claude-hybrid mode hybrid

3. Offline Development

# No internet? No problem!
claude-hybrid mode ollama-only

# Full coding capability with local models
claude-hybrid "implement binary search tree"
claude-hybrid "explain the algorithm"
claude-hybrid "optimize for performance"

📊 Performance

| Metric | Before claude-hybrid | After | |--------|---------------------|-------| | Rate limit handling | ❌ Stop working | ✅ Seamless failover | | Context preservation | ❌ Lost | ✅ 95%+ accuracy | | Request overhead | 50ms | 5ms (10x faster) | | Context retrieval | 50ms | 0.1ms (500x faster) | | Failure detection | 600s | 30s (20x faster) | | User feedback | None | Real-time streaming |


🧠 How It Works

Intelligent Context Management

  1. Semantic Search: Uses embeddings to find relevant context
  2. Recency Boost: Prioritizes recent work (3x for <5min, 2x for <15min)
  3. Importance Scoring: User-defined priority (1-10 scale)
  4. Composite Ranking: score = (1.0 × similarity) + (0.3 × recency) + (0.2 × importance)

Smart Failover

┌─────────────┐
│ User Prompt │
└──────┬──────┘
       │
       ▼
┌──────────────────┐
│  Try Claude CLI  │ ◄─── First choice (best quality)
└────┬────────┬────┘
     │        │
  Success   Rate Limit
     │        │
     ▼        ▼
┌─────────┐ ┌──────────────────┐
│ Return  │ │ Failover to      │
│ Result  │ │ Ollama + Context │
└─────────┘ └────┬─────────────┘
                  │
                  ▼
            ┌──────────────┐
            │ Return Result│
            └──────────────┘

Dynamic Timeout Calculation

timeout = base(30s) × length_factor × context_factor × complexity_factor × model_factor
timeout = clamp(timeout, 30s, 300s)

# Examples:
"2+2" → 30s (simple, small model)
"implement OAuth2" → 164s (complex, large model)
"comprehensive analysis" → 300s (max)

🔧 Advanced Usage

Command Flags

--ollama-only          # Skip Claude entirely (same as mode ollama-only)
--claude-only          # No Ollama failover (same as mode claude-only)
--model <name>         # Specify Ollama model (e.g., qwen2.5-coder:32b)
--no-stream            # Disable streaming, wait for complete response

Environment Variables

CLAUDE_HYBRID_DIR      # Custom config directory (default: ~/.claude-hybrid)
OLLAMA_HOST            # Ollama server URL (default: http://localhost:11434)

Programmatic Usage

from smart_claude import HybridClaude

hybrid = HybridClaude()
response = hybrid.run("implement feature X")
print(response)

🐛 Troubleshooting

Ollama Connection Error

# Check if Ollama is running
ollama ps

# Start Ollama if not running
ollama serve

# Test connectivity
claude-hybrid health

Semantic Search Not Working

# Check Python dependencies
source .venv/bin/activate
python -c "import sentence_transformers; import faiss; import onnxruntime; print('✅ OK')"

# Reinstall if needed
pip install -r scripts/requirements.txt

# Regenerate embeddings and Faiss index
claude-hybrid migrate-embeddings --force --verify

Rate Limit Still Occurring

# Check current mode
claude-hybrid mode

# Ensure failover is enabled
claude-hybrid mode hybrid

# Comprehensive health check
claude-hybrid health --verbose

# Test failover manually
claude-hybrid --ollama-only "test prompt"

Database Issues

# Check database integrity
sqlite3 ~/.claude-hybrid/hybrid-memory.db "PRAGMA integrity_check;"

# View recent activity
claude-hybrid query sessions -l 5

# Clean old data if corrupted
claude-hybrid cleanup --days 7 --force

📚 Documentation

Project Documentation


🤝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Read CLAUDE.md for architecture details
  4. Add tests for new features (npm test)
  5. Update documentation (README.md, CLAUDE.md)
  6. Submit a pull request

Development Setup

git clone https://github.com/dgtise25/claude-hybrid.git
cd claude-hybrid
npm install          # Installs deps + creates Python venv
npm link             # Symlink for local testing
npm test             # Run test suite

📄 License

MIT License - see LICENSE file


🙏 Acknowledgments


🔗 Links

  • Repository: https://github.com/dgtise25/claude-hybrid
  • Issues: https://github.com/dgtise25/claude-hybrid/issues
  • Documentation: CLAUDE.md and /docs
  • NPM Package: https://www.npmjs.com/package/claude-hybrid

💡 Pro Tips

Optimize Model Selection

# Code tasks → Use code specialist
claude-hybrid --model qwen2.5-coder:32b "implement API"

# Quick queries → Use fast model
claude-hybrid --model llama3.2:latest "what is X?"

# General purpose → Use qwen3
claude-hybrid --model qwen3:32b "explain concept"

Maximize Context Accuracy

# Store important decisions with high importance
# Context chunks with importance ≥7 are prioritized

# Semantic search automatically finds relevant context
# No manual tagging needed!

Performance Tips

# First query to a model loads it into VRAM (~30-60s)
# Subsequent queries are fast

# Warmup large models:
claude-hybrid --model qwen3:32b "warmup"

# Use caching (60s TTL by default)
# Repeated similar queries use cached context

⚠️ Important Performance Note

Initial Ollama prompts with large models (32b parameters) can take 60+ seconds on the first request as the model loads into VRAM. This is expected behavior:

  • First request: 60-90 seconds (model loading + inference)
  • Subsequent requests: 2-10 seconds (inference only)
  • Not a bug: This is how Ollama manages memory-efficient model serving

Tip: Keep Ollama running and use the same model repeatedly to avoid reloading delays.


⭐ Star History

If you find this useful, please star the repository!


Made with ❤️ for developers who refuse to let rate limits slow them down.