@intentsolutionsio/ollama-local-ai

v1.0.0

Published

a month ago

Run AI models locally with Ollama - free alternative to OpenAI, Anthropic, and other paid LLM APIs. Zero-cost, privacy-first AI infrastructure.

0High
0Medium
0Low

intentsolutionsio

ollama local-llm free-ai self-hosted llama mistral privacy zero-cost openai-alternative claude-code claude-plugin tonsofskills

Ollama Local AI

Free, self-hosted alternative to OpenAI, Anthropic, and paid LLM APIs

Run powerful AI models locally with zero API costs. Complete privacy, unlimited usage, no subscriptions.

Why Ollama?

💰 Free Forever - No API keys, no subscriptions, no usage limits
🔒 Privacy First - Your data never leaves your machine
⚡ Fast - Local inference, no network latency
🎯 Production Ready - Used by thousands of developers worldwide
🔧 Easy Setup - One command installation

Quick Start

# Install Ollama
/setup-ollama

# Or manually:
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.2

# Start using it!
ollama run llama3.2

Available Models

Code Generation

CodeLlama 34B - Best for code generation
Qwen2.5-Coder 32B - Excellent coding assistant
DeepSeek-Coder 33B - Strong code understanding

General Purpose

Llama 3.2 70B - Meta's flagship model
Mistral 7B - Fast and efficient
Mixtral 8x7B - High quality reasoning

Specialized

Phi-3 14B - Microsoft's efficient model
Gemma 27B - Google's open model
Command-R 35B - Cohere's command model

Replace Paid APIs

OpenAI GPT-4 → Llama 3.2 70B

# Before (Paid - $0.03/1K tokens)
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

# After (Free)
import ollama
response = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello"}]
)

Anthropic Claude → Mistral

# Before (Paid - $0.015/1K tokens)
from anthropic import Anthropic
client = Anthropic(api_key="sk-ant-...")
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": "Hello"}]
)

# After (Free)
import ollama
response = ollama.chat(
    model="mistral",
    messages=[{"role": "user", "content": "Hello"}]
)

System Requirements

Minimum:

8GB RAM for 7B models
16GB RAM for 13B models
32GB RAM for 33B+ models

Recommended:

NVIDIA GPU with 8GB+ VRAM (10x faster)
Apple Silicon (M1/M2/M3) works great
AMD GPUs supported

⚠️ Rate Limits & Resource Constraints

No API Limits (Local Deployment)

| Constraint Type | Ollama (Local) | Cloud APIs (OpenAI/Anthropic) | |----------------|----------------|-------------------------------| | Daily requests | ∞ Unlimited | Limited by subscription tier | | Rate limiting | ❌ None | ✅ Yes (RPM/TPM limits) | | Registration | ❌ Not required | ✅ Email + payment required | | API keys | ❌ Not needed | ✅ Required for all requests | | IP tracking | ❌ N/A (local) | ✅ Yes (can get banned) | | Data privacy | ✅ 100% local | ❌ Sent to cloud servers |

Hardware-Based "Rate Limits"

Unlike API services, Ollama's constraints are hardware-based, not usage-based:

1. Memory Constraints

| Model Size | RAM Required | Max Concurrent Requests | Notes | |-----------|--------------|------------------------|--------| | 7B models | 8GB | 1-2 agents | Basic usage | | 13B models | 16GB | 1-2 agents | Good quality | | 33B models | 32GB | 1 agent | High quality | | 70B models | 64GB+ | 1 agent | Best quality, needs GPU |

Multiple Agents on Same Machine:

# With 32GB RAM, you can run:
# - 3-4 agents using 7B models (8GB each)
# - 2 agents using 13B models (16GB each)
# - 1 agent using 33B model (32GB)

# Agent coordination example
from ollama import Client
import asyncio

async def agent_task(agent_name, model):
    client = Client()
    # Ollama automatically queues requests if busy
    response = await client.chat(
        model=model,
        messages=[{"role": "user", "content": f"Task for {agent_name}"}]
    )
    return response

# Run 3 agents concurrently on 32GB machine
tasks = [
    agent_task("agent1", "llama3.2"),  # ~8GB
    agent_task("agent2", "mistral"),   # ~4GB
    agent_task("agent3", "codellama"), # ~7GB
]
results = await asyncio.gather(*tasks)

2. Disk Space Requirements

| Model | Download Size | Disk Space Required | |-------|--------------|---------------------| | Llama 3.2 7B | 4.7 GB | ~5 GB | | Mistral 7B | 4.1 GB | ~4.5 GB | | CodeLlama 34B | 19 GB | ~20 GB | | Llama 3.2 70B | 40 GB | ~45 GB | | Mixtral 8x7B | 26 GB | ~30 GB |

Multi-Model Strategy:

# Storage planning for 3 agents
ollama pull llama3.2      # 4.7 GB (general purpose)
ollama pull codellama     # 13 GB (code generation)
ollama pull mistral       # 4.1 GB (fast responses)
# Total: ~22 GB disk space

3. Inference Speed "Limits"

| Hardware | Tokens/Second | Realistic Agents | Notes | |----------|---------------|------------------|--------| | CPU only | 2-5 tok/s | 1-2 | Slow but free | | Apple M1/M2 | 15-25 tok/s | 3-5 | Excellent performance | | NVIDIA RTX 3060 | 30-50 tok/s | 5-8 | Good mid-range GPU | | NVIDIA RTX 4090 | 80-120 tok/s | 10-15 | High-end GPU |

Agent Best Practice - Queue Management:

# Smart request queuing for single GPU
from queue import Queue
import threading

class LocalLLMCoordinator:
    def __init__(self, max_concurrent=3):
        self.queue = Queue()
        self.max_concurrent = max_concurrent
        self.active_requests = 0

    def process_request(self, agent_id, prompt):
        # Wait if too many concurrent requests
        while self.active_requests >= self.max_concurrent:
            time.sleep(0.1)

        self.active_requests += 1
        try:
            response = ollama.chat(
                model='llama3.2',
                messages=[{"role": "user", "content": prompt}]
            )
            return response
        finally:
            self.active_requests -= 1

# Use coordinator for 10 agents on one machine
coordinator = LocalLLMCoordinator(max_concurrent=3)

Registration & Setup Requirements

| Requirement | Status | Details | |------------|--------|---------| | Email signup | ❌ Not required | No account needed | | API key | ❌ Not required | Runs locally | | Payment method | ❌ Not required | 100% free | | Terms acceptance | ✅ MIT License | Open source, permissive | | Installation | ✅ Required | One command: curl -fsSL https://ollama.com/install.sh \| sh |

Agent Strategies for Single Machine

Scenario: 10 Agents on One Machine (32GB RAM, NVIDIA RTX 3060)

# Strategy 1: Shared model pool (most efficient)
class AgentPool:
    def __init__(self):
        self.model = "llama3.2"  # Single model loaded once
        self.cache = {}  # Shared response cache

    async def agent_request(self, agent_id, prompt):
        # Check cache first (avoid redundant inference)
        cache_key = hash(prompt)
        if cache_key in self.cache:
            return self.cache[cache_key]

        # Process request
        response = await ollama.chat_async(
            model=self.model,
            messages=[{"role": "user", "content": prompt}]
        )

        # Cache for other agents
        self.cache[cache_key] = response
        return response

# 10 agents share one model = ~8GB RAM total
pool = AgentPool()
agents = [Agent(id=i, pool=pool) for i in range(10)]

Strategy 2: Specialized model per task type

# Allocate different models for different agent types
config = {
    "code_agents": {  # 4 agents
        "model": "codellama",  # 13GB
        "ram": "13GB"
    },
    "chat_agents": {  # 4 agents
        "model": "llama3.2",   # 8GB
        "ram": "8GB"
    },
    "fast_agents": {  # 2 agents
        "model": "mistral",    # 4GB
        "ram": "4GB"
    }
}
# Total: 25GB RAM (fits in 32GB machine)

Strategy 3: Request batching

# Batch multiple agent requests into one inference call
def batch_agent_requests(agent_prompts):
    combined_prompt = "\n\n".join([
        f"[Agent {i}]: {prompt}"
        for i, prompt in enumerate(agent_prompts)
    ])

    response = ollama.chat(
        model='llama3.2',
        messages=[{"role": "user", "content": combined_prompt}]
    )

    # Parse response for each agent
    return parse_multi_agent_response(response)

# Process 5 agents in one request instead of 5 separate requests
results = batch_agent_requests([
    "What is Python?",
    "What is JavaScript?",
    "What is Rust?",
    "What is Go?",
    "What is TypeScript?"
])

When Hardware Becomes the "Rate Limit"

Upgrade paths when local resources aren't enough:

| Your Situation | Solution | Cost | |----------------|----------|------| | Need more concurrent agents | Upgrade RAM (16GB → 32GB) | ~$60-150 one-time | | Slow inference speeds | Add GPU (RTX 3060) | ~$300-400 one-time | | Multiple machines | Run Ollama on each | $0 (just install) | | Cloud deployment | Deploy to vast.ai or Runpod | $0.20-0.50/hour | | Enterprise scale | Self-host on server cluster | $2000-5000 hardware |

Still cheaper than cloud APIs:

OpenAI GPT-4: $30-60/month ongoing
Anthropic Claude: $15-30/month ongoing
Local hardware: One-time cost, infinite usage

Installation

macOS

brew install ollama
ollama serve

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download from https://ollama.com/download/windows

Docker

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Usage Examples

Chat Interface

ollama run llama3.2
>>> Write a Python function to sort a list

API Server

import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.2',
    'prompt': 'Why is the sky blue?'
})

print(response.json()['response'])

Streaming

import ollama

for chunk in ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Performance Comparison

| Model | Speed (tokens/sec) | Quality | Memory | |-------|-------------------|---------|--------| | Llama 3.2 7B | 50-100 | Good | 8GB | | Mistral 7B | 60-120 | Great | 8GB | | CodeLlama 34B | 20-40 | Excellent | 32GB | | Llama 3.2 70B | 10-20 | Best | 64GB |

With GPU acceleration

Cost Savings

Replacing OpenAI GPT-4:

Current cost: $0.03/1K input tokens, $0.06/1K output
1M tokens/month = $30-60/month
Ollama cost: $0 ✓

Replacing Anthropic Claude:

Current cost: $0.015/1K input tokens, $0.075/1K output
1M tokens/month = $15-75/month
Ollama cost: $0 ✓

Advanced Configuration

Custom Models

# Create Modelfile
FROM llama3.2
PARAMETER temperature 0.7
SYSTEM You are a helpful coding assistant

# Build custom model
ollama create my-assistant -f Modelfile

API Integration

// Node.js
const ollama = require('ollama')

const response = await ollama.chat({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Hello!' }],
})

Multiple Models

# Pull multiple models
ollama pull llama3.2
ollama pull mistral
ollama pull codellama

# List installed
ollama list

Troubleshooting

Model Too Large

# Use smaller quantized version
ollama pull llama3.2:7b-q4  # 4-bit quantization (4GB)

Slow Performance

# Check GPU usage
nvidia-smi  # NVIDIA
system_profiler SPDisplaysDataType  # macOS

# Use faster model
ollama pull mistral:7b

Memory Issues

# Clear old models
ollama rm unused-model

# Use smaller context
ollama run llama3.2 --ctx-size 2048

Resources

Official Docs: https://ollama.com/docs
Model Library: https://ollama.com/library
GitHub: https://github.com/ollama/ollama
Discord: https://discord.gg/ollama

Related Plugins

local-llm-wrapper - Generic wrapper for all local LLMs
ai-sdk-agents - AI SDK with Ollama support
geepers-agents - 51 agents powered by Ollama

License

MIT License - Free to use commercially and personally