@intentsolutionsio/ollama-local-ai
v1.0.0
Published
Run AI models locally with Ollama - free alternative to OpenAI, Anthropic, and other paid LLM APIs. Zero-cost, privacy-first AI infrastructure.
Maintainers
Readme
Ollama Local AI
Free, self-hosted alternative to OpenAI, Anthropic, and paid LLM APIs
Run powerful AI models locally with zero API costs. Complete privacy, unlimited usage, no subscriptions.
Why Ollama?
- 💰 Free Forever - No API keys, no subscriptions, no usage limits
- 🔒 Privacy First - Your data never leaves your machine
- ⚡ Fast - Local inference, no network latency
- 🎯 Production Ready - Used by thousands of developers worldwide
- 🔧 Easy Setup - One command installation
Quick Start
# Install Ollama
/setup-ollama
# Or manually:
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.2
# Start using it!
ollama run llama3.2Available Models
Code Generation
- CodeLlama 34B - Best for code generation
- Qwen2.5-Coder 32B - Excellent coding assistant
- DeepSeek-Coder 33B - Strong code understanding
General Purpose
- Llama 3.2 70B - Meta's flagship model
- Mistral 7B - Fast and efficient
- Mixtral 8x7B - High quality reasoning
Specialized
- Phi-3 14B - Microsoft's efficient model
- Gemma 27B - Google's open model
- Command-R 35B - Cohere's command model
Replace Paid APIs
OpenAI GPT-4 → Llama 3.2 70B
# Before (Paid - $0.03/1K tokens)
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
# After (Free)
import ollama
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello"}]
)Anthropic Claude → Mistral
# Before (Paid - $0.015/1K tokens)
from anthropic import Anthropic
client = Anthropic(api_key="sk-ant-...")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Hello"}]
)
# After (Free)
import ollama
response = ollama.chat(
model="mistral",
messages=[{"role": "user", "content": "Hello"}]
)System Requirements
Minimum:
- 8GB RAM for 7B models
- 16GB RAM for 13B models
- 32GB RAM for 33B+ models
Recommended:
- NVIDIA GPU with 8GB+ VRAM (10x faster)
- Apple Silicon (M1/M2/M3) works great
- AMD GPUs supported
⚠️ Rate Limits & Resource Constraints
No API Limits (Local Deployment)
| Constraint Type | Ollama (Local) | Cloud APIs (OpenAI/Anthropic) | |----------------|----------------|-------------------------------| | Daily requests | ∞ Unlimited | Limited by subscription tier | | Rate limiting | ❌ None | ✅ Yes (RPM/TPM limits) | | Registration | ❌ Not required | ✅ Email + payment required | | API keys | ❌ Not needed | ✅ Required for all requests | | IP tracking | ❌ N/A (local) | ✅ Yes (can get banned) | | Data privacy | ✅ 100% local | ❌ Sent to cloud servers |
Hardware-Based "Rate Limits"
Unlike API services, Ollama's constraints are hardware-based, not usage-based:
1. Memory Constraints
| Model Size | RAM Required | Max Concurrent Requests | Notes | |-----------|--------------|------------------------|--------| | 7B models | 8GB | 1-2 agents | Basic usage | | 13B models | 16GB | 1-2 agents | Good quality | | 33B models | 32GB | 1 agent | High quality | | 70B models | 64GB+ | 1 agent | Best quality, needs GPU |
Multiple Agents on Same Machine:
# With 32GB RAM, you can run:
# - 3-4 agents using 7B models (8GB each)
# - 2 agents using 13B models (16GB each)
# - 1 agent using 33B model (32GB)
# Agent coordination example
from ollama import Client
import asyncio
async def agent_task(agent_name, model):
client = Client()
# Ollama automatically queues requests if busy
response = await client.chat(
model=model,
messages=[{"role": "user", "content": f"Task for {agent_name}"}]
)
return response
# Run 3 agents concurrently on 32GB machine
tasks = [
agent_task("agent1", "llama3.2"), # ~8GB
agent_task("agent2", "mistral"), # ~4GB
agent_task("agent3", "codellama"), # ~7GB
]
results = await asyncio.gather(*tasks)2. Disk Space Requirements
| Model | Download Size | Disk Space Required | |-------|--------------|---------------------| | Llama 3.2 7B | 4.7 GB | ~5 GB | | Mistral 7B | 4.1 GB | ~4.5 GB | | CodeLlama 34B | 19 GB | ~20 GB | | Llama 3.2 70B | 40 GB | ~45 GB | | Mixtral 8x7B | 26 GB | ~30 GB |
Multi-Model Strategy:
# Storage planning for 3 agents
ollama pull llama3.2 # 4.7 GB (general purpose)
ollama pull codellama # 13 GB (code generation)
ollama pull mistral # 4.1 GB (fast responses)
# Total: ~22 GB disk space3. Inference Speed "Limits"
| Hardware | Tokens/Second | Realistic Agents | Notes | |----------|---------------|------------------|--------| | CPU only | 2-5 tok/s | 1-2 | Slow but free | | Apple M1/M2 | 15-25 tok/s | 3-5 | Excellent performance | | NVIDIA RTX 3060 | 30-50 tok/s | 5-8 | Good mid-range GPU | | NVIDIA RTX 4090 | 80-120 tok/s | 10-15 | High-end GPU |
Agent Best Practice - Queue Management:
# Smart request queuing for single GPU
from queue import Queue
import threading
class LocalLLMCoordinator:
def __init__(self, max_concurrent=3):
self.queue = Queue()
self.max_concurrent = max_concurrent
self.active_requests = 0
def process_request(self, agent_id, prompt):
# Wait if too many concurrent requests
while self.active_requests >= self.max_concurrent:
time.sleep(0.1)
self.active_requests += 1
try:
response = ollama.chat(
model='llama3.2',
messages=[{"role": "user", "content": prompt}]
)
return response
finally:
self.active_requests -= 1
# Use coordinator for 10 agents on one machine
coordinator = LocalLLMCoordinator(max_concurrent=3)Registration & Setup Requirements
| Requirement | Status | Details |
|------------|--------|---------|
| Email signup | ❌ Not required | No account needed |
| API key | ❌ Not required | Runs locally |
| Payment method | ❌ Not required | 100% free |
| Terms acceptance | ✅ MIT License | Open source, permissive |
| Installation | ✅ Required | One command: curl -fsSL https://ollama.com/install.sh \| sh |
Agent Strategies for Single Machine
Scenario: 10 Agents on One Machine (32GB RAM, NVIDIA RTX 3060)
# Strategy 1: Shared model pool (most efficient)
class AgentPool:
def __init__(self):
self.model = "llama3.2" # Single model loaded once
self.cache = {} # Shared response cache
async def agent_request(self, agent_id, prompt):
# Check cache first (avoid redundant inference)
cache_key = hash(prompt)
if cache_key in self.cache:
return self.cache[cache_key]
# Process request
response = await ollama.chat_async(
model=self.model,
messages=[{"role": "user", "content": prompt}]
)
# Cache for other agents
self.cache[cache_key] = response
return response
# 10 agents share one model = ~8GB RAM total
pool = AgentPool()
agents = [Agent(id=i, pool=pool) for i in range(10)]Strategy 2: Specialized model per task type
# Allocate different models for different agent types
config = {
"code_agents": { # 4 agents
"model": "codellama", # 13GB
"ram": "13GB"
},
"chat_agents": { # 4 agents
"model": "llama3.2", # 8GB
"ram": "8GB"
},
"fast_agents": { # 2 agents
"model": "mistral", # 4GB
"ram": "4GB"
}
}
# Total: 25GB RAM (fits in 32GB machine)Strategy 3: Request batching
# Batch multiple agent requests into one inference call
def batch_agent_requests(agent_prompts):
combined_prompt = "\n\n".join([
f"[Agent {i}]: {prompt}"
for i, prompt in enumerate(agent_prompts)
])
response = ollama.chat(
model='llama3.2',
messages=[{"role": "user", "content": combined_prompt}]
)
# Parse response for each agent
return parse_multi_agent_response(response)
# Process 5 agents in one request instead of 5 separate requests
results = batch_agent_requests([
"What is Python?",
"What is JavaScript?",
"What is Rust?",
"What is Go?",
"What is TypeScript?"
])When Hardware Becomes the "Rate Limit"
Upgrade paths when local resources aren't enough:
| Your Situation | Solution | Cost | |----------------|----------|------| | Need more concurrent agents | Upgrade RAM (16GB → 32GB) | ~$60-150 one-time | | Slow inference speeds | Add GPU (RTX 3060) | ~$300-400 one-time | | Multiple machines | Run Ollama on each | $0 (just install) | | Cloud deployment | Deploy to vast.ai or Runpod | $0.20-0.50/hour | | Enterprise scale | Self-host on server cluster | $2000-5000 hardware |
Still cheaper than cloud APIs:
- OpenAI GPT-4: $30-60/month ongoing
- Anthropic Claude: $15-30/month ongoing
- Local hardware: One-time cost, infinite usage
Installation
macOS
brew install ollama
ollama serveLinux
curl -fsSL https://ollama.com/install.sh | shWindows
Download from https://ollama.com/download/windows
Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamaUsage Examples
Chat Interface
ollama run llama3.2
>>> Write a Python function to sort a listAPI Server
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama3.2',
'prompt': 'Why is the sky blue?'
})
print(response.json()['response'])Streaming
import ollama
for chunk in ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Tell me a story'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)Performance Comparison
| Model | Speed (tokens/sec) | Quality | Memory | |-------|-------------------|---------|--------| | Llama 3.2 7B | 50-100 | Good | 8GB | | Mistral 7B | 60-120 | Great | 8GB | | CodeLlama 34B | 20-40 | Excellent | 32GB | | Llama 3.2 70B | 10-20 | Best | 64GB |
With GPU acceleration
Cost Savings
Replacing OpenAI GPT-4:
- Current cost: $0.03/1K input tokens, $0.06/1K output
- 1M tokens/month = $30-60/month
- Ollama cost: $0 ✓
Replacing Anthropic Claude:
- Current cost: $0.015/1K input tokens, $0.075/1K output
- 1M tokens/month = $15-75/month
- Ollama cost: $0 ✓
Advanced Configuration
Custom Models
# Create Modelfile
FROM llama3.2
PARAMETER temperature 0.7
SYSTEM You are a helpful coding assistant
# Build custom model
ollama create my-assistant -f ModelfileAPI Integration
// Node.js
const ollama = require('ollama')
const response = await ollama.chat({
model: 'llama3.2',
messages: [{ role: 'user', content: 'Hello!' }],
})Multiple Models
# Pull multiple models
ollama pull llama3.2
ollama pull mistral
ollama pull codellama
# List installed
ollama listTroubleshooting
Model Too Large
# Use smaller quantized version
ollama pull llama3.2:7b-q4 # 4-bit quantization (4GB)Slow Performance
# Check GPU usage
nvidia-smi # NVIDIA
system_profiler SPDisplaysDataType # macOS
# Use faster model
ollama pull mistral:7bMemory Issues
# Clear old models
ollama rm unused-model
# Use smaller context
ollama run llama3.2 --ctx-size 2048Resources
- Official Docs: https://ollama.com/docs
- Model Library: https://ollama.com/library
- GitHub: https://github.com/ollama/ollama
- Discord: https://discord.gg/ollama
Related Plugins
local-llm-wrapper- Generic wrapper for all local LLMsai-sdk-agents- AI SDK with Ollama supportgeepers-agents- 51 agents powered by Ollama
License
MIT License - Free to use commercially and personally
