arbiter-cli

v0.12.1

Published

8 days ago

AI inference optimization — smart routing + context compression saves up to 98% on LLM costs

0High
0Medium
0Low

hamood_vr

llm proxy openai cost optimization arbiter ai routing compression claude gpt context tokens

arbiter-cli

Cut LLM API costs 69% with one line of code. Smart routing proxy that sends each request to the cheapest model capable of handling it.

Quick Start

# Interactive chat (like Claude CLI, but 69% cheaper)
npx arbiter-cli chat

# AI coding agent (reads files, writes code, runs commands)
npx arbiter-cli code "add error handling to utils.py"

# Set up in your project (zero code changes to your app)
npx arbiter-cli init

# Check your savings
npx arbiter-cli stats

What it does

Arbiter routes every LLM request to the cheapest model that can handle it:

Simple questions → Gemini Flash / GPT-4o Mini (95% cheaper)
Medium code tasks → Qwen / Mistral (90% cheaper)
Complex reasoning → Claude Sonnet 4 / GPT-4o (full quality)

You get the same quality. You pay 69% less on average.

Setup Options

Option 1: Interactive Chat

npx arbiter-cli chat

Chat like you would in Claude CLI. Each response shows which model was picked and how much you saved.

⚡ Arbiter Chat

› What is the capital of France?
  Paris.
  ↳ gemini-2.5-flash · saved <$0.001 (95%)

› Design a CRDT for collaborative editing
  Here's an approach using operation-based CRDTs...
  ↳ claude-sonnet-4.6 · saved $0.00 (0%) — frontier needed

Option 2: Coding Agent

npx arbiter-cli code "fix the bug in main.py"
npx arbiter-cli code   # interactive mode

Reads files, writes code, runs commands. Routes cheap for simple file ops, frontier for architecture decisions.

Option 3: Drop-in Proxy (for your existing code)

npx arbiter-cli init

This adds OPENAI_BASE_URL to your .env. Your existing OpenAI SDK code routes through Arbiter automatically — no code changes.

from openai import OpenAI

# Works unchanged — Arbiter routes behind the scenes
client = OpenAI()  # Reads OPENAI_BASE_URL from .env
response = client.chat.completions.create(
    model="gpt-4o",  # Arbiter overrides intelligently
    messages=[{"role": "user", "content": "What is 2+2?"}]
)
# → Routed to Gemini Flash, saved 95%

CLI Commands

| Command | Description | |---------|-------------| | chat | Interactive chat with smart routing | | chat --fast | Prefer low-latency models | | chat --model claude | Force a specific model | | code | AI coding agent (interactive) | | code "task" | One-shot coding task | | init | Add Arbiter to current project | | status | Check proxy connection | | stats | View cost savings |

Chat Commands

| Command | Description | |---------|-------------| | /stats | Session cost breakdown | | /model claude | Switch model (claude, gpt4o, flash, haiku, fable, auto) | | /good or /bad | Rate response (improves routing) | | /copy | Copy last response to clipboard | | /save name | Save conversation | | /load name | Load conversation | | """ | Start/end multi-line input | | quit | Exit |

How It Works

Classify — Each request is analyzed for task type (code, reasoning, analysis, creative, etc.) and complexity (simple/medium/complex) in <1ms
Route — Performance matrix picks the cheapest model that meets the quality bar
Quality Gate — If cheap model gives garbage, transparently retries on frontier
Cache — Identical requests return instantly at $0
Compress — Non-frontier responses use concise prompts (fewer output tokens)

Models Available

| Model | Best for | Cost | |-------|----------|------| | Claude Sonnet 4 | Complex reasoning, analysis | $$$ | | Claude Fable 5 | Autonomous coding agents | $$$$ | | GPT-4o | Complex code, multi-step | $$$ | | Gemini 2.5 Flash | Simple Q&A, classification | $ | | GPT-4o Mini | Simple tasks, extraction | $ | | Qwen 2.5 72B | Code generation, math | $ | | Llama 3.3 70B | General tasks | $ | | Mistral Large | Code review, analysis | $$ | | Claude 3.5 Haiku | Fast responses | $$ |

Requirements

Node.js 18+
An OpenRouter API key (one key, all models)

Set your key:

export OPENROUTER_API_KEY=sk-or-v1-...
# or add to .env in your project directory

Savings Breakdown

From real testing across 90 varied requests:

| Traffic Type | Routed To | Savings | |-------------|-----------|---------| | Simple Q&A (40%) | Gemini Flash | 95% | | Classification (15%) | Gemini Flash | 95% | | Code tasks (25%) | Qwen / GPT-4o | 50-93% | | Complex reasoning (10%) | Claude Sonnet 4 | 0% | | Analysis (10%) | Claude Sonnet 4 | 0% | | Average | Mixed | 69% |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

arbiter-cli

Quick Start

What it does

Setup Options

Option 1: Interactive Chat

Option 2: Coding Agent

Option 3: Drop-in Proxy (for your existing code)

CLI Commands

Chat Commands

How It Works

Models Available

Requirements

Savings Breakdown

Links

License