@thelogicatelier/muxa

v0.1.2

Published

9 days ago

Universal LLM proxy for Cursor, Claude Code, Cline, Antigravity, and OpenAI-compatible tools. Run on any model or provider.

0High
0Medium
0Low

thetlogicatelier

llm proxy openai anthropic cursor claude codex ai

Muxa — Universal LLM Proxy

Muxa is a self-hosted proxy that presents Anthropic- and OpenAI-compatible APIs so IDE/CLI tooling (Claude Code, Cursor, Codex, Continue, Copilot, etc.) can run against your choice of provider—cloud or local—without touching client settings.

Why Muxa?

One URL for many providers – Point every client at http://localhost:8081, then change providers (OpenRouter, Azure, Databricks, Ollama, etc.) centrally via .env.
Auto routing & fallback – Send simple prompts to a local Ollama model, heavy workloads to a cloud model, and fall back automatically when a provider fails.
Token optimization – Prompt cache, semantic cache, memory injection, and headroom compression operate server-side so all clients enjoy reduced latency/cost.
Observability + policy controls – Built-in load shedding, circuit breaker, structured logs, and Prometheus endpoints give production visibility; policy guards enforce workspace/host/git/test rules.
Advanced providers – Some tools only speak OpenAI/Anthropic; Muxa converts that traffic to providers they don’t natively support (OpenRouter, Ollama, MLX).
Easy rollouts – Update .env once; every IDE routed through Muxa immediately uses the new provider/policy set.

Quick Start

npm (Recommended)

# Install globally
npm install -g @thelogicatelier/muxa

# Create a .env with your provider key (see docs for all options)
muxa                  # proxy listens on http://localhost:8081

Or run instantly without installing:

npx @thelogicatelier/muxa

From Source

git clone https://github.com/achatt89/muxa.git
cd muxa
npm install
cp .env.example .env  # fill in OPENAI_API_KEY, OPENROUTER_API_KEY, etc.
npm start             # proxy listens on http://localhost:8081

Docker

The example below uses OpenAI, but you can substitute any supported provider (Anthropic, OpenRouter, Ollama, Databricks, Azure, etc.):

docker build -t muxa .
docker run --rm -p 8081:8081 \
  -e MUXA_PRIMARY_PROVIDER=openai \
  -e OPENAI_API_KEY=sk-your-key \
  muxa:latest

Homebrew (macOS)

brew tap thelogicatelier/muxa https://github.com/achatt89/muxa.git
brew install muxa
muxa --help

Multi-Provider Routing Modes

| Mode | Description | |------|-------------| | single | All requests go to MUXA_PRIMARY_PROVIDER. Use this when you only want one provider. | | hybrid | Muxa evaluates request “complexity” (prompt length, tool use, etc.) and routes high-cost calls to MUXA_FALLBACK_PROVIDER. If the primary fails, the fallback is also used. |

Example .env:

MUXA_ROUTING_STRATEGY=hybrid
MUXA_PRIMARY_PROVIDER=openrouter
MUXA_FALLBACK_PROVIDER=anthropic
OPENROUTER_API_KEY=sk-or-...
ANTHROPIC_API_KEY=sk-ant-...

Use single mode if you don’t need fallback.

Token Optimization & Memory

Enable caches and memory to cut token usage:

MUXA_PROMPT_CACHE_ENABLED=true
MUXA_PROMPT_CACHE_TTL_MS=120000
MUXA_SEMANTIC_CACHE_ENABLED=true
MUXA_SEMANTIC_CACHE_THRESHOLD=0.9
MUXA_MEMORY_ENABLED=true
MUXA_MEMORY_TOPK=3
MUXA_HEADROOM_ENABLED=true
MUXA_HEADROOM_MODE=optimize

Prompt cache instantly returns repeated prompts.
Semantic cache reuses answers for similar prompts (requires embeddings provider—see docs/embeddings.md).
Memory store injects top-K memories into each request.
Headroom exposes /metrics/compression and /headroom/* to track savings.

Variable descriptions:

| Variable | Purpose | |----------|---------| | MUXA_PROMPT_CACHE_ENABLED | Enable exact match cache; repeated prompts return instantly. | | MUXA_PROMPT_CACHE_TTL_MS | Time-to-live (milliseconds) for prompt cache entries. | | MUXA_SEMANTIC_CACHE_ENABLED | Enable semantic (embeddings-based) cache. Set to true once embeddings are configured. | | MUXA_SEMANTIC_CACHE_THRESHOLD | Cosine similarity threshold (0-1) for semantic cache hits. | | MUXA_MEMORY_ENABLED | Enable long-term memory extraction/storage. | | MUXA_MEMORY_TOPK | Number of memories injected into each request when relevant. | | MUXA_HEADROOM_ENABLED | Enable headroom sidecar/compression pipeline. | | MUXA_HEADROOM_MODE | audit (record metrics only) or optimize (mutate/compress history). |

Client Overrides (Cursor, Claude, Codex, Copilot)

Start Muxa (npm or Docker as shown above).
Point clients at Muxa:

| Client | Configuration | |--------|---------------| | Claude Code CLI | ANTHROPIC_BASE_URL=http://localhost:8081 ANTHROPIC_API_KEY=sk-muxa claude "Prompt" (or export those vars once before running). | | Cursor IDE | Settings → Features → Models → Base URL http://localhost:8081/v1, API key sk-muxa, select the model configured in .env. For @Codebase, enable embeddings docs/embeddings.md. | | OpenAI Codex CLI | codex -c model_provider='"muxa"' -c model='"gpt-5.2-codex"' -c 'model_providers.muxa={name="Muxa Proxy",base_url="http://localhost:8081/v1",wire_api="responses",api_key="sk-muxa"}' (no config change needed). | | GitHub Copilot | export GITHUB_COPILOT_PROXY_URL=http://localhost:8081/v1, export GITHUB_COPILOT_PROXY_KEY=dummy, restart the editor (Works for VS Code / JetBrains). | | Cline / Continue / ClawdBot / other OpenAI-compatible tools | Set their custom OpenAI endpoint to http://localhost:8081/v1 with API key sk-muxa and use your desired model name.

macOS launchctl environment helpers

When you launch IDEs via Spotlight or the Dock, macOS ignores shell exports. Persist API keys for GUI apps (VS Code, Cursor, Claude CLI, etc.) with launchctl:

# Persist environment variables for GUI-launched apps
## NO NEED to provide the actual key - just needs a dummy value so that openAI/Anthropic don't complain). The actual keys are read from the .env file
launchctl setenv OPENAI_API_KEY sk-muxa
launchctl setenv ANTHROPIC_API_KEY sk-muxa
launchctl setenv MUXA_BASE_URL http://localhost:8081

# Inspect current values
launchctl getenv OPENAI_API_KEY
launchctl getenv ANTHROPIC_API_KEY

# Remove when rotating credentials
launchctl unsetenv OPENAI_API_KEY
launchctl unsetenv ANTHROPIC_API_KEY

After setting values, quit/reopen the IDE so it inherits the updated environment.

Observability & Diagnostics

/dashboard – lightweight HTML dashboard showing health, metrics, routing, compression, and headroom status (auto-refreshing)
/health, /health/live, /health/ready – readiness probes
/routing/stats, /debug/session, /v1/agents/* – routing + agent diagnostics
/metrics, /metrics/prometheus, /metrics/compression, /metrics/semantic-cache – Prometheus-ready metrics, headroom/semantic cache stats
/health/headroom, /headroom/status, /headroom/restart, /headroom/logs – headroom lifecycle

Cost Optimization & Multi-Model Strategy

Muxa’s proxy lives between every IDE and the upstream provider, so optimization happens once and benefits everything:

Caching chain – Prompt cache handles exact repeat prompts instantly; semantic cache (with embeddings) reuses near-duplicates; the memory store injects curated snippets for long-running projects.
Headroom compression – Enable MUXA_HEADROOM_ENABLED=true to shrink chat histories and reduce token spend while keeping context intact—inspect savings via /metrics/compression.
Hybrid routing – Set MUXA_ROUTING_STRATEGY=hybrid with a fallback provider to auto-route tool-heavy or long prompts to a premium model while keeping cheap models for short requests. Model aliases/fallbacks map IDE-only IDs (e.g., gpt-5.3-codex) to real upstream SKUs without touching editor config.
Instrumentation loop – /api/tokens/stats, /metrics/semantic-cache, /routing/stats, and the dashboard show exactly when caches hit or fallback routing triggers so you can prove savings to the team.

Best results happen when you warm caches (replay tests or common prompts), keep session_id/user identifiers consistent, and leave the proxy running for a day or two so semantic cache + headroom gather enough data. See docs/cost-optimization.md for the full playbook, configuration examples, timelines, and troubleshooting tips.

For a deep dive into how hybrid routing scores requests and auto-switches between providers, see docs/routing.md.

Embeddings & @Codebase Support

See docs/embeddings.md for Ollama, llama.cpp, OpenRouter, and OpenAI embeddings configuration. Example (Ollama):

ollama pull nomic-embed-text
export MUXA_SEMANTIC_CACHE_ENABLED=true
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_EMBEDDINGS_MODEL=nomic-embed-text
npm start

Testing

npm test                    # 90+ suites covering API/provider/integration/perf
node scripts/endpoint-parity-preflight.js

Docker Compose Example

See docker-compose.example.yml for a sample proxy + Ollama stack.

Additional Documentation

Detailed GitBook-style docs live under docs/:

docs/README.md — table of contents
docs/cost-optimization.md — cost optimization playbook
docs/setup.md — installation/config cheat sheet
docs/providers.md — provider-specific notes
docs/observability.md — endpoints + dashboard
docs/embeddings.md — embeddings/@Codebase setup
docs/routing.md — hybrid routing & cost optimization

Built by The Logic Atelier