cheap-claude

v1.1.0

Published

9 days ago

57% token cost reduction for Claude Code. Zero quality loss. Plugin + API proxy.

0High
0Medium
0Low

claude claude-code token-optimization cost-reduction anthropic llm prompt-caching proxy mcp plugin

   _____ _                         _____ _                 _      
  / ____| |                       / ____| |               | |     
 | |    | |__   ___  __ _ _ __   | |    | | __ _ _   _  __| | ___ 
 | |    | '_ \ / _ \/ _` | '_ \  | |    | |/ _` | | | |/ _` |/ _ \
 | |____| | | |  __/ (_| | |_) | | |____| | (_| | |_| | (_| |  __/
  \_____|_| |_|\___|\__,_| .__/   \_____|_|\__,_|\__,_|\__,_|\___|
                          | |                                       
                          |_|

70% cost reduction on API calls. 57% on Claude Code sessions. Zero quality loss.

Why This Exists

On April 4, 2026, Anthropic blocked Claude Pro and Max subscribers from using their flat-rate plans with third-party AI agent frameworks — starting with OpenClaw, expanding to all third-party harnesses this month.

The impact is brutal:

135,000+ OpenClaw instances were running when the announcement hit
A $200/month Max subscription was covering $1,000-$5,000 in actual compute — third-party tools bypass Anthropic's Prompt Cache optimizations that official tools use to cut costs
Users are now forced onto pay-as-you-go API billing, where a heavy day of agentic coding can cost $20-50+
Anthropic is offering a one-time credit and up to 30% off pre-purchased bundles, but that doesn't close the 5x price gap
Google made a parallel move against Gemini CLI users connecting third-party tools — this is an industry-wide reckoning with the economics of subsidizing agentic compute at flat rates

Cheap Claude exists to close that gap. If you're one of the 135K+ developers who just lost flat-rate pricing, this tool automatically applies the same Prompt Cache optimizations that Anthropic's official tools use — bringing your API costs back down to something sustainable.

Before & After

                        BEFORE                          AFTER
                   (no Cheap Claude)              (with Cheap Claude)

  API session (10 turns):
    Cost:       $0.42/session      →         $0.13/session    (70% saved)
    Cache:           0%            →              90-98%

  Claude Code session (12 turns):
    Cost:       $0.39/session      →         $0.17/session    (57% saved)
    Cache:           0%            →              75-94%

  ┌──────────────────────────────────────────────────────────────┐
  │  Monthly cost (20 devs, Opus 4.6, 5 sessions/day):         │
  │                                                              │
  │  API users:                                                  │
  │  Before:  ████████████████████████████████████████  $1,269   │
  │  After:   ████████████                              $384     │
  │  Saved:                 ░░░░░░░░░░░░░░░░░░░░░░░░░░  $885    │
  │                                                              │
  │  Claude Code users:                                          │
  │  Before:  ████████████████████████████████████████  $1,172   │
  │  After:   ██████████████████                        $446     │
  │  Saved:                     ░░░░░░░░░░░░░░░░░░░░░░  $726    │
  └──────────────────────────────────────────────────────────────┘

Install

Claude Code Plugin (recommended)

# Option A: npm (fastest)
npm install -g cheap-claude
ln -sf $(npm root -g)/cheap-claude/plugin ~/.claude/plugins/cheap-claude

# Option B: clone
git clone https://github.com/ajsai47/cheap-claude.git ~/cheap-claude
ln -sf ~/cheap-claude/plugin ~/.claude/plugins/cheap-claude

Restart Claude Code. The plugin activates automatically — you'll see /cheap-stats available and duplicate read warnings in your sessions.

API Proxy (for SDKs, scripts, custom apps)

# Clone + start
git clone https://github.com/ajsai47/cheap-claude.git ~/cheap-claude
cd ~/cheap-claude && npm install
npx tsx src/proxy/server.ts &
export ANTHROPIC_BASE_URL=http://localhost:8082
# Dashboard at http://localhost:8082/dashboard

Commands

Once the plugin is installed, these are available in any Claude Code session:

| Command | What it does | |---------|-------------| | /cheap-stats | Show session costs, duplicate reads, saving tips | | cheap_session_stats | MCP tool — compact cost summary (~50 tokens) | | cheap_session_details | MCP tool — full breakdown with per-tool stats | | cheap_dedup_report | MCP tool — list every duplicate file read this session | | cheap_cost_tip | MCP tool — get personalized cost-saving suggestions | | cheap_outline | MCP tool — structural file overview (4-8x cheaper than Read) | | cheap_unfold | MCP tool — read just one function from a file | | cheap_search | MCP tool — compact grep across directory (max 20 results) |

Example: /cheap-stats

> /cheap-stats

  Cheap Claude — Session Stats
  ─────────────────────────────────────
  Tool calls:        47
  Estimated cost:    $0.14
  File reads:        12 (3 duplicates caught)
  Tokens wasted:     ~4,200 on re-reads

  Tips:
  • You re-read src/index.ts 3 times. Use your existing knowledge.
  • 8 Bash calls — consider using Grep/Glob instead.
  ─────────────────────────────────────

How It Works

Plugin (Claude Code users)

The plugin drops into ~/.claude/plugins/ and activates automatically:

┌─────────────────────────────────────────────────────────┐
│  Claude Code                                            │
│                                                          │
│  ┌──────────┐  CLAUDE.md injects terse + efficiency     │
│  │  Cheap   │  hints every session                      │
│  │  Claude  │                                            │
│  │  Plugin  │  PreToolUse hook warns before re-reading   │
│  │          │  files you already have in context         │
│  │          │                                            │
│  │          │  PostToolUse hook logs every tool call     │
│  │          │  and tracks file read patterns             │
│  │          │                                            │
│  │          │  MCP server provides /cheap-stats and      │
│  │          │  cost tracking tools                       │
│  └──────────┘                                            │
└─────────────────────────────────────────────────────────┘

| Component | File | What it does | |-----------|------|-------------| | Context injection | CLAUDE.md | Terse output rules, file-read efficiency hints | | Read dedup warning | scripts/pre-tool-use.sh | Catches duplicate reads before they happen | | Tool tracking | scripts/post-tool-use.sh | Logs all tool calls with token estimates | | Session init | scripts/session-start.sh | Initializes per-session tracking | | Cost tools | mcp/server.ts | cheap_session_stats, cheap_dedup_report, cheap_cost_tip | | Stats command | skills/cheap-stats/ | /cheap-stats slash command |

Proxy (API consumers)

Six engines run on every /v1/messages request:

Request  ─→  Model Router  ─→  MCP Optimizer  ─→  History Compressor
                                                          │
api.anthropic.com  ←─  Cache Maximizer  ←─  Output Enforcer  ←─  Result Dedup
         │
         └─→  SSE stream back to client (cost tracked per turn)

| # | Engine | What it does | Savings | |---|--------|-------------|---------| | 1 | Model Router | Routes simple turns ("yes", "run tests") to Haiku | ~5% on Opus | | 2 | MCP Optimizer | Defers unused tool schemas from turn 1 | 85% tool overhead | | 3 | History Compressor | Haiku summarizes old turns (keeps last 4 messages) | 50-70% on history | | 4 | Result Deduplicator | SHA256 dedup of identical file re-reads | 2-5 dupes/session | | 5 | Output Enforcer | Terse prompt injection (never caps max_tokens) | ~35% output | | 6 | Cache Maximizer | Adds cache_control breakpoints to prefix | 80-94% cache hits |

Dashboard

The proxy serves a live dashboard at http://localhost:8082/dashboard:

┌──────────────────────────────────────────────────┐
│  Cheap Claude                   proxy active     │
│                                                   │
│  Today          Savings          Setup            │
│  ─────────────────────────────────────────────── │
│                                                   │
│  Before:  $5.82 / day                            │
│  Today:   $2.51 / day         ↓ 57%              │
│                                                   │
│  ████████████████████░░░  $3.31 saved today      │
│                                                   │
│  Cache Maximizer      $1.42   42%                │
│  MCP Optimizer        $0.81   24%                │
│  History Compressor   $0.68   21%                │
│  Model Router         $0.22    7%                │
│  Result Deduplicator  $0.12    4%                │
│  Output Enforcer      $0.06    2%                │
└──────────────────────────────────────────────────┘

Pricing Context

Why this matters — the cost difference between cached and uncached:

                    Input         Cache Read       You Save
  Opus 4.6:     $5.00/MTok  →   $0.50/MTok        90%
  Sonnet 4.6:   $3.00/MTok  →   $0.30/MTok        90%
  Haiku 4.5:    $1.00/MTok  →   $0.10/MTok        90%

The Cache Maximizer's job: make as much of every request cacheable as possible. On a typical work turn, 90% of input tokens are cached — you pay 10 cents instead of a dollar.

API Users — What to Expect

The proxy works for any app using the Anthropic SDK. Your savings depend on your system prompt size:

  Your System Prompt          Cache Hit Rate     Savings
  ────────────────────────    ──────────────     ───────
  8K+ tokens (large app)      90-98%             70%     ← verified on 10 turns
  2-8K tokens (medium app)    80-95%             40-60%
  Under 2K tokens             0%                 0%      ← too small to cache

  Minimum for caching:
    Haiku 4.5:   2,048 tokens (~8K chars)
    Sonnet/Opus: 1,024 tokens (~4K chars)

If your system prompt is under the minimum, the proxy can't help with caching. The Output Enforcer and Model Router still provide modest savings.

Verified: 10-turn API chatbot with 8K system prompt

  Turn  1:   0% cache  $0.011  (one-time cache write)
  Turn  2:  98% cache  $0.001  ← 87% cheaper immediately
  Turn  3:  97% cache  $0.002
  ...
  Turn 10:  90% cache  $0.002
  Total:    $0.026 (would have been $0.085 → 70% saved)

Best for

Multi-turn chatbots with stable system prompts → 70% savings
AI agents with tools and growing history → 57% savings
RAG apps with large static context + dynamic retrieval → 30-50% savings
Batch processing with repeated prompts → 50-60% savings

Not helpful for

One-shot API calls with tiny system prompts
Apps that already implement cache_control manually
Streaming-only apps where latency matters more than cost (proxy adds <5ms)

Quick test for your app

# Start the proxy
git clone https://github.com/ajsai47/cheap-claude.git && cd cheap-claude
npm install && npx tsx src/proxy/server.ts &

# Point your app at it
export ANTHROPIC_BASE_URL=http://localhost:8082

# Run your app normally, then check savings
curl http://localhost:8082/stats

The Honest Story

We ran 21+ iterations using an autoagent-style loop. Here's what actually happened:

Iteration   Score    What we tried
─────────   ─────    ─────────────────────────────────────────
Baseline    14.5%    5 engines, nothing tuned
   1        56.9%    ← THE BIG ONE: MCP Optimizer from turn 1
   3        57.9%    Tuned compression thresholds
   8        70.1%    Aggressive file truncation (reverted — loses context)
  15        76.6%    Keep only 1 recent message (reverted — loses context)
  23        79.3%    Cap max_tokens at 30% (reverted — breaks code gen)
  Reset     57.3%    Reverted everything that hurt quality
  Final     62%*     Added model routing (*projected on Opus)
  API test  70%      10-turn chatbot with 8K system prompt (verified)

The jump from 14% to 57% came from one insight: run the MCP Optimizer from turn 1 so the cached prefix is byte-identical across all turns. Everything else was tuning.

We hit 79% by cheating — truncating file content Claude needs, capping output length, stripping conversation context. It looked great on the eval. It would break real usage. We reverted all of it.

The 70% API number is real — tested on a 10-turn multi-turn chatbot with an 8K-token system prompt. Turns 2-10 all hit 90-98% cache.

57% is the honest number. 62% with model routing on Opus.

Project Structure

cheap-claude/
├── plugin/                        # Claude Code plugin
│   ├── .claude-plugin/plugin.json
│   ├── .mcp.json                  # MCP server registration
│   ├── CLAUDE.md                  # Injected every session
│   ├── hooks/hooks.json           # Lifecycle hooks
│   ├── mcp/server.ts              # Cost tracking tools
│   ├── scripts/                   # Hook implementations
│   └── skills/cheap-stats/        # /cheap-stats command
├── src/proxy/                     # API proxy (6 engines)
│   ├── server.ts                  # HTTP proxy + dashboard
│   ├── pipeline.ts                # Engine orchestrator
│   └── engines/                   # The 6 engines
├── src/dashboard/index.html       # Live cost dashboard
├── src/cli/                       # init + stop
├── test/harness.ts                # 12-turn benchmark
├── eval.sh                        # Automated eval
└── program.md                     # Autoagent config

Data

All data at ~/.claude-cheap/. Nothing in your repo. Nothing leaves your machine.

License

MIT

Acknowledgments

ClaudeMem · ccusage · token-optimizer · autoagent