llm-cache-proxy
v2.0.5
Published
Local-only, zero-dependency byte-exact caching reverse proxy for the Anthropic Messages API. Replays cached responses with no upstream call on exact-match repeats.
Maintainers
Readme
llm-cache-proxy
Local-only, zero-dependency caching reverse proxy for the Anthropic Messages API.
On an exact-match repeat it replays the byte-identical cached response with no
upstream call — 100% token save per hit. Built for rerun / eval / CI / dev-loop
workloads, where the same /v1/messages request recurs.
- One Node file, no dependencies (
proxy-a.mjs). - Starts in <2s, no database, no API key juggling (reads
.env). - Byte-exact SSE replay (streaming +
tool_usepreserved verbatim). - 100%-covered zero-dep unit suite (no network, no paid calls) + a live byte-exact fidelity proof.
- Realtime
/monitorstream, this-session + all-time stats, log verbosity, a cache-explorer TUI. - Loopback by default; opt-in network bind gated by an auth token. Optional boot service (systemd / launchd).
- Optional partial caching: strip dynamic fields (timestamps, session IDs, tool results) before hashing →
HIT-NORM; or key on only the last N messages →HIT-SUFFIX.
Measured token savings
Side-by-side, 5 identical /v1/messages calls (Haiku) through the proxy, cache ON
vs bypass (cachectl-a.sh off). Measured via bench.py + the proxy ledger:
| Metric | Cache OFF (bypass) | Cache ON | |---|---|---| | Hit rate | 0% | 80% | | Upstream calls (for 5 identical) | 5 | 1 | | Tokens billed | all 5 calls | 1 call (4 served free) | | Tokens saved (ledger) | 0 | 296 | | Warm-call latency | 1.141 s | 0.001 s (~1000× faster) |
Savings ≈ your full-call repeat rate. With N identical calls the cache eliminates (N−1) of them — here 4/5 = 80%. On a rerun/eval/CI suite that re-issues the same prompts, that is a direct ~80%+ cut in tokens and latency on the repeated portion.
By default only exact full-call repeats hit. Novel calls (different messages) are
never cached — so interactive, always-different traffic sees little benefit. This is a
rerun/eval/CI optimizer, not a general speedup. (bench.py's own "saved %" line is a
client-side artifact — a cached response still carries usage numbers, so the SDK can't
tell it was free; the proxy ledger and the 0.001 s latency are ground truth.)
Optional partial caching (normalize.json) extends this: HIT-NORM strips volatile
fields before hashing (timestamps in system prompts, changing session IDs), so "same
logical prompt, different run date" also hits. HIT-SUFFIX (gated, higher risk) ignores
conversation history and keys only on the last N messages — see below.
Reproduce:
./cachectl-a.sh on
.venv/bin/python bench.py --identical 5 --varied 0 --model claude-haiku-4-5-20251001 \
--base-url http://localhost:4000
./cachectl-a.sh stats # hit rate + tokens/$ savedInstall
Needs Node ≥ 18 and a real Anthropic key. Pick one:
brew install mithudso/tap/llm-cache-proxy # Homebrew (macOS / Linux)
npm install -g llm-cache-proxy # npm (or run ad hoc: npx llm-cache-proxy <cmd>)
git clone https://github.com/mithudso/llm-cache-proxy.git && cd llm-cache-proxy # from sourceRun
First run — just call on; it detects the missing key and prompts:
llm-cache-proxy on # Homebrew / npm install
./cachectl-a.sh on # source install (repo root)Both prompt for your Anthropic key and write it to a config file immediately. After that:
export ANTHROPIC_BASE_URL=http://localhost:4000 # point Claude Code / SDK at it
export ANTHROPIC_API_KEY=anything # client key ignored; .env key is usedConfig file paths:
| Install method | Config file written on first run |
|---|---|
| Homebrew / npm | ~/.llm-cache-a/.env |
| Source (git clone) | <repo-root>/.env |
What the config file looks like (generated by the setup wizard, chmod 600):
ANTHROPIC_API_KEY_REAL=sk-ant-api03-...
CACHE_PORT=4000
CACHE_TTL_SEC=604800
CACHE_MAX_ENTRIES=5000
CACHE_HOST=127.0.0.1
# CACHE_LOG_LEVEL=info # silent|error|info|debug
# CACHE_LOG_FILE=/Users/you/.llm-cache-a/proxy.log # or 'none'To re-run setup or change the key: llm-cache-proxy setup (Homebrew/npm) or ./cachectl-a.sh setup (source).
Full control surface:
- Homebrew/npm:
llm-cache-proxy on | off | restart | stop | stats | setup | validate - Source:
./cachectl-a.sh on | off | restart | stop | validate | stats | status | monitor | explore | setup | run | install | uninstall
(off = bypass: forwards everything, caches nothing. restart = stop then start. validate = check config files for errors and run liveness checks if the proxy is up.)
validate — config + runtime health check:
$ llm-cache-proxy validate
== llm-cache-proxy validate ==
Config:
✓ ANTHROPIC_API_KEY_REAL — set (sk-ant-api03-****)
✓ CACHE_PORT=4000
✓ CACHE_HOST=127.0.0.1
✓ normalize.json — valid JSON (2 system_strip, 1 message_strip pattern(s); suffix_only=false)
✓ prices.json — not present (built-in haiku/sonnet/opus prices used)
Runtime (proxy at :4000):
✓ /health → 200
✓ /stats → 200 (42 calls, 38 hits, 90.5% hit rate, cache on)
✓ /metrics → 200 (Prometheus format, expected metrics present)
Result: all checks passed ✓Exits 0 on all-pass, 1 if any error — safe to use in CI or boot scripts.
npm test runs the zero-dep unit suite (43 tests) against a mock upstream (no network, no key, 100% line/function
coverage of proxy-a.mjs); npm run test:fidelity runs the live, paid byte-exact proof. bench.py needs
anthropic (pip install anthropic).
Full guide: USAGE.md (or ./cachectl-a.sh --help) · docs/INSTALL.md — prerequisites, configuration (env vars, per-model pricing), client setup, monitoring, troubleshooting, uninstall.
How it works
Reverse proxy in front of api.anthropic.com. On each request the proxy tries three cache key tiers in order:
| Tier | Key | Label | Active when |
|---|---|---|---|
| 1 — exact | sha256(model + raw body) | HIT | always |
| 2 — normalized | sha256(model + normalized body) | HIT-NORM | normalize.json present |
| 3 — suffix | sha256(model + system_norm + last N messages) | HIT-SUFFIX | suffix_only: true |
Exact HIT → replay stored bytes, zero upstream call. MISS → forward with the real key, tee the response to client + cache (complete 200s only), and write alias files under any active normalized/suffix keys so future requests at those tiers can also hit. Cache + metrics live in ~/.llm-cache-a/ (outside the repo). See docs/ARCHITECTURE.md.
Partial caching (optional)
Create ~/.llm-cache-a/normalize.json to enable normalized and suffix matching:
{
"system_strip": ["Current date[^\\n]*", "Session-ID: [a-f0-9-]+"],
"message_strip": ["<tool_result>[\\s\\S]*?</tool_result>"],
"suffix_only": false,
"suffix_turns": 3
}system_strip— regex patterns stripped from the system prompt before hashing. Use for timestamps, dates, session IDs, or any field that changes run-to-run but doesn't affect the response.message_strip— same, applied to message content. Use for<tool_result>blocks that carry dynamic values (file listings, timestamps, prices).suffix_only: true— also tries a key built from only the lastsuffix_turnsmessages. Hits when a new conversation shares the same recent context with an earlier one. Risk: ignores older history, so a "same last 2 messages" hit in a different logical context replays a response that may not be appropriate. Only enable for idempotent, context-independent queries.
What partial caching does not solve: truly interactive sessions where every turn is unique. If the last N messages are always different, no tier hits. The tier-1 exact cache remains the safe, high-confidence path; tiers 2–3 trade some replay confidence for higher hit rates on structured, partially-dynamic workloads.
Logging & monitoring
Every request emits one structured log line on stdout (captured in ~/.llm-cache-a/proxy.log):
HIT claude-haiku-4-5-20251001 +76tok $0.00025 | saved $0.0007 / 228tok hit-rate 33.3%
MISS claude-haiku-4-5-20251001 200 274tok $0.00104 3951ms [cached] | spend $0.0013Running counters track tokens and dollars saved (cache hits) versus dollars spent (misses), priced per model. They seed from the metrics log on boot, so totals survive a restart. /stats reports this-session (since the process booted) and all-time (seeded + session) figures:
curl localhost:4000/stats # JSON: top-level = all-time; nested .session = this run
curl localhost:4000/metrics # Prometheus: llm_cache_{hits,misses,tokens_saved,usd_saved,...}_total
curl -N localhost:4000/monitor # realtime SSE: one event per served call (HIT/MISS/…)
./cachectl-a.sh stats # pretty-prints this-session + all-time; offline, reads the ledger
./cachectl-a.sh status # process up? accepting calls? cache on/off? last call? errors this run
./cachectl-a.sh monitor # tails /monitor: #seq type model tok $ ms | snippetMonitor output example:
2026-06-24T14:20:01Z #0001 MISS claude-haiku-4-5-20251001 33tok $0.00008 1054ms | Gold is a chemical element...
2026-06-24T14:20:03Z #0002 HIT claude-haiku-4-5-20251001 33tok $0.00008 1ms | Gold is a chemical element...Each event includes a monotonic seq counter (per process) and a snippet of the first 80 chars of the response — makes it easy to confirm cache hits are returning the right content at a glance.
Log verbosity: CACHE_LOG_LEVEL = silent | error | info (default) | debug (CACHE_QUIET=1 == silent).
Logs tee to stdout and a default file (CACHE_LOG_FILE, default ~/.llm-cache-a/proxy.log; none disables).
/metrics drops straight into Prometheus/Grafana. Pricing is matched by model substring (haiku/sonnet/opus); override or extend it with ~/.llm-cache-a/prices.json ({"haiku":[0.8e-6,4e-6]}).
Network access & auth
The proxy injects the real key for any client that reaches it, so it binds loopback (127.0.0.1) by default. To expose it on a LAN, set CACHE_HOST to a reachable address — which then requires CACHE_AUTH_TOKEN: start() refuses a non-loopback bind without one, and once set, every route except /health requires header x-cache-auth: <token>. The setup wizard generates a token automatically when you pick a non-loopback host.
CACHE_HOST=0.0.0.0 CACHE_AUTH_TOKEN=$(openssl rand -hex 18) ./cachectl-a.sh on
curl -H "x-cache-auth: <token>" http://<host>:4000/v1/messages ...Run as a service (start on boot, restart on failure)
./cachectl-a.sh install # systemd user unit (Linux) or launchd agent (macOS)
./cachectl-a.sh uninstall # remove it
./cachectl-a.sh run # foreground exec (what the service manager calls)Linux gets a systemd user unit (EnvironmentFile=.env, Restart=on-failure, enabled at boot via linger);
macOS gets a launchd agent (RunAtLoad + restart-on-failure), which sources .env via a small wrapper.
Two macOS-specific behaviors are handled correctly: cachectl-a.sh on/off/stop unloads the launchd plist before killing the process (preventing EADDRINUSE from KeepAlive restarting too fast), and cachectl-a.sh status falls back to pgrep when the pidfile is stale after a system reboot (launchd restarts give the process a new PID), auto-healing the pidfile in place.
Cache explorer
./cachectl-a.sh explore # interactive TUI: ↑/↓ browse, enter view, d invalidate, q quit
node cache-explorer.mjs --list # non-interactive: one row per entry
node cache-explorer.mjs --view <keyPrefix> # dump an entry's meta + body head
node cache-explorer.mjs --invalidate <keyPrefix> # delete matching entriesCLI (callable / testable routines)
The proxy's core routines run from the shell (no args = start the server), and are exported for tests:
node proxy-a.mjs stats # print the stats JSON
node proxy-a.mjs price claude-opus-4-8 # [15e-6, 75e-6]
node proxy-a.mjs usage '<text>' # extract {input_tokens, output_tokens}
node proxy-a.mjs key <model> <body> # the exact-match cache keyGuardrails
- Only complete
200responses cached (streaming requiresmessage_stop). - TTL 7d (
CACHE_TTL_SEC), LRU prune atCACHE_MAX_ENTRIES(5000). - Fail-open: upstream/proxy errors forward to the client; never break a turn.
- Real key lives only in
.env(gitignored, chmod 600) — never committed. - Loopback by default (
CACHE_HOST=127.0.0.1); exposing it needsCACHE_AUTH_TOKEN.
Config (env, all optional): CACHE_PORT · CACHE_HOST · CACHE_AUTH_TOKEN · CACHE_TTL_SEC · CACHE_MAX_ENTRIES · CACHE_OFF · CACHE_LOG_LEVEL · CACHE_LOG_FILE.
Correctness & concurrency
Two test layers. npm test is a zero-dep node:test suite that drives the proxy against
a local mock upstream — no network, no key, no paid calls — and enforces 100% line + 100%
function coverage of proxy-a.mjs (hit/miss/coalesce/bypass/expired/prune/seed/auth/monitor/
verbosity + byte-exact multi-chunk SSE replay + session-vs-all-time). npm run test:fidelity
is the live, paid proof: test-fidelity.mjs shows byte-exact cold→warm replay against the
real API for streaming SSE, tool_use, streaming + tool_use, and request coalescing
(a burst of N identical concurrent calls makes exactly one upstream call) — 23/23 pass,
re-verified after the refactor.
Concurrency hardening in proxy-a.mjs:
- Async I/O — cache reads/writes/prune use
fs/promises, off the event loop. - Request coalescing — identical in-flight requests share one upstream fetch (no stampede); extras return
x-cache: HIT-COALESCED. - Client-abort guard — a disconnect tears down the upstream call and never crashes the process; all client writes are guarded.
- Throttled prune — entry count tracked in memory; the LRU sweep runs only when the cap is exceeded, not on every write.
A real claude -p agent loop was run through the proxy end to end: correct output,
streaming intact, zero proxy errors. The proxy is transparent to live Claude Code.
One caveat that the live loop made concrete: interactive Claude Code sessions do not get
cross-run cache hits by default. Claude Code's request bodies vary run to run (dynamic
system prompt and context), so two "identical" sessions hash to different keys. Cache wins
come from deterministic, byte-identical repeats (eval suites, CI, scripted SDK calls,
npm test), not from live agent sessions. Partial caching via normalize.json closes part
of this gap: HIT-NORM handles timestamp/session variation in the system prompt; HIT-SUFFIX
handles different conversation prefixes with a shared recent context — but genuinely novel
interactive turns still miss.
Why not LiteLLM
An earlier LiteLLM-based attempt was evaluated and dropped: ~87s import, >120s flaky
startup, the /v1/messages passthrough route bypassed the cache (0% hits), and
master_key+wildcard routing required a Prisma DB. The hand-rolled zero-dependency
Node proxy (proxy-a.mjs) replaced it. Full decision record in
docs/ARCHITECTURE.md.
