@sapperjohn/kostai
v0.5.2
Published
Developer-first instrumentation for LLM call cost and context waste analysis
Downloads
1,448
Maintainers
Readme
ai-cost
Local-only cost-and-waste instrumentation for LLM-powered apps — with shadow-mode A/B testing, local-first routing, a multi-repo parallel-agent orchestrator, and an MCP server for Claude Desktop / Claude Code.
🚀 v0.5.1 — BoilTheOcean orchestrator released (2026-04-18). Multi-repo parallel-agent runner with per-task model routing, dynamic TABOO escalation, cost ledger + budget gate, bridge driver for data-sovereignty routing, and a strategic-brain layer that learns across waves. Measured: 41-task dogfood rollup on this codebase, mean +44.3% token savings vs naive Opus baseline. See
docs/RELEASE_NOTES_v0.5.0.md
docs/QUICKSTART.md. 30 new boil-specific tests.
v0.4.0 Sprint 1 rollup: router classifier v2 (+6.5 pts vs v1), scorer v2 with seven new detectors, native TLS/mTLS on the bridge, throughput-first dashboard, macOS DMG distribution, SQLite partitioning. Mean ~55% token reduction on the frozen benchmark suite. See
docs/RELEASE_NOTES_v0.4.0.md.
Elastic reviewers, start here:
docs/ELASTIC_REVIEW.mdis the end-to-end review and test plan (15 min single-machine walkthrough, two-machine bridge, production-deploy checklist, what-to-evaluate checklist).
What it does: wraps your LLM SDK (Anthropic, OpenAI, Google, Ollama, LM Studio, OpenAI-compat) or slots in as an HTTP proxy, records every call, scores it for waste across 8 categories, and — if you turn on shadow mode — runs a cheaper/local path in parallel and grades the output so you can see per-call what each optimized route would have saved.
Nothing leaves the machine. All data lives in a local JSONL file you can
cat.
Table of contents
- Install
- Claude skills
- Quick start
- Providers
- Shadow mode (A/B the frontier model against a cheaper one)
- Router (classify + downgrade simple tasks)
- HTTP proxy (one env-var drop-in)
- MCP server (Claude Desktop / Claude Code)
- Bridge (multi-machine MCP — local↔frontier handoff)
- Dashboard
- CLI commands
- What it detects (waste categories)
- Configuration
- Privacy
- Extended docs
Install
npx -y @sapperjohn/kostai installThat one command:
- stamps
ai-cost.config.json - auto-applies safe SDK wrapper starter patches when it can
- writes
.kostai/optimizations.mdwith the remaining high-impact changes - preps the local Command Node bridge so the workspace is ready to join the pool
If you want the package installed in package.json too:
npm install @sapperjohn/kostai
# or
pnpm add @sapperjohn/kostaiThe CLI binary is available as both kostai and ai-cost.
Claude skills
This package ships a Claude skill suite under skills/:
skills/cost-optimization/— AI Performance, the Adnan-ready cost proof workflowskills/brainofbrains/— Brain Orchestrationskills/elasticjudge/— Quality Judgeskills/surge/— deliverables tracking
To install the AI Performance skill for Claude Code:
npm view @sapperjohn/kostai version # must be 0.5.2 or newer
npm install -g @sapperjohn/kostai@^0.5.2
ln -s "$(npm prefix -g)/lib/node_modules/@sapperjohn/kostai/skills/cost-optimization" \
"$HOME/.claude/skills/cost-optimization"The skill is local-first: no MCP server is enabled by default, no prompt bodies are shared, and scripts/feedback.sh only writes an opt-in aggregate packet.
Quick start
1. Initialize
npx kostai installCreates ai-cost.config.json, applies safe starter patches, and refreshes the savings plan.
2. Wrap your client
import Anthropic from "@anthropic-ai/sdk";
import { wrapAnthropic } from "@sapperjohn/kostai";
const client = wrapAnthropic(new Anthropic(), {
appName: "my-app",
route: "bugfix-agent",
});
await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 1024,
messages: [{ role: "user", content: "Fix the auth bug" }],
});3. Open the dashboard
npx ai-cost dashboardEight tabs: Overview, Shadow Mode, Router, Local LLMs,
Bridge, Queue, Calls, Trends. Everything runs on http://localhost:3674.
Providers
import {
wrapAnthropic,
wrapOpenAI,
wrapGoogle, // @google/generative-ai
wrapOllama, // local Ollama HTTP client
wrapOpenAICompat, // LM Studio, Kimi, DeepSeek, vLLM, Moonshot
} from "@sapperjohn/kostai";All wrappers use the same wrap(client, { appName, route, workflow, tags })
shape. Events are persisted whether the call succeeds or fails.
Shadow mode
Every shadow run calls both the frontier model and a cheaper/local path in
parallel, returns the frontier result to the app, and writes a comparison
record with baselineCostUsd, optimizedCostUsd, savedUsd, and a Kimi-2.5
quality score (0–100).
import { runShadow, evaluateQuality, wrapAnthropic, wrapOllama } from "@sapperjohn/kostai";
const anthropic = wrapAnthropic(new Anthropic());
const ollama = wrapOllama({ baseUrl: "http://localhost:11434" });
const { baselineResult, comparison } = await runShadow({
ask: userMessage,
route: "ticket-classifier",
baseline: async () => {
const r = await anthropic.messages.create({
model: "claude-opus-4-7",
max_tokens: 256,
messages: [{ role: "user", content: userMessage }],
});
return { event: { id: r._ai_cost_event_id, model: r.model, ... }, result: r };
},
optimized: async () => {
const r = await ollama.chat({
model: "llama3.2",
messages: [{ role: "user", content: userMessage }],
});
return { event: { id: r._ai_cost_event_id, model: r.model, ... }, result: r };
},
qualityEvaluator: evaluateQuality,
});
// baselineResult flows back to the caller. The app never sees optimizedResult
// — shadow mode is read-only w.r.t. production.The dashboard's Shadow Mode tab aggregates these as:
- total saved $
- average saved %
- average quality score
- by model pair
- by route
- recent A/B comparisons (click to see the diff)
Router
Pure function. Given a call, classify the task, check the model, emit one of four decisions with a USD-denominated savings estimate.
import { routeCall } from "@sapperjohn/kostai";
const decision = routeCall(
{
model: "claude-opus-4-7",
messages: [{ role: "user", content: "Classify this ticket as bug | feature | question." }],
inputTokens: 120,
outputTokensEstimate: 20,
},
{
router: {
enabled: true,
localProvider: "ollama",
localModel: "llama3.2",
cheapApiProvider: "anthropic",
cheapApiModel: "claude-haiku-4-5",
},
},
);
// decision.decision:
// "local_sufficient" — route to ollama/llama3.2
// "cheaper_api_sufficient" — route to anthropic/claude-haiku-4-5
// "frontier_required" — keep on claude-opus-4-7
// "cache_hit" — (reserved; identical-prompt detection)
// decision.estimatedSavingsUsd, decision.reason, decision.confidenceLevelThe Router dashboard tab scans your recent traffic, runs the same classifier offline, and shows the top 20 routable calls sorted by annualized savings.
HTTP proxy
One env-var adoption. The proxy speaks OpenAI's /v1/chat/completions
shape.
# Observe — record every call, never modify
npx ai-cost proxy --mode observe --port 4311
# Route — downgrade confidently routable calls
npx ai-cost proxy --mode route --port 4311
# Shadow — always run both paths, log the comparison
npx ai-cost proxy --mode shadow --port 4311In your app:
OPENAI_BASE_URL=http://localhost:4311/v1That's the entire integration. No code changes.
MCP server
Expose ai-cost's primitives over the Model Context Protocol (spec
2024-11-05) so Claude Desktop, Claude Code, or any MCP client can call
them in-loop.
# Inspect the exposed tools
npx ai-cost mcp --list
# Run the server on stdio (register in Claude Desktop's config)
npx ai-cost mcpTools (28 total today — same set served over both stdio and the bridge HTTP transport):
ai_cost_overview— totals + waste breakdownai_cost_top_workflows— ranked by avoidable spendai_cost_recommend_route— run the router against an ask+modelai_cost_record_call— manual call loggingai_cost_list_comparisons— recent shadow-mode comparisonsai_cost_ollama_chat— pass-through to local Ollama, auto-recordedai_cost_shadow_compare— run a live A/B from the callerai_cost_local_status— detected local runtimes + configai_cost_anthropic_chat— direct Anthropic Messages call (no SDK dep), auto-recordedai_cost_list_peers— list configured bridge peers + reachabilityai_cost_escalate_to_frontier— local node asks a frontier-role peer to run a promptai_cost_delegate_to_local— frontier node asks a local-role peer to run a prompt; records would-have-cost savingsai_cost_route_cheap_api— route to a cheap-API peer, or fall back to the frontier peer with a cheap model overrideai_cost_handoff— router-driven smart dispatch across peersai_cost_preprocess— distill a prompt locally before escalationai_cost_preprocess_then_escalate— local preprocess + frontier escalation in one toolai_cost_queue_enqueue— durable async enqueue for bridge workai_cost_queue_status— queue counters + worker heartbeatai_cost_queue_list— inspect queued/running/done/failed workai_cost_queue_cancel— cancel a queued or running taskai_cost_kb_query— query the cost-reduction KB for prior routing and optimization patternsai_cost_govspend_lookup— look up GovSpend agencies, vendors, and opportunitiesai_cost_govspend_summary— summarize GovSpend corpus coverage, duplicates, and open issuesai_cost_agent_dispatch— run one BoilTheOcean task on this node via a local driverai_cost_research_brain— return the latest research-brain recommendations for this workspaceai_cost_research_fleet— return the latest cross-project research-fleet rollupai_cost_research_fleet_dispatch— enqueue research-fleet specialist packets into the durable queueai_cost_strategic_brain— return the strategic-brain status and missing calibration/autonomy actions
Claude Desktop config:
{
"mcpServers": {
"ai-cost": {
"command": "npx",
"args": ["ai-cost", "mcp"]
}
}
}Bridge
The bridge runs the same MCP tools over an authenticated HTTP+SSE transport so two machines can hand work to each other in-loop. Typical setup: a Mac Mini running Ollama as the local node, a MacBook with the Anthropic API key as the frontier node. Either side can call the other.
Priority #1: make Command Node linkup feel as easy as npm install.
This package is the shared install primitive for BrainOfBrains.ai,
KostAI.app, and CommandNodeAI.com: create an MCP connection operated
by a lightweight open-source model, then pool AI + compute resources across
humans and machines.
Fastest pairing flow:
# On the inviter's machine/repo
npx ai-cost bridge --invite --invite-name PatrickCommandNode
# On the recipient's machine/repo
npx -y @sapperjohn/kostai install --accept ./.kostai/bridge-invite.json
npx ai-cost bridge --doctorThat flow writes:
- a shareable invite at
.kostai/bridge-invite.json - a human-readable handoff at
.kostai/BRIDGE_SETUP.md - a centralized workspace snapshot at
.kostai/command-node-registry.json - a pooled peer registry entry at
~/.ai-cost-peers.json
Rollout brief:
docs/MCP_COMMAND_NODE_1CLICK_ROLLOUT_BRIEF_2026-04-20.md
On each machine:
# Generate a shared secret (run once per pairing)
npx ai-cost bridge --gen-token
# → 64-char hex string. Put the same value in both machines' configs.
# Start the bridge listener
npx ai-cost bridge --listen
# → ai-cost bridge listening at http://0.0.0.0:4319/mcp/v1
# tools: 28 transport: http+sse auth: bearer
# Probe configured peers
npx ai-cost bridge --status
# → ✓ macbook http://10.0.1.42:4319 role=frontier (claude-opus-4-7)
# ✓ mini http://10.0.1.50:4319 role=local (llama3.2, qwen2.5-coder, ...)Config block (added to ai-cost.config.json):
{
"bridge": {
"listenPort": 4319,
"listenHost": "0.0.0.0",
"authToken": "<64-char hex from --gen-token>",
"probeTimeoutMs": 10000,
"peers": [
{
"name": "macbook",
"url": "http://10.0.1.42:4319",
"token": "<the macbook's authToken>",
"role": "frontier",
"frontierModel": "claude-opus-4-7"
}
]
}
}bridge.probeTimeoutMs keeps bridge --status and bridge --doctor snappy. If a real delegation or escalation regularly needs longer than 60s, raise bridge.rpcTimeoutMs globally or set bridge.peers[*].rpcTimeoutMs only on the slower peer.
Flows:
- Escalation (local → frontier): Mac Mini's local agent calls
ai_cost_escalate_to_frontierwithmessages,reason, optionalmodel. The bridge POSTs to the MacBook peer'sai_cost_anthropic_chat, records the call locally withroute="escalation_request"andmeta.bridge_peer="macbook", and returns the response. - Delegation (frontier → local): MacBook calls
ai_cost_delegate_to_localwith the samemessages. The bridge runs it on the Mini'sai_cost_ollama_chat, computeswouldHaveCostfor the frontier model, and storesmeta.delegation_savings_usdso the dashboard can total cumulative savings. - Handoff (smart dispatch):
ai_cost_handoffruns the router against themessages;local_sufficient→ delegate,frontier_required→ escalate. Force withforce: "local" | "frontier".
The dashboard's Bridge tab shows configured peers, reachability,
delegation count, savings to date, and recent escalations with peer + reason
metadata. Endpoint: GET /api/bridge.
Wire formats (all over POST /mcp/v1/rpc with
Authorization: Bearer <token>):
# Health (no auth)
curl http://10.0.1.42:4319/mcp/v1/health
# JSON-RPC tools/list
curl -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
http://10.0.1.42:4319/mcp/v1/rpc
# Server→client notifications stream
curl -H "Authorization: Bearer $TOKEN" \
http://10.0.1.42:4319/mcp/v1/sseSee docs/MAC_MINI_HANDOFF.md for the full
two-machine setup walk-through.
Dashboard
npx ai-cost dashboardEight tabs:
- Overview — total spend, avoidable spend, shadow-saved, efficiency score, top waste categories, top repeated prompts, budget banner.
- Shadow Mode — A/B comparisons, saved totals, quality scores.
- Router — recent routable calls and annualized savings.
- Local LLMs — detected Ollama/LM Studio runtimes, local vs. cloud spend, configuration.
- Bridge — peer reachability, delegation count, cumulative savings,
recent escalations. Reflects this node's
bridgeconfig block. - Queue — durable 24h task queue: queued/running/done/failed counts and per-task inspection.
- Calls — searchable/filterable list of every recorded call. Click for full detail.
- Trends — daily spend and daily avoidable spend charts.
Auto-refreshes every 5 seconds. Local-only. No TLS. No login.
Elastic / Kibana integration
- Generate a Kibana 8.x bundle:
npx ai-cost kibana --output ai-cost.kibana.ndjson - Import path: Kibana → Stack Management → Saved Objects → Import → pick the NDJSON
- Required index pattern:
kostai-shadow-*(override with--index <pattern>) - Optional shipping hint:
npx ai-cost kibana --filebeat filebeat.ymlwrites a Filebeat config that tails~/.ai-cost/events.jsonlinto that index
CLI commands
| Command | Description |
|---|---|
| npx -y @sapperjohn/kostai install [--accept <path-or-url>] | One-click bootstrap for config, starter patches, savings plan, and optional Command Node linkup |
| npx ai-cost init | Create config file |
| npx ai-cost dashboard | Start local dashboard on port 3674 |
| npx ai-cost scan [--repo <path>] | Detect local LLM runtimes + LLM usage in a repo |
| npx ai-cost mcp [--list] | Start MCP server over stdio |
| npx ai-cost bridge --listen [--port 4319] [--host 0.0.0.0] | Start the HTTP+SSE MCP bridge |
| npx ai-cost bridge --status | Probe configured peers — reachable, models, errors |
| npx ai-cost bridge --gen-token | Generate a 64-char hex shared secret |
| npx ai-cost proxy --mode <observe\|route\|shadow> | Drop-in OpenAI-compat proxy |
| npx ai-cost compare --limit <n> | Summarize shadow-mode comparisons |
| npx ai-cost report --last 7d | Print markdown report |
| npx ai-cost export --format <json\|csv> | Export events |
| npx ai-cost kibana [--output <path>] [--filebeat <path>] [--index <pattern>] | Emit Kibana-ready NDJSON dashboard bundle |
| npx ai-cost doctor | Check configuration |
| npx ai-cost reset [--comparisons-only] | Clear all stored data |
| npx ai-cost features [--json\|--markdown] [--write-readme] | List every cost-reduction technique currently implemented |
Current capability set
The full list of cost-reduction techniques KostAI implements right now is
generated from src/capabilities/registry.ts — the same data the CLI
prints when you run npx kostai features. Refresh this block with
npx kostai features --markdown --write-readme.
KostAI currently implements 41 cost-reduction techniques across 9 categories.
Model routing (4)
| Technique | Status | How invoked | Mechanism | Measured impact |
|---|---|---|---|---|
| Rule-based router (rule-based-router)Classifies each call by complexity and downgrades expensive models when a cheaper tier can handle the task. | GA | router.enabled in ai-cost.config.json | Routes simple/deterministic work away from frontier pricing. | $1.01 across 8 pairs |
| Trained-classifier router (v2) (router-classifier-v2)ML short-circuit in front of the rule router; decides routing from prompt features when confidence is high. | Beta | router.useClassifierV2 = true | +6.5pt accuracy vs v1 on the frozen bench; reduces misroutes. | $1.01 across 8 pairs |
| Expensive-model gate (expensive-model-gate)Blocks calls from silently reaching a costly model (configurable $/M-token threshold) unless elevation is justified. | GA | router.expensiveModelThresholdUsdPerMToken | Keeps a forgotten model: string from burning $75/M output. | — |
| Elevation check (elevation-check)When a higher tier IS required, emits an auditable justification rather than a silent upgrade. | GA | automatic | Makes tier escalations visible and reviewable. | — |
Context compression (4)
| Technique | Status | How invoked | Mechanism | Measured impact |
|---|---|---|---|---|
| Deterministic prose compressor (prose-compress)Pure-TS rule-based compressor for long system prompts and memory files — byte-exact on code/URLs/headings, idempotent. | GA | compressProse(text) from ./core/prose-compress | ~46% input-token reduction on markdown memory files (adapted from caveman). | $1.07 across 6 pairs |
| Tool-result compression (tool-result-compress)Summarizes large tool outputs (shell output, file dumps, API bodies) with a local model before they hit the frontier. | GA | compressToolResults({ messages }) from ./core/tool-compress | Cuts the dominant input-token source in agent loops. | $1.07 across 6 pairs |
| Local-model pre-processor (local-preprocess)Runs a local model first to summarize history and draft a local attempt; the frontier sees a distilled prompt. | GA | preprocess({ messages }) or preprocessThenEscalate(...) | Shrinks input tokens to the expensive model; frontier validates rather than generates. | $1.07 across 6 pairs |
| Draft-Verify-Patch (DVP) (draft-verify-patch)Local model drafts the answer; frontier either APPROVES, PATCHES, or REWRITES — output tokens collapse on approve. | GA | draftVerifyPatch({ messages }) from ./core/draft-verify | Targets output-side cost where frontier pricing is 5x input. | — |
Caching (2)
| Technique | Status | How invoked | Mechanism | Measured impact |
|---|---|---|---|---|
| Anthropic prompt caching (anthropic-prompt-cache)Shapes system blocks with cache_control so repeat calls replay cached tokens at ~90% discount. | GA | cachedSystem(SYSTEM_PROMPT) from ./providers | ~90% discount on cached input tokens (Anthropic ephemeral cache). | $1.08 across 5 pairs |
| Semantic cache (semantic-cache)Near-duplicate prompt detection via local embeddings; replays a cached answer for prompts above a cosine threshold. | GA | cacheOrCall({ key, compute }) from ./core/semantic-cache | Published benchmarks report ~73% cost reduction at 0.95 threshold on agent workloads. | $1.08 across 5 pairs |
Waste detection (21)
| Technique | Status | How invoked | Mechanism | Measured impact |
|---|---|---|---|---|
| Duplicate block detector (waste-duplicate-block)Flags identical large content blocks repeated within a single call. | GA | automatic | Straight-line token savings when the duplicate is removed. | — |
| Replayed-history detector (waste-replayed-history)Flags long conversation history being replayed for a narrow follow-up ask. | GA | automatic | Signals candidate for local preprocessing / summarization. | — |
| Repeated-artifact detector (waste-repeated-artifact)Flags the same large block (logs, files, RAG chunks) sent across recent calls. | GA | automatic | Points at cacheable or deduplicatable content across a session. | — |
| Low-relevance large-block detector (waste-low-relevance)Lexical scoring of large blocks that have weak lexical overlap with the current ask. | GA | automatic | Stitches into the context-pruning recommendation. | — |
| Semantic low-relevance detector (waste-semantic-low-relevance)Embedding-based refinement on top of the lexical relevance detector; off by default for stability. | Beta | score.semanticRelevance = true | Catches semantically irrelevant blocks the lexical pass misses. | — |
| Oversized-logs detector (waste-oversized-logs)Flags raw log dumps that could be summarized before being sent. | GA | automatic | Summarization typically retains the signal at a fraction of tokens. | — |
| Oversized code-context detector (waste-oversized-code-context)Flags overly wide code imports (too many files, whole-repo dumps) for the current ask. | GA | automatic | Narrows the search surface the model has to traverse. | — |
| Cacheable-system-prompt detector (waste-cacheable-system-prompt)Flags stable system prompts resent unchanged across calls — direct candidate for Anthropic prompt caching. | GA | automatic | Recommends the cachedSystem() wrapper; unlocks ~90% discount. | — |
| System-prompt reuse detector (waste-system-prompt-reuse)Flags near-duplicate system prompts — would be cacheable with a small canonicalization step. | GA | automatic | Catches the trailing-whitespace/ timestamp cache-miss class. | — |
| Stale-tool-definitions detector (waste-stale-tool-definitions)Flags tool JSON schemas that haven't changed across calls but are resent every turn. | GA | automatic | Typical recovery: move tool definitions into the cache breakpoint. | — |
| Oversized tool-result detector (waste-oversized-tool-result)Flags tool outputs that would compress well before being echoed back into the prompt. | GA | automatic | Pairs with the tool-result compression wrapper. | — |
| Oversized JSON tool-output detector (waste-oversized-json-tool-output)Tight (90%) savings estimate on structured JSON tool outputs — highest-confidence compression target. | GA | automatic | JSON is compressible with near-zero semantic loss. | — |
| Verbose prose-input detector (waste-verbose-prose-input)Flags input prose with filler/hedging/pleasantries that the prose compressor would shrink. | GA | automatic | Routes the call to the deterministic prose compressor. | — |
| Verbose output-preamble detector (waste-verbose-output-preamble)Flags boilerplate preambles in responses that the model can be instructed to drop. | GA | automatic | Output-side waste; reduces $/out-token directly. | — |
| Language-verbose output detector (waste-language-verbose-output)Language-specific verbosity patterns (e.g., over-explanatory code comments) that could be trimmed. | GA | automatic | Output-side waste specific to code-generation workloads. | — |
| Repeated image-attachments detector (waste-repeated-image-attachments)Flags image inputs resent across calls — candidate for a cached reference instead of re-upload. | GA | automatic | Image tokens dominate multimodal spend; cache or reference instead. | — |
| Model-overkill detector (waste-model-overkill)Flags frontier-model calls the router would have routed to a cheaper tier. | GA | automatic | Directly monetizable — the saved delta is routable today. | — |
| Model-downshift-opportunity detector (waste-model-downshift-opportunity)Signals calls whose output suggests the task could have run on a cheaper/smaller model. | GA | automatic | Post-hoc evidence that tightens router rules over time. | — |
| DVP-candidate detector (waste-dvp-candidate)Flags calls whose shape (long output, moderate input) make them Draft-Verify-Patch candidates. | GA | automatic | Feeds the draft-verify wrapper recommendation. | — |
| Unbounded stream-continuation detector (waste-unbounded-stream)Flags streams that ran to max_tokens without a natural stop — likely over-generation. | GA | automatic | Points at max_tokens / stop-sequence tuning. | — |
| Metadata-only oversized detector (waste-metadata-inferred-oversized)Fallback inference from token counts alone when capture mode stripped bodies for privacy. | GA | automatic | Keeps the scorer useful in the strictest capture mode. | — |
Shadow mode (2)
| Technique | Status | How invoked | Mechanism | Measured impact |
|---|---|---|---|---|
| Shadow-mode A/B (shadow-mode)Runs a baseline + optimized call in parallel, returns baseline to the caller, logs the delta for before/after proof. | GA | shadowMode.enabled = true OR proxy --mode shadow | Generates the comparison ledger that powers kostai proof. | — |
| Quality evaluator (quality-evaluator)Grades optimized-vs-baseline outputs (heuristic + optional LLM judge) so savings claims carry a quality signal. | GA | shadowMode.runQualityEval = true | Prevents 'cheaper but worse' regressions from slipping through. | — |
Local inference (2)
| Technique | Status | How invoked | Mechanism | Measured impact |
|---|---|---|---|---|
| Local-LLM routing (local-routing)Routes eligible calls to Ollama / LM Studio / OpenAI-compat local endpoints instead of a paid API. | GA | providers.ollama.enabled + router rules, or proxy --mode route | Local inference is $0/token; electricity-only cost ledger available. | $0.0256 across 1 pair |
| Local-runtime scan (local-runtime-scan)Detects running Ollama / LM Studio / OpenAI-compat servers on this machine and enumerates installed models. | GA | npx kostai scan | Surfaces free local compute that would otherwise be ignored. | — |
Batching & deliberation (1)
| Technique | Status | How invoked | Mechanism | Measured impact |
|---|---|---|---|---|
| LLM Council (cost-aggressive) (llm-council)3-stage drafter/reviewer/chairman pattern with semantic-cache, consensus short-circuit, and free-tier drafters. | Beta | runCouncil({ ... }) from ./core/council | Preserves Karpathy council quality; six stacked cost wedges collapse the spend. | — |
Budget & governance (2)
| Technique | Status | How invoked | Mechanism | Measured impact |
|---|---|---|---|---|
| Budget gate (budget-gate)Per-wave and per-task spend cap that halts dispatch before runaway cost. | GA | boil budget.enabled + max USD in ai-cost.config.json | Hard dollar ceiling on orchestrated agent runs. | — |
| Retention-aware ledger (retention-ledger)Every optimized call is tagged with the mechanism that saved the money, so proof output attributes savings per lever. | GA | automatic | Powers the mechanism-breakdown table in kostai proof. | — |
Observability (3)
| Technique | Status | How invoked | Mechanism | Measured impact |
|---|---|---|---|---|
| Executive proof-of-savings (proof-one-pager)One-page markdown/HTML/JSON proof — saved $, pass-through subscription value, mechanism breakdown, quality signal. | GA | npx kostai proof [--html path] [--json path] [--rate 0.10] [--last 30d|90d|all] | Turns the ledger into a defensible CIO-grade artifact. | — |
| Local dashboard (dashboard)Throughput-first web dashboard reading the same local JSONL store — trends, waste categories, per-call inspection. | GA | npx kostai dashboard | Keeps the user in the loop without leaving the machine. | — |
| Repo-scan optimization plan (optimize-plan)Scans the current repo for LLM call sites and emits a prioritized .kostai/optimizations.md the agent can apply item-by-item. | GA | npx kostai optimize | Magic-sentence entrypoint — an agent reads the plan and implements it. | — |
What it detects
ai-cost runs 9 waste heuristics on every call (was 8; added
local_routable for explicit local-downgrade flagging):
| Category | Confidence | Catches |
|---|---|---|
| duplicate_block | High | Same content repeated within a call |
| replayed_history | Medium | Long conversation replay for narrow asks |
| repeated_artifact | High | Same large block sent across recent calls |
| low_relevance_large_block | Low | Large blocks with weak link to the ask |
| oversized_logs | Medium | Raw logs that could be summarized |
| oversized_code_context | Low | Too many code files for the scope |
| cacheable_system_prompt | High | Stable system prompt resent unchanged |
| model_overkill | Low | Frontier model on a task the router flagged simple |
| local_routable | Medium | Call that would execute correctly on local inference |
Every finding carries estimatedTokens, estimatedCostUsd, and a
confidence level. The dashboard's "Top Waste Categories" is a
prioritized remediation list.
Configuration
{
"appName": "my-app",
"storeDir": ".ai-cost-data",
"port": 3674,
"captureMode": "metadata_only",
"redactSecrets": true,
"redactPatterns": [],
"providers": {
"anthropic": { "enabled": true },
"openai": { "enabled": true },
"google": { "enabled": true },
"ollama": { "enabled": true, "baseUrl": "http://localhost:11434", "defaultModel": "llama3.2", "powerWatts": 60, "electricityCostPerKwh": 0.15 },
"lmstudio": { "enabled": true, "baseUrl": "http://localhost:1234/v1", "defaultModel": "lmstudio-community/llama-3.2-3b-instruct" },
"openaiCompat": { "enabled": false, "baseUrl": "https://api.moonshot.cn/v1", "defaultModel": "kimi-2.5" }
},
"thresholds": {
"largeBlockTokens": 500,
"logBlockTokens": 300,
"repeatedHistoryTurns": 6,
"efficiencyWarnPct": 70
},
"shadowMode": {
"enabled": true,
"recordSamplePct": 100,
"optimizedProvider": "ollama",
"optimizedModel": "llama3.2",
"runQualityEval": true
},
"router": {
"enabled": false,
"simpleTaskMaxTokens": 800,
"maxLocalLatencyMs": 8000,
"localProvider": "ollama",
"localModel": "llama3.2",
"frontierProvider": "anthropic",
"frontierModel": "claude-opus-4-7",
"cheapApiProvider": "anthropic",
"cheapApiModel": "claude-haiku-4-5"
},
"evaluator": {
"enabled": false,
"provider": "ollama",
"model": "kimi-2.5"
},
"budget": {
"monthlyUsd": 500,
"warnAtPct": 80
}
}Privacy
- Default capture mode is
metadata_only— no content is stored, only hashes, token counts, cost, and scores. redacted_bodystores truncated previews with PII patterns scrubbed.full_bodyis opt-in for local debugging only.- No network egress. Everything runs on the local machine.
- No telemetry. No usage reporting to any external service.
Extended docs
docs/ELASTIC_REVIEW.md— step-by-step review and test plan for the Elastic team.docs/ARCHITECTURE.md— full architecture, data model, extension points.docs/RUNBOOK.md— operational guide: ports, logs, launchd persistence, disk management, upgrade, incident cheatsheet.docs/TWO_WAY_BRIDGE.md— two-machine local↔frontier bridge walkthrough.docs/MAC_MINI_SETUP.md— Mac-Mini-side install for the bridge peer.docs/BUSINESS_PLAN.md— pricing, unit economics, go-to-market.docs/ELASTIC_STRATEGY.md— the deck-revision strategy doc with per-pair cost multipliers, empirical savings math, and the slide-by-slide proposal.docs/OPENCLAW.md— command-node orientation for any Claude Code instance picking up this repo (macmini ↔ macbook routing, Kimi-first model cascade, cowork boundary).
Known limitations
- Waste estimates are heuristic — likely waste, not certainty.
- Context relevance is estimated via lexical overlap, not semantic.
- Router rules are regex-based by design (auditable). Replaceable by a trained classifier once a comparison corpus exists.
- Only Node.js / TypeScript SDKs are supported today. Other languages adopt via the HTTP proxy.
- Token counts may be estimated when providers don't return usage.
- Store is append-only JSONL; for > 10k events/day a SQLite backend
behind the same
EventStoreinterface is the next step.
License
MIT
