@sapperjohn/kostai

v0.5.2

Published

20 days ago

Developer-first instrumentation for LLM call cost and context waste analysis

Downloads

1,448

0High
0Medium
0Low

sapperjohn

llm cost optimization ai tokens waste anthropic openai google gemini ollama lmstudio kimi shadow-mode router mcp proxy local-llm finops

ai-cost

Local-only cost-and-waste instrumentation for LLM-powered apps — with shadow-mode A/B testing, local-first routing, a multi-repo parallel-agent orchestrator, and an MCP server for Claude Desktop / Claude Code.

🚀 v0.5.1 — BoilTheOcean orchestrator released (2026-04-18). Multi-repo parallel-agent runner with per-task model routing, dynamic TABOO escalation, cost ledger + budget gate, bridge driver for data-sovereignty routing, and a strategic-brain layer that learns across waves. Measured: 41-task dogfood rollup on this codebase, mean +44.3% token savings vs naive Opus baseline. See docs/RELEASE_NOTES_v0.5.0.md
docs/QUICKSTART.md. 30 new boil-specific tests.

v0.4.0 Sprint 1 rollup: router classifier v2 (+6.5 pts vs v1), scorer v2 with seven new detectors, native TLS/mTLS on the bridge, throughput-first dashboard, macOS DMG distribution, SQLite partitioning. Mean ~55% token reduction on the frozen benchmark suite. See docs/RELEASE_NOTES_v0.4.0.md.

Elastic reviewers, start here: docs/ELASTIC_REVIEW.md is the end-to-end review and test plan (15 min single-machine walkthrough, two-machine bridge, production-deploy checklist, what-to-evaluate checklist).

What it does: wraps your LLM SDK (Anthropic, OpenAI, Google, Ollama, LM Studio, OpenAI-compat) or slots in as an HTTP proxy, records every call, scores it for waste across 8 categories, and — if you turn on shadow mode — runs a cheaper/local path in parallel and grades the output so you can see per-call what each optimized route would have saved.

Nothing leaves the machine. All data lives in a local JSONL file you can cat.

Install

npx -y @sapperjohn/kostai install

That one command:

stamps ai-cost.config.json
auto-applies safe SDK wrapper starter patches when it can
writes .kostai/optimizations.md with the remaining high-impact changes
preps the local Command Node bridge so the workspace is ready to join the pool

If you want the package installed in package.json too:

npm install @sapperjohn/kostai
# or
pnpm add @sapperjohn/kostai

The CLI binary is available as both kostai and ai-cost.

Claude skills

This package ships a Claude skill suite under skills/:

skills/cost-optimization/ — AI Performance, the Adnan-ready cost proof workflow
skills/brainofbrains/ — Brain Orchestration
skills/elasticjudge/ — Quality Judge
skills/surge/ — deliverables tracking

To install the AI Performance skill for Claude Code:

npm view @sapperjohn/kostai version  # must be 0.5.2 or newer
npm install -g @sapperjohn/kostai@^0.5.2
ln -s "$(npm prefix -g)/lib/node_modules/@sapperjohn/kostai/skills/cost-optimization" \
  "$HOME/.claude/skills/cost-optimization"

The skill is local-first: no MCP server is enabled by default, no prompt bodies are shared, and scripts/feedback.sh only writes an opt-in aggregate packet.

Quick start

1. Initialize

npx kostai install

Creates ai-cost.config.json, applies safe starter patches, and refreshes the savings plan.

2. Wrap your client

import Anthropic from "@anthropic-ai/sdk";
import { wrapAnthropic } from "@sapperjohn/kostai";

const client = wrapAnthropic(new Anthropic(), {
  appName: "my-app",
  route: "bugfix-agent",
});

await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Fix the auth bug" }],
});

3. Open the dashboard

npx ai-cost dashboard

Eight tabs: Overview, Shadow Mode, Router, Local LLMs, Bridge, Queue, Calls, Trends. Everything runs on http://localhost:3674.

Providers

import {
  wrapAnthropic,
  wrapOpenAI,
  wrapGoogle,            // @google/generative-ai
  wrapOllama,            // local Ollama HTTP client
  wrapOpenAICompat,      // LM Studio, Kimi, DeepSeek, vLLM, Moonshot
} from "@sapperjohn/kostai";

All wrappers use the same wrap(client, { appName, route, workflow, tags }) shape. Events are persisted whether the call succeeds or fails.

Shadow mode

Every shadow run calls both the frontier model and a cheaper/local path in parallel, returns the frontier result to the app, and writes a comparison record with baselineCostUsd, optimizedCostUsd, savedUsd, and a Kimi-2.5 quality score (0–100).

import { runShadow, evaluateQuality, wrapAnthropic, wrapOllama } from "@sapperjohn/kostai";

const anthropic = wrapAnthropic(new Anthropic());
const ollama = wrapOllama({ baseUrl: "http://localhost:11434" });

const { baselineResult, comparison } = await runShadow({
  ask: userMessage,
  route: "ticket-classifier",
  baseline: async () => {
    const r = await anthropic.messages.create({
      model: "claude-opus-4-7",
      max_tokens: 256,
      messages: [{ role: "user", content: userMessage }],
    });
    return { event: { id: r._ai_cost_event_id, model: r.model, ... }, result: r };
  },
  optimized: async () => {
    const r = await ollama.chat({
      model: "llama3.2",
      messages: [{ role: "user", content: userMessage }],
    });
    return { event: { id: r._ai_cost_event_id, model: r.model, ... }, result: r };
  },
  qualityEvaluator: evaluateQuality,
});

// baselineResult flows back to the caller. The app never sees optimizedResult
// — shadow mode is read-only w.r.t. production.

The dashboard's Shadow Mode tab aggregates these as:

total saved $
average saved %
average quality score
by model pair
by route
recent A/B comparisons (click to see the diff)

Router

Pure function. Given a call, classify the task, check the model, emit one of four decisions with a USD-denominated savings estimate.

import { routeCall } from "@sapperjohn/kostai";

const decision = routeCall(
  {
    model: "claude-opus-4-7",
    messages: [{ role: "user", content: "Classify this ticket as bug | feature | question." }],
    inputTokens: 120,
    outputTokensEstimate: 20,
  },
  {
    router: {
      enabled: true,
      localProvider: "ollama",
      localModel: "llama3.2",
      cheapApiProvider: "anthropic",
      cheapApiModel: "claude-haiku-4-5",
    },
  },
);

// decision.decision:
//   "local_sufficient"        — route to ollama/llama3.2
//   "cheaper_api_sufficient"  — route to anthropic/claude-haiku-4-5
//   "frontier_required"       — keep on claude-opus-4-7
//   "cache_hit"               — (reserved; identical-prompt detection)
// decision.estimatedSavingsUsd, decision.reason, decision.confidenceLevel

The Router dashboard tab scans your recent traffic, runs the same classifier offline, and shows the top 20 routable calls sorted by annualized savings.

HTTP proxy

One env-var adoption. The proxy speaks OpenAI's /v1/chat/completions shape.

# Observe — record every call, never modify
npx ai-cost proxy --mode observe --port 4311

# Route — downgrade confidently routable calls
npx ai-cost proxy --mode route --port 4311

# Shadow — always run both paths, log the comparison
npx ai-cost proxy --mode shadow --port 4311

In your app:

OPENAI_BASE_URL=http://localhost:4311/v1

That's the entire integration. No code changes.

MCP server

Expose ai-cost's primitives over the Model Context Protocol (spec 2024-11-05) so Claude Desktop, Claude Code, or any MCP client can call them in-loop.

# Inspect the exposed tools
npx ai-cost mcp --list

# Run the server on stdio (register in Claude Desktop's config)
npx ai-cost mcp

Tools (28 total today — same set served over both stdio and the bridge HTTP transport):

ai_cost_overview — totals + waste breakdown
ai_cost_top_workflows — ranked by avoidable spend
ai_cost_recommend_route — run the router against an ask+model
ai_cost_record_call — manual call logging
ai_cost_list_comparisons — recent shadow-mode comparisons
ai_cost_ollama_chat — pass-through to local Ollama, auto-recorded
ai_cost_shadow_compare — run a live A/B from the caller
ai_cost_local_status — detected local runtimes + config
ai_cost_anthropic_chat — direct Anthropic Messages call (no SDK dep), auto-recorded
ai_cost_list_peers — list configured bridge peers + reachability
ai_cost_escalate_to_frontier — local node asks a frontier-role peer to run a prompt
ai_cost_delegate_to_local — frontier node asks a local-role peer to run a prompt; records would-have-cost savings
ai_cost_route_cheap_api — route to a cheap-API peer, or fall back to the frontier peer with a cheap model override
ai_cost_handoff — router-driven smart dispatch across peers
ai_cost_preprocess — distill a prompt locally before escalation
ai_cost_preprocess_then_escalate — local preprocess + frontier escalation in one tool
ai_cost_queue_enqueue — durable async enqueue for bridge work
ai_cost_queue_status — queue counters + worker heartbeat
ai_cost_queue_list — inspect queued/running/done/failed work
ai_cost_queue_cancel — cancel a queued or running task
ai_cost_kb_query — query the cost-reduction KB for prior routing and optimization patterns
ai_cost_govspend_lookup — look up GovSpend agencies, vendors, and opportunities
ai_cost_govspend_summary — summarize GovSpend corpus coverage, duplicates, and open issues
ai_cost_agent_dispatch — run one BoilTheOcean task on this node via a local driver
ai_cost_research_brain — return the latest research-brain recommendations for this workspace
ai_cost_research_fleet — return the latest cross-project research-fleet rollup
ai_cost_research_fleet_dispatch — enqueue research-fleet specialist packets into the durable queue
ai_cost_strategic_brain — return the strategic-brain status and missing calibration/autonomy actions

Claude Desktop config:

{
  "mcpServers": {
    "ai-cost": {
      "command": "npx",
      "args": ["ai-cost", "mcp"]
    }
  }
}

Bridge

The bridge runs the same MCP tools over an authenticated HTTP+SSE transport so two machines can hand work to each other in-loop. Typical setup: a Mac Mini running Ollama as the local node, a MacBook with the Anthropic API key as the frontier node. Either side can call the other.

Priority #1: make Command Node linkup feel as easy as npm install. This package is the shared install primitive for BrainOfBrains.ai, KostAI.app, and CommandNodeAI.com: create an MCP connection operated by a lightweight open-source model, then pool AI + compute resources across humans and machines.

Fastest pairing flow:

# On the inviter's machine/repo
npx ai-cost bridge --invite --invite-name PatrickCommandNode

# On the recipient's machine/repo
npx -y @sapperjohn/kostai install --accept ./.kostai/bridge-invite.json
npx ai-cost bridge --doctor

That flow writes:

a shareable invite at .kostai/bridge-invite.json
a human-readable handoff at .kostai/BRIDGE_SETUP.md
a centralized workspace snapshot at .kostai/command-node-registry.json
a pooled peer registry entry at ~/.ai-cost-peers.json

Rollout brief: docs/MCP_COMMAND_NODE_1CLICK_ROLLOUT_BRIEF_2026-04-20.md

On each machine:

# Generate a shared secret (run once per pairing)
npx ai-cost bridge --gen-token
# → 64-char hex string. Put the same value in both machines' configs.

# Start the bridge listener
npx ai-cost bridge --listen
# → ai-cost bridge listening at http://0.0.0.0:4319/mcp/v1
#   tools: 28    transport: http+sse    auth: bearer

# Probe configured peers
npx ai-cost bridge --status
# → ✓ macbook  http://10.0.1.42:4319  role=frontier  (claude-opus-4-7)
#   ✓ mini     http://10.0.1.50:4319  role=local     (llama3.2, qwen2.5-coder, ...)

Config block (added to ai-cost.config.json):

{
  "bridge": {
    "listenPort": 4319,
    "listenHost": "0.0.0.0",
    "authToken": "<64-char hex from --gen-token>",
    "probeTimeoutMs": 10000,
    "peers": [
      {
        "name": "macbook",
        "url": "http://10.0.1.42:4319",
        "token": "<the macbook's authToken>",
        "role": "frontier",
        "frontierModel": "claude-opus-4-7"
      }
    ]
  }
}

bridge.probeTimeoutMs keeps bridge --status and bridge --doctor snappy. If a real delegation or escalation regularly needs longer than 60s, raise bridge.rpcTimeoutMs globally or set bridge.peers[*].rpcTimeoutMs only on the slower peer.

Flows:

Escalation (local → frontier): Mac Mini's local agent calls ai_cost_escalate_to_frontier with messages, reason, optional model. The bridge POSTs to the MacBook peer's ai_cost_anthropic_chat, records the call locally with route="escalation_request" and meta.bridge_peer="macbook", and returns the response.
Delegation (frontier → local): MacBook calls ai_cost_delegate_to_local with the same messages. The bridge runs it on the Mini's ai_cost_ollama_chat, computes wouldHaveCost for the frontier model, and stores meta.delegation_savings_usd so the dashboard can total cumulative savings.
Handoff (smart dispatch): ai_cost_handoff runs the router against the messages; local_sufficient → delegate, frontier_required → escalate. Force with force: "local" | "frontier".

The dashboard's Bridge tab shows configured peers, reachability, delegation count, savings to date, and recent escalations with peer + reason metadata. Endpoint: GET /api/bridge.

Wire formats (all over POST /mcp/v1/rpc with Authorization: Bearer <token>):

# Health (no auth)
curl http://10.0.1.42:4319/mcp/v1/health

# JSON-RPC tools/list
curl -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
     http://10.0.1.42:4319/mcp/v1/rpc

# Server→client notifications stream
curl -H "Authorization: Bearer $TOKEN" \
     http://10.0.1.42:4319/mcp/v1/sse

See docs/MAC_MINI_HANDOFF.md for the full two-machine setup walk-through.

Dashboard

npx ai-cost dashboard

Eight tabs:

Overview — total spend, avoidable spend, shadow-saved, efficiency score, top waste categories, top repeated prompts, budget banner.
Shadow Mode — A/B comparisons, saved totals, quality scores.
Router — recent routable calls and annualized savings.
Local LLMs — detected Ollama/LM Studio runtimes, local vs. cloud spend, configuration.
Bridge — peer reachability, delegation count, cumulative savings, recent escalations. Reflects this node's bridge config block.
Queue — durable 24h task queue: queued/running/done/failed counts and per-task inspection.
Calls — searchable/filterable list of every recorded call. Click for full detail.
Trends — daily spend and daily avoidable spend charts.

Auto-refreshes every 5 seconds. Local-only. No TLS. No login.

Elastic / Kibana integration

Generate a Kibana 8.x bundle: npx ai-cost kibana --output ai-cost.kibana.ndjson
Import path: Kibana → Stack Management → Saved Objects → Import → pick the NDJSON
Required index pattern: kostai-shadow-* (override with --index <pattern>)
Optional shipping hint: npx ai-cost kibana --filebeat filebeat.yml writes a Filebeat config that tails ~/.ai-cost/events.jsonl into that index

CLI commands

| Command | Description | |---|---| | npx -y @sapperjohn/kostai install [--accept <path-or-url>] | One-click bootstrap for config, starter patches, savings plan, and optional Command Node linkup | | npx ai-cost init | Create config file | | npx ai-cost dashboard | Start local dashboard on port 3674 | | npx ai-cost scan [--repo <path>] | Detect local LLM runtimes + LLM usage in a repo | | npx ai-cost mcp [--list] | Start MCP server over stdio | | npx ai-cost bridge --listen [--port 4319] [--host 0.0.0.0] | Start the HTTP+SSE MCP bridge | | npx ai-cost bridge --status | Probe configured peers — reachable, models, errors | | npx ai-cost bridge --gen-token | Generate a 64-char hex shared secret | | npx ai-cost proxy --mode <observe\|route\|shadow> | Drop-in OpenAI-compat proxy | | npx ai-cost compare --limit <n> | Summarize shadow-mode comparisons | | npx ai-cost report --last 7d | Print markdown report | | npx ai-cost export --format <json\|csv> | Export events | | npx ai-cost kibana [--output <path>] [--filebeat <path>] [--index <pattern>] | Emit Kibana-ready NDJSON dashboard bundle | | npx ai-cost doctor | Check configuration | | npx ai-cost reset [--comparisons-only] | Clear all stored data | | npx ai-cost features [--json\|--markdown] [--write-readme] | List every cost-reduction technique currently implemented |

Current capability set

The full list of cost-reduction techniques KostAI implements right now is generated from src/capabilities/registry.ts — the same data the CLI prints when you run npx kostai features. Refresh this block with npx kostai features --markdown --write-readme.

KostAI currently implements 41 cost-reduction techniques across 9 categories.

Model routing (4)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Rule-based router (rule-based-router)Classifies each call by complexity and downgrades expensive models when a cheaper tier can handle the task. | GA | router.enabled in ai-cost.config.json | Routes simple/deterministic work away from frontier pricing. | $1.01 across 8 pairs | | Trained-classifier router (v2) (router-classifier-v2)ML short-circuit in front of the rule router; decides routing from prompt features when confidence is high. | Beta | router.useClassifierV2 = true | +6.5pt accuracy vs v1 on the frozen bench; reduces misroutes. | $1.01 across 8 pairs | | Expensive-model gate (expensive-model-gate)Blocks calls from silently reaching a costly model (configurable $/M-token threshold) unless elevation is justified. | GA | router.expensiveModelThresholdUsdPerMToken | Keeps a forgotten model: string from burning $75/M output. | — | | Elevation check (elevation-check)When a higher tier IS required, emits an auditable justification rather than a silent upgrade. | GA | automatic | Makes tier escalations visible and reviewable. | — |

Context compression (4)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Deterministic prose compressor (prose-compress)Pure-TS rule-based compressor for long system prompts and memory files — byte-exact on code/URLs/headings, idempotent. | GA | compressProse(text) from ./core/prose-compress | ~46% input-token reduction on markdown memory files (adapted from caveman). | $1.07 across 6 pairs | | Tool-result compression (tool-result-compress)Summarizes large tool outputs (shell output, file dumps, API bodies) with a local model before they hit the frontier. | GA | compressToolResults({ messages }) from ./core/tool-compress | Cuts the dominant input-token source in agent loops. | $1.07 across 6 pairs | | Local-model pre-processor (local-preprocess)Runs a local model first to summarize history and draft a local attempt; the frontier sees a distilled prompt. | GA | preprocess({ messages }) or preprocessThenEscalate(...) | Shrinks input tokens to the expensive model; frontier validates rather than generates. | $1.07 across 6 pairs | | Draft-Verify-Patch (DVP) (draft-verify-patch)Local model drafts the answer; frontier either APPROVES, PATCHES, or REWRITES — output tokens collapse on approve. | GA | draftVerifyPatch({ messages }) from ./core/draft-verify | Targets output-side cost where frontier pricing is 5x input. | — |

Caching (2)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Anthropic prompt caching (anthropic-prompt-cache)Shapes system blocks with cache_control so repeat calls replay cached tokens at ~90% discount. | GA | cachedSystem(SYSTEM_PROMPT) from ./providers | ~90% discount on cached input tokens (Anthropic ephemeral cache). | $1.08 across 5 pairs | | Semantic cache (semantic-cache)Near-duplicate prompt detection via local embeddings; replays a cached answer for prompts above a cosine threshold. | GA | cacheOrCall({ key, compute }) from ./core/semantic-cache | Published benchmarks report ~73% cost reduction at 0.95 threshold on agent workloads. | $1.08 across 5 pairs |

Waste detection (21)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Duplicate block detector (waste-duplicate-block)Flags identical large content blocks repeated within a single call. | GA | automatic | Straight-line token savings when the duplicate is removed. | — | | Replayed-history detector (waste-replayed-history)Flags long conversation history being replayed for a narrow follow-up ask. | GA | automatic | Signals candidate for local preprocessing / summarization. | — | | Repeated-artifact detector (waste-repeated-artifact)Flags the same large block (logs, files, RAG chunks) sent across recent calls. | GA | automatic | Points at cacheable or deduplicatable content across a session. | — | | Low-relevance large-block detector (waste-low-relevance)Lexical scoring of large blocks that have weak lexical overlap with the current ask. | GA | automatic | Stitches into the context-pruning recommendation. | — | | Semantic low-relevance detector (waste-semantic-low-relevance)Embedding-based refinement on top of the lexical relevance detector; off by default for stability. | Beta | score.semanticRelevance = true | Catches semantically irrelevant blocks the lexical pass misses. | — | | Oversized-logs detector (waste-oversized-logs)Flags raw log dumps that could be summarized before being sent. | GA | automatic | Summarization typically retains the signal at a fraction of tokens. | — | | Oversized code-context detector (waste-oversized-code-context)Flags overly wide code imports (too many files, whole-repo dumps) for the current ask. | GA | automatic | Narrows the search surface the model has to traverse. | — | | Cacheable-system-prompt detector (waste-cacheable-system-prompt)Flags stable system prompts resent unchanged across calls — direct candidate for Anthropic prompt caching. | GA | automatic | Recommends the cachedSystem() wrapper; unlocks ~90% discount. | — | | System-prompt reuse detector (waste-system-prompt-reuse)Flags near-duplicate system prompts — would be cacheable with a small canonicalization step. | GA | automatic | Catches the trailing-whitespace/ timestamp cache-miss class. | — | | Stale-tool-definitions detector (waste-stale-tool-definitions)Flags tool JSON schemas that haven't changed across calls but are resent every turn. | GA | automatic | Typical recovery: move tool definitions into the cache breakpoint. | — | | Oversized tool-result detector (waste-oversized-tool-result)Flags tool outputs that would compress well before being echoed back into the prompt. | GA | automatic | Pairs with the tool-result compression wrapper. | — | | Oversized JSON tool-output detector (waste-oversized-json-tool-output)Tight (90%) savings estimate on structured JSON tool outputs — highest-confidence compression target. | GA | automatic | JSON is compressible with near-zero semantic loss. | — | | Verbose prose-input detector (waste-verbose-prose-input)Flags input prose with filler/hedging/pleasantries that the prose compressor would shrink. | GA | automatic | Routes the call to the deterministic prose compressor. | — | | Verbose output-preamble detector (waste-verbose-output-preamble)Flags boilerplate preambles in responses that the model can be instructed to drop. | GA | automatic | Output-side waste; reduces $/out-token directly. | — | | Language-verbose output detector (waste-language-verbose-output)Language-specific verbosity patterns (e.g., over-explanatory code comments) that could be trimmed. | GA | automatic | Output-side waste specific to code-generation workloads. | — | | Repeated image-attachments detector (waste-repeated-image-attachments)Flags image inputs resent across calls — candidate for a cached reference instead of re-upload. | GA | automatic | Image tokens dominate multimodal spend; cache or reference instead. | — | | Model-overkill detector (waste-model-overkill)Flags frontier-model calls the router would have routed to a cheaper tier. | GA | automatic | Directly monetizable — the saved delta is routable today. | — | | Model-downshift-opportunity detector (waste-model-downshift-opportunity)Signals calls whose output suggests the task could have run on a cheaper/smaller model. | GA | automatic | Post-hoc evidence that tightens router rules over time. | — | | DVP-candidate detector (waste-dvp-candidate)Flags calls whose shape (long output, moderate input) make them Draft-Verify-Patch candidates. | GA | automatic | Feeds the draft-verify wrapper recommendation. | — | | Unbounded stream-continuation detector (waste-unbounded-stream)Flags streams that ran to max_tokens without a natural stop — likely over-generation. | GA | automatic | Points at max_tokens / stop-sequence tuning. | — | | Metadata-only oversized detector (waste-metadata-inferred-oversized)Fallback inference from token counts alone when capture mode stripped bodies for privacy. | GA | automatic | Keeps the scorer useful in the strictest capture mode. | — |

Shadow mode (2)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Shadow-mode A/B (shadow-mode)Runs a baseline + optimized call in parallel, returns baseline to the caller, logs the delta for before/after proof. | GA | shadowMode.enabled = true OR proxy --mode shadow | Generates the comparison ledger that powers kostai proof. | — | | Quality evaluator (quality-evaluator)Grades optimized-vs-baseline outputs (heuristic + optional LLM judge) so savings claims carry a quality signal. | GA | shadowMode.runQualityEval = true | Prevents 'cheaper but worse' regressions from slipping through. | — |

Local inference (2)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Local-LLM routing (local-routing)Routes eligible calls to Ollama / LM Studio / OpenAI-compat local endpoints instead of a paid API. | GA | providers.ollama.enabled + router rules, or proxy --mode route | Local inference is $0/token; electricity-only cost ledger available. | $0.0256 across 1 pair | | Local-runtime scan (local-runtime-scan)Detects running Ollama / LM Studio / OpenAI-compat servers on this machine and enumerates installed models. | GA | npx kostai scan | Surfaces free local compute that would otherwise be ignored. | — |

Batching & deliberation (1)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | LLM Council (cost-aggressive) (llm-council)3-stage drafter/reviewer/chairman pattern with semantic-cache, consensus short-circuit, and free-tier drafters. | Beta | runCouncil({ ... }) from ./core/council | Preserves Karpathy council quality; six stacked cost wedges collapse the spend. | — |

Budget & governance (2)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Budget gate (budget-gate)Per-wave and per-task spend cap that halts dispatch before runaway cost. | GA | boil budget.enabled + max USD in ai-cost.config.json | Hard dollar ceiling on orchestrated agent runs. | — | | Retention-aware ledger (retention-ledger)Every optimized call is tagged with the mechanism that saved the money, so proof output attributes savings per lever. | GA | automatic | Powers the mechanism-breakdown table in kostai proof. | — |

Observability (3)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Executive proof-of-savings (proof-one-pager)One-page markdown/HTML/JSON proof — saved $, pass-through subscription value, mechanism breakdown, quality signal. | GA | npx kostai proof [--html path] [--json path] [--rate 0.10] [--last 30d|90d|all] | Turns the ledger into a defensible CIO-grade artifact. | — | | Local dashboard (dashboard)Throughput-first web dashboard reading the same local JSONL store — trends, waste categories, per-call inspection. | GA | npx kostai dashboard | Keeps the user in the loop without leaving the machine. | — | | Repo-scan optimization plan (optimize-plan)Scans the current repo for LLM call sites and emits a prioritized .kostai/optimizations.md the agent can apply item-by-item. | GA | npx kostai optimize | Magic-sentence entrypoint — an agent reads the plan and implements it. | — |

What it detects

ai-cost runs 9 waste heuristics on every call (was 8; added local_routable for explicit local-downgrade flagging):

| Category | Confidence | Catches | |---|---|---| | duplicate_block | High | Same content repeated within a call | | replayed_history | Medium | Long conversation replay for narrow asks | | repeated_artifact | High | Same large block sent across recent calls | | low_relevance_large_block | Low | Large blocks with weak link to the ask | | oversized_logs | Medium | Raw logs that could be summarized | | oversized_code_context | Low | Too many code files for the scope | | cacheable_system_prompt | High | Stable system prompt resent unchanged | | model_overkill | Low | Frontier model on a task the router flagged simple | | local_routable | Medium | Call that would execute correctly on local inference |

Every finding carries estimatedTokens, estimatedCostUsd, and a confidence level. The dashboard's "Top Waste Categories" is a prioritized remediation list.

Configuration

{
  "appName": "my-app",
  "storeDir": ".ai-cost-data",
  "port": 3674,
  "captureMode": "metadata_only",
  "redactSecrets": true,
  "redactPatterns": [],
  "providers": {
    "anthropic":    { "enabled": true },
    "openai":       { "enabled": true },
    "google":       { "enabled": true },
    "ollama":       { "enabled": true, "baseUrl": "http://localhost:11434", "defaultModel": "llama3.2", "powerWatts": 60, "electricityCostPerKwh": 0.15 },
    "lmstudio":     { "enabled": true, "baseUrl": "http://localhost:1234/v1", "defaultModel": "lmstudio-community/llama-3.2-3b-instruct" },
    "openaiCompat": { "enabled": false, "baseUrl": "https://api.moonshot.cn/v1", "defaultModel": "kimi-2.5" }
  },
  "thresholds": {
    "largeBlockTokens": 500,
    "logBlockTokens": 300,
    "repeatedHistoryTurns": 6,
    "efficiencyWarnPct": 70
  },
  "shadowMode": {
    "enabled": true,
    "recordSamplePct": 100,
    "optimizedProvider": "ollama",
    "optimizedModel": "llama3.2",
    "runQualityEval": true
  },
  "router": {
    "enabled": false,
    "simpleTaskMaxTokens": 800,
    "maxLocalLatencyMs": 8000,
    "localProvider": "ollama",
    "localModel": "llama3.2",
    "frontierProvider": "anthropic",
    "frontierModel": "claude-opus-4-7",
    "cheapApiProvider": "anthropic",
    "cheapApiModel": "claude-haiku-4-5"
  },
  "evaluator": {
    "enabled": false,
    "provider": "ollama",
    "model": "kimi-2.5"
  },
  "budget": {
    "monthlyUsd": 500,
    "warnAtPct": 80
  }
}

Privacy

Default capture mode is metadata_only — no content is stored, only hashes, token counts, cost, and scores.
redacted_body stores truncated previews with PII patterns scrubbed.
full_body is opt-in for local debugging only.
No network egress. Everything runs on the local machine.
No telemetry. No usage reporting to any external service.

Extended docs

docs/ELASTIC_REVIEW.md — step-by-step review and test plan for the Elastic team.
docs/ARCHITECTURE.md — full architecture, data model, extension points.
docs/RUNBOOK.md — operational guide: ports, logs, launchd persistence, disk management, upgrade, incident cheatsheet.
docs/TWO_WAY_BRIDGE.md — two-machine local↔frontier bridge walkthrough.
docs/MAC_MINI_SETUP.md — Mac-Mini-side install for the bridge peer.
docs/BUSINESS_PLAN.md — pricing, unit economics, go-to-market.
docs/ELASTIC_STRATEGY.md — the deck-revision strategy doc with per-pair cost multipliers, empirical savings math, and the slide-by-slide proposal.
docs/OPENCLAW.md — command-node orientation for any Claude Code instance picking up this repo (macmini ↔ macbook routing, Kimi-first model cascade, cowork boundary).

Known limitations

Waste estimates are heuristic — likely waste, not certainty.
Context relevance is estimated via lexical overlap, not semantic.
Router rules are regex-based by design (auditable). Replaceable by a trained classifier once a comparison corpus exists.
Only Node.js / TypeScript SDKs are supported today. Other languages adopt via the HTTP proxy.
Token counts may be estimated when providers don't return usage.
Store is append-only JSONL; for > 10k events/day a SQLite backend behind the same EventStore interface is the next step.

License

MIT