npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@sapperjohn/kostai

v0.5.2

Published

Developer-first instrumentation for LLM call cost and context waste analysis

Downloads

1,448

Readme

ai-cost

Local-only cost-and-waste instrumentation for LLM-powered apps — with shadow-mode A/B testing, local-first routing, a multi-repo parallel-agent orchestrator, and an MCP server for Claude Desktop / Claude Code.

🚀 v0.5.1 — BoilTheOcean orchestrator released (2026-04-18). Multi-repo parallel-agent runner with per-task model routing, dynamic TABOO escalation, cost ledger + budget gate, bridge driver for data-sovereignty routing, and a strategic-brain layer that learns across waves. Measured: 41-task dogfood rollup on this codebase, mean +44.3% token savings vs naive Opus baseline. See docs/RELEASE_NOTES_v0.5.0.md

v0.4.0 Sprint 1 rollup: router classifier v2 (+6.5 pts vs v1), scorer v2 with seven new detectors, native TLS/mTLS on the bridge, throughput-first dashboard, macOS DMG distribution, SQLite partitioning. Mean ~55% token reduction on the frozen benchmark suite. See docs/RELEASE_NOTES_v0.4.0.md.

Elastic reviewers, start here: docs/ELASTIC_REVIEW.md is the end-to-end review and test plan (15 min single-machine walkthrough, two-machine bridge, production-deploy checklist, what-to-evaluate checklist).

What it does: wraps your LLM SDK (Anthropic, OpenAI, Google, Ollama, LM Studio, OpenAI-compat) or slots in as an HTTP proxy, records every call, scores it for waste across 8 categories, and — if you turn on shadow mode — runs a cheaper/local path in parallel and grades the output so you can see per-call what each optimized route would have saved.

Nothing leaves the machine. All data lives in a local JSONL file you can cat.

Table of contents

Install

npx -y @sapperjohn/kostai install

That one command:

  • stamps ai-cost.config.json
  • auto-applies safe SDK wrapper starter patches when it can
  • writes .kostai/optimizations.md with the remaining high-impact changes
  • preps the local Command Node bridge so the workspace is ready to join the pool

If you want the package installed in package.json too:

npm install @sapperjohn/kostai
# or
pnpm add @sapperjohn/kostai

The CLI binary is available as both kostai and ai-cost.

Claude skills

This package ships a Claude skill suite under skills/:

  • skills/cost-optimization/ — AI Performance, the Adnan-ready cost proof workflow
  • skills/brainofbrains/ — Brain Orchestration
  • skills/elasticjudge/ — Quality Judge
  • skills/surge/ — deliverables tracking

To install the AI Performance skill for Claude Code:

npm view @sapperjohn/kostai version  # must be 0.5.2 or newer
npm install -g @sapperjohn/kostai@^0.5.2
ln -s "$(npm prefix -g)/lib/node_modules/@sapperjohn/kostai/skills/cost-optimization" \
  "$HOME/.claude/skills/cost-optimization"

The skill is local-first: no MCP server is enabled by default, no prompt bodies are shared, and scripts/feedback.sh only writes an opt-in aggregate packet.

Quick start

1. Initialize

npx kostai install

Creates ai-cost.config.json, applies safe starter patches, and refreshes the savings plan.

2. Wrap your client

import Anthropic from "@anthropic-ai/sdk";
import { wrapAnthropic } from "@sapperjohn/kostai";

const client = wrapAnthropic(new Anthropic(), {
  appName: "my-app",
  route: "bugfix-agent",
});

await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Fix the auth bug" }],
});

3. Open the dashboard

npx ai-cost dashboard

Eight tabs: Overview, Shadow Mode, Router, Local LLMs, Bridge, Queue, Calls, Trends. Everything runs on http://localhost:3674.

Providers

import {
  wrapAnthropic,
  wrapOpenAI,
  wrapGoogle,            // @google/generative-ai
  wrapOllama,            // local Ollama HTTP client
  wrapOpenAICompat,      // LM Studio, Kimi, DeepSeek, vLLM, Moonshot
} from "@sapperjohn/kostai";

All wrappers use the same wrap(client, { appName, route, workflow, tags }) shape. Events are persisted whether the call succeeds or fails.

Shadow mode

Every shadow run calls both the frontier model and a cheaper/local path in parallel, returns the frontier result to the app, and writes a comparison record with baselineCostUsd, optimizedCostUsd, savedUsd, and a Kimi-2.5 quality score (0–100).

import { runShadow, evaluateQuality, wrapAnthropic, wrapOllama } from "@sapperjohn/kostai";

const anthropic = wrapAnthropic(new Anthropic());
const ollama = wrapOllama({ baseUrl: "http://localhost:11434" });

const { baselineResult, comparison } = await runShadow({
  ask: userMessage,
  route: "ticket-classifier",
  baseline: async () => {
    const r = await anthropic.messages.create({
      model: "claude-opus-4-7",
      max_tokens: 256,
      messages: [{ role: "user", content: userMessage }],
    });
    return { event: { id: r._ai_cost_event_id, model: r.model, ... }, result: r };
  },
  optimized: async () => {
    const r = await ollama.chat({
      model: "llama3.2",
      messages: [{ role: "user", content: userMessage }],
    });
    return { event: { id: r._ai_cost_event_id, model: r.model, ... }, result: r };
  },
  qualityEvaluator: evaluateQuality,
});

// baselineResult flows back to the caller. The app never sees optimizedResult
// — shadow mode is read-only w.r.t. production.

The dashboard's Shadow Mode tab aggregates these as:

  • total saved $
  • average saved %
  • average quality score
  • by model pair
  • by route
  • recent A/B comparisons (click to see the diff)

Router

Pure function. Given a call, classify the task, check the model, emit one of four decisions with a USD-denominated savings estimate.

import { routeCall } from "@sapperjohn/kostai";

const decision = routeCall(
  {
    model: "claude-opus-4-7",
    messages: [{ role: "user", content: "Classify this ticket as bug | feature | question." }],
    inputTokens: 120,
    outputTokensEstimate: 20,
  },
  {
    router: {
      enabled: true,
      localProvider: "ollama",
      localModel: "llama3.2",
      cheapApiProvider: "anthropic",
      cheapApiModel: "claude-haiku-4-5",
    },
  },
);

// decision.decision:
//   "local_sufficient"        — route to ollama/llama3.2
//   "cheaper_api_sufficient"  — route to anthropic/claude-haiku-4-5
//   "frontier_required"       — keep on claude-opus-4-7
//   "cache_hit"               — (reserved; identical-prompt detection)
// decision.estimatedSavingsUsd, decision.reason, decision.confidenceLevel

The Router dashboard tab scans your recent traffic, runs the same classifier offline, and shows the top 20 routable calls sorted by annualized savings.

HTTP proxy

One env-var adoption. The proxy speaks OpenAI's /v1/chat/completions shape.

# Observe — record every call, never modify
npx ai-cost proxy --mode observe --port 4311

# Route — downgrade confidently routable calls
npx ai-cost proxy --mode route --port 4311

# Shadow — always run both paths, log the comparison
npx ai-cost proxy --mode shadow --port 4311

In your app:

OPENAI_BASE_URL=http://localhost:4311/v1

That's the entire integration. No code changes.

MCP server

Expose ai-cost's primitives over the Model Context Protocol (spec 2024-11-05) so Claude Desktop, Claude Code, or any MCP client can call them in-loop.

# Inspect the exposed tools
npx ai-cost mcp --list

# Run the server on stdio (register in Claude Desktop's config)
npx ai-cost mcp

Tools (28 total today — same set served over both stdio and the bridge HTTP transport):

  • ai_cost_overview — totals + waste breakdown
  • ai_cost_top_workflows — ranked by avoidable spend
  • ai_cost_recommend_route — run the router against an ask+model
  • ai_cost_record_call — manual call logging
  • ai_cost_list_comparisons — recent shadow-mode comparisons
  • ai_cost_ollama_chat — pass-through to local Ollama, auto-recorded
  • ai_cost_shadow_compare — run a live A/B from the caller
  • ai_cost_local_status — detected local runtimes + config
  • ai_cost_anthropic_chat — direct Anthropic Messages call (no SDK dep), auto-recorded
  • ai_cost_list_peers — list configured bridge peers + reachability
  • ai_cost_escalate_to_frontier — local node asks a frontier-role peer to run a prompt
  • ai_cost_delegate_to_local — frontier node asks a local-role peer to run a prompt; records would-have-cost savings
  • ai_cost_route_cheap_api — route to a cheap-API peer, or fall back to the frontier peer with a cheap model override
  • ai_cost_handoff — router-driven smart dispatch across peers
  • ai_cost_preprocess — distill a prompt locally before escalation
  • ai_cost_preprocess_then_escalate — local preprocess + frontier escalation in one tool
  • ai_cost_queue_enqueue — durable async enqueue for bridge work
  • ai_cost_queue_status — queue counters + worker heartbeat
  • ai_cost_queue_list — inspect queued/running/done/failed work
  • ai_cost_queue_cancel — cancel a queued or running task
  • ai_cost_kb_query — query the cost-reduction KB for prior routing and optimization patterns
  • ai_cost_govspend_lookup — look up GovSpend agencies, vendors, and opportunities
  • ai_cost_govspend_summary — summarize GovSpend corpus coverage, duplicates, and open issues
  • ai_cost_agent_dispatch — run one BoilTheOcean task on this node via a local driver
  • ai_cost_research_brain — return the latest research-brain recommendations for this workspace
  • ai_cost_research_fleet — return the latest cross-project research-fleet rollup
  • ai_cost_research_fleet_dispatch — enqueue research-fleet specialist packets into the durable queue
  • ai_cost_strategic_brain — return the strategic-brain status and missing calibration/autonomy actions

Claude Desktop config:

{
  "mcpServers": {
    "ai-cost": {
      "command": "npx",
      "args": ["ai-cost", "mcp"]
    }
  }
}

Bridge

The bridge runs the same MCP tools over an authenticated HTTP+SSE transport so two machines can hand work to each other in-loop. Typical setup: a Mac Mini running Ollama as the local node, a MacBook with the Anthropic API key as the frontier node. Either side can call the other.

Priority #1: make Command Node linkup feel as easy as npm install. This package is the shared install primitive for BrainOfBrains.ai, KostAI.app, and CommandNodeAI.com: create an MCP connection operated by a lightweight open-source model, then pool AI + compute resources across humans and machines.

Fastest pairing flow:

# On the inviter's machine/repo
npx ai-cost bridge --invite --invite-name PatrickCommandNode

# On the recipient's machine/repo
npx -y @sapperjohn/kostai install --accept ./.kostai/bridge-invite.json
npx ai-cost bridge --doctor

That flow writes:

  • a shareable invite at .kostai/bridge-invite.json
  • a human-readable handoff at .kostai/BRIDGE_SETUP.md
  • a centralized workspace snapshot at .kostai/command-node-registry.json
  • a pooled peer registry entry at ~/.ai-cost-peers.json

Rollout brief: docs/MCP_COMMAND_NODE_1CLICK_ROLLOUT_BRIEF_2026-04-20.md

On each machine:

# Generate a shared secret (run once per pairing)
npx ai-cost bridge --gen-token
# → 64-char hex string. Put the same value in both machines' configs.

# Start the bridge listener
npx ai-cost bridge --listen
# → ai-cost bridge listening at http://0.0.0.0:4319/mcp/v1
#   tools: 28    transport: http+sse    auth: bearer

# Probe configured peers
npx ai-cost bridge --status
# → ✓ macbook  http://10.0.1.42:4319  role=frontier  (claude-opus-4-7)
#   ✓ mini     http://10.0.1.50:4319  role=local     (llama3.2, qwen2.5-coder, ...)

Config block (added to ai-cost.config.json):

{
  "bridge": {
    "listenPort": 4319,
    "listenHost": "0.0.0.0",
    "authToken": "<64-char hex from --gen-token>",
    "probeTimeoutMs": 10000,
    "peers": [
      {
        "name": "macbook",
        "url": "http://10.0.1.42:4319",
        "token": "<the macbook's authToken>",
        "role": "frontier",
        "frontierModel": "claude-opus-4-7"
      }
    ]
  }
}

bridge.probeTimeoutMs keeps bridge --status and bridge --doctor snappy. If a real delegation or escalation regularly needs longer than 60s, raise bridge.rpcTimeoutMs globally or set bridge.peers[*].rpcTimeoutMs only on the slower peer.

Flows:

  • Escalation (local → frontier): Mac Mini's local agent calls ai_cost_escalate_to_frontier with messages, reason, optional model. The bridge POSTs to the MacBook peer's ai_cost_anthropic_chat, records the call locally with route="escalation_request" and meta.bridge_peer="macbook", and returns the response.
  • Delegation (frontier → local): MacBook calls ai_cost_delegate_to_local with the same messages. The bridge runs it on the Mini's ai_cost_ollama_chat, computes wouldHaveCost for the frontier model, and stores meta.delegation_savings_usd so the dashboard can total cumulative savings.
  • Handoff (smart dispatch): ai_cost_handoff runs the router against the messages; local_sufficient → delegate, frontier_required → escalate. Force with force: "local" | "frontier".

The dashboard's Bridge tab shows configured peers, reachability, delegation count, savings to date, and recent escalations with peer + reason metadata. Endpoint: GET /api/bridge.

Wire formats (all over POST /mcp/v1/rpc with Authorization: Bearer <token>):

# Health (no auth)
curl http://10.0.1.42:4319/mcp/v1/health

# JSON-RPC tools/list
curl -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
     http://10.0.1.42:4319/mcp/v1/rpc

# Server→client notifications stream
curl -H "Authorization: Bearer $TOKEN" \
     http://10.0.1.42:4319/mcp/v1/sse

See docs/MAC_MINI_HANDOFF.md for the full two-machine setup walk-through.

Dashboard

npx ai-cost dashboard

Eight tabs:

  • Overview — total spend, avoidable spend, shadow-saved, efficiency score, top waste categories, top repeated prompts, budget banner.
  • Shadow Mode — A/B comparisons, saved totals, quality scores.
  • Router — recent routable calls and annualized savings.
  • Local LLMs — detected Ollama/LM Studio runtimes, local vs. cloud spend, configuration.
  • Bridge — peer reachability, delegation count, cumulative savings, recent escalations. Reflects this node's bridge config block.
  • Queue — durable 24h task queue: queued/running/done/failed counts and per-task inspection.
  • Calls — searchable/filterable list of every recorded call. Click for full detail.
  • Trends — daily spend and daily avoidable spend charts.

Auto-refreshes every 5 seconds. Local-only. No TLS. No login.

Elastic / Kibana integration

  • Generate a Kibana 8.x bundle: npx ai-cost kibana --output ai-cost.kibana.ndjson
  • Import path: Kibana → Stack Management → Saved Objects → Import → pick the NDJSON
  • Required index pattern: kostai-shadow-* (override with --index <pattern>)
  • Optional shipping hint: npx ai-cost kibana --filebeat filebeat.yml writes a Filebeat config that tails ~/.ai-cost/events.jsonl into that index

CLI commands

| Command | Description | |---|---| | npx -y @sapperjohn/kostai install [--accept <path-or-url>] | One-click bootstrap for config, starter patches, savings plan, and optional Command Node linkup | | npx ai-cost init | Create config file | | npx ai-cost dashboard | Start local dashboard on port 3674 | | npx ai-cost scan [--repo <path>] | Detect local LLM runtimes + LLM usage in a repo | | npx ai-cost mcp [--list] | Start MCP server over stdio | | npx ai-cost bridge --listen [--port 4319] [--host 0.0.0.0] | Start the HTTP+SSE MCP bridge | | npx ai-cost bridge --status | Probe configured peers — reachable, models, errors | | npx ai-cost bridge --gen-token | Generate a 64-char hex shared secret | | npx ai-cost proxy --mode <observe\|route\|shadow> | Drop-in OpenAI-compat proxy | | npx ai-cost compare --limit <n> | Summarize shadow-mode comparisons | | npx ai-cost report --last 7d | Print markdown report | | npx ai-cost export --format <json\|csv> | Export events | | npx ai-cost kibana [--output <path>] [--filebeat <path>] [--index <pattern>] | Emit Kibana-ready NDJSON dashboard bundle | | npx ai-cost doctor | Check configuration | | npx ai-cost reset [--comparisons-only] | Clear all stored data | | npx ai-cost features [--json\|--markdown] [--write-readme] | List every cost-reduction technique currently implemented |

Current capability set

The full list of cost-reduction techniques KostAI implements right now is generated from src/capabilities/registry.ts — the same data the CLI prints when you run npx kostai features. Refresh this block with npx kostai features --markdown --write-readme.

KostAI currently implements 41 cost-reduction techniques across 9 categories.

Model routing (4)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Rule-based router (rule-based-router)Classifies each call by complexity and downgrades expensive models when a cheaper tier can handle the task. | GA | router.enabled in ai-cost.config.json | Routes simple/deterministic work away from frontier pricing. | $1.01 across 8 pairs | | Trained-classifier router (v2) (router-classifier-v2)ML short-circuit in front of the rule router; decides routing from prompt features when confidence is high. | Beta | router.useClassifierV2 = true | +6.5pt accuracy vs v1 on the frozen bench; reduces misroutes. | $1.01 across 8 pairs | | Expensive-model gate (expensive-model-gate)Blocks calls from silently reaching a costly model (configurable $/M-token threshold) unless elevation is justified. | GA | router.expensiveModelThresholdUsdPerMToken | Keeps a forgotten model: string from burning $75/M output. | — | | Elevation check (elevation-check)When a higher tier IS required, emits an auditable justification rather than a silent upgrade. | GA | automatic | Makes tier escalations visible and reviewable. | — |

Context compression (4)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Deterministic prose compressor (prose-compress)Pure-TS rule-based compressor for long system prompts and memory files — byte-exact on code/URLs/headings, idempotent. | GA | compressProse(text) from ./core/prose-compress | ~46% input-token reduction on markdown memory files (adapted from caveman). | $1.07 across 6 pairs | | Tool-result compression (tool-result-compress)Summarizes large tool outputs (shell output, file dumps, API bodies) with a local model before they hit the frontier. | GA | compressToolResults({ messages }) from ./core/tool-compress | Cuts the dominant input-token source in agent loops. | $1.07 across 6 pairs | | Local-model pre-processor (local-preprocess)Runs a local model first to summarize history and draft a local attempt; the frontier sees a distilled prompt. | GA | preprocess({ messages }) or preprocessThenEscalate(...) | Shrinks input tokens to the expensive model; frontier validates rather than generates. | $1.07 across 6 pairs | | Draft-Verify-Patch (DVP) (draft-verify-patch)Local model drafts the answer; frontier either APPROVES, PATCHES, or REWRITES — output tokens collapse on approve. | GA | draftVerifyPatch({ messages }) from ./core/draft-verify | Targets output-side cost where frontier pricing is 5x input. | — |

Caching (2)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Anthropic prompt caching (anthropic-prompt-cache)Shapes system blocks with cache_control so repeat calls replay cached tokens at ~90% discount. | GA | cachedSystem(SYSTEM_PROMPT) from ./providers | ~90% discount on cached input tokens (Anthropic ephemeral cache). | $1.08 across 5 pairs | | Semantic cache (semantic-cache)Near-duplicate prompt detection via local embeddings; replays a cached answer for prompts above a cosine threshold. | GA | cacheOrCall({ key, compute }) from ./core/semantic-cache | Published benchmarks report ~73% cost reduction at 0.95 threshold on agent workloads. | $1.08 across 5 pairs |

Waste detection (21)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Duplicate block detector (waste-duplicate-block)Flags identical large content blocks repeated within a single call. | GA | automatic | Straight-line token savings when the duplicate is removed. | — | | Replayed-history detector (waste-replayed-history)Flags long conversation history being replayed for a narrow follow-up ask. | GA | automatic | Signals candidate for local preprocessing / summarization. | — | | Repeated-artifact detector (waste-repeated-artifact)Flags the same large block (logs, files, RAG chunks) sent across recent calls. | GA | automatic | Points at cacheable or deduplicatable content across a session. | — | | Low-relevance large-block detector (waste-low-relevance)Lexical scoring of large blocks that have weak lexical overlap with the current ask. | GA | automatic | Stitches into the context-pruning recommendation. | — | | Semantic low-relevance detector (waste-semantic-low-relevance)Embedding-based refinement on top of the lexical relevance detector; off by default for stability. | Beta | score.semanticRelevance = true | Catches semantically irrelevant blocks the lexical pass misses. | — | | Oversized-logs detector (waste-oversized-logs)Flags raw log dumps that could be summarized before being sent. | GA | automatic | Summarization typically retains the signal at a fraction of tokens. | — | | Oversized code-context detector (waste-oversized-code-context)Flags overly wide code imports (too many files, whole-repo dumps) for the current ask. | GA | automatic | Narrows the search surface the model has to traverse. | — | | Cacheable-system-prompt detector (waste-cacheable-system-prompt)Flags stable system prompts resent unchanged across calls — direct candidate for Anthropic prompt caching. | GA | automatic | Recommends the cachedSystem() wrapper; unlocks ~90% discount. | — | | System-prompt reuse detector (waste-system-prompt-reuse)Flags near-duplicate system prompts — would be cacheable with a small canonicalization step. | GA | automatic | Catches the trailing-whitespace/ timestamp cache-miss class. | — | | Stale-tool-definitions detector (waste-stale-tool-definitions)Flags tool JSON schemas that haven't changed across calls but are resent every turn. | GA | automatic | Typical recovery: move tool definitions into the cache breakpoint. | — | | Oversized tool-result detector (waste-oversized-tool-result)Flags tool outputs that would compress well before being echoed back into the prompt. | GA | automatic | Pairs with the tool-result compression wrapper. | — | | Oversized JSON tool-output detector (waste-oversized-json-tool-output)Tight (90%) savings estimate on structured JSON tool outputs — highest-confidence compression target. | GA | automatic | JSON is compressible with near-zero semantic loss. | — | | Verbose prose-input detector (waste-verbose-prose-input)Flags input prose with filler/hedging/pleasantries that the prose compressor would shrink. | GA | automatic | Routes the call to the deterministic prose compressor. | — | | Verbose output-preamble detector (waste-verbose-output-preamble)Flags boilerplate preambles in responses that the model can be instructed to drop. | GA | automatic | Output-side waste; reduces $/out-token directly. | — | | Language-verbose output detector (waste-language-verbose-output)Language-specific verbosity patterns (e.g., over-explanatory code comments) that could be trimmed. | GA | automatic | Output-side waste specific to code-generation workloads. | — | | Repeated image-attachments detector (waste-repeated-image-attachments)Flags image inputs resent across calls — candidate for a cached reference instead of re-upload. | GA | automatic | Image tokens dominate multimodal spend; cache or reference instead. | — | | Model-overkill detector (waste-model-overkill)Flags frontier-model calls the router would have routed to a cheaper tier. | GA | automatic | Directly monetizable — the saved delta is routable today. | — | | Model-downshift-opportunity detector (waste-model-downshift-opportunity)Signals calls whose output suggests the task could have run on a cheaper/smaller model. | GA | automatic | Post-hoc evidence that tightens router rules over time. | — | | DVP-candidate detector (waste-dvp-candidate)Flags calls whose shape (long output, moderate input) make them Draft-Verify-Patch candidates. | GA | automatic | Feeds the draft-verify wrapper recommendation. | — | | Unbounded stream-continuation detector (waste-unbounded-stream)Flags streams that ran to max_tokens without a natural stop — likely over-generation. | GA | automatic | Points at max_tokens / stop-sequence tuning. | — | | Metadata-only oversized detector (waste-metadata-inferred-oversized)Fallback inference from token counts alone when capture mode stripped bodies for privacy. | GA | automatic | Keeps the scorer useful in the strictest capture mode. | — |

Shadow mode (2)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Shadow-mode A/B (shadow-mode)Runs a baseline + optimized call in parallel, returns baseline to the caller, logs the delta for before/after proof. | GA | shadowMode.enabled = true OR proxy --mode shadow | Generates the comparison ledger that powers kostai proof. | — | | Quality evaluator (quality-evaluator)Grades optimized-vs-baseline outputs (heuristic + optional LLM judge) so savings claims carry a quality signal. | GA | shadowMode.runQualityEval = true | Prevents 'cheaper but worse' regressions from slipping through. | — |

Local inference (2)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Local-LLM routing (local-routing)Routes eligible calls to Ollama / LM Studio / OpenAI-compat local endpoints instead of a paid API. | GA | providers.ollama.enabled + router rules, or proxy --mode route | Local inference is $0/token; electricity-only cost ledger available. | $0.0256 across 1 pair | | Local-runtime scan (local-runtime-scan)Detects running Ollama / LM Studio / OpenAI-compat servers on this machine and enumerates installed models. | GA | npx kostai scan | Surfaces free local compute that would otherwise be ignored. | — |

Batching & deliberation (1)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | LLM Council (cost-aggressive) (llm-council)3-stage drafter/reviewer/chairman pattern with semantic-cache, consensus short-circuit, and free-tier drafters. | Beta | runCouncil({ ... }) from ./core/council | Preserves Karpathy council quality; six stacked cost wedges collapse the spend. | — |

Budget & governance (2)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Budget gate (budget-gate)Per-wave and per-task spend cap that halts dispatch before runaway cost. | GA | boil budget.enabled + max USD in ai-cost.config.json | Hard dollar ceiling on orchestrated agent runs. | — | | Retention-aware ledger (retention-ledger)Every optimized call is tagged with the mechanism that saved the money, so proof output attributes savings per lever. | GA | automatic | Powers the mechanism-breakdown table in kostai proof. | — |

Observability (3)

| Technique | Status | How invoked | Mechanism | Measured impact | |---|---|---|---|---| | Executive proof-of-savings (proof-one-pager)One-page markdown/HTML/JSON proof — saved $, pass-through subscription value, mechanism breakdown, quality signal. | GA | npx kostai proof [--html path] [--json path] [--rate 0.10] [--last 30d|90d|all] | Turns the ledger into a defensible CIO-grade artifact. | — | | Local dashboard (dashboard)Throughput-first web dashboard reading the same local JSONL store — trends, waste categories, per-call inspection. | GA | npx kostai dashboard | Keeps the user in the loop without leaving the machine. | — | | Repo-scan optimization plan (optimize-plan)Scans the current repo for LLM call sites and emits a prioritized .kostai/optimizations.md the agent can apply item-by-item. | GA | npx kostai optimize | Magic-sentence entrypoint — an agent reads the plan and implements it. | — |

What it detects

ai-cost runs 9 waste heuristics on every call (was 8; added local_routable for explicit local-downgrade flagging):

| Category | Confidence | Catches | |---|---|---| | duplicate_block | High | Same content repeated within a call | | replayed_history | Medium | Long conversation replay for narrow asks | | repeated_artifact | High | Same large block sent across recent calls | | low_relevance_large_block | Low | Large blocks with weak link to the ask | | oversized_logs | Medium | Raw logs that could be summarized | | oversized_code_context | Low | Too many code files for the scope | | cacheable_system_prompt | High | Stable system prompt resent unchanged | | model_overkill | Low | Frontier model on a task the router flagged simple | | local_routable | Medium | Call that would execute correctly on local inference |

Every finding carries estimatedTokens, estimatedCostUsd, and a confidence level. The dashboard's "Top Waste Categories" is a prioritized remediation list.

Configuration

{
  "appName": "my-app",
  "storeDir": ".ai-cost-data",
  "port": 3674,
  "captureMode": "metadata_only",
  "redactSecrets": true,
  "redactPatterns": [],
  "providers": {
    "anthropic":    { "enabled": true },
    "openai":       { "enabled": true },
    "google":       { "enabled": true },
    "ollama":       { "enabled": true, "baseUrl": "http://localhost:11434", "defaultModel": "llama3.2", "powerWatts": 60, "electricityCostPerKwh": 0.15 },
    "lmstudio":     { "enabled": true, "baseUrl": "http://localhost:1234/v1", "defaultModel": "lmstudio-community/llama-3.2-3b-instruct" },
    "openaiCompat": { "enabled": false, "baseUrl": "https://api.moonshot.cn/v1", "defaultModel": "kimi-2.5" }
  },
  "thresholds": {
    "largeBlockTokens": 500,
    "logBlockTokens": 300,
    "repeatedHistoryTurns": 6,
    "efficiencyWarnPct": 70
  },
  "shadowMode": {
    "enabled": true,
    "recordSamplePct": 100,
    "optimizedProvider": "ollama",
    "optimizedModel": "llama3.2",
    "runQualityEval": true
  },
  "router": {
    "enabled": false,
    "simpleTaskMaxTokens": 800,
    "maxLocalLatencyMs": 8000,
    "localProvider": "ollama",
    "localModel": "llama3.2",
    "frontierProvider": "anthropic",
    "frontierModel": "claude-opus-4-7",
    "cheapApiProvider": "anthropic",
    "cheapApiModel": "claude-haiku-4-5"
  },
  "evaluator": {
    "enabled": false,
    "provider": "ollama",
    "model": "kimi-2.5"
  },
  "budget": {
    "monthlyUsd": 500,
    "warnAtPct": 80
  }
}

Privacy

  • Default capture mode is metadata_only — no content is stored, only hashes, token counts, cost, and scores.
  • redacted_body stores truncated previews with PII patterns scrubbed.
  • full_body is opt-in for local debugging only.
  • No network egress. Everything runs on the local machine.
  • No telemetry. No usage reporting to any external service.

Extended docs

Known limitations

  1. Waste estimates are heuristic — likely waste, not certainty.
  2. Context relevance is estimated via lexical overlap, not semantic.
  3. Router rules are regex-based by design (auditable). Replaceable by a trained classifier once a comparison corpus exists.
  4. Only Node.js / TypeScript SDKs are supported today. Other languages adopt via the HTTP proxy.
  5. Token counts may be estimated when providers don't return usage.
  6. Store is append-only JSONL; for > 10k events/day a SQLite backend behind the same EventStore interface is the next step.

License

MIT