@dstyll/cc-clip

v0.1.0

Published

10 days ago

Route coding-agent prompts to the right Claude model tier (haiku/sonnet/opus) based on classified complexity.

0High
0Medium
0Low

dstyll

claude claude-code llm router model-routing token-optimization cost

dstyll CC Clip

Route coding-agent prompts to the right Claude model tier — haiku / sonnet / opus — based on the prompt's classified complexity. Easy work runs cheap, hard work escalates up, and you don't change how you prompt.

It works as a Claude Code UserPromptSubmit hook: every prompt is classified locally (no model tokens, tens of milliseconds), and a directive is injected telling the agent to delegate the task to a model-pinned subagent and relay the result. Classification is pure TypeScript — no Python, no torch, no native modules at runtime.

Install

npm i -g @dstyll/cc-clip

(You already have Node — Claude Code runs on it.)

Set up a repo

cd your-project
cc-clip init

This is idempotent and:

registers the UserPromptSubmit hook in .claude/settings.json
writes easy / medium / hard subagents to .claude/agents/ (pinned to haiku / sonnet / opus)
adds a managed routing rule to CLAUDE.md

Remove it any time with cc-clip init --remove.

Recommended: set a cheap orchestrator

In Claude Code, set your model with /model to Sonnet or Haiku.

cc-clip routes the heavy work to a subagent and has the orchestrator relay the result, so a cheap orchestrator is what turns routing into real savings — hard tasks still escalate up to Opus via the hard subagent. If you run Opus as the orchestrator, savings only appear when post-processing stays minimal (which the relay directive enforces).

How classification works

Three stages, all in-process:

Heuristics — length, code fences, file references, and verb signals (refactor/migrate/architect → hard; rename/typo/format → easy). Obvious prompts short-circuit here instantly.
Static embedding + logistic regression — a model2vec-style quantized word table (mean-pooled) feeds a pretrained softmax head producing per-tier probabilities. No neural inference, no network.
Confidence fallback — if the top probability is below the threshold, the prompt routes to Sonnet (the safe all-rounder).

On a fresh install before the embedding artifacts are present, it falls back to heuristics + the safe tier — so it always works, just more conservatively.

Context-aware follow-ups

Conversational follow-ups ("now do the same", "make that thread-safe") are ambiguous on their own. When contextAware is on (default), the hook reads the session transcript and, only for messages a detector flags as ambiguous follow-ups, consults the last couple of turns to resolve the reference. The prior context is blended under a capped weight that cannot override a confident read of the current message, so it lifts follow-up accuracy without making everything look hard. With no transcript (e.g. cc-clip classify) or the feature off, behaviour is identical to single-message routing. On follow-ups where prior session context is available, context-aware routing adds +31 pts (0.62 → 0.92, held-out evaluation) with no regression on standalone prompts.

Commands

| Command | What it does | |---------|--------------| | cc-clip classify [prompt] | Classify a prompt (arg or stdin) → JSON {tier,confidence,scores,reasons} | | cc-clip hook | The UserPromptSubmit entry point (reads the payload on stdin) | | cc-clip init [--remove] | Scaffold / remove the hook, subagents, and CLAUDE.md rule | | cc-clip stats | Tier distribution + estimated cumulative savings | | cc-clip savings [--since 7d] | Projected-savings total + extrapolated rate (estimate) | | cc-clip train --data labels.jsonl | Refine the LR head locally from labeled prompts |

Projected savings

Every routing decision is logged to a local JSONL file (a prompt hash, never the raw text). cc-clip stats and cc-clip savings report how much routing is estimated to save versus running every prompt on your baseline tier.

These are estimates, not billed amounts: at classify time the real response token count is unknown, so savings are computed as baseline_cost − chosen_cost using a configurable per-tier price table and estimated token counts. Override the prices as pricing changes (see Config).

On the evaluation dataset, routing cuts spend ~20% vs an Opus-only baseline (risk-adjusted, held-out evaluation). With an Opus orchestrator or large per-prompt context, savings are smaller — set a cheap orchestrator (Sonnet or Haiku) to get the full benefit.

Config

Resolved from built-in defaults → ~/.config/cc-clip/config.toml → a project-level ./.cc-clip.toml. Example:

enabled = true
confidenceThreshold = 0.42
baselineModel = "opus"

# context-aware routing for ambiguous follow-ups
contextAware = true
priorTurns = 2
maxContextTokens = 400

[tierModel]
easy = "haiku"
medium = "sonnet"
hard = "opus"

[postProcessing]
easy = "relay"
medium = "relay"
hard = "light-verify"

[expectedOutputTokens]
easy = 400
medium = 1200
hard = 3000

# USD per token (override as list prices change)
[prices.haiku]
input = 0.0000008
output = 0.000004
[prices.sonnet]
input = 0.000003
output = 0.000015
[prices.opus]
input = 0.000015
output = 0.000075

Accuracy

On a held-out split the classifier (embedding+LR) reaches ~74% accuracy on the combined dataset of standalone prompts and context-paired follow-ups. Errors skew to the safe direction by design — hard tasks are never silently handed to the cheapest model; uncertain prompts fall back to Sonnet. Accuracy improves as the dataset grows — add labeled rows and re-run the pipeline (scripts/); no code changes needed. See docs/performance.md for full benchmark results.

Development

npm install
npm run typecheck
npm test
npm run build

The bundled model artifacts under artifacts/ are produced by the offline pipeline in scripts/ (Python — never run by end users, only when retraining). See scripts/README.md.

License

MIT