@askalf/brio

v0.0.4

Published

5 days ago

The capability layer for AI workloads — semantic cache, cost-aware tiering, structured cost reporting, policy enforcement. Sits in front of any Anthropic-compat endpoint (dario, api.anthropic.com, OpenRouter, vLLM, Ollama).

Downloads

242

0High
0Medium
0Low

askalf

llm ai semantic-cache cost-optimization model-tiering policy anthropic claude openai-compat proxy middleware developer-tools cli

Why this exists

You're paying Anthropic per token (or via subscription routed through dario) and watching half the spend go to questions you've already answered. A coding agent reads the same package.json thirty times in a session. A research agent re-fetches the same source on the second turn. A team of five engineers asks the same dependency-version question across the week. Every one of those costs you the full prompt tokens on every hit, even when the answer was identical the last time.

That's the wedge. brio caches the prompt-response pair under a semantic key and serves the cached answer when the same shape recurs, saving tokens, latency, and rate-limit headroom. Cache miss → request flows through to your backend untouched. Cache hit → response comes back in single-digit milliseconds with the cached answer marked so the calling agent knows it's a replay.

That's v0.1. Around it: cost-aware model tiering (route easy prompts to Haiku, hard ones to Opus, by length + complexity heuristics), structured cost reports (per-conversation, per-user, per-day), and a policy layer (model allowlists, cost caps, PII redaction). Eventually team mode (multi-user auth + per-user quotas + audit log) for org adoption.

Where it sits

   client (Cursor, Aider, Continue,            client (Claude Code,
   custom code, etc.)                          OpenClaw, Hermes, etc.)
        │                                            │
        └─────► http://localhost:8765 ◄──────────────┘
                        │
                       brio    ─── cache (semantic key)
                        │       ─── cost report
                        │       ─── tier (haiku / sonnet / opus)
                        │       ─── policy (allowlists, caps, DLP)
                        │
                        ▼
        ┌─────────────────────────────────────┐
        │  ANY Anthropic-compatible endpoint  │
        │                                     │
        │  - http://localhost:3456 (dario)    │ ← Claude Max via OAuth
        │  - https://api.anthropic.com         │ ← per-token API
        │  - https://openrouter.ai/v1          │ ← OpenRouter
        │  - http://localhost:11434            │ ← Ollama, etc.
        └─────────────────────────────────────┘

brio doesn't replace dario. dario solves "speak Anthropic's wire shape exactly so my Claude Max subscription works outside Claude Code." brio solves "make every backend smarter about cost, latency, and policy." Composing them: clients hit brio, brio caches what it can, the rest flows to dario, dario routes to your subscription. Either layer can run alone; neither requires the other.

60 seconds

# 1. Install.
npm install -g @askalf/brio

# 2. Point brio at whichever backend you want to wrap. Default is dario at :3456.
brio start                                   # wraps dario on localhost:3456
brio start --upstream=https://api.anthropic.com --api-key=$ANTHROPIC_API_KEY
brio start --upstream=https://openrouter.ai/v1 --api-key=$OPENROUTER_API_KEY

# 3. Point your client at brio instead of the backend directly.
ANTHROPIC_BASE_URL=http://localhost:8765
ANTHROPIC_API_KEY=brio                       # any value when running through dario

# 4. Use whatever client you already use. Everything routes through brio.
claude                                       # Claude Code
cursor                                       # Cursor
aider --model=claude-opus-4-7                # Aider

Run brio cost after a session. You'll see the cache hit rate, the dollar value of replay traffic, and the per-conversation breakdown.

What v0.1 ships

Semantic response cache — every successful request keyed by a hash over {model, system_prompt, messages, tools}. TTL configurable (default 1 hour). Hits return in single-digit ms with x-brio-cache: hit header. Disk-backed at ~/.brio/cache/<sha>.json. Verify with brio cache stats.
Cache-aware streaming — cache hits replay the original SSE event stream so streaming clients see the same chunks they would have without the cache. No bypass needed for streaming requests.
Structured cost reporting — every request records {timestamp, model, inputTokens, outputTokens, cacheHit, latencyMs, conversationId}. brio cost summarizes per-day, per-conversation, per-model. brio cost --json for piping into your own dashboards.
Pass-through everything else — non-cacheable requests, tool-call patterns brio doesn't understand yet, anything that touches /v1/files or other side-channels — all forwarded byte-for-byte to the upstream. brio is additive; it shouldn't change what works.
brio doctor — health check across upstream reachability, cache directory writability, and a smoke probe to verify the upstream is what you said it was.

What v0.2 will probably add

Cost-aware tiering — --tier=auto routes prompts under N tokens to Haiku, complex ones to Opus, with explainable decisions surfaced via brio explain <request-id>.
Per-user accounting (single-machine) — a user header / API key from the client lets brio attribute spend per developer when one machine serves a team.
Policy file — declarative model allowlists, cost caps, PII regex strip-list, blocked tool patterns. brio policy validate lints the file.

What v1 will look like

Team mode — multi-user auth, per-user quotas, signed audit log. Turns brio from a personal middleware into a small ops-deployable service.
Federation — multiple brio instances coordinate cache + cost across machines.
Hot-reload config — change tier/policy/upstream without restart.

What v0.1 is NOT

Not a model proxy that forwards to multiple backends in parallel. brio talks to one upstream at a time. (For multi-backend routing, dario's pool mode handles that at the subscription layer.)
Not a vector store / RAG layer. The cache is keyed on the literal request shape, not on semantic similarity of message content. RAG and brio are orthogonal: brio caches the request whether or not it includes RAG-fetched context.
Not a guardrails framework with model-level safety reasoning. Policy is rule-based: regex / allowlist / cap. If you need an LLM-as-judge guardrails layer, run that as a separate service in front of brio.
Not branded as the askalf commercial product. brio is open OSS infrastructure. askalf is something else, and that something else may eventually run brio internally as a component.

Flags you'll reach for

| Flag | Default | Why | |---|---|---| | --upstream <url> | http://localhost:3456 | Where requests go on cache miss. Anthropic-compat endpoint. | | --port <n> | 8765 | brio's listen port. | | --api-key <k> | — | API key brio sends to upstream when upstream isn't dario. | | --cache-ttl <ms> | 3600000 (1h) | TTL on cache entries. 0 disables caching. | | --cache-dir <path> | ~/.brio/cache | Where cache files live. | | --no-cache | off | Bypass cache for this run. | | --no-cost | off | Suppress per-request cost line on stderr. | | --verbose, -v | — | Stream cache hits / misses / forward decisions to stderr. | | --upstream-format <anthropic\|openai> | auto | Wire format the upstream expects. Auto-detected from URL. |

Every flag mirrors a BRIO_* env var. CLI wins over env.

Trust and transparency

| Signal | Status | |---|---| | Runtime dependencies | Two — one HTTP framework, one schema validator. Pinned, audited. No hosted services, no telemetry. | | Credentials | API keys live in env vars or CLI flags; brio never persists them. Cache files store request + response payloads only. | | Network scope | Whatever upstream you point at, plus the cache TTL clock (no external time service). No other outbound traffic. Verify with lsof -i during a run. | | Telemetry | None. Zero analytics, tracking, or data collection. Deliberately, not aspirationally. | | License | MIT |

See DISCLAIMER.md for the full AS IS / no-affiliation / user-responsibility terms.

Relationship to other askalf projects

dario — wire-fidelity LLM router. brio's default upstream. Stable maintenance mode (drift watch only); brio is where active feature work lives.
hands — computer-use agent. Routes through brio (or dario, or anything Anthropic-compat) like any other client.
arnie — IT troubleshooting companion. Same — client of brio.
deepdive — local research agent. Same — client of brio.

askalf (the org) is the umbrella; the future commercial chat/agent product called askalf is something else and not what brio is.

Contributing

PRs welcome. Code style matches dario — small TypeScript, pure decision functions, node --test assertions on anything with logic in it. Run npm run build && npm test before submitting.

License

MIT — see LICENSE and DISCLAIMER.md.