llm-hard-cap
v0.2.1
Published
Hard spend caps for OpenAI, Anthropic Claude, Google Gemini, and any LLM API. Real-time token cost tracking and daily / monthly / per-user USD limits that stop runaway AI bills before they happen.
Downloads
524
Maintainers
Readme
llm-hard-cap
Hard spend limits for OpenAI, Anthropic Claude, Google Gemini, and any LLM API. Track token costs in real time, enforce daily / monthly / per-user USD caps, and stop runaway AI bills before they happen.
npm install llm-hard-capllm-hard-cap is a zero-dependency TypeScript library that puts a hard ceiling on what your application can spend on LLM APIs. It supports OpenAI (GPT-4o, GPT-4-turbo, o1, o3-mini), Anthropic Claude (Opus 4.7, Sonnet 4.6, Haiku 4.5), Google Gemini (1.5/2.0/2.5), Mistral, DeepSeek, and any custom model you add.
If you've ever woken up to a $30,000 OpenAI bill from a runaway loop, or shipped an AI feature without per-user limits and learned the hard way that one user can drain your monthly quota in an hour — this is for you.
Why this exists
LLM provider dashboards show you what you spent yesterday. Rate limits stop you at 10,000 RPM, not at $500. So when a bug, retry loop, or one heavy user starts burning tokens, you find out from the bill.
llm-hard-cap enforces spend at the call site, before the request goes out:
- Hard caps in USD, not RPM.
daily: 10means "$10/day, full stop." - Per-user / per-route scoping. Free users get $0.10/day; paid users get $5; an experimental route gets $1.
- Pre-flight estimate + post-flight reconciliation. Block expensive calls before they hit the API, then record the actual cost from the response.
- Provider-agnostic. Built-in pricing for 25+ models; bring your own for fine-tunes, Bedrock, Ollama, etc.
- Zero dependencies. ~3 KB gzipped. TypeScript-first. Works in Node 18+, Bun, Deno.
Quick start
import OpenAI from "openai";
import { BudgetGuard } from "llm-hard-cap";
const openai = new OpenAI();
const guard = new BudgetGuard({
limits: { daily: 10, monthly: 200 },
});
const response = await guard.wrap(
{ model: "gpt-4o-mini", estimatedInputTokens: 500, estimatedOutputTokens: 300 },
() =>
openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Hello!" }],
}),
);If today's spend would push past $10, guard.wrap throws BudgetExceededError before the OpenAI call is made. Actual token usage is recorded automatically after the call returns.
Per-user spend limits in 5 lines
Free vs. paid tiers without writing a quota system:
const guard = new BudgetGuard({ limits: { daily: 100 } }); // global ceiling
const userGuard = guard.for(`user:${userId}`, {
daily: plan === "pro" ? 5 : 0.1,
});
await userGuard.wrap({ model: "gpt-4o", estimatedInputTokens: 1000 }, callOpenAi);Each scope is tracked independently. Reset a scope when a user upgrades, or query their current usage:
await guard.usage("user:alice"); // { day: 0.0427, month: 1.13, total: 9.42, requests: 87 }Three modes of use
1. wrap — protect a single call (recommended)
const result = await guard.wrap(
{ model: "claude-sonnet-4-6", estimatedInputTokens: 800, estimatedOutputTokens: 400 },
() => anthropic.messages.create({ /* ... */ }),
);wrap runs a pre-flight estimate, executes the call, and then records actual usage from the response. It auto-detects OpenAI-style (usage.prompt_tokens / usage.completion_tokens) and Anthropic-style (usage.input_tokens / usage.output_tokens) responses.
For other providers, pass an extract function:
await guard.wrap(
{ model: "gemini-1.5-pro", estimatedInputTokens: 1000 },
() => gemini.generateContent({ /* ... */ }),
(r) => ({
inputTokens: r.usageMetadata.promptTokenCount,
outputTokens: r.usageMetadata.candidatesTokenCount,
}),
);2. estimate — pre-flight check only
const { projectedUsd } = await guard.estimate({
model: "gpt-4o",
estimatedInputTokens: 50_000,
});
// Throws BudgetExceededError if not affordable, otherwise returns the cost.3. check — record after the fact
When you already have real token counts (custom client, streaming, batch jobs):
await guard.check({
model: "gpt-4o",
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
});Limits you can set
new BudgetGuard({
limits: {
perRequest: 0.25, // refuse any single call over $0.25
daily: 10, // $10 per UTC day
monthly: 200, // $200 per UTC month
total: 1000, // $1000 lifetime cap (per scope)
},
onExceeded: "throw", // or "warn" / "block"
});Limits are checked in this order: perRequest, daily, monthly, total. The first violation throws BudgetExceededError, which exposes .window, .limitUsd, .currentUsd, .projectedUsd, and .scope.
What onExceeded does when a limit is hit
| Mode | check() | estimate() | wrap() |
|-----------|--------------------------------------------------------|-------------------------------|-----------------------------------|
| "throw" | throws BudgetExceededError | throws | throws (call never runs) |
| "block" | returns { recorded: false }, spend not recorded | returns { allowed: false } | throws (call never runs) |
| "warn" | logs, still records the spend (recorded: true) | logs, returns allowed: true | logs, runs the call, records it |
check() resolves to { costUsd, summary, recorded } and estimate() to { projectedUsd, summary, allowed }. In "block" mode, inspect allowed/recorded to decide what to do; wrap() can't return a value without making the call, so it throws instead.
Unknown models
By default an unpriced model name (e.g. a typo) throws UnknownModelError so a mistake can't silently disable the guard. Add the model via pricing, or pass onUnknownModel: "zero" to treat unknown models as free.
Handling rejected calls
import { BudgetExceededError } from "llm-hard-cap";
try {
await guard.wrap({ model: "gpt-4o", estimatedInputTokens: 1000 }, call);
} catch (err) {
if (err instanceof BudgetExceededError) {
return res.status(429).json({
error: "budget_exceeded",
window: err.window, // "perRequest" | "day" | "month" | "total"
limitUsd: err.limitUsd,
currentUsd: err.currentUsd,
});
}
throw err;
}Persistence
The default storage is in-memory — fine for short-lived scripts and tests. For real apps:
import { BudgetGuard, FileStorage } from "llm-hard-cap";
const guard = new BudgetGuard({
limits: { daily: 10 },
storage: new FileStorage("./.llm-hard-cap.json"),
});For distributed / multi-process setups, implement the Storage interface against Redis, Postgres, or your existing database:
import type { Storage, SpendEvent, UsageSummary } from "llm-hard-cap";
class RedisStorage implements Storage {
async record(event: SpendEvent) { /* INCRBYFLOAT keys */ }
async summary(scope: string): Promise<UsageSummary> { /* GET keys */ }
async reset(scope?: string) { /* DEL */ }
}Supported models out of the box
Pricing is built-in for:
| Provider | Models |
|------------|--------|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4, gpt-3.5-turbo, o1, o1-mini, o3-mini |
| Anthropic | claude-opus-4-7, claude-opus-4-6, claude-sonnet-4-6, claude-sonnet-4, claude-haiku-4-5, plus all Claude 3.x snapshots |
| Google | gemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash, gemini-2.5-pro |
| Mistral | mistral-large-latest, mistral-small-latest |
| DeepSeek | deepseek-chat, deepseek-reasoner |
Snapshot IDs like gpt-4o-2024-11-20 fall back via prefix match. Override or extend at any time:
new BudgetGuard({
limits: { daily: 5 },
pricing: {
"my-fine-tune": { input: 0.8, output: 2.4 }, // USD per 1M tokens
},
});FAQ
How is this different from OpenAI's usage limits in the dashboard?
OpenAI's caps are organization-wide, settle a day later, and don't tell you who spent what. llm-hard-cap enforces in real time, per scope (user, route, environment), and rejects calls before they leave your server.
Does this work with streaming responses?
Yes. Use estimate before opening the stream, then check once you receive the final usage event (OpenAI emits usage in the last chunk if you pass stream_options: { include_usage: true }).
Does this work with prompt caching discounts?
Compute the cost yourself with the cached vs. non-cached split and pass it via the pricing override or call calculateCost and use check with the real counts.
Does it count tokens for me?
No — it expects you to pass token counts (from the API response, or your own pre-flight estimate via tiktoken / @anthropic-ai/tokenizer). This keeps the package zero-dependency and accurate.
What happens on rate limit / 5xx errors?
The wrapped call propagates the error untouched. Nothing is recorded if the response doesn't include usage. This means failed calls don't count against your budget — exactly what you want.
Is it safe for multi-process servers?
The default MemoryStorage is per-process. Use FileStorage for single-host setups or implement Storage against Redis / Postgres for distributed apps.
API reference
new BudgetGuard(options)
| Option | Type | Default |
|----------------|-----------------------------------|----------------|
| limits | BudgetLimits | required |
| onExceeded | "throw" \| "warn" \| "block" | "throw" |
| storage | Storage | MemoryStorage|
| pricing | Record<string, ModelPricing> | — |
| onUnknownModel | "throw" \| "zero" | "throw" |
| onSpend | (event: SpendEvent) => void | — |
Methods
guard.wrap(args, call, extract?)— pre-check, run, record. Returns the call's result.guard.estimate(args)— pre-check only. Throws on violation.guard.check(args)— record actual usage. Throws on violation.guard.for(scope, limits?)— scoped child guard (per-user / per-route).guard.usage(scope?)— current{ day, month, total, requests }.guard.reset(scope?)— clear usage for a scope (or all).
Examples
See examples/ for runnable scripts:
openai.ts— OpenAI chat completion with file-backed storageanthropic.ts— Claude with default in-memory trackingexpress-middleware.ts— per-user free/paid tiersmanual-tracking.ts— custom models / providers
Comparison
| | llm-hard-cap | Provider dashboards | API gateway proxies |
|--|--|--|--|
| Real-time enforcement | ✅ | ❌ (delayed) | ✅ |
| Per-user / per-scope | ✅ | ❌ | partial |
| Zero infrastructure | ✅ | ✅ | ❌ (extra hop) |
| Works across providers | ✅ | one each | ✅ |
| Refuses calls before request | ✅ | ❌ | ✅ |
| Bundle size | < 4 KB | n/a | n/a |
Contributing
Issues and PRs welcome. Pricing updates appreciated — providers change rates often.
License
MIT
