llm-spend-guard
v2.0.4
Published
Enforce real-time token budgets and spending limits for OpenAI, Anthropic Claude, and Google Gemini API calls in Node.js
Maintainers
Readme
The Problem
A single runaway loop, an uncapped user session, or one oversized prompt can burn through your entire LLM budget in minutes. There is no built-in way to set spending limits across OpenAI, Anthropic, or Gemini SDKs.
llm-spend-guard wraps your existing LLM SDK calls and enforces token budgets before any request is sent to the API. If a request would exceed your budget, it gets blocked instantly — no money wasted.
Why llm-spend-guard?
- Pre-request blocking — Stops overspending before the API call, not after
- Multi-provider — Single API for OpenAI, Anthropic Claude, and Google Gemini
- Multi-scope budgets — Global, per-user, per-session, and per-route limits
- Zero config — Works with 3 lines of code, no infrastructure needed
- Production-ready — Redis storage, Express/Next.js middleware, TypeScript-first
- Lightweight — zero runtime dependencies beyond tiktoken
Table of Contents
- Why llm-spend-guard?
- How It Works
- Compatible Tech Stacks
- Installation
- Quick Start
- Configuration Options
- Usage By Provider
- Budget Scopes
- How Guarding Works (Request Lifecycle)
- Viewing Reports and Stats
- Alert Callbacks (Monitoring)
- What Happens When Budget Is Exceeded
- Auto Truncation
- Storage Backends
- Framework Integration
- SaaS Per-User Budget Example
- Full API Reference
- Comparison with Alternatives
- Running Tests
- Publishing to NPM
- Roadmap
- Contributing
- Security
- Support
- Contributors
- License
How It Works
Your Code --> llm-spend-guard --> LLM API (OpenAI / Anthropic / Gemini)
|
|-- 1. Estimates tokens BEFORE the request
|-- 2. Checks all budget scopes (global, user, session, route)
|-- 3. If over budget --> BLOCKS the request (throws BudgetExceededError)
|-- 4. If auto-truncate enabled --> trims prompt to fit
|-- 5. Sends request to LLM API
|-- 6. Records actual token usage from response
|-- 7. Fires alert callbacks at 50%, 80%, 100% thresholdsKey principle: The guard sits between your code and the LLM SDK. It estimates cost before sending, blocks if over budget, and tracks actual usage after the response.
Compatible Tech Stacks
| Category | Supported | |----------|-----------| | Runtime | Node.js >= 18, Bun, Deno (with Node compat) | | Language | TypeScript, JavaScript (CommonJS and ESM) | | LLM Providers | OpenAI, Anthropic (Claude), Google Gemini | | Frameworks | Express.js, Next.js, Fastify, Koa, Hono, NestJS, or any Node.js server | | Storage | In-memory (default), Redis, or any custom adapter | | Use Cases | REST APIs, SaaS backends, chatbots, AI agents, CLI tools, serverless functions |
Not compatible with: Browser/frontend code (this is a server-side package), Python, or non-Node runtimes without Node compatibility.
Installation
npm install llm-spend-guardThen install the provider SDK(s) you use:
# Pick one or more
npm install openai # For OpenAI (GPT-4o, GPT-4, etc.)
npm install @anthropic-ai/sdk # For Anthropic (Claude)
npm install @google/generative-ai # For Google Gemini
# Optional: Redis storage
npm install ioredisQuick Start
import { LLMGuard } from 'llm-spend-guard';
import OpenAI from 'openai';
// 1. Create the guard with your budget
const guard = new LLMGuard({
dailyBudgetTokens: 100_000, // 100K tokens per day
maxTokensPerRequest: 10_000, // No single request can use more than 10K
onBudgetWarning(level, stats) {
console.log(`Budget alert [${level}]: ${stats.percentage.toFixed(1)}% used`);
},
});
// 2. Wrap your existing SDK client
const openai = new OpenAI();
guard.wrapOpenAI(openai);
// 3. Use guard.openai instead of openai directly
const response = await guard.openai.chat({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'What is the meaning of life?' }],
max_tokens: 500,
});
console.log(response.choices[0].message.content);
// 4. Check your budget anytime
const remaining = await guard.getRemainingBudget();
console.log(`Tokens remaining today: ${remaining}`);That's it. If any request would exceed the budget, it throws BudgetExceededError and the API is never called.
Configuration Options
const guard = new LLMGuard({
// --- Budget Limits ---
dailyBudgetTokens: 100_000, // Max tokens per day (resets at midnight)
globalBudgetTokens: 1_000_000, // Lifetime global cap
userBudgetTokens: 10_000, // Max per user
sessionBudgetTokens: 5_000, // Max per session
maxTokensPerRequest: 10_000, // Max per single request
// --- Behavior ---
autoTruncate: true, // Auto-trim prompts to fit budget
// --- Storage ---
storage: new MemoryStorage(), // Default. Use RedisStorage for production.
// --- Monitoring ---
onBudgetWarning(level, stats) {
// level: 'warning_50' | 'warning_80' | 'exceeded'
// stats: { scope, scopeKey, used, limit, remaining, percentage }
},
});| Option | Type | Default | Description |
|--------|------|---------|-------------|
| dailyBudgetTokens | number | undefined | Max tokens per day. Auto-resets at midnight. |
| globalBudgetTokens | number | undefined | Lifetime total token cap. |
| userBudgetTokens | number | undefined | Max tokens per unique user. |
| sessionBudgetTokens | number | undefined | Max tokens per session. |
| maxTokensPerRequest | number | undefined | Hard cap on a single request. |
| autoTruncate | boolean | false | Automatically shorten prompts to fit remaining budget. |
| storage | StorageAdapter | MemoryStorage | Where usage data is stored. |
| onBudgetWarning | function | undefined | Called at 50%, 80%, and 100% usage. |
Usage By Provider
OpenAI
import { LLMGuard } from 'llm-spend-guard';
import OpenAI from 'openai';
const guard = new LLMGuard({ dailyBudgetTokens: 100_000 });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
guard.wrapOpenAI(openai);
const res = await guard.openai.chat({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello!' }],
max_tokens: 500,
});Anthropic (Claude)
import { LLMGuard } from 'llm-spend-guard';
import Anthropic from '@anthropic-ai/sdk';
const guard = new LLMGuard({ dailyBudgetTokens: 100_000 });
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
guard.wrapAnthropic(anthropic);
const res = await guard.anthropic.chat({
model: 'claude-sonnet-4-20250514',
messages: [{ role: 'user', content: 'Hello!' }],
max_tokens: 500,
system: 'You are a helpful assistant.', // Anthropic system prompt
});Google Gemini
import { LLMGuard } from 'llm-spend-guard';
import { GoogleGenerativeAI } from '@google/generative-ai';
const guard = new LLMGuard({ dailyBudgetTokens: 100_000 });
const gemini = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
guard.wrapGemini(gemini);
const res = await guard.gemini.chat({
model: 'gemini-1.5-pro',
messages: [{ role: 'user', content: 'Hello!' }],
max_tokens: 500,
});Budget Scopes
You can enforce budgets at multiple levels simultaneously:
+-------------------+
| Global Budget | <-- total across everything
+-------------------+
/ | \
+--------+ +--------+ +--------+
| User A | | User B | | User C | <-- per-user limit
+--------+ +--------+ +--------+
| |
+---------+ +---------+
| Session | | Session | <-- per-session limit
+---------+ +---------+
|
+---------+
| Route | <-- per-route limit
+---------+Pass context with every request to activate scopes:
await guard.openai.chat(
{
model: 'gpt-4o',
messages: [...],
max_tokens: 500,
},
{
userId: 'user-123', // activates per-user budget
sessionId: 'sess-abc', // activates per-session budget
route: '/api/chat', // activates per-route budget
}
);All applicable scopes are checked. If any scope is exceeded, the request is blocked.
How Guarding Works (Request Lifecycle)
Here is exactly what happens on every .chat() call:
Step 1: ESTIMATE
| Count tokens in all messages using tiktoken (OpenAI) or heuristic (others)
| Add max_tokens (expected output) to get total estimated cost
v
Step 2: CHECK BUDGET
| For each active scope (global, daily, user, session, route):
| - Load current usage from storage
| - Compare: estimated tokens vs remaining budget
| - If over budget --> throw BudgetExceededError (request NEVER sent)
v
Step 3: AUTO-TRUNCATE (if enabled)
| If prompt is too large but truncation is on:
| - Keep system message intact
| - Keep most recent messages
| - Drop oldest messages first
| - Truncate text of last message if still too large
v
Step 4: SEND REQUEST
| Forward to actual LLM API (OpenAI/Anthropic/Gemini)
v
Step 5: RECORD USAGE
| Read actual token counts from API response
| Update all scope counters in storage
v
Step 6: FIRE ALERTS
| If any scope crosses 50% --> onBudgetWarning('warning_50', stats)
| If any scope crosses 80% --> onBudgetWarning('warning_80', stats)
| If any scope crosses 100% --> onBudgetWarning('exceeded', stats)
v
Step 7: RETURN RESPONSE
Return the original API response to your codeViewing Reports and Stats
Get Budget Stats
// Global stats (all scopes)
const stats = await guard.getStats();
console.log(stats);Output:
[
{
"scope": "global",
"scopeKey": "daily",
"used": 45200,
"limit": 100000,
"remaining": 54800,
"percentage": 45.2
}
]Get Per-User Stats
const userStats = await guard.getStats({ userId: 'user-123' });
console.log(userStats);Output:
[
{
"scope": "global",
"scopeKey": "daily",
"used": 45200,
"limit": 100000,
"remaining": 54800,
"percentage": 45.2
},
{
"scope": "user",
"scopeKey": "user:user-123",
"used": 8300,
"limit": 10000,
"remaining": 1700,
"percentage": 83.0
}
]Get Remaining Token Count
const remaining = await guard.getRemainingBudget({ userId: 'user-123' });
console.log(`Tokens left: ${remaining}`);
// Output: "Tokens left: 1700"This returns the minimum remaining across all active scopes. If the user has 1700 left on their user budget but 54800 left on the daily budget, it returns 1700 (the tightest constraint).
Build a Usage Dashboard Endpoint
app.get('/api/usage', async (req, res) => {
const userId = req.headers['x-user-id'] as string;
const stats = await guard.getStats({ userId });
const remaining = await guard.getRemainingBudget({ userId });
res.json({
budgets: stats.map(s => ({
scope: s.scope,
key: s.scopeKey,
used: s.used,
limit: s.limit,
remaining: s.remaining,
percentUsed: `${s.percentage.toFixed(1)}%`,
})),
totalRemaining: remaining,
});
});Response:
{
"budgets": [
{
"scope": "global",
"key": "daily",
"used": 45200,
"limit": 100000,
"remaining": 54800,
"percentUsed": "45.2%"
},
{
"scope": "user",
"key": "user:user-123",
"used": 8300,
"limit": 10000,
"remaining": 1700,
"percentUsed": "83.0%"
}
],
"totalRemaining": 1700
}Reset Budgets
// Reset all budgets
await guard.reset();
// Reset for a specific user
await guard.reset({ userId: 'user-123' });Alert Callbacks (Monitoring)
Get notified as budgets are consumed:
const guard = new LLMGuard({
dailyBudgetTokens: 100_000,
userBudgetTokens: 10_000,
onBudgetWarning(level, stats) {
switch (level) {
case 'warning_50':
console.log(`[WARN] ${stats.scopeKey} is 50% used (${stats.used}/${stats.limit})`);
break;
case 'warning_80':
console.warn(`[CRITICAL] ${stats.scopeKey} is 80% used!`);
// Send Slack notification, email alert, etc.
break;
case 'exceeded':
console.error(`[EXCEEDED] ${stats.scopeKey} has exceeded the budget!`);
// Page on-call, disable feature flag, etc.
break;
}
},
});Alert levels fire once per scope per threshold — you won't get spammed with duplicate alerts.
| Level | Fires When | Typical Action |
|-------|-----------|----------------|
| warning_50 | 50% budget consumed | Log it, update dashboard |
| warning_80 | 80% budget consumed | Alert team via Slack/email |
| exceeded | 100% budget consumed | Block requests, page on-call |
What Happens When Budget Is Exceeded
When a request would exceed any budget scope, the guard throws BudgetExceededError:
import { BudgetExceededError } from 'llm-spend-guard';
try {
await guard.openai.chat({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Tell me everything about the universe' }],
max_tokens: 50_000,
});
} catch (err) {
if (err instanceof BudgetExceededError) {
console.log(err.message);
// "Token budget exceeded for global:daily. Used 95000/100000 tokens (95.0%)"
console.log(err.stats);
// {
// scope: 'global',
// scopeKey: 'daily',
// used: 95000,
// limit: 100000,
// remaining: 5000,
// percentage: 95.0
// }
}
}The LLM API is NEVER called. No money is spent. The request is blocked locally before it leaves your server.
Auto Truncation
When autoTruncate: true, instead of rejecting oversized prompts, the guard intelligently trims them:
const guard = new LLMGuard({
dailyBudgetTokens: 5_000,
autoTruncate: true, // Enable smart truncation
});Truncation strategy:
- System messages are always preserved
- Most recent messages are kept first
- Oldest messages are dropped
- If the last message is still too large, its text is trimmed with
...appended
This is useful for chatbots with long conversation histories — the guard keeps the most relevant context while staying within budget.
Storage Backends
In-Memory (Default)
import { LLMGuard, MemoryStorage } from 'llm-spend-guard';
const guard = new LLMGuard({
storage: new MemoryStorage(), // This is the default, no need to specify
dailyBudgetTokens: 100_000,
});Good for: single-process apps, development, testing. Limitation: data is lost on restart, not shared across processes.
Redis (Production)
import { LLMGuard, RedisStorage } from 'llm-spend-guard';
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
const guard = new LLMGuard({
storage: new RedisStorage(redis, 'myapp:budget:'), // optional key prefix
dailyBudgetTokens: 100_000,
});Good for: production, multi-instance, serverless. Keys auto-expire at midnight (daily reset built-in).
Custom Adapter
Implement the StorageAdapter interface for any backend (PostgreSQL, DynamoDB, file system, etc.):
import { LLMGuard, StorageAdapter, ScopeUsage } from 'llm-spend-guard';
const myStorage: StorageAdapter = {
async get(key: string): Promise<ScopeUsage | null> {
// Read from your database
return db.get(key);
},
async set(key: string, value: ScopeUsage): Promise<void> {
// Write to your database
await db.set(key, value);
},
async increment(key: string, tokens: number): Promise<ScopeUsage> {
// Atomically increment and return updated value
const existing = await this.get(key) ?? { totalTokens: 0, date: new Date().toISOString().slice(0, 10) };
existing.totalTokens += tokens;
await this.set(key, existing);
return existing;
},
async reset(key: string): Promise<void> {
await db.delete(key);
},
};
const guard = new LLMGuard({ storage: myStorage, dailyBudgetTokens: 100_000 });Framework Integration
Express.js
import express from 'express';
import OpenAI from 'openai';
import { LLMGuard, expressMiddleware, budgetErrorHandler } from 'llm-spend-guard';
const app = express();
app.use(express.json());
const guard = new LLMGuard({
dailyBudgetTokens: 500_000,
userBudgetTokens: 50_000,
maxTokensPerRequest: 10_000,
onBudgetWarning(level, stats) {
console.warn(`[${level}] ${stats.scopeKey}: ${stats.percentage.toFixed(1)}%`);
},
});
const openai = new OpenAI();
guard.wrapOpenAI(openai);
// Middleware auto-extracts userId, sessionId, route from request
// userId from: x-user-id header or req.user.id (passport)
// sessionId from: x-session-id header or req.sessionID (express-session)
// route from: req.path
app.use(expressMiddleware(guard));
app.post('/api/chat', async (req, res, next) => {
try {
const response = await guard.openai.chat(
{
model: 'gpt-4o',
messages: req.body.messages,
max_tokens: 1000,
},
req.llmBudgetContext, // Automatically populated by middleware
);
res.json(response);
} catch (err) {
next(err);
}
});
// Returns HTTP 429 with error details when budget exceeded
app.use(budgetErrorHandler);
app.listen(3000);When budget is exceeded, the client gets:
HTTP 429 Too Many Requests
{
"error": "Token budget exceeded",
"details": {
"scope": "user",
"scopeKey": "user:user-123",
"used": 48500,
"limit": 50000,
"remaining": 1500,
"percentage": 97.0
}
}Next.js API Routes
// pages/api/chat.ts (or app/api/chat/route.ts)
import OpenAI from 'openai';
import { LLMGuard, withBudgetGuard } from 'llm-spend-guard';
const guard = new LLMGuard({
dailyBudgetTokens: 200_000,
userBudgetTokens: 20_000,
autoTruncate: true,
});
const openai = new OpenAI();
guard.wrapOpenAI(openai);
async function handler(req: any, res: any) {
const response = await guard.openai.chat(
{
model: 'gpt-4o',
messages: req.body.messages,
max_tokens: 1000,
},
req.llmBudgetContext, // Auto-populated by withBudgetGuard
);
res.status(200).json(response);
}
// Wraps handler with budget enforcement + auto 429 on exceeded
export default withBudgetGuard(guard, handler);Fastify / Koa / Hono
No built-in middleware for these, but integration is trivial since the guard is framework-agnostic:
// Fastify example
fastify.post('/api/chat', async (request, reply) => {
try {
const response = await guard.openai.chat(
{
model: 'gpt-4o',
messages: request.body.messages,
max_tokens: 1000,
},
{
userId: request.headers['x-user-id'] as string,
sessionId: request.headers['x-session-id'] as string,
route: request.url,
},
);
return response;
} catch (err) {
if (err instanceof BudgetExceededError) {
reply.status(429).send({ error: 'Budget exceeded', details: err.stats });
return;
}
throw err;
}
});SaaS Per-User Budget Example
For multi-tenant SaaS apps where each user has their own token budget:
import { LLMGuard, RedisStorage } from 'llm-spend-guard';
import Anthropic from '@anthropic-ai/sdk';
import Redis from 'ioredis';
const guard = new LLMGuard({
userBudgetTokens: 10_000, // 10K tokens per user per day
dailyBudgetTokens: 1_000_000, // 1M total across all users
maxTokensPerRequest: 5_000,
autoTruncate: true,
storage: new RedisStorage(new Redis()),
onBudgetWarning(level, stats) {
if (stats.scope === 'user' && level === 'warning_80') {
// Notify user they're running low
notifyUser(stats.scopeKey.replace('user:', ''), {
message: `You've used ${stats.percentage.toFixed(0)}% of your daily AI quota.`,
remaining: stats.remaining,
});
}
},
});
const anthropic = new Anthropic();
guard.wrapAnthropic(anthropic);
// In your API handler:
async function handleChat(userId: string, messages: any[]) {
return guard.anthropic.chat(
{
model: 'claude-sonnet-4-20250514',
messages,
max_tokens: 1000,
},
{ userId },
);
}Full API Reference
LLMGuard
| Method | Returns | Description |
|--------|---------|-------------|
| new LLMGuard(config) | LLMGuard | Create a guard instance |
| wrapOpenAI(client) | OpenAIProvider | Wrap an OpenAI SDK client |
| wrapAnthropic(client) | AnthropicProvider | Wrap an Anthropic SDK client |
| wrapGemini(client) | GeminiProvider | Wrap a Google Generative AI client |
| guard.openai | OpenAIProvider | Access the wrapped OpenAI provider |
| guard.anthropic | AnthropicProvider | Access the wrapped Anthropic provider |
| guard.gemini | GeminiProvider | Access the wrapped Gemini provider |
| getStats(ctx?) | Promise<BudgetStats[]> | Get usage stats for all applicable scopes |
| getRemainingBudget(ctx?) | Promise<number> | Get minimum remaining tokens across scopes |
| reset(ctx?) | Promise<void> | Reset usage counters |
| getBudgetManager() | BudgetManager | Access the underlying budget manager |
Provider .chat() Method
All providers (OpenAI, Anthropic, Gemini) have the same interface:
await guard.openai.chat(params, context?)| Parameter | Type | Description |
|-----------|------|-------------|
| params.model | string | Model name (e.g. 'gpt-4o', 'claude-sonnet-4-20250514') |
| params.messages | ChatMessage[] | Array of { role, content } messages |
| params.max_tokens | number | Max output tokens (default: 4096) |
| context.userId | string? | User identifier for per-user budgets |
| context.sessionId | string? | Session identifier for per-session budgets |
| context.route | string? | Route/endpoint for per-route budgets |
BudgetStats Object
{
scope: 'global' | 'user' | 'session' | 'route',
scopeKey: string, // e.g. "daily", "user:user-123"
used: number, // tokens consumed
limit: number, // budget cap
remaining: number, // tokens left
percentage: number // 0-100+
}BudgetExceededError
err.message // Human-readable error string
err.stats // BudgetStats object with full details
err.name // 'BudgetExceededError'Exports
// Core
import { LLMGuard, BudgetManager, BudgetExceededError } from 'llm-spend-guard';
// Providers
import { OpenAIProvider, AnthropicProvider, GeminiProvider } from 'llm-spend-guard';
// Storage
import { MemoryStorage, RedisStorage } from 'llm-spend-guard';
// Middleware
import { expressMiddleware, budgetErrorHandler, withBudgetGuard } from 'llm-spend-guard';
// Utilities
import { estimateTokens, estimateMessagesTokens, truncateMessages } from 'llm-spend-guard';
// Types
import type {
GuardConfig, BudgetConfig, BudgetStats, BudgetScope,
AlertLevel, StorageAdapter, ScopeUsage, RequestContext,
ChatMessage, TokenEstimatorFn,
} from 'llm-spend-guard';Comparison with Alternatives
| Feature | llm-spend-guard | Manual tracking | OpenAI Usage Limits | |---------|----------------|-----------------|---------------------| | Pre-request blocking | Yes | No | No (post-hoc only) | | Multi-provider support | OpenAI + Claude + Gemini | Manual per SDK | OpenAI only | | Per-user budgets | Built-in | Build yourself | No | | Per-session / per-route scopes | Built-in | Build yourself | No | | Auto-truncation | Yes | No | No | | Express/Next.js middleware | Built-in | Build yourself | No | | Redis support | Built-in | Build yourself | No | | Self-hosted | Yes | Yes | No (vendor dashboard) |
Running Tests
git clone <repo-url>
cd llm-spend-guard
npm install
npm test108 tests (99% coverage) covering:
- Budget overflow and enforcement (global, daily, per-request limits)
- Per-user, per-session, per-route scopes
- Token estimation accuracy (tiktoken + heuristic)
- Context truncation logic (system messages, binary search trimming)
- All provider wrappers — OpenAI, Anthropic, Gemini (mocked, no API keys needed)
- Auto-truncation across all providers
- Alert callback firing and deduplication
- Guard lifecycle (create, wrap, reset)
- Express middleware and Next.js wrapper
- Error handling (BudgetExceededError, budget error handler)
- Storage backends (MemoryStorage, RedisStorage with mock)
Contributing
We welcome contributions! Please read the Contributing Guide before submitting a PR.
Look for issues labeled good first issue to get started.
Security
To report vulnerabilities, please see our Security Policy.
Support
If this package helps you, consider supporting its development:
