llm-spend-guard

v2.0.6

Published

a month ago

Enforce real-time token budgets and spending limits for OpenAI, Anthropic Claude, and Google Gemini API calls in Node.js

The Problem

A single runaway loop, an uncapped user session, or one oversized prompt can burn through your entire LLM budget in minutes. There is no built-in way to set spending limits across OpenAI, Anthropic, or Gemini SDKs.

llm-spend-guard wraps your existing LLM SDK calls and enforces token budgets before any request is sent to the API. If a request would exceed your budget, it gets blocked instantly — no money wasted.

Why llm-spend-guard?

Pre-request blocking — Stops overspending before the API call, not after
Multi-provider — Single API for OpenAI, Anthropic Claude, and Google Gemini
Multi-scope budgets — Global, per-user, per-session, and per-route limits
Zero config — Works with 3 lines of code, no infrastructure needed
Production-ready — Redis storage, Express/Next.js middleware, TypeScript-first
Lightweight — zero runtime dependencies beyond tiktoken

How It Works

Your Code --> llm-spend-guard --> LLM API (OpenAI / Anthropic / Gemini)
                  |
                  |-- 1. Estimates tokens BEFORE the request
                  |-- 2. Checks all budget scopes (global, user, session, route)
                  |-- 3. If over budget --> BLOCKS the request (throws BudgetExceededError)
                  |-- 4. If auto-truncate enabled --> trims prompt to fit
                  |-- 5. Sends request to LLM API
                  |-- 6. Records actual token usage from response
                  |-- 7. Fires alert callbacks at 50%, 80%, 100% thresholds

Key principle: The guard sits between your code and the LLM SDK. It estimates cost before sending, blocks if over budget, and tracks actual usage after the response.

Compatible Tech Stacks

| Category | Supported | |----------|-----------| | Runtime | Node.js >= 18, Bun, Deno (with Node compat) | | Language | TypeScript, JavaScript (CommonJS and ESM) | | LLM Providers | OpenAI, Anthropic (Claude), Google Gemini | | Frameworks | Express.js, Next.js, Fastify, Koa, Hono, NestJS, or any Node.js server | | Storage | In-memory (default), Redis, or any custom adapter | | Use Cases | REST APIs, SaaS backends, chatbots, AI agents, CLI tools, serverless functions |

Not compatible with: Browser/frontend code (this is a server-side package), Python, or non-Node runtimes without Node compatibility.

Installation

npm install llm-spend-guard

Then install the provider SDK(s) you use:

# Pick one or more
npm install openai                  # For OpenAI (GPT-4o, GPT-4, etc.)
npm install @anthropic-ai/sdk       # For Anthropic (Claude)
npm install @google/generative-ai   # For Google Gemini

# Optional: Redis storage
npm install ioredis

Quick Start

import { LLMGuard } from 'llm-spend-guard';
import OpenAI from 'openai';

// 1. Create the guard with your budget
const guard = new LLMGuard({
  dailyBudgetTokens: 100_000,       // 100K tokens per day
  maxTokensPerRequest: 10_000,      // No single request can use more than 10K
  onBudgetWarning(level, stats) {
    console.log(`Budget alert [${level}]: ${stats.percentage.toFixed(1)}% used`);
  },
});

// 2. Wrap your existing SDK client
const openai = new OpenAI();
guard.wrapOpenAI(openai);

// 3. Use guard.openai instead of openai directly
const response = await guard.openai.chat({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is the meaning of life?' }],
  max_tokens: 500,
});

console.log(response.choices[0].message.content);

// 4. Check your budget anytime
const remaining = await guard.getRemainingBudget();
console.log(`Tokens remaining today: ${remaining}`);

That's it. If any request would exceed the budget, it throws BudgetExceededError and the API is never called.

Configuration Options

const guard = new LLMGuard({
  // --- Budget Limits ---
  dailyBudgetTokens: 100_000,       // Max tokens per day (resets at midnight)
  globalBudgetTokens: 1_000_000,    // Lifetime global cap
  userBudgetTokens: 10_000,         // Max per user
  sessionBudgetTokens: 5_000,       // Max per session
  maxTokensPerRequest: 10_000,      // Max per single request

  // --- Behavior ---
  autoTruncate: true,               // Auto-trim prompts to fit budget

  // --- Storage ---
  storage: new MemoryStorage(),     // Default. Use RedisStorage for production.

  // --- Monitoring ---
  onBudgetWarning(level, stats) {
    // level: 'warning_50' | 'warning_80' | 'exceeded'
    // stats: { scope, scopeKey, used, limit, remaining, percentage }
  },
});

| Option | Type | Default | Description | |--------|------|---------|-------------| | dailyBudgetTokens | number | undefined | Max tokens per day. Auto-resets at midnight. | | globalBudgetTokens | number | undefined | Lifetime total token cap. | | userBudgetTokens | number | undefined | Max tokens per unique user. | | sessionBudgetTokens | number | undefined | Max tokens per session. | | maxTokensPerRequest | number | undefined | Hard cap on a single request. | | autoTruncate | boolean | false | Automatically shorten prompts to fit remaining budget. | | storage | StorageAdapter | MemoryStorage | Where usage data is stored. | | onBudgetWarning | function | undefined | Called at 50%, 80%, and 100% usage. |

Usage By Provider

OpenAI

import { LLMGuard } from 'llm-spend-guard';
import OpenAI from 'openai';

const guard = new LLMGuard({ dailyBudgetTokens: 100_000 });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
guard.wrapOpenAI(openai);

const res = await guard.openai.chat({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello!' }],
  max_tokens: 500,
});

Anthropic (Claude)

import { LLMGuard } from 'llm-spend-guard';
import Anthropic from '@anthropic-ai/sdk';

const guard = new LLMGuard({ dailyBudgetTokens: 100_000 });
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
guard.wrapAnthropic(anthropic);

const res = await guard.anthropic.chat({
  model: 'claude-sonnet-4-20250514',
  messages: [{ role: 'user', content: 'Hello!' }],
  max_tokens: 500,
  system: 'You are a helpful assistant.',  // Anthropic system prompt
});

Google Gemini

import { LLMGuard } from 'llm-spend-guard';
import { GoogleGenerativeAI } from '@google/generative-ai';

const guard = new LLMGuard({ dailyBudgetTokens: 100_000 });
const gemini = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
guard.wrapGemini(gemini);

const res = await guard.gemini.chat({
  model: 'gemini-1.5-pro',
  messages: [{ role: 'user', content: 'Hello!' }],
  max_tokens: 500,
});

Budget Scopes

You can enforce budgets at multiple levels simultaneously:

                    +-------------------+
                    |   Global Budget   |  <-- total across everything
                    +-------------------+
                     /        |         \
            +--------+  +--------+  +--------+
            | User A |  | User B |  | User C |  <-- per-user limit
            +--------+  +--------+  +--------+
               |            |
          +---------+  +---------+
          | Session |  | Session |  <-- per-session limit
          +---------+  +---------+
               |
          +---------+
          |  Route  |  <-- per-route limit
          +---------+

Pass context with every request to activate scopes:

await guard.openai.chat(
  {
    model: 'gpt-4o',
    messages: [...],
    max_tokens: 500,
  },
  {
    userId: 'user-123',       // activates per-user budget
    sessionId: 'sess-abc',    // activates per-session budget
    route: '/api/chat',       // activates per-route budget
  }
);

All applicable scopes are checked. If any scope is exceeded, the request is blocked.

How Guarding Works (Request Lifecycle)

Here is exactly what happens on every .chat() call:

Step 1: ESTIMATE
   |  Count tokens in all messages using tiktoken (OpenAI) or heuristic (others)
   |  Add max_tokens (expected output) to get total estimated cost
   v
Step 2: CHECK BUDGET
   |  For each active scope (global, daily, user, session, route):
   |    - Load current usage from storage
   |    - Compare: estimated tokens vs remaining budget
   |    - If over budget --> throw BudgetExceededError (request NEVER sent)
   v
Step 3: AUTO-TRUNCATE (if enabled)
   |  If prompt is too large but truncation is on:
   |    - Keep system message intact
   |    - Keep most recent messages
   |    - Drop oldest messages first
   |    - Truncate text of last message if still too large
   v
Step 4: SEND REQUEST
   |  Forward to actual LLM API (OpenAI/Anthropic/Gemini)
   v
Step 5: RECORD USAGE
   |  Read actual token counts from API response
   |  Update all scope counters in storage
   v
Step 6: FIRE ALERTS
   |  If any scope crosses 50% --> onBudgetWarning('warning_50', stats)
   |  If any scope crosses 80% --> onBudgetWarning('warning_80', stats)
   |  If any scope crosses 100% --> onBudgetWarning('exceeded', stats)
   v
Step 7: RETURN RESPONSE
      Return the original API response to your code

Viewing Reports and Stats

Get Budget Stats

// Global stats (all scopes)
const stats = await guard.getStats();
console.log(stats);

Output:

[
  {
    "scope": "global",
    "scopeKey": "daily",
    "used": 45200,
    "limit": 100000,
    "remaining": 54800,
    "percentage": 45.2
  }
]

Get Per-User Stats

const userStats = await guard.getStats({ userId: 'user-123' });
console.log(userStats);

Output:

[
  {
    "scope": "global",
    "scopeKey": "daily",
    "used": 45200,
    "limit": 100000,
    "remaining": 54800,
    "percentage": 45.2
  },
  {
    "scope": "user",
    "scopeKey": "user:user-123",
    "used": 8300,
    "limit": 10000,
    "remaining": 1700,
    "percentage": 83.0
  }
]

Get Remaining Token Count

const remaining = await guard.getRemainingBudget({ userId: 'user-123' });
console.log(`Tokens left: ${remaining}`);
// Output: "Tokens left: 1700"

This returns the minimum remaining across all active scopes. If the user has 1700 left on their user budget but 54800 left on the daily budget, it returns 1700 (the tightest constraint).

Build a Usage Dashboard Endpoint

app.get('/api/usage', async (req, res) => {
  const userId = req.headers['x-user-id'] as string;

  const stats = await guard.getStats({ userId });
  const remaining = await guard.getRemainingBudget({ userId });

  res.json({
    budgets: stats.map(s => ({
      scope: s.scope,
      key: s.scopeKey,
      used: s.used,
      limit: s.limit,
      remaining: s.remaining,
      percentUsed: `${s.percentage.toFixed(1)}%`,
    })),
    totalRemaining: remaining,
  });
});

Response:

{
  "budgets": [
    {
      "scope": "global",
      "key": "daily",
      "used": 45200,
      "limit": 100000,
      "remaining": 54800,
      "percentUsed": "45.2%"
    },
    {
      "scope": "user",
      "key": "user:user-123",
      "used": 8300,
      "limit": 10000,
      "remaining": 1700,
      "percentUsed": "83.0%"
    }
  ],
  "totalRemaining": 1700
}

Reset Budgets

// Reset all budgets
await guard.reset();

// Reset for a specific user
await guard.reset({ userId: 'user-123' });

Alert Callbacks (Monitoring)

Get notified as budgets are consumed:

const guard = new LLMGuard({
  dailyBudgetTokens: 100_000,
  userBudgetTokens: 10_000,
  onBudgetWarning(level, stats) {
    switch (level) {
      case 'warning_50':
        console.log(`[WARN] ${stats.scopeKey} is 50% used (${stats.used}/${stats.limit})`);
        break;
      case 'warning_80':
        console.warn(`[CRITICAL] ${stats.scopeKey} is 80% used!`);
        // Send Slack notification, email alert, etc.
        break;
      case 'exceeded':
        console.error(`[EXCEEDED] ${stats.scopeKey} has exceeded the budget!`);
        // Page on-call, disable feature flag, etc.
        break;
    }
  },
});

Alert levels fire once per scope per threshold — you won't get spammed with duplicate alerts.

| Level | Fires When | Typical Action | |-------|-----------|----------------| | warning_50 | 50% budget consumed | Log it, update dashboard | | warning_80 | 80% budget consumed | Alert team via Slack/email | | exceeded | 100% budget consumed | Block requests, page on-call |

What Happens When Budget Is Exceeded

When a request would exceed any budget scope, the guard throws BudgetExceededError:

import { BudgetExceededError } from 'llm-spend-guard';

try {
  await guard.openai.chat({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'Tell me everything about the universe' }],
    max_tokens: 50_000,
  });
} catch (err) {
  if (err instanceof BudgetExceededError) {
    console.log(err.message);
    // "Token budget exceeded for global:daily. Used 95000/100000 tokens (95.0%)"

    console.log(err.stats);
    // {
    //   scope: 'global',
    //   scopeKey: 'daily',
    //   used: 95000,
    //   limit: 100000,
    //   remaining: 5000,
    //   percentage: 95.0
    // }
  }
}

The LLM API is NEVER called. No money is spent. The request is blocked locally before it leaves your server.

Auto Truncation

When autoTruncate: true, instead of rejecting oversized prompts, the guard intelligently trims them:

const guard = new LLMGuard({
  dailyBudgetTokens: 5_000,
  autoTruncate: true,  // Enable smart truncation
});

Truncation strategy:

System messages are always preserved
Most recent messages are kept first
Oldest messages are dropped
If the last message is still too large, its text is trimmed with ... appended

This is useful for chatbots with long conversation histories — the guard keeps the most relevant context while staying within budget.

Storage Backends

In-Memory (Default)

import { LLMGuard, MemoryStorage } from 'llm-spend-guard';

const guard = new LLMGuard({
  storage: new MemoryStorage(),  // This is the default, no need to specify
  dailyBudgetTokens: 100_000,
});

Good for: single-process apps, development, testing. Limitation: data is lost on restart, not shared across processes.

Redis (Production)

import { LLMGuard, RedisStorage } from 'llm-spend-guard';
import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);

const guard = new LLMGuard({
  storage: new RedisStorage(redis, 'myapp:budget:'),  // optional key prefix
  dailyBudgetTokens: 100_000,
});

Good for: production, multi-instance, serverless. Keys auto-expire at midnight (daily reset built-in).

Custom Adapter

Implement the StorageAdapter interface for any backend (PostgreSQL, DynamoDB, file system, etc.):

import { LLMGuard, StorageAdapter, ScopeUsage } from 'llm-spend-guard';

const myStorage: StorageAdapter = {
  async get(key: string): Promise<ScopeUsage | null> {
    // Read from your database
    return db.get(key);
  },
  async set(key: string, value: ScopeUsage): Promise<void> {
    // Write to your database
    await db.set(key, value);
  },
  async increment(key: string, tokens: number): Promise<ScopeUsage> {
    // Atomically increment and return updated value
    const existing = await this.get(key) ?? { totalTokens: 0, date: new Date().toISOString().slice(0, 10) };
    existing.totalTokens += tokens;
    await this.set(key, existing);
    return existing;
  },
  async reset(key: string): Promise<void> {
    await db.delete(key);
  },
};

const guard = new LLMGuard({ storage: myStorage, dailyBudgetTokens: 100_000 });

Framework Integration

Express.js

import express from 'express';
import OpenAI from 'openai';
import { LLMGuard, expressMiddleware, budgetErrorHandler } from 'llm-spend-guard';

const app = express();
app.use(express.json());

const guard = new LLMGuard({
  dailyBudgetTokens: 500_000,
  userBudgetTokens: 50_000,
  maxTokensPerRequest: 10_000,
  onBudgetWarning(level, stats) {
    console.warn(`[${level}] ${stats.scopeKey}: ${stats.percentage.toFixed(1)}%`);
  },
});

const openai = new OpenAI();
guard.wrapOpenAI(openai);

// Middleware auto-extracts userId, sessionId, route from request
// userId from: x-user-id header or req.user.id (passport)
// sessionId from: x-session-id header or req.sessionID (express-session)
// route from: req.path
app.use(expressMiddleware(guard));

app.post('/api/chat', async (req, res, next) => {
  try {
    const response = await guard.openai.chat(
      {
        model: 'gpt-4o',
        messages: req.body.messages,
        max_tokens: 1000,
      },
      req.llmBudgetContext,  // Automatically populated by middleware
    );
    res.json(response);
  } catch (err) {
    next(err);
  }
});

// Returns HTTP 429 with error details when budget exceeded
app.use(budgetErrorHandler);

app.listen(3000);

When budget is exceeded, the client gets:

HTTP 429 Too Many Requests

{
  "error": "Token budget exceeded",
  "details": {
    "scope": "user",
    "scopeKey": "user:user-123",
    "used": 48500,
    "limit": 50000,
    "remaining": 1500,
    "percentage": 97.0
  }
}

Next.js API Routes

// pages/api/chat.ts (or app/api/chat/route.ts)
import OpenAI from 'openai';
import { LLMGuard, withBudgetGuard } from 'llm-spend-guard';

const guard = new LLMGuard({
  dailyBudgetTokens: 200_000,
  userBudgetTokens: 20_000,
  autoTruncate: true,
});

const openai = new OpenAI();
guard.wrapOpenAI(openai);

async function handler(req: any, res: any) {
  const response = await guard.openai.chat(
    {
      model: 'gpt-4o',
      messages: req.body.messages,
      max_tokens: 1000,
    },
    req.llmBudgetContext,  // Auto-populated by withBudgetGuard
  );
  res.status(200).json(response);
}

// Wraps handler with budget enforcement + auto 429 on exceeded
export default withBudgetGuard(guard, handler);

Fastify / Koa / Hono

No built-in middleware for these, but integration is trivial since the guard is framework-agnostic:

// Fastify example
fastify.post('/api/chat', async (request, reply) => {
  try {
    const response = await guard.openai.chat(
      {
        model: 'gpt-4o',
        messages: request.body.messages,
        max_tokens: 1000,
      },
      {
        userId: request.headers['x-user-id'] as string,
        sessionId: request.headers['x-session-id'] as string,
        route: request.url,
      },
    );
    return response;
  } catch (err) {
    if (err instanceof BudgetExceededError) {
      reply.status(429).send({ error: 'Budget exceeded', details: err.stats });
      return;
    }
    throw err;
  }
});

SaaS Per-User Budget Example

For multi-tenant SaaS apps where each user has their own token budget:

import { LLMGuard, RedisStorage } from 'llm-spend-guard';
import Anthropic from '@anthropic-ai/sdk';
import Redis from 'ioredis';

const guard = new LLMGuard({
  userBudgetTokens: 10_000,          // 10K tokens per user per day
  dailyBudgetTokens: 1_000_000,      // 1M total across all users
  maxTokensPerRequest: 5_000,
  autoTruncate: true,
  storage: new RedisStorage(new Redis()),
  onBudgetWarning(level, stats) {
    if (stats.scope === 'user' && level === 'warning_80') {
      // Notify user they're running low
      notifyUser(stats.scopeKey.replace('user:', ''), {
        message: `You've used ${stats.percentage.toFixed(0)}% of your daily AI quota.`,
        remaining: stats.remaining,
      });
    }
  },
});

const anthropic = new Anthropic();
guard.wrapAnthropic(anthropic);

// In your API handler:
async function handleChat(userId: string, messages: any[]) {
  return guard.anthropic.chat(
    {
      model: 'claude-sonnet-4-20250514',
      messages,
      max_tokens: 1000,
    },
    { userId },
  );
}

Full API Reference

`LLMGuard`

| Method | Returns | Description | |--------|---------|-------------| | new LLMGuard(config) | LLMGuard | Create a guard instance | | wrapOpenAI(client) | OpenAIProvider | Wrap an OpenAI SDK client | | wrapAnthropic(client) | AnthropicProvider | Wrap an Anthropic SDK client | | wrapGemini(client) | GeminiProvider | Wrap a Google Generative AI client | | guard.openai | OpenAIProvider | Access the wrapped OpenAI provider | | guard.anthropic | AnthropicProvider | Access the wrapped Anthropic provider | | guard.gemini | GeminiProvider | Access the wrapped Gemini provider | | getStats(ctx?) | Promise<BudgetStats[]> | Get usage stats for all applicable scopes | | getRemainingBudget(ctx?) | Promise<number> | Get minimum remaining tokens across scopes | | reset(ctx?) | Promise<void> | Reset usage counters | | getBudgetManager() | BudgetManager | Access the underlying budget manager |

Provider `.chat()` Method

All providers (OpenAI, Anthropic, Gemini) have the same interface:

await guard.openai.chat(params, context?)

| Parameter | Type | Description | |-----------|------|-------------| | params.model | string | Model name (e.g. 'gpt-4o', 'claude-sonnet-4-20250514') | | params.messages | ChatMessage[] | Array of { role, content } messages | | params.max_tokens | number | Max output tokens (default: 4096) | | context.userId | string? | User identifier for per-user budgets | | context.sessionId | string? | Session identifier for per-session budgets | | context.route | string? | Route/endpoint for per-route budgets |

`BudgetStats` Object

{
  scope: 'global' | 'user' | 'session' | 'route',
  scopeKey: string,     // e.g. "daily", "user:user-123"
  used: number,         // tokens consumed
  limit: number,        // budget cap
  remaining: number,    // tokens left
  percentage: number    // 0-100+
}

`BudgetExceededError`

err.message   // Human-readable error string
err.stats     // BudgetStats object with full details
err.name      // 'BudgetExceededError'

Exports

// Core
import { LLMGuard, BudgetManager, BudgetExceededError } from 'llm-spend-guard';

// Providers
import { OpenAIProvider, AnthropicProvider, GeminiProvider } from 'llm-spend-guard';

// Storage
import { MemoryStorage, RedisStorage } from 'llm-spend-guard';

// Middleware
import { expressMiddleware, budgetErrorHandler, withBudgetGuard } from 'llm-spend-guard';

// Utilities
import { estimateTokens, estimateMessagesTokens, truncateMessages } from 'llm-spend-guard';

// Types
import type {
  GuardConfig, BudgetConfig, BudgetStats, BudgetScope,
  AlertLevel, StorageAdapter, ScopeUsage, RequestContext,
  ChatMessage, TokenEstimatorFn,
} from 'llm-spend-guard';

Comparison with Alternatives

| Feature | llm-spend-guard | Manual tracking | OpenAI Usage Limits | |---------|----------------|-----------------|---------------------| | Pre-request blocking | Yes | No | No (post-hoc only) | | Multi-provider support | OpenAI + Claude + Gemini | Manual per SDK | OpenAI only | | Per-user budgets | Built-in | Build yourself | No | | Per-session / per-route scopes | Built-in | Build yourself | No | | Auto-truncation | Yes | No | No | | Express/Next.js middleware | Built-in | Build yourself | No | | Redis support | Built-in | Build yourself | No | | Self-hosted | Yes | Yes | No (vendor dashboard) |

Running Tests

git clone <repo-url>
cd llm-spend-guard
npm install
npm test

108 tests (99% coverage) covering:

Budget overflow and enforcement (global, daily, per-request limits)
Per-user, per-session, per-route scopes
Token estimation accuracy (tiktoken + heuristic)
Context truncation logic (system messages, binary search trimming)
All provider wrappers — OpenAI, Anthropic, Gemini (mocked, no API keys needed)
Auto-truncation across all providers
Alert callback firing and deduplication
Guard lifecycle (create, wrap, reset)
Express middleware and Next.js wrapper
Error handling (BudgetExceededError, budget error handler)
Storage backends (MemoryStorage, RedisStorage with mock)

Contributing

We welcome contributions! Please read the Contributing Guide before submitting a PR.

Look for issues labeled good first issue to get started.

Security

To report vulnerabilities, please see our Security Policy.

Support

If this package helps you, consider supporting its development:

Contributors

License

MIT — Made by Ali Raza