throttleai

v1.1.0

Published

4 days ago

Lightweight, token-based AI governance for TypeScript

Downloads

363

0High
0Medium
0Low

mikefrilot

ai throttle governance rate-limit concurrency lease llm

ThrottleAI is a zero-dependency governor for concurrency, rate, and token budgets, with adapters for fetch / OpenAI / tools / Express / Hono.

Every AI application eventually hits a wall: rate limits, blown budgets, noisy tenants, or cascading failures from uncontrolled concurrency. ThrottleAI sits between your code and the model call, enforcing limits with a lease-based protocol that guarantees cleanup — even when things go wrong.

At a glance

| | | |---|---| | Zero dependencies | Nothing to audit, nothing to break. Pure TypeScript. | | Lease-based | Acquire before calling, release after. Auto-expire on timeout. No leaked slots. | | 5 limiters | Concurrency, request rate, token rate, fairness, adaptive tuning — mix and match. | | 5 adapters | fetch, OpenAI, tool wrapper, Express middleware, Hono middleware — tree-shakeable. | | 3 presets | quiet(), balanced(), aggressive() — start in seconds, tune later. | | Observability built in | Structured events, formatted logs, snapshot inspection, stats collector. | | Test-friendly | Deterministic clock injection, no timers in your test suite. | | Dual build | ESM + CJS via tsup. Works everywhere Node 18+ runs. |

Install

pnpm add throttleai    # or npm / yarn / bun

60-second quickstart

import { createGovernor, withLease, presets } from "throttleai";

const gov = createGovernor(presets.balanced());

const result = await withLease(
  gov,
  { actorId: "user-1", action: "chat" },
  async () => await callMyModel(),
);

if (result.granted) {
  console.log(result.result);
} else {
  console.log("Throttled:", result.decision.recommendation);
}

That's it. The governor enforces concurrency, rate limits, and fairness. Leases auto-expire if you forget to release.

Why ThrottleAI exists

AI applications hit rate limits, blow budgets, and create stampedes. Without governance, a single runaway loop can exhaust your API quota, a noisy tenant can starve everyone else, and a slow upstream can cascade into timeouts across your stack.

ThrottleAI solves this with five composable limiters:

Concurrency — cap in-flight calls with weighted slots and interactive reserve
Rate — requests/min and tokens/min with rolling windows
Fairness — no single actor monopolizes capacity
Adaptive — auto-tune concurrency based on deny rate and upstream latency
Leases — acquire before, release after, auto-expire on timeout

You configure what you need and skip the rest. Most apps only need concurrency.

Choose your limiter

| Limiter | What it caps | When to use | |---------|-------------|-------------| | Concurrency | Simultaneous in-flight calls | Always — this is the most important knob | | Rate | Requests per minute | When the upstream API has a documented rate limit | | Token rate | Tokens per minute | When you have a per-minute token budget | | Fairness | Per-actor share of capacity | Multi-tenant apps where one user shouldn't hog slots | | Adaptive | Auto-tuned concurrency ceiling | When upstream latency is unpredictable |

Start with concurrency. Add rate only if needed. See the Tuning Cheatsheet for scenario-based guidance.

Presets

import { presets } from "throttleai";

// Single user, CLI tools — 1 call at a time, 10 req/min
createGovernor(presets.quiet());

// SaaS backend — 5 concurrent (2 interactive reserve), 60 req/min, fairness
createGovernor(presets.balanced());

// Batch processing — 20 concurrent, 300 req/min, fairness + adaptive tuning
createGovernor(presets.aggressive());

// Override any field
createGovernor({ ...presets.balanced(), leaseTtlMs: 30_000 });

| Preset | maxInFlight | interactiveReserve | req/min | tok/min | Fairness | Adaptive | Best for | |--------|:-----------:|:---------:|:-------:|:-------:|:--------:|:--------:|----------| | quiet() | 1 | 0 | 10 | — | No | No | CLI tools, scripts, single-user | | balanced() | 5 | 2 | 60 | 100K | Yes | No | SaaS backends, API servers | | aggressive() | 20 | 5 | 300 | 500K | Yes | Yes | Batch pipelines, high-volume |

Common patterns

Server endpoint: 429 vs queue

// Option A: immediate deny with 429
const result = await withLease(gov, request, fn);
// result.granted === false → respond with 429

// Option B: wait with bounded retries
const result = await withLease(gov, request, fn, {
  strategy: "wait-then-deny",
  maxAttempts: 3,
  maxWaitMs: 5_000,
});

UI interactive vs background

// User-facing chat gets priority
gov.acquire({ actorId: "user", action: "chat", priority: "interactive" });

// Background embedding can wait
gov.acquire({ actorId: "pipeline", action: "embed", priority: "background" });

With interactiveReserve: 2, background tasks are blocked when only 2 slots remain, keeping those for interactive requests.

Streaming calls

const decision = gov.acquire({ actorId: "user", action: "stream" });
if (!decision.granted) return;

try {
  const stream = await openai.chat.completions.create({ stream: true, ... });
  for await (const chunk of stream) {
    // process chunk
  }
  gov.release(decision.leaseId, { outcome: "success" });
} catch (err) {
  gov.release(decision.leaseId, { outcome: "error" });
  throw err;
}

Acquire once, release once — the lease holds for the entire stream duration.

Weighted calls

// Embedding: cheap (weight 1, the default)
gov.acquire({ actorId: "user", action: "embed" });

// GPT-4 with vision: expensive (weight 4 → consumes 4 concurrency slots)
gov.acquire({
  actorId: "user",
  action: "vision",
  estimate: { weight: 4 },
});

Idempotency

// Same key = same lease (no double-acquire)
const d1 = gov.acquire({ actorId: "user", action: "chat", idempotencyKey: "req-123" });
const d2 = gov.acquire({ actorId: "user", action: "chat", idempotencyKey: "req-123" });
// d1.leaseId === d2.leaseId — only one slot consumed

Observability: see why it throttles

import { createGovernor, formatEvent, formatSnapshot } from "throttleai";

const gov = createGovernor({
  ...presets.balanced(),
  onEvent: (e) => console.log(formatEvent(e)),
  // [deny] actor=user-1 action=chat reason=concurrency retryAfterMs=500 — All 5 slots in use...
});

// Point-in-time view
console.log(formatSnapshot(gov.snapshot()));
// concurrency=3/5 rate=12/60 leases=3

Stats collector

import { createGovernor, createStatsCollector } from "throttleai";

const stats = createStatsCollector();
const gov = createGovernor({ ...presets.balanced(), onEvent: stats.handler });

// Periodically check metrics
setInterval(() => {
  const s = stats.snapshot();
  console.log(`deny rate: ${(s.denyRate * 100).toFixed(1)}%, avg latency: ${s.avgLatencyMs.toFixed(0)}ms`);
}, 10_000);

Configuration

createGovernor({
  // Concurrency (optional)
  concurrency: {
    maxInFlight: 5,          // max simultaneous weight
    interactiveReserve: 1,   // slots reserved for interactive priority
  },

  // Rate limiting (optional)
  rate: {
    requestsPerMinute: 60,   // request-rate cap
    tokensPerMinute: 100_000, // token-rate cap
    windowMs: 60_000,         // rolling window (default 60s)
  },

  // Advanced (optional)
  fairness: true,             // prevent actor monopolization
  adaptive: true,             // auto-tune concurrency from deny rate + latency
  strict: true,               // throw on double release / unknown ID (dev mode)

  // Lease settings
  leaseTtlMs: 60_000,         // auto-expire (default 60s)
  reaperIntervalMs: 5_000,    // sweep interval (default 5s)

  // Observability
  onEvent: (e) => { /* acquire, deny, release, expire, warn */ },
});

API

`createGovernor(config): Governor`

Factory function. Returns a Governor instance.

`governor.acquire(request): AcquireDecision`

Request a lease. Returns:

// Granted
{ granted: true, leaseId: string, expiresAt: number }

// Denied
{ granted: false, reason, retryAfterMs, recommendation, limitsHint? }

Deny reasons: "concurrency" | "rate" | "budget" | "policy"

`governor.release(leaseId, report?): void`

Release a lease. Always call this — even on errors.

`withLease(governor, request, fn, options?)`

Execute fn under a lease with automatic release.

withLease(gov, request, fn, {
  strategy: "deny",           // default — fail immediately
  strategy: "wait",           // retry with backoff until maxWaitMs
  strategy: "wait-then-deny", // retry up to maxAttempts
  maxWaitMs: 10_000,          // max total wait (default 10s)
  maxAttempts: 3,             // for "wait-then-deny" (default 3)
  initialBackoffMs: 250,      // starting backoff (default 250ms)
});

`governor.snapshot(): GovernorSnapshot`

Point-in-time state: concurrency, rate, tokens, last deny.

`formatEvent(event): string` / `formatSnapshot(snap): string`

One-line human-readable formatters.

`createStatsCollector(): StatsCollector`

Zero-dep stats collector. Wire to onEvent for grants, denials, outcomes, latency tracking, and deny-rate calculation.

`createTestClock(startMs?): Clock`

Deterministic clock for testing. Advances manually — no flaky timers.

Status getters

gov.activeLeases         // active lease count
gov.concurrencyActive    // in-flight weight
gov.concurrencyAvailable // remaining capacity
gov.rateCount            // requests in current window
gov.tokenRateCount       // tokens in current window

`governor.dispose(): void`

Stop the TTL reaper. Call on shutdown.

Adapters

Tree-shakeable wrappers — import only what you use. No runtime deps.

| Adapter | Import | Auto-reports | |---------|--------|-------------| | fetch | throttleai/adapters/fetch | outcome (from HTTP status) + latency | | OpenAI | throttleai/adapters/openai | outcome + latency + token usage | | Tool | throttleai/adapters/tools | outcome + latency + custom weight | | Express | throttleai/adapters/express | outcome (from res.statusCode) + latency | | Hono | throttleai/adapters/hono | outcome + latency |

All adapters return { ok: true, result, latencyMs } on grant, { ok: false, decision } on deny.

fetch

import { wrapFetch } from "throttleai/adapters/fetch";
const throttledFetch = wrapFetch(fetch, { governor: gov });
const r = await throttledFetch("https://api.example.com/v1/chat");
if (r.ok) console.log(r.response.status);

OpenAI-compatible

import { wrapChatCompletions } from "throttleai/adapters/openai";
const chat = wrapChatCompletions(openai.chat.completions.create, { governor: gov });
const r = await chat({ model: "gpt-4", messages });
if (r.ok) console.log(r.result.choices[0].message.content);

Tool call

import { wrapTool } from "throttleai/adapters/tools";
const embed = wrapTool(myEmbedFn, { governor: gov, toolId: "embed", costWeight: 2 });
const r = await embed("hello");
if (r.ok) console.log(r.result);

Express

import { throttleMiddleware } from "throttleai/adapters/express";
app.use("/ai", throttleMiddleware({ governor: gov }));
// 429 + Retry-After header + JSON body on deny

See examples/express-adaptive/ for a full runnable server with adaptive tuning.

Hono

import { throttle } from "throttleai/adapters/hono";
app.use("/ai/*", throttle({ governor: gov }));
// 429 JSON on deny, leaseId stored on context

Docs

| Document | What it covers | |----------|---------------| | Handbook | End-to-end usage guide: architecture, patterns, production checklist | | Tuning cheatsheet | Scenario-based config guide, decision tree, knob reference | | Troubleshooting | Common issues: always denied, stalls, adaptive oscillation | | API stability | Public vs internal API surface, versioning policy | | Release manifest | Release process and artifact details | | Repo hygiene | Asset policy and history rewrite log |

Tuning quick reference

| You see this | Adjust this | |---|---| | reason: "concurrency" | Increase maxInFlight or decrease call duration | | reason: "rate" | Increase requestsPerMinute / tokensPerMinute | | reason: "policy" (fairness) | Lower softCapRatio or increase maxInFlight | | High retryAfterMs | Reduce leaseTtlMs so expired leases free faster | | Background tasks starved | Increase maxInFlight or reduce interactiveReserve | | Interactive latency high | Increase interactiveReserve | | Adaptive shrinks too fast | Lower alpha or raise targetDenyRate |

For deeper guidance, see the Tuning Cheatsheet.

Examples

See examples/ for runnable demos:

express-adaptive/ — full Express server with adaptive tuning + load generator
node-basic.ts — burst simulation with snapshot printing
express-middleware.ts — 429 + retry-after endpoint
cookbook-adapters.ts — all five adapters in action
cookbook-burst-snapshot.ts — burst load with governor snapshots
cookbook-interactive-reserve.ts — interactive vs background priority
cookbook-express-429.ts — 429 vs queue retry pattern

npx tsx examples/node-basic.ts

Stability

ThrottleAI follows Semantic Versioning. The public API — everything exported from throttleai and throttleai/adapters/* — is stable as of v1.0.0. Breaking changes require a major version bump.

For details on what's considered public vs internal, see API stability. For security reporting, see SECURITY.md.

License

MIT