throttleai
v1.1.0
Published
Lightweight, token-based AI governance for TypeScript
Downloads
363
Maintainers
Readme
ThrottleAI is a zero-dependency governor for concurrency, rate, and token budgets, with adapters for fetch / OpenAI / tools / Express / Hono.
Every AI application eventually hits a wall: rate limits, blown budgets, noisy tenants, or cascading failures from uncontrolled concurrency. ThrottleAI sits between your code and the model call, enforcing limits with a lease-based protocol that guarantees cleanup — even when things go wrong.
At a glance
| | |
|---|---|
| Zero dependencies | Nothing to audit, nothing to break. Pure TypeScript. |
| Lease-based | Acquire before calling, release after. Auto-expire on timeout. No leaked slots. |
| 5 limiters | Concurrency, request rate, token rate, fairness, adaptive tuning — mix and match. |
| 5 adapters | fetch, OpenAI, tool wrapper, Express middleware, Hono middleware — tree-shakeable. |
| 3 presets | quiet(), balanced(), aggressive() — start in seconds, tune later. |
| Observability built in | Structured events, formatted logs, snapshot inspection, stats collector. |
| Test-friendly | Deterministic clock injection, no timers in your test suite. |
| Dual build | ESM + CJS via tsup. Works everywhere Node 18+ runs. |
Install
pnpm add throttleai # or npm / yarn / bun60-second quickstart
import { createGovernor, withLease, presets } from "throttleai";
const gov = createGovernor(presets.balanced());
const result = await withLease(
gov,
{ actorId: "user-1", action: "chat" },
async () => await callMyModel(),
);
if (result.granted) {
console.log(result.result);
} else {
console.log("Throttled:", result.decision.recommendation);
}That's it. The governor enforces concurrency, rate limits, and fairness. Leases auto-expire if you forget to release.
Why ThrottleAI exists
AI applications hit rate limits, blow budgets, and create stampedes. Without governance, a single runaway loop can exhaust your API quota, a noisy tenant can starve everyone else, and a slow upstream can cascade into timeouts across your stack.
ThrottleAI solves this with five composable limiters:
- Concurrency — cap in-flight calls with weighted slots and interactive reserve
- Rate — requests/min and tokens/min with rolling windows
- Fairness — no single actor monopolizes capacity
- Adaptive — auto-tune concurrency based on deny rate and upstream latency
- Leases — acquire before, release after, auto-expire on timeout
You configure what you need and skip the rest. Most apps only need concurrency.
Choose your limiter
| Limiter | What it caps | When to use | |---------|-------------|-------------| | Concurrency | Simultaneous in-flight calls | Always — this is the most important knob | | Rate | Requests per minute | When the upstream API has a documented rate limit | | Token rate | Tokens per minute | When you have a per-minute token budget | | Fairness | Per-actor share of capacity | Multi-tenant apps where one user shouldn't hog slots | | Adaptive | Auto-tuned concurrency ceiling | When upstream latency is unpredictable |
Start with concurrency. Add rate only if needed. See the Tuning Cheatsheet for scenario-based guidance.
Presets
import { presets } from "throttleai";
// Single user, CLI tools — 1 call at a time, 10 req/min
createGovernor(presets.quiet());
// SaaS backend — 5 concurrent (2 interactive reserve), 60 req/min, fairness
createGovernor(presets.balanced());
// Batch processing — 20 concurrent, 300 req/min, fairness + adaptive tuning
createGovernor(presets.aggressive());
// Override any field
createGovernor({ ...presets.balanced(), leaseTtlMs: 30_000 });| Preset | maxInFlight | interactiveReserve | req/min | tok/min | Fairness | Adaptive | Best for |
|--------|:-----------:|:---------:|:-------:|:-------:|:--------:|:--------:|----------|
| quiet() | 1 | 0 | 10 | — | No | No | CLI tools, scripts, single-user |
| balanced() | 5 | 2 | 60 | 100K | Yes | No | SaaS backends, API servers |
| aggressive() | 20 | 5 | 300 | 500K | Yes | Yes | Batch pipelines, high-volume |
Common patterns
Server endpoint: 429 vs queue
// Option A: immediate deny with 429
const result = await withLease(gov, request, fn);
// result.granted === false → respond with 429
// Option B: wait with bounded retries
const result = await withLease(gov, request, fn, {
strategy: "wait-then-deny",
maxAttempts: 3,
maxWaitMs: 5_000,
});UI interactive vs background
// User-facing chat gets priority
gov.acquire({ actorId: "user", action: "chat", priority: "interactive" });
// Background embedding can wait
gov.acquire({ actorId: "pipeline", action: "embed", priority: "background" });With interactiveReserve: 2, background tasks are blocked when only 2 slots remain, keeping those for interactive requests.
Streaming calls
const decision = gov.acquire({ actorId: "user", action: "stream" });
if (!decision.granted) return;
try {
const stream = await openai.chat.completions.create({ stream: true, ... });
for await (const chunk of stream) {
// process chunk
}
gov.release(decision.leaseId, { outcome: "success" });
} catch (err) {
gov.release(decision.leaseId, { outcome: "error" });
throw err;
}Acquire once, release once — the lease holds for the entire stream duration.
Weighted calls
// Embedding: cheap (weight 1, the default)
gov.acquire({ actorId: "user", action: "embed" });
// GPT-4 with vision: expensive (weight 4 → consumes 4 concurrency slots)
gov.acquire({
actorId: "user",
action: "vision",
estimate: { weight: 4 },
});Idempotency
// Same key = same lease (no double-acquire)
const d1 = gov.acquire({ actorId: "user", action: "chat", idempotencyKey: "req-123" });
const d2 = gov.acquire({ actorId: "user", action: "chat", idempotencyKey: "req-123" });
// d1.leaseId === d2.leaseId — only one slot consumedObservability: see why it throttles
import { createGovernor, formatEvent, formatSnapshot } from "throttleai";
const gov = createGovernor({
...presets.balanced(),
onEvent: (e) => console.log(formatEvent(e)),
// [deny] actor=user-1 action=chat reason=concurrency retryAfterMs=500 — All 5 slots in use...
});
// Point-in-time view
console.log(formatSnapshot(gov.snapshot()));
// concurrency=3/5 rate=12/60 leases=3Stats collector
import { createGovernor, createStatsCollector } from "throttleai";
const stats = createStatsCollector();
const gov = createGovernor({ ...presets.balanced(), onEvent: stats.handler });
// Periodically check metrics
setInterval(() => {
const s = stats.snapshot();
console.log(`deny rate: ${(s.denyRate * 100).toFixed(1)}%, avg latency: ${s.avgLatencyMs.toFixed(0)}ms`);
}, 10_000);Configuration
createGovernor({
// Concurrency (optional)
concurrency: {
maxInFlight: 5, // max simultaneous weight
interactiveReserve: 1, // slots reserved for interactive priority
},
// Rate limiting (optional)
rate: {
requestsPerMinute: 60, // request-rate cap
tokensPerMinute: 100_000, // token-rate cap
windowMs: 60_000, // rolling window (default 60s)
},
// Advanced (optional)
fairness: true, // prevent actor monopolization
adaptive: true, // auto-tune concurrency from deny rate + latency
strict: true, // throw on double release / unknown ID (dev mode)
// Lease settings
leaseTtlMs: 60_000, // auto-expire (default 60s)
reaperIntervalMs: 5_000, // sweep interval (default 5s)
// Observability
onEvent: (e) => { /* acquire, deny, release, expire, warn */ },
});API
createGovernor(config): Governor
Factory function. Returns a Governor instance.
governor.acquire(request): AcquireDecision
Request a lease. Returns:
// Granted
{ granted: true, leaseId: string, expiresAt: number }
// Denied
{ granted: false, reason, retryAfterMs, recommendation, limitsHint? }Deny reasons: "concurrency" | "rate" | "budget" | "policy"
governor.release(leaseId, report?): void
Release a lease. Always call this — even on errors.
withLease(governor, request, fn, options?)
Execute fn under a lease with automatic release.
withLease(gov, request, fn, {
strategy: "deny", // default — fail immediately
strategy: "wait", // retry with backoff until maxWaitMs
strategy: "wait-then-deny", // retry up to maxAttempts
maxWaitMs: 10_000, // max total wait (default 10s)
maxAttempts: 3, // for "wait-then-deny" (default 3)
initialBackoffMs: 250, // starting backoff (default 250ms)
});governor.snapshot(): GovernorSnapshot
Point-in-time state: concurrency, rate, tokens, last deny.
formatEvent(event): string / formatSnapshot(snap): string
One-line human-readable formatters.
createStatsCollector(): StatsCollector
Zero-dep stats collector. Wire to onEvent for grants, denials, outcomes, latency tracking, and deny-rate calculation.
createTestClock(startMs?): Clock
Deterministic clock for testing. Advances manually — no flaky timers.
Status getters
gov.activeLeases // active lease count
gov.concurrencyActive // in-flight weight
gov.concurrencyAvailable // remaining capacity
gov.rateCount // requests in current window
gov.tokenRateCount // tokens in current windowgovernor.dispose(): void
Stop the TTL reaper. Call on shutdown.
Adapters
Tree-shakeable wrappers — import only what you use. No runtime deps.
| Adapter | Import | Auto-reports |
|---------|--------|-------------|
| fetch | throttleai/adapters/fetch | outcome (from HTTP status) + latency |
| OpenAI | throttleai/adapters/openai | outcome + latency + token usage |
| Tool | throttleai/adapters/tools | outcome + latency + custom weight |
| Express | throttleai/adapters/express | outcome (from res.statusCode) + latency |
| Hono | throttleai/adapters/hono | outcome + latency |
All adapters return { ok: true, result, latencyMs } on grant, { ok: false, decision } on deny.
fetch
import { wrapFetch } from "throttleai/adapters/fetch";
const throttledFetch = wrapFetch(fetch, { governor: gov });
const r = await throttledFetch("https://api.example.com/v1/chat");
if (r.ok) console.log(r.response.status);OpenAI-compatible
import { wrapChatCompletions } from "throttleai/adapters/openai";
const chat = wrapChatCompletions(openai.chat.completions.create, { governor: gov });
const r = await chat({ model: "gpt-4", messages });
if (r.ok) console.log(r.result.choices[0].message.content);Tool call
import { wrapTool } from "throttleai/adapters/tools";
const embed = wrapTool(myEmbedFn, { governor: gov, toolId: "embed", costWeight: 2 });
const r = await embed("hello");
if (r.ok) console.log(r.result);Express
import { throttleMiddleware } from "throttleai/adapters/express";
app.use("/ai", throttleMiddleware({ governor: gov }));
// 429 + Retry-After header + JSON body on denySee examples/express-adaptive/ for a full runnable server with adaptive tuning.
Hono
import { throttle } from "throttleai/adapters/hono";
app.use("/ai/*", throttle({ governor: gov }));
// 429 JSON on deny, leaseId stored on contextDocs
| Document | What it covers | |----------|---------------| | Handbook | End-to-end usage guide: architecture, patterns, production checklist | | Tuning cheatsheet | Scenario-based config guide, decision tree, knob reference | | Troubleshooting | Common issues: always denied, stalls, adaptive oscillation | | API stability | Public vs internal API surface, versioning policy | | Release manifest | Release process and artifact details | | Repo hygiene | Asset policy and history rewrite log |
Tuning quick reference
| You see this | Adjust this |
|---|---|
| reason: "concurrency" | Increase maxInFlight or decrease call duration |
| reason: "rate" | Increase requestsPerMinute / tokensPerMinute |
| reason: "policy" (fairness) | Lower softCapRatio or increase maxInFlight |
| High retryAfterMs | Reduce leaseTtlMs so expired leases free faster |
| Background tasks starved | Increase maxInFlight or reduce interactiveReserve |
| Interactive latency high | Increase interactiveReserve |
| Adaptive shrinks too fast | Lower alpha or raise targetDenyRate |
For deeper guidance, see the Tuning Cheatsheet.
Examples
See examples/ for runnable demos:
- express-adaptive/ — full Express server with adaptive tuning + load generator
- node-basic.ts — burst simulation with snapshot printing
- express-middleware.ts — 429 + retry-after endpoint
- cookbook-adapters.ts — all five adapters in action
- cookbook-burst-snapshot.ts — burst load with governor snapshots
- cookbook-interactive-reserve.ts — interactive vs background priority
- cookbook-express-429.ts — 429 vs queue retry pattern
npx tsx examples/node-basic.tsStability
ThrottleAI follows Semantic Versioning. The public API — everything exported from throttleai and throttleai/adapters/* — is stable as of v1.0.0. Breaking changes require a major version bump.
For details on what's considered public vs internal, see API stability. For security reporting, see SECURITY.md.
License
MIT
