throttlekit-server
v0.4.2
Published
gRPC service door for ThrottleKit — run the rate-limiting core as a network service for polyglot clients.
Readme
throttlekit-server
Beyond rate limiting — over the wire. The gRPC service door for ThrottleKit: run the proven core that governs rate, concurrency, and cost — its GALE (provable distributed leasing) and TALE (LLM token-budget escrow) engines — as a network service so polyglot clients (Python, Go, …) get decisions identical to an embedded Node library, without re-implementing any algorithm or touching the raw Lua wire.
One server now carries the whole fleet story: the four distributed features reach any client over the
existing decision RPCs with no client change (federation, fleet token-budget, distributed concurrency,
cross-region fair escrow); a very high-throughput client can lease a slice of the global budget through
the additive Fleet door and spend it locally; every policy is observable through ThrottleKit Lens (a
zero-dependency terminal dashboard) and the read-only Monitor door (gRPC + Prometheus /metrics); and
you can plan a limit change against recorded traffic before you ship it.
Status:
0.4.0. The gRPC decision contract is stable and conformance-tested against the golden vectors (a polyglot client's decisions are byte-identical to the embedded library). The wire evolves additively only — theMonitorandFleetservices were added underthrottlekit.v1, machine-gated bybuf breakingin CI; the decision messages never change. This server depends only on thethrottlekitcore's public API — it adds no surface to the core and keeps its zero-runtime-dependency promise intact. Monitoring, the Fleet lease, decision capture, and What-If Replay are opt-in /@experimental.
Why a service (not a port)
The whole ThrottleKit design rests on one invariant: exactly one thing computes a Decision — the
Node core, directly or as Lua-in-Redis. The service exposes that core over gRPC, so a client is a trivial
RPC stub instead of a second rate-limiter to keep in sync. A rate-limit denial is a normal Decision
(allowed: false), never an RPC error; errors are reserved for operational faults (unknown policy →
NOT_FOUND, unsupported op → UNIMPLEMENTED).
This is the door we lead with for non-Node languages: the in-process ~169 ns number doesn't transfer to CPython, so the network-bound service is where the value is.
Run it
throttlekit-server --config .throttlekit.yaml --port 50051# .throttlekit.yaml
version: 1
limiters:
api: { strategy: gcra, limit: 100, period: 1m, burst: 20 }
uploads: { strategy: fixedWindow, limit: 10, period: 1h }A client sends Check { policy: "api", key: apiKey, cost: 1 } and reads back a Decision.
By default each policy uses an in-process memory store (correct for a single instance). Point every instance at the same Redis to run a coordinated fleet enforcing one shared limit:
throttlekit-server --config .throttlekit.yaml --redis redis://redis:6379…or back the fleet with Postgres instead — no Redis required, the same shared-store guarantee, and decisions stay bit-identical (the core's pure transform runs inside the store, server-side):
throttlekit-server --config .throttlekit.yaml --postgres-url postgres://user:pass@db:5432/app…or DynamoDB (--dynamodb-create-table provisions the single-pk table on first run):
throttlekit-server --config .throttlekit.yaml \
--store dynamodb --dynamodb-table throttlekit --dynamodb-create-tableBacking stores
The server can host any of the core's exact rate stores. The decision always runs in the core, so
every backend yields bit-identical decisions — the store only transports state. Select one with --store,
or let it infer from which URL flag you pass:
| Store | How | Notes |
|---|---|---|
| Memory | default (no flag) | per-policy, in-process — single instance only |
| Redis | --redis <url> | shared fleet store; one atomic Lua round trip |
| Postgres | --postgres-url <url> | shared fleet store, no Redis required; per-key advisory-lock atomicity |
| DynamoDB | --store dynamodb --dynamodb-table <t> | shared fleet store, no Redis required; version-CAS atomicity + native TTL |
DenoKV and Cloudflare (D1 / Durable Objects / Workers KV) are edge-runtime stores — they bind to APIs that don't exist in Node, so they can't back a Node
throttlekit-server. Reach them by running ThrottleKit inside those runtimes, not through this service door.
Two-tier leasing (cut the per-request round trip)
A policy can carry a twoTier block to be served as a two-tier leased limiter: each instance leases a
batch of tokens from the shared L2 (Redis) and then admits locally until the batch runs low — trading one
Redis round trip per batch requests for a bounded, self-healing overshoot (≤ fleet × (batch − 1) per
window, or exactly the limit with windowCoupled). The client reaches it with a plain check —
no new RPC, the core still computes every decision.
version: 1
limiters:
leased-api:
strategy: gcra # the same algorithm/fields as a plain policy, enforced at L2
limit: 1000
period: 1m
twoTier: # ← a nested *block* (the config parser does not accept nested flow `{…}`)
mode: leased # strict | cached-deny | leased
batch: 50 # tokens leased from L2 per refill
windowCoupled: true # tie credit lifetime to the L2 window ⇒ per-window overshoot = limitWithout --redis a twoTier policy falls back to a private in-process L2 (single-instance, same as a
plain policy); point the fleet at one Redis to share the budget. peek/forecast aren't offered on a
leased policy (it is consume-only) — they return UNIMPLEMENTED.
Token budgets (the cost axis)
For post-hoc costs you only learn after a request runs — the LLM-gateway problem, where a completion's
token count isn't known until it streams — a policy can be a tokenBudget meter, served via the Debit
RPC. The client debits the actual tokens as they are produced; a debit is admitted while budget
remains, and the meter stops on the token that crosses the limit (per-token debiting overshoots by 0).
version: 1
limiters:
completions:
tokenBudget: # ← a block, not a strategy: this policy is a meter, served via Debit
budget: 100000 # tokens per window, per key
windowMs: 60000A client calls Debit { policy: "completions", key: tenant, tokens: n } per chunk. The service keeps one
meter per key (bounded by maxKeys, default 100k). A tokenBudget meter is process-local (each instance
counts independently); for one budget shared across the fleet, use fleetBudget (next section). check on a
token-budget policy — and debit on a rate limiter — return UNIMPLEMENTED.
Fleet token budgets (one budget across the whole fleet)
A tokenBudget meter counts per instance. To enforce one token budget across every server instance — the
same fleet promise the shared store already gives rate limits and two-tier leasing — use a fleetBudget block.
It is the same cost axis served by the same Debit RPC (no client change, no wire change), but each per-key
counter lives in the shared store (--redis / --postgres / …) and is debited atomically, so the budget holds
no matter how many instances point at it.
version: 1
limiters:
completions:
fleetBudget: # like tokenBudget, but ONE budget shared across every instance on the same store
budget: 1000000 # tokens per window, per key, fleet-wide
windowMs: 60000Run two instances against one --redis and a client's Debit { policy: "completions", key: tenant } calls are
metered against a single global budget. Key-semantics (read this): the request key selects which budget
— each distinct key is an independent counter at store key "<prefix>:<key>", the prefix defaulting to the
policy name. Two instances coordinate iff they resolve the same store key, which same-config instances do
automatically; set an explicit prefix only to deliberately share one budget across differently-named policies.
Without a shared store a fleetBudget policy is process-local — identical to tokenBudget — so it is correct on
a single instance and becomes fleet-coordinated the moment you add the store. check on a fleetBudget policy
returns UNIMPLEMENTED (it is a meter, like tokenBudget).
Cross-region federation (one global rate limit across regions)
A plain rate-limit policy on a shared store already coordinates a fleet, but each cross-region trip pays the
full store round-trip. A federated block instead enforces one global per-window budget across regions
through a cross-region coordinator (the core's federate()), served over the same Check RPC (no
client change, no wire change). Each instance leases a slice of the global budget from the coordinator, so
the fleet admits at most the strategy's limit per window — regardless of region or instance count.
version: 1
limiters:
global-api:
federated: { batch: 16 } # ← cross-region: one global budget, leased per region
strategy: fixedWindow # MUST be window-coupled (fixedWindow / slidingWindow / fixed-cadence quota)
limit: 10000
period: 1mRun it with --redis (or --postgres) and --region <id> (or TK_REGION; default "default"); the
coordinator lives in that shared store. A client's Check { policy: "global-api", key } is then bound by the
one global budget. Constraints (enforced at load, fail-fast): the strategy must have a discrete window
— gcra / tokenBucket are rejected (a continuous rate has no window boundary to couple to), as is a
calendar-cadence quota; and a coordinator store is required (memory / dynamodb cannot federate). Peek /
Forecast are UNIMPLEMENTED on a federated policy (it is async + window-based). The coordinator's global
budget is the strategy's limit; batch (default 16) trades cross-region round-trips for some unused
capacity under skew — which does not add overshoot under window-coupling, only affects utilization.
Tier-2 fleet leasing (Fleet.Reserve — lease a chunk, spend it locally)
A per-request Check/Debit round trip is the bottleneck for a very high-throughput client. The Fleet
door (throttlekit.v1.Fleet / Reserve) hands such a client a chunk of a federated: policy's global
per-window budget to spend locally, so it round-trips only to refresh — not once per request:
Reserve { policy: "global-api", caller: { domain: "acme" }, wants: 200 }
→ Lease { capacity: 200, expiry_ms, refresh_interval_ms, safe_capacity, retry_after_ms, limit }The server is the one oracle: it computes the grant size via the policy's federation coordinator (a
partial grant — capacity may be < wants — is legitimate; the grant is window-coupled and discarded at
expiry_ms). The client spends it with the core LeaseSpender (throttlekit/twotier) — a verbatim port of
the leased-L1 spend, pinned byte-for-byte by the golden lease vectors — and surfaces the server's denial
when capacity is 0 (it never invents one). caller.domain selects which budget to lease (a tenant id);
empty leases the policy as a whole.
The door is served automatically whenever a federated: policy is configured, on the same gRPC port. It
is loopback-only by default (handing out budget is a poisoning vector): set --fleet-secret <s> (or
THROTTLEKIT_FLEET_SECRET) to use it from a remote peer (x-fleet-secret metadata, or
authorization: Bearer <s>), paired with TLS. v1 leases the rate axis; Reserve returns UNIMPLEMENTED
for the concurrency axis and NOT_FOUND for a policy that isn't leasable.
Cross-region fair escrow (federatedFairEscrow)
federatedFairEscrow is the cross-region face of fairEscrow: the same weighted-fair split of one
per-window budget across tenants, but the budget L is now global across regions. A store-backed region
pool reserves each region a weighted-max-min slice of L (region weight = Σ its active tenants' weights), and
each region splits its slice across its own tenants — so the fleet's total admits stay ≤ L no matter how
many region instances run. Served over the same Check RPC (the request key is the tenant; no client
change, no wire change) — the fourth of four fleet-distributed features reachable over an existing RPC.
version: 1
limiters:
gateway:
federatedFairEscrow:
limit: 100000 # the GLOBAL per-window budget, shared across regions
windowMs: 60000
weights: { team-a: 3, team-b: 1 } # per-tenant weights (default 1)Run it with --redis and --region <id> (or TK_REGION). Every region instance draws from one shared pool
(keyed by the policy name), so N regions admit ≤ L total — never N × L. Constraints (enforced at
load): it needs --redis (the only backend with a cross-region pool today; memory / postgres /
dynamodb error, pointing you at plain fairEscrow: for a single instance). The decision is the core's
federatedWeightedFairEscrow over a RedisRegionFairPool (one oracle). The Fairness view + Cost Room
light up for it exactly like fairEscrow — each showing this region's granted slice. Peek / debit /
admit are UNIMPLEMENTED. Needs throttlekit@^1.4.0.
Concurrency & unified admission (the in-flight axis)
For limiting concurrent work — not a rate, but how many requests are in flight at once — a policy can
carry a concurrency block. It is served by a stateful lifecycle: Admit takes a slot, Release
returns it, Heartbeat renews long holds. The ceiling is the core's adaptive adaptiveConcurrency
(it grows while latency stays low and contracts under load); pin it with minLimit === maxLimit for a
fixed cap. Add a strategy alongside and the policy becomes a unified rate × concurrency admitter —
the core composes the axes and reports which one bound a denial.
version: 1
limiters:
checkout: # concurrency-only: at most `maxLimit` requests in flight
concurrency: { minLimit: 4, maxLimit: 200 }
api: # unified: rate (gcra) AND concurrency, whichever binds first
strategy: gcra
limit: 1000
period: 1m
burst: 100
concurrency: { maxLimit: 64 }A granted Admit returns a lease_id the caller must Release when the work finishes (pass
dropped: true on a timeout/error so the adaptive limit contracts). If a client crashes without
releasing, the server reclaims the slot once the lease TTL (default 2s) lapses without a heartbeat —
the same crash-safety contract the core uses node↔coordinator, one layer out. check/debit on an
admitter (and admit on a rate limiter / meter) return UNIMPLEMENTED. A plain concurrency block is the
in-process authority for one instance's own clients; for one ceiling across the whole fleet, use
distributedConcurrency (next section) — reached by the same Admit/Release/Heartbeat lifecycle.
Fleet-coordinated concurrency (distributedConcurrency)
distributedConcurrency is the fleet-shared face of concurrency: the same adaptive in-flight axis, but
the ceiling is held across every instance on a shared store via the core's
distributedAdaptiveConcurrency. Each node heartbeats its locally-inferred limit to a concurrency
coordinator in the shared store; the coordinator folds the fleet's views into one L_global and hands
each node its share — so N instances admit under one global ceiling, not N × the per-instance one. It
carries every concurrency tuning field (forwarded as each node's local guard) plus the coordinator knobs,
and is served over the same Admit RPC (no client change, no wire change).
version: 1
limiters:
checkout:
distributedConcurrency:
minLimit: 4
maxLimit: 200 # ← ONE in-flight ceiling of 200 across the whole fleet, not 200 per instance
aggregate: median # how the fleet folds nodes' limits (median | min); default medianRun it with --redis (or --postgres) and a unique --node-id <id> per process (or TK_NODE_ID;
defaults to host#pid) — a node-id collision corrupts the fleet aggregate, so identity is mandatory. A
coordinator store is required (memory / dynamodb cannot coordinate; the policy errors at load). The admit
path stays local and fast — coordination rides an out-of-band heartbeat, not a per-request round-trip —
and a partitioned node self-fences on lease expiry (onCoordinatorOutage: "local-only" trades the global
bound for availability). The two concurrency leases never merge: the server's per-client Admit lease and
the node↔coordinator heartbeat lease run independently. On shutdown the server leave()s the fleet so peers
reclaim its share immediately.
ThrottleKit Lens — watch it live in the terminal
throttlekit-server --config x.yaml --tui opens ThrottleKit Lens, a built-in, zero-dependency live
dashboard right in your terminal, alongside gRPC — no browser, no metrics backend:
throttlekit-server --config .throttlekit.yaml --tui
# → gRPC on :50051 + a live dashboard (q quit · 1-8/Tab switch · ↑↓ scroll · p pause · r what-if · P plan)It taps every limiter and unified admitter into an in-process hub (synchronous, exception-swallowing, O(1)
— the gRPC decisions are byte-for-byte unchanged) and renders the full ops board plus live binding-axis
attribution: for a unified policy, which of rate / concurrency / cost (or the joint-LP policy lane) is
throttling each key right now. It works for every policy — a plain gcra limiter gets the board and the
"why throttled" attribution by policy + key; the axis lane lights up for unified admitters.
The dashboard is organized into eight views — press 1–8 or Tab / Shift-Tab to switch:
Overview, Latency (avg / p50 / p99 / max admit-path latency), Fairness (per-tenant
weighted-fair-escrow share), Capacity (per-key spendable + refill ETA), Guarantee (concurrency
headroom to each guard's enforced ceiling + self-fence status), Cost Room (per-tenant cost-axis
burn-down for a fairEscrow policy), Replay (deterministic what-if), and Plan (a whole-config
"terraform plan for limits" — see Policy Plans below). Fairness + Cost Room light up for a fairEscrow
(or federatedFairEscrow) policy (served by check, the key being the tenant); Guarantee lights up for any
concurrency policy (the admitter's guard is surfaced to the dashboard).
A TUI owns the terminal, so it is opt-in and needs an interactive TTY (a non-TTY warns and serves without it). For headless / production monitoring, emit OpenTelemetry → Grafana, or read the same operational state programmatically over the Monitor door (next section).
Read it remotely — the Monitor door
The same operational state the dashboard renders is also a read-only gRPC service
(throttlekit.v1.Monitor), so any language can read it remotely — no terminal, no scraping. It runs on the
same port as the rate limiter and is on by default (--monitor off to disable).
rpc GetSnapshot(GetSnapshotRequest) returns (GetSnapshotResponse); // a point-in-time operational snapshot
rpc Watch(WatchRequest) returns (stream WatchResponse); // a live, filtered denial feedGetSnapshot returns a typed envelope — per-policy allowed/denied/limit/latency + top keys, concurrency
guard health, the recent denial feed — plus a raw_json field carrying the full dashboard snapshot (cost
rooms, per-axis analytics, replay, custom stats) for depth and forward-compatibility. Watch opens a live
denial stream (optionally filtered to one policy), each event the "why, with numbers" of a rejection. The
stream is rate-capped and backpressured server-side — a slow reader drops events, so the feed never grows
server memory or perturbs the control path (it is best-effort observability, not a durable log — use capture
for that). Both are strictly read-only: they never compute, return, or affect a rate-limit decision.
Auth (the snapshot carries traffic keys = PII). The door is loopback-only by default. To read it from
another host, set a secret with --monitor-secret <s> (or THROTTLEKIT_MONITOR_SECRET) and present it in call
metadata (x-monitor-secret: <s>, or authorization: Bearer <s>); pair it with TLS for confidentiality. A
non-loopback call without the secret is rejected UNAUTHENTICATED. (Not composed with --tui exclusivity:
the door is served alongside the dashboard, and alongside the decision RPCs; it is not served together with
capture in this version.)
Prometheus /metrics + /healthz. For metrics tooling, add --metrics-port <n> to serve a small HTTP
endpoint: GET /metrics renders the live counters in Prometheus exposition format — per-policy
throttlekit_allowed_total / throttlekit_denied_total, the per-axis throttlekit_denied_by_axis_total
(binding-axis attribution), observed ceiling, p50/p99 admit latency, and concurrency-guard health — and
GET /healthz is a 200 liveness probe. These series are aggregate and PII-free (no per-key data — that
lives only on the authed gRPC door), so the endpoint defaults to loopback and needs no auth;
--metrics-host 0.0.0.0 exposes it (with a warning). It needs the telemetry hub, so run with monitoring on.
gRPC health (grpc.health.v1.Health). The standard gRPC health-checking service is served on the same
port as the decision RPCs — always on, no auth (it reports only SERVING / NOT_SERVING, never traffic
data) — so grpc_health_probe, Kubernetes gRPC liveness/readiness probes, and service meshes work out of the
box. Check returns SERVING for the overall server ("") and each served service (throttlekit.v1.RateLimiter,
and throttlekit.v1.Monitor when the Monitor door is on), NOT_FOUND for an unknown one; Watch streams the
current status. (Its proto is the vendored upstream standard, kept outside the additive-only wire/ contract.)
Decision capture (experimental, opt-in, default-OFF)
Record the server's live decision stream to a durable, redacted, AES-256-GCM-encrypted forensic store —
then investigate it out-of-band with a fail-closed, audited CLI. Capture is opt-in and OFF by default
(it records PII); enable it with a top-level capture: block:
capture:
enabled: true # anything but an explicit true is OFF
redaction:
mode: hmac # hmac | per-trace-salt | drop (keys + tenants are redacted at capture)
secretEnv: TK_CAPTURE_HMAC # hmac needs a secret (prefer an env var over inline)
tenant: { from: key-prefix, delimiter: ":" } # derive the tenant; omit ⇒ counts-only (no per-key rows)
durable:
dir: /var/lib/throttlekit/captures
encryptionKeyHexEnv: TK_CAPTURE_KEY # 32-byte (64-hex) AES-256 key — encryption is mandatory
retention: { ttlMs: 86400000, maxScopes: 1000, ringSize: 10000 }
auth:
operatorSecretEnv: TK_CAPTURE_OP # required for the admin CLI (fail-closed without it)On start the server prints a loud ⚠ capture ON — recording decisions (PII) banner. Capture is a
post-decision tail — it is O(1), synchronous, exception-swallowing, and bounded, so it can never change,
delay, or break a decision, and a key/tenant flood can't exhaust memory. The flush to disk runs off the
decision path.
Admin CLI (out-of-band, not the gRPC port; every action is audited):
throttlekit-server capture list --config policies.yaml # list segments (decrypted metadata)
throttlekit-server capture export --config policies.yaml --id <id> # → a downstream-replayable trace (leaf-rate)
throttlekit-server capture sweep --config policies.yaml # purge past-TTL segments
# credential via THROTTLEKIT_CAPTURE_CREDENTIAL (preferred) or --credential (visible in `ps`)What it is — and isn't. Captures are a forensic/audit record: live decisions run over a system
clock, so a captured trace is stamped clock:"system" and is replay-refused by the testkit — export
emits the ReplayTrace JSON for downstream replay/what-if with a testkit-capable build (a deterministic
in-server replay mode is a documented follow-on). Only leaf rate-limit policies project to a replayable
trace; admitter/meter/fair-escrow segments are forensic-only. Keys and tenants are redacted at capture
(full HMAC digest, never the raw value); under hmac an operator locates a tenant by hashing its id with the
secret, under per-trace-salt scopes are opaque and re-salt each server run, under drop identity is erased.
With no tenant rule capture drops to counts-only (per-policy tallies, no per-key rows). Tenant
isolation is only as correct as your tenant rule. Capture is wired in the standard (non---tui) serve path.
What-If Replay (experimental, opt-in, default-OFF)
Ask "how many requests would this config change have flipped?" against your real traffic, live in the
--tui dashboard's Replay tab. Enable it with a top-level replay: block (opt-in, OFF by default — it
records redacted keys):
replay:
enabled: true # anything but an explicit true is OFF
policies: api, search # leaf-rate policies to shadow (comma-separated; omit ⇒ all leaf-rate)
maxSteps: 50000 # per-policy recording cap = the memory bound
redaction: { mode: per-trace-salt } # keys are redacted before entering a shadow (default per-trace-salt)
candidate: # the what-if the `r` key runs
policy: api
set: { limit: 200 } # set / scale / swap — the testkit candidate DSLRun with --tui, open the Replay tab (7), and press r to replay the configured candidate over the
traffic recorded so far. The pane shows the directional allow↔deny flip ledger — e.g. "42 would flip
(0 allow→deny, 42 deny→allow)" — or an honest empty / truncated / refused state, never a faked number.
How it works — and what it isn't. For each shadowed leaf-rate policy the server runs an isolated
shadow of the live arrival stream through a cold, deterministic (ManualClock) copy of the limiter, built
on the published throttlekit/testkit replay primitives. The shadow is a post-decision tail over its
own store, so it can never change, delay, or break a production decision; it stops recording at maxSteps,
so a distinct-key flood can't exhaust memory (the trace is then honestly flagged truncated and the what-if
refuses rather than understating). The flip count is candidate-spec vs the deterministic baseline over this
traffic shape — not a replay of production's exact decisions (a Redis-backed or warm production node
decides differently from the cold shadow). Keys are redacted before they enter a shadow. Replay is a --tui
feature (the what-if is a keybind); configuring replay: without --tui warns. It is distinct from
capture above: capture is the durable, forensic record; replay is the in-memory, deterministic what-if.
Policy Plans — a "terraform plan" for your limits (experimental)
What-If Replay answers "what would this change flip?" for one policy, live. Policy Plans answers it for your whole config, as a CI-gateable artifact: replay your recorded traffic against a candidate config and read the exact per-policy allow↔deny diff before you deploy.
# diff a candidate config against the current one over recorded traffic
throttlekit-server policy plan \
--config .throttlekit.yaml --candidate candidate.yaml \
--corpus traffic.json # or --from-capture to read the durable capture store
# gate it in CI — non-zero exit if the change is too big
throttlekit-server policy plan -c current.yaml --candidate candidate.yaml --from-capture \
--credential "$TK_CAP" --max-allow-deny 0 --require-replayableThe corpus is either a trace JSON file (e.g. assembled from capture export) or the server's durable
capture store (--from-capture, read through the same fail-closed + audited path as capture — every
leaf-rate segment decrypted, projected, and audited). The plan covers leaf-rate policies; every non-rate
axis (cost meter / concurrency / two-tier / escrow / federated / federatedFairEscrow) is reported
not-replayable ("observe live via attribution"), never scored as a fabricated zero. --json emits the
machine-readable Plan; the --max-allow-deny / --max-deny-allow / --max-flips / --max-keys /
--require-replayable gate exits non-zero past the predicted blast radius. The diff baseline is the current
policy cold-replayed over your arrival timing — not a warm-production comparison (a cold replay can't
reproduce those exact decisions).
You can also run a whole-config plan live in the --tui: start with --plan-candidate <config> (plus an
enabled replay: block for the corpus), open the Plan tab (8), and press P to diff the candidate
against the running config over the shadow-recorded traffic. Built on the published core's throttlekit/policy
(^1.4.0); no wire change.
Embed it (Node)
import { readFileSync } from "node:fs";
import { createRateLimiterServiceFromConfig, serve } from "throttlekit-server";
import { RedisStore } from "throttlekit/redis";
const service = createRateLimiterServiceFromConfig(readFileSync(".throttlekit.yaml", "utf8"), {
store: new RedisStore({ client }), // shared across the fleet
fail: "closed",
});
const running = await serve({ service, port: 50051 });
// … on shutdown
await running.close();The contract
The service answers throttlekit.proto (throttlekit.v1.RateLimiter:
Check / CheckMany / Peek / Forecast for rate, Debit for the cost axis, and the stateful
Admit / Release / Heartbeat lifecycle for concurrency / unified admission). It is conformance-tested
end-to-end against the same golden vectors the wire contract is built from: a live in-process
server + client replays every suite and must reproduce the oracle's decisions field-for-field (test/),
and the admission lifecycle is driven over real gRPC (admit / release / heartbeat / crash-reclaim).
Clients
throttlekit-py is the reference client — point its
ServiceBackend at this server. (It also ships a direct RedisBackend that runs the same vendored Lua
straight against Redis, for when you'd rather skip the hop — proven bit-for-bit against the same golden
vectors.) Any language with gRPC can be a client: load throttlekit.proto and call RateLimiter.
Deploy
# fleet mode (shared Redis) + mTLS
throttlekit-server --config .throttlekit.yaml \
--redis redis://redis:6379 --redis-prefix prod \
--tls-cert server.crt --tls-key server.key --tls-ca client-ca.crt \
--fail closed| Flag | Effect |
|---|---|
| --store <backend> | pick the backend explicitly: memory | redis | postgres | dynamodb (inferred from the URL flags if omitted) |
| --redis <url> | share one Redis store across instances (one fleet-wide limit); omit for in-process memory |
| --redis-prefix <p> | key prefix for the shared Redis store |
| --postgres-url <url> | back the fleet with a shared Postgres store (no Redis required) |
| --postgres-table <t> | table holding limiter state (default throttlekit) |
| --postgres-prefix <p> | key prefix for the shared Postgres store |
| --dynamodb-table <t> | back the fleet with a DynamoDB table (implies --store dynamodb; no Redis required) |
| --dynamodb-region <r> / --dynamodb-endpoint <url> | AWS region / endpoint override (e.g. http://localhost:8000 for dynamodb-local) |
| --dynamodb-prefix <p> | key prefix for the shared DynamoDB store |
| --dynamodb-create-table | create the single-pk table if absent, then wait for it (dev convenience) |
| --tls-cert + --tls-key | serve TLS |
| --tls-ca <ca> | require + verify client certs ⇒ mTLS |
| --fail open\|closed | store-outage policy (default open) |
| --tui | live terminal dashboard alongside gRPC (interactive TTY only; q to quit); see Watch it live |
Container (build from the repo root so the single-source proto in wire/ is bundled):
docker build -f server/Dockerfile -t throttlekit-server .
docker run -p 50051:50051 -v "$PWD/.throttlekit.yaml:/etc/tk.yaml" \
throttlekit-server --config /etc/tk.yaml --redis redis://host.docker.internal:6379Failure modes
| Condition | Behavior |
|---|---|
| Rate limit hit | a normal Decision with allowed:false + retryAfterMs — not an RPC error |
| Unknown policy | gRPC NOT_FOUND |
| Op unsupported by the strategy (peek/forecast) | gRPC UNIMPLEMENTED |
| Store (Redis/Postgres/DynamoDB) outage | resolved by --fail: open admits, closed denies (a synthesized Decision) |
| Service unreachable (transport) | the client's call to make — fail-open or fail-closed in your code; a returned Decision is always authoritative |
Security
The default credentials are insecure (loopback/dev only). Front anything exposed with TLS/mTLS
(flags above, or pass grpc.ServerCredentials to serve({ credentials })) so nothing can poison a
shared budget. The server warns on startup if it binds a non-loopback host without TLS.
