llm-retry-kit
v0.4.0
Published
Resilience toolkit for LLM APIs with retries, streaming, provider fallback, circuit breakers, adaptive hedging, rolling budgets, aborts, and observability.
Maintainers
Readme
llm-retry-kit
Small resilience layer for production LLM calls. llm-retry-kit gives you
provider-aware retries, fallback chains, jittered exponential backoff,
Retry-After handling, streaming retries, circuit breakers, adaptive hedged
requests, rolling budget windows, cancellation, timeouts, and observability
hooks without runtime dependencies.
npm install llm-retry-kitWhy llm-retry-kit?
LLM APIs fail in ways that normal API wrappers often do not model well:
429rate limits need backoff, not immediate loops.500,503,504, and Anthropic529 overloaded_errorare usually transient and often worth retrying or failing over.400,401,403, and request-too-large errors are usually request or credential problems and should not blindly retry or fallback.- Failed retry attempts can still count toward provider rate limits.
- Production apps need cancellation, budget limits, and logs around every attempt.
This package keeps the core primitive small: you provide the actual SDK call,
and llm-retry-kit manages the reliability policy around it.
Features
- Retry transient LLM failures with exponential backoff and jitter.
- Respect
Retry-Afterheaders from provider errors. - Chain named providers or models with explicit fallback behavior.
- Avoid fallback on non-transient client errors by default.
- Customize retry and fallback decisions with
shouldRetryandshouldFallback. - Track token usage and estimated cost.
- Use custom input/output token pricing through
costCalculator. - Wrap streaming responses with retry-before-first-chunk safety.
- Track partial stream token usage from provider events.
- Skip unhealthy providers with
CircuitBreaker. - Set timeout budgets per provider/model.
- Start hedged requests to reduce tail latency.
- Adapt hedge delays from recent provider latency with
AdaptiveHedgeDelay. - Enforce rolling cost windows with
GlobalBudgetTracker. - Await async stream chunk hooks while protecting the stream from hook errors.
- Pass request
metaandpayloadthrough every context for logging. - Abort long calls and retry sleeps with
AbortSignalortimeoutMs. - Observe attempts, retries, success, failure, and budget events.
- Strict TypeScript types.
- ESM package with no runtime dependencies.
Quick Start
import { llmRetry } from 'llm-retry-kit'
import OpenAI from 'openai'
const openai = new OpenAI()
const result = await llmRetry({
fn: async ({ signal }) => {
const response = await openai.chat.completions.create(
{
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: 'Hello!' }],
},
{ signal }
)
return {
data: response.choices[0]?.message.content ?? '',
usage: response.usage
? {
promptTokens: response.usage.prompt_tokens,
completionTokens: response.usage.completion_tokens,
totalTokens: response.usage.total_tokens,
}
: undefined,
}
},
maxRetries: 3,
initialDelayMs: 1000,
maxDelayMs: 30000,
})
console.log(result.data)
console.log(result.provider)
console.log(result.attempts)
console.log(result.totalCostUSD)Complete Provider Fallback Example
This example tries OpenAI first, then falls back to Anthropic only for transient failures. Client errors like invalid requests or bad credentials stop the chain by default.
import Anthropic from '@anthropic-ai/sdk'
import OpenAI from 'openai'
import { llmRetry } from 'llm-retry-kit'
const openai = new OpenAI()
const anthropic = new Anthropic()
const prompt = 'Summarize the following support ticket...'
const result = await llmRetry({
providers: [
{
name: 'openai:gpt-4o-mini',
maxRetries: 2,
fn: async ({ signal }) => {
const response = await openai.chat.completions.create(
{
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }],
},
{ signal }
)
return {
data: response.choices[0]?.message.content ?? '',
usage: response.usage
? {
promptTokens: response.usage.prompt_tokens,
completionTokens: response.usage.completion_tokens,
totalTokens: response.usage.total_tokens,
}
: undefined,
}
},
},
{
name: 'anthropic:claude-sonnet',
maxRetries: 1,
fn: async ({ signal }) => {
const response = await anthropic.messages.create(
{
model: 'claude-sonnet-4-6',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }],
},
{ signal }
)
const text = response.content
.filter((block) => block.type === 'text')
.map((block) => block.text)
.join('')
return {
data: text,
usage: {
promptTokens: response.usage.input_tokens,
completionTokens: response.usage.output_tokens,
totalTokens: response.usage.input_tokens + response.usage.output_tokens,
},
}
},
},
],
timeoutMs: 45_000,
})
console.log({
provider: result.provider,
usedFallback: result.usedFallback,
attempts: result.attempts,
answer: result.data,
})Streaming
OpenAI and Anthropic both expose streaming APIs, but their event formats and
resume behavior are provider-specific. llm-retry-kit therefore keeps the
stream wrapper provider-agnostic and conservative:
- By default, it retries only if the stream fails before the first chunk.
- After a chunk has been yielded, retrying could duplicate output, so it stops
unless you explicitly set
retryMode: 'always'. - Token usage can be tracked from stream events with
getChunkUsage. - Use
chunkUsageMode: 'cumulative'for providers that send cumulative usage snapshots during a stream.
import { llmRetryStream } from 'llm-retry-kit'
const result = llmRetryStream({
stream: async ({ signal }) => {
const stream = await openai.responses.create({
model: 'gpt-4o-mini',
input: 'Write a short incident summary.',
stream: true,
}, { signal })
return stream
},
retryMode: 'before-first-chunk',
getChunkUsage: (event) => {
if (!('usage' in event) || !event.usage) return undefined
return {
promptTokens: event.usage.input_tokens ?? 0,
completionTokens: event.usage.output_tokens ?? 0,
totalTokens: event.usage.total_tokens ?? 0,
}
},
chunkUsageMode: 'cumulative',
})
for await (const event of result.stream) {
// Send provider events to your UI, parser, or SSE response.
}
console.log(result.getStats())Advanced Production Controls
Circuit Breaker
Keep one CircuitBreaker instance per provider/model at application scope. Do
not create the breaker inline inside a request handler; its state must survive
between calls. If the failure threshold is reached inside the time window,
later calls skip that provider until the cooldown expires.
import { CircuitBreaker, llmRetry } from 'llm-retry-kit'
const openaiBreaker = new CircuitBreaker({
failureThreshold: 5,
windowMs: 60_000,
cooldownMs: 120_000,
})
await llmRetry({
providers: [
{
name: 'openai:gpt-4o-mini',
fn: callOpenAI,
circuitBreaker: openaiBreaker,
},
{
name: 'anthropic:claude-sonnet',
fn: callAnthropic,
},
],
})Per-Provider Timeout
Use global timeoutMs for the whole workflow and provider timeoutMs for a
single attempt.
await llmRetry({
providers: [
{ name: 'openai:fast', fn: callOpenAI, timeoutMs: 3_000, maxRetries: 1 },
{ name: 'anthropic:steady', fn: callAnthropic, timeoutMs: 10_000 },
],
timeoutMs: 30_000,
})Hedged Requests
Hedging starts the next provider in parallel if the current provider has not
answered after hedgeDelayMs. The first successful response wins and the
slower request is aborted through the context signal.
await llmRetry({
providers: [
{ name: 'primary', fn: callPrimary },
{ name: 'hedge', fn: callBackup },
],
hedgeDelayMs: 750,
})Hedging is best for latency-sensitive read paths. It can increase provider traffic, so pair it with budget tracking and conservative delay values.
Adaptive Hedging
Use AdaptiveHedgeDelay when fixed hedge delays are too brittle. It records
recent latency samples per provider and uses the configured percentile as the
next hedge delay. Keep the instance at application scope so the latency history
survives between requests.
import { AdaptiveHedgeDelay, llmRetry } from 'llm-retry-kit'
const adaptiveHedge = new AdaptiveHedgeDelay({
sampleSize: 100,
percentile: 0.95,
minSamples: 10,
minDelayMs: 250,
maxDelayMs: 5_000,
})
await llmRetry({
providers: [
{ name: 'openai:gpt-4o-mini', fn: callOpenAI },
{ name: 'anthropic:claude-sonnet', fn: callAnthropic },
],
hedgeDelayStrategy: adaptiveHedge,
})If there are not enough samples yet, no hedge is fired unless you set
defaultDelayMs.
Rolling Global Budget
maxCostUSD limits a single retry workflow. GlobalBudgetTracker limits the
total spend across many calls inside a rolling time window. Keep one instance at
application scope.
import { GlobalBudgetTracker, llmRetry } from 'llm-retry-kit'
const globalBudget = new GlobalBudgetTracker({
maxCostUSD: 5,
windowMs: 60_000,
})
await llmRetry({
fn: callModel,
globalBudget,
costCalculator: calculateRealProviderCost,
})When the rolling window is exhausted, new attempts fail before the provider is called. In-flight non-streaming calls can still finish because final usage is known only after the provider returns. Streaming calls are checked as chunk usage is reported.
Metadata And Payload Tracking
Attach request metadata once and it flows into provider calls and hooks.
await llmRetry({
fn: callModel,
meta: { requestId: 'req_123', tenant: 'acme' },
payload: { prompt: 'Classify this ticket', userId: 'user_42' },
onAttempt: (context) => {
console.log(context.meta, context.payload)
},
onFailure: (error, context) => {
console.error(context.meta, error)
},
})Simple Fallback API
For smaller apps, fn plus fallback is still supported.
const result = await llmRetry({
fn: async () => callPrimaryModel(),
fallback: async () => callFallbackModel(),
maxRetries: 2,
})Configuration
Retry Timing
await llmRetry({
fn: myLLMCall,
maxRetries: 4,
initialDelayMs: 500,
maxDelayMs: 60_000,
})Retries use exponential backoff with jitter. If the provider exposes a
Retry-After header, that delay is preferred.
Timeout And Cancellation
const controller = new AbortController()
const result = await llmRetry({
fn: async ({ signal }) => myLLMCall({ signal }),
signal: controller.signal,
timeoutMs: 30_000,
})timeoutMs aborts the wrapper and retry sleeps. Passing signal into your SDK
call also lets the underlying request stop when the SDK supports it.
Budget Tracking
const result = await llmRetry({
fn: myLLMCall,
maxCostUSD: 0.5,
costPer1kTokens: 0.002,
onBudgetExceeded: (spent, limit) => {
console.warn(`Budget exceeded: $${spent.toFixed(4)} / $${limit}`)
},
})For real provider pricing, prefer costCalculator:
const result = await llmRetry({
fn: myLLMCall,
costCalculator: (usage) => {
const inputCost = usage.promptTokens * 0.00000015
const outputCost = usage.completionTokens * 0.0000006
return inputCost + outputCost
},
})Budget tracking is based on the usage object returned by your function. A
wrapper cannot know the final cost of an in-flight LLM call before the provider
returns usage, so maxCostUSD is a guard for later attempts and fallback calls.
Custom Retry Policy
Use context.defaultShouldRetry to compose with the built-in transient error
detection.
await llmRetry({
fn: myLLMCall,
shouldRetry: (error, context) => {
if (error.message.includes('insufficient quota')) return false
return context.defaultShouldRetry
},
})By default, llm-retry-kit retries common transient failures such as HTTP
408, 409, 429, 5xx, Anthropic 529, timeout, network, and overload
errors.
Custom Fallback Policy
Fallback is a separate decision from retry. By default, fallback is allowed only after transient failures. If you intentionally want to fallback for a known client-side case, opt in explicitly.
await llmRetry({
providers: [
{ name: 'small-context-model', fn: callSmallModel },
{ name: 'large-context-model', fn: callLargeModel },
],
shouldFallback: (error, context) => {
if (error.message.includes('context length')) {
return context.nextProvider === 'large-context-model'
}
return context.defaultShouldFallback
},
})Observability
await llmRetry({
fn: myLLMCall,
onAttempt: (context) => {
console.log(`Calling ${context.provider}, attempt ${context.attempt}`)
},
onRetry: (attempt, error, delayMs, context) => {
console.log(`${context.provider} failed: ${error.message}`)
console.log(`Retrying in ${delayMs}ms`)
},
onSuccess: (context) => {
console.log(`Cost so far: $${context.totalCostUSD}`)
},
onFailure: (error, context) => {
console.error(context.meta, error)
},
})API Reference
llmRetry(options)
| Option | Type | Default | Description |
| --- | --- | --- | --- |
| fn | (context) => Promise<LLMResponse<T>> | optional | Primary LLM call for the simple API. |
| fallback | (context) => Promise<LLMResponse<T>> | optional | Backup LLM call for the simple API. |
| providers | RetryProvider<T>[] | optional | Explicit provider/model chain. |
| maxRetries | number | 3 | Retries after the first attempt. |
| maxCostUSD | number | optional | Maximum tracked cost before later attempts stop. |
| globalBudget | GlobalBudgetTracker | optional | Shared rolling cost budget across calls. |
| costPer1kTokens | number | 0.002 | Simple cost estimate. |
| costCalculator | (usage, context) => number | optional | Custom cost calculation. |
| initialDelayMs | number | 1000 | Initial retry delay. |
| maxDelayMs | number | 30000 | Maximum retry delay. |
| timeoutMs | number | optional | Abort wrapper after this time. |
| hedgeDelayMs | number | optional | Start the next provider after this delay if the current provider is still pending. |
| hedgeDelayStrategy | AdaptiveHedgeDelay | optional | Compute hedge delay from recent provider latency. |
| signal | AbortSignal | optional | External cancellation signal. |
| meta | unknown | optional | User metadata copied into attempt/failure contexts. |
| payload | unknown | optional | Request payload copied into attempt/failure contexts. |
| shouldRetry | (error, context) => boolean \| Promise<boolean> | optional | Override retry decisions. |
| shouldFallback | (error, context) => boolean \| Promise<boolean> | optional | Override provider fallback decisions. |
| onAttempt | (context) => void | optional | Called before each attempt. |
| onRetry | (attempt, error, delayMs, context) => void | optional | Called before retry wait. |
| onSuccess | (context) => void | optional | Called after a successful response. |
| onFailure | (error, context) => void | optional | Called before final failure is thrown. |
| onBudgetExceeded | (spentUSD, limitUSD) => void | optional | Called when budget is exhausted. |
RetryProvider<T>
{
name: string
fn: (context: RetryAttemptContext) => Promise<LLMResponse<T>>
maxRetries?: number
timeoutMs?: number
hedgeDelayMs?: number
hedgeDelayStrategy?: AdaptiveHedgeDelay
circuitBreaker?: CircuitBreaker
costPer1kTokens?: number
costCalculator?: (usage, context) => number
}llmRetryStream(options)
Returns { stream, getStats }. The request begins when the returned async
iterable is consumed.
| Option | Type | Default | Description |
| --- | --- | --- | --- |
| stream | (context) => AsyncIterable<TChunk> \| Promise<AsyncIterable<TChunk>> | optional | Primary stream call for the simple API. |
| fallbackStream | (context) => AsyncIterable<TChunk> \| Promise<AsyncIterable<TChunk>> | optional | Backup stream call. |
| providers | StreamRetryProvider<TChunk>[] | optional | Explicit stream provider chain. |
| retryMode | 'before-first-chunk' \| 'always' \| 'never' | 'before-first-chunk' | Controls whether interrupted streams are retried. |
| getChunkUsage | (chunk, context) => TokenUsage \| undefined | optional | Extract token usage from stream chunks/events. |
| chunkUsageMode | 'delta' \| 'cumulative' | 'delta' | Interpret chunk usage as incremental or cumulative. |
| maxRetries | number | 3 | Retries after the first attempt. |
| timeoutMs | number | optional | Abort the whole stream workflow after this time. |
| globalBudget | GlobalBudgetTracker | optional | Shared rolling cost budget across calls. |
| meta | unknown | optional | User metadata copied into contexts. |
| payload | unknown | optional | Request payload copied into contexts. |
| onChunk | (chunk, context) => void \| Promise<void> | optional | Called for each chunk before it is yielded. |
| onChunkError | (error, chunk, context) => void \| Promise<void> | optional | Called when onChunk fails. |
| onChunkErrorMode | 'ignore' \| 'throw' | 'ignore' | Decide whether onChunk failures should break the stream. |
CircuitBreaker
new CircuitBreaker({
failureThreshold: 5,
windowMs: 60_000,
cooldownMs: 120_000,
})snapshot() returns { state, failures, openedAt }, where state is
'closed', 'open', or 'half_open'.
AdaptiveHedgeDelay
new AdaptiveHedgeDelay({
sampleSize: 100,
percentile: 0.95,
minSamples: 5,
minDelayMs: 250,
maxDelayMs: 5_000,
defaultDelayMs: 750,
})snapshot() returns the current sample count and computed delay per provider.
GlobalBudgetTracker
new GlobalBudgetTracker({
maxCostUSD: 5,
windowMs: 60_000,
})snapshot() returns { spentUSD, limitUSD, windowMs, resetAt, entries }.
RetryResult<T>
{
data: T
attempts: number
provider: string
usedFallback: boolean
totalCostUSD: number
totalTokens: number
}LLMRetryError
{
name: 'LLMRetryError'
primaryError: Error | null
fallbackError: Error | null
totalCostUSD: number
totalTokens: number
attempts: number
providers: string[]
reason: 'failure' | 'budget_exceeded' | 'aborted'
}Defaults
| Setting | Default |
| --- | --- |
| maxRetries | 3 |
| initialDelayMs | 1000 |
| maxDelayMs | 30000 |
| costPer1kTokens | 0.002 |
| stream retry mode | before-first-chunk |
| stream chunk hook error mode | ignore |
| fallback on client errors | false |
| fallback on transient errors | true |
| runtime dependencies | none |
Development
npm install
npm run typecheck
npm test
npm run build
npm pack --dry-runLicense
MIT
