npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

llm-retry-kit

v0.4.0

Published

Resilience toolkit for LLM APIs with retries, streaming, provider fallback, circuit breakers, adaptive hedging, rolling budgets, aborts, and observability.

Readme

llm-retry-kit

npm downloads license node TypeScript CI

Small resilience layer for production LLM calls. llm-retry-kit gives you provider-aware retries, fallback chains, jittered exponential backoff, Retry-After handling, streaming retries, circuit breakers, adaptive hedged requests, rolling budget windows, cancellation, timeouts, and observability hooks without runtime dependencies.

npm install llm-retry-kit

Why llm-retry-kit?

LLM APIs fail in ways that normal API wrappers often do not model well:

  • 429 rate limits need backoff, not immediate loops.
  • 500, 503, 504, and Anthropic 529 overloaded_error are usually transient and often worth retrying or failing over.
  • 400, 401, 403, and request-too-large errors are usually request or credential problems and should not blindly retry or fallback.
  • Failed retry attempts can still count toward provider rate limits.
  • Production apps need cancellation, budget limits, and logs around every attempt.

This package keeps the core primitive small: you provide the actual SDK call, and llm-retry-kit manages the reliability policy around it.

Features

  • Retry transient LLM failures with exponential backoff and jitter.
  • Respect Retry-After headers from provider errors.
  • Chain named providers or models with explicit fallback behavior.
  • Avoid fallback on non-transient client errors by default.
  • Customize retry and fallback decisions with shouldRetry and shouldFallback.
  • Track token usage and estimated cost.
  • Use custom input/output token pricing through costCalculator.
  • Wrap streaming responses with retry-before-first-chunk safety.
  • Track partial stream token usage from provider events.
  • Skip unhealthy providers with CircuitBreaker.
  • Set timeout budgets per provider/model.
  • Start hedged requests to reduce tail latency.
  • Adapt hedge delays from recent provider latency with AdaptiveHedgeDelay.
  • Enforce rolling cost windows with GlobalBudgetTracker.
  • Await async stream chunk hooks while protecting the stream from hook errors.
  • Pass request meta and payload through every context for logging.
  • Abort long calls and retry sleeps with AbortSignal or timeoutMs.
  • Observe attempts, retries, success, failure, and budget events.
  • Strict TypeScript types.
  • ESM package with no runtime dependencies.

Quick Start

import { llmRetry } from 'llm-retry-kit'
import OpenAI from 'openai'

const openai = new OpenAI()

const result = await llmRetry({
  fn: async ({ signal }) => {
    const response = await openai.chat.completions.create(
      {
        model: 'gpt-4o-mini',
        messages: [{ role: 'user', content: 'Hello!' }],
      },
      { signal }
    )

    return {
      data: response.choices[0]?.message.content ?? '',
      usage: response.usage
        ? {
            promptTokens: response.usage.prompt_tokens,
            completionTokens: response.usage.completion_tokens,
            totalTokens: response.usage.total_tokens,
          }
        : undefined,
    }
  },
  maxRetries: 3,
  initialDelayMs: 1000,
  maxDelayMs: 30000,
})

console.log(result.data)
console.log(result.provider)
console.log(result.attempts)
console.log(result.totalCostUSD)

Complete Provider Fallback Example

This example tries OpenAI first, then falls back to Anthropic only for transient failures. Client errors like invalid requests or bad credentials stop the chain by default.

import Anthropic from '@anthropic-ai/sdk'
import OpenAI from 'openai'
import { llmRetry } from 'llm-retry-kit'

const openai = new OpenAI()
const anthropic = new Anthropic()

const prompt = 'Summarize the following support ticket...'

const result = await llmRetry({
  providers: [
    {
      name: 'openai:gpt-4o-mini',
      maxRetries: 2,
      fn: async ({ signal }) => {
        const response = await openai.chat.completions.create(
          {
            model: 'gpt-4o-mini',
            messages: [{ role: 'user', content: prompt }],
          },
          { signal }
        )

        return {
          data: response.choices[0]?.message.content ?? '',
          usage: response.usage
            ? {
                promptTokens: response.usage.prompt_tokens,
                completionTokens: response.usage.completion_tokens,
                totalTokens: response.usage.total_tokens,
              }
            : undefined,
        }
      },
    },
    {
      name: 'anthropic:claude-sonnet',
      maxRetries: 1,
      fn: async ({ signal }) => {
        const response = await anthropic.messages.create(
          {
            model: 'claude-sonnet-4-6',
            max_tokens: 1024,
            messages: [{ role: 'user', content: prompt }],
          },
          { signal }
        )

        const text = response.content
          .filter((block) => block.type === 'text')
          .map((block) => block.text)
          .join('')

        return {
          data: text,
          usage: {
            promptTokens: response.usage.input_tokens,
            completionTokens: response.usage.output_tokens,
            totalTokens: response.usage.input_tokens + response.usage.output_tokens,
          },
        }
      },
    },
  ],
  timeoutMs: 45_000,
})

console.log({
  provider: result.provider,
  usedFallback: result.usedFallback,
  attempts: result.attempts,
  answer: result.data,
})

Streaming

OpenAI and Anthropic both expose streaming APIs, but their event formats and resume behavior are provider-specific. llm-retry-kit therefore keeps the stream wrapper provider-agnostic and conservative:

  • By default, it retries only if the stream fails before the first chunk.
  • After a chunk has been yielded, retrying could duplicate output, so it stops unless you explicitly set retryMode: 'always'.
  • Token usage can be tracked from stream events with getChunkUsage.
  • Use chunkUsageMode: 'cumulative' for providers that send cumulative usage snapshots during a stream.
import { llmRetryStream } from 'llm-retry-kit'

const result = llmRetryStream({
  stream: async ({ signal }) => {
    const stream = await openai.responses.create({
      model: 'gpt-4o-mini',
      input: 'Write a short incident summary.',
      stream: true,
    }, { signal })

    return stream
  },
  retryMode: 'before-first-chunk',
  getChunkUsage: (event) => {
    if (!('usage' in event) || !event.usage) return undefined

    return {
      promptTokens: event.usage.input_tokens ?? 0,
      completionTokens: event.usage.output_tokens ?? 0,
      totalTokens: event.usage.total_tokens ?? 0,
    }
  },
  chunkUsageMode: 'cumulative',
})

for await (const event of result.stream) {
  // Send provider events to your UI, parser, or SSE response.
}

console.log(result.getStats())

Advanced Production Controls

Circuit Breaker

Keep one CircuitBreaker instance per provider/model at application scope. Do not create the breaker inline inside a request handler; its state must survive between calls. If the failure threshold is reached inside the time window, later calls skip that provider until the cooldown expires.

import { CircuitBreaker, llmRetry } from 'llm-retry-kit'

const openaiBreaker = new CircuitBreaker({
  failureThreshold: 5,
  windowMs: 60_000,
  cooldownMs: 120_000,
})

await llmRetry({
  providers: [
    {
      name: 'openai:gpt-4o-mini',
      fn: callOpenAI,
      circuitBreaker: openaiBreaker,
    },
    {
      name: 'anthropic:claude-sonnet',
      fn: callAnthropic,
    },
  ],
})

Per-Provider Timeout

Use global timeoutMs for the whole workflow and provider timeoutMs for a single attempt.

await llmRetry({
  providers: [
    { name: 'openai:fast', fn: callOpenAI, timeoutMs: 3_000, maxRetries: 1 },
    { name: 'anthropic:steady', fn: callAnthropic, timeoutMs: 10_000 },
  ],
  timeoutMs: 30_000,
})

Hedged Requests

Hedging starts the next provider in parallel if the current provider has not answered after hedgeDelayMs. The first successful response wins and the slower request is aborted through the context signal.

await llmRetry({
  providers: [
    { name: 'primary', fn: callPrimary },
    { name: 'hedge', fn: callBackup },
  ],
  hedgeDelayMs: 750,
})

Hedging is best for latency-sensitive read paths. It can increase provider traffic, so pair it with budget tracking and conservative delay values.

Adaptive Hedging

Use AdaptiveHedgeDelay when fixed hedge delays are too brittle. It records recent latency samples per provider and uses the configured percentile as the next hedge delay. Keep the instance at application scope so the latency history survives between requests.

import { AdaptiveHedgeDelay, llmRetry } from 'llm-retry-kit'

const adaptiveHedge = new AdaptiveHedgeDelay({
  sampleSize: 100,
  percentile: 0.95,
  minSamples: 10,
  minDelayMs: 250,
  maxDelayMs: 5_000,
})

await llmRetry({
  providers: [
    { name: 'openai:gpt-4o-mini', fn: callOpenAI },
    { name: 'anthropic:claude-sonnet', fn: callAnthropic },
  ],
  hedgeDelayStrategy: adaptiveHedge,
})

If there are not enough samples yet, no hedge is fired unless you set defaultDelayMs.

Rolling Global Budget

maxCostUSD limits a single retry workflow. GlobalBudgetTracker limits the total spend across many calls inside a rolling time window. Keep one instance at application scope.

import { GlobalBudgetTracker, llmRetry } from 'llm-retry-kit'

const globalBudget = new GlobalBudgetTracker({
  maxCostUSD: 5,
  windowMs: 60_000,
})

await llmRetry({
  fn: callModel,
  globalBudget,
  costCalculator: calculateRealProviderCost,
})

When the rolling window is exhausted, new attempts fail before the provider is called. In-flight non-streaming calls can still finish because final usage is known only after the provider returns. Streaming calls are checked as chunk usage is reported.

Metadata And Payload Tracking

Attach request metadata once and it flows into provider calls and hooks.

await llmRetry({
  fn: callModel,
  meta: { requestId: 'req_123', tenant: 'acme' },
  payload: { prompt: 'Classify this ticket', userId: 'user_42' },
  onAttempt: (context) => {
    console.log(context.meta, context.payload)
  },
  onFailure: (error, context) => {
    console.error(context.meta, error)
  },
})

Simple Fallback API

For smaller apps, fn plus fallback is still supported.

const result = await llmRetry({
  fn: async () => callPrimaryModel(),
  fallback: async () => callFallbackModel(),
  maxRetries: 2,
})

Configuration

Retry Timing

await llmRetry({
  fn: myLLMCall,
  maxRetries: 4,
  initialDelayMs: 500,
  maxDelayMs: 60_000,
})

Retries use exponential backoff with jitter. If the provider exposes a Retry-After header, that delay is preferred.

Timeout And Cancellation

const controller = new AbortController()

const result = await llmRetry({
  fn: async ({ signal }) => myLLMCall({ signal }),
  signal: controller.signal,
  timeoutMs: 30_000,
})

timeoutMs aborts the wrapper and retry sleeps. Passing signal into your SDK call also lets the underlying request stop when the SDK supports it.

Budget Tracking

const result = await llmRetry({
  fn: myLLMCall,
  maxCostUSD: 0.5,
  costPer1kTokens: 0.002,
  onBudgetExceeded: (spent, limit) => {
    console.warn(`Budget exceeded: $${spent.toFixed(4)} / $${limit}`)
  },
})

For real provider pricing, prefer costCalculator:

const result = await llmRetry({
  fn: myLLMCall,
  costCalculator: (usage) => {
    const inputCost = usage.promptTokens * 0.00000015
    const outputCost = usage.completionTokens * 0.0000006
    return inputCost + outputCost
  },
})

Budget tracking is based on the usage object returned by your function. A wrapper cannot know the final cost of an in-flight LLM call before the provider returns usage, so maxCostUSD is a guard for later attempts and fallback calls.

Custom Retry Policy

Use context.defaultShouldRetry to compose with the built-in transient error detection.

await llmRetry({
  fn: myLLMCall,
  shouldRetry: (error, context) => {
    if (error.message.includes('insufficient quota')) return false
    return context.defaultShouldRetry
  },
})

By default, llm-retry-kit retries common transient failures such as HTTP 408, 409, 429, 5xx, Anthropic 529, timeout, network, and overload errors.

Custom Fallback Policy

Fallback is a separate decision from retry. By default, fallback is allowed only after transient failures. If you intentionally want to fallback for a known client-side case, opt in explicitly.

await llmRetry({
  providers: [
    { name: 'small-context-model', fn: callSmallModel },
    { name: 'large-context-model', fn: callLargeModel },
  ],
  shouldFallback: (error, context) => {
    if (error.message.includes('context length')) {
      return context.nextProvider === 'large-context-model'
    }

    return context.defaultShouldFallback
  },
})

Observability

await llmRetry({
  fn: myLLMCall,
  onAttempt: (context) => {
    console.log(`Calling ${context.provider}, attempt ${context.attempt}`)
  },
  onRetry: (attempt, error, delayMs, context) => {
    console.log(`${context.provider} failed: ${error.message}`)
    console.log(`Retrying in ${delayMs}ms`)
  },
  onSuccess: (context) => {
    console.log(`Cost so far: $${context.totalCostUSD}`)
  },
  onFailure: (error, context) => {
    console.error(context.meta, error)
  },
})

API Reference

llmRetry(options)

| Option | Type | Default | Description | | --- | --- | --- | --- | | fn | (context) => Promise<LLMResponse<T>> | optional | Primary LLM call for the simple API. | | fallback | (context) => Promise<LLMResponse<T>> | optional | Backup LLM call for the simple API. | | providers | RetryProvider<T>[] | optional | Explicit provider/model chain. | | maxRetries | number | 3 | Retries after the first attempt. | | maxCostUSD | number | optional | Maximum tracked cost before later attempts stop. | | globalBudget | GlobalBudgetTracker | optional | Shared rolling cost budget across calls. | | costPer1kTokens | number | 0.002 | Simple cost estimate. | | costCalculator | (usage, context) => number | optional | Custom cost calculation. | | initialDelayMs | number | 1000 | Initial retry delay. | | maxDelayMs | number | 30000 | Maximum retry delay. | | timeoutMs | number | optional | Abort wrapper after this time. | | hedgeDelayMs | number | optional | Start the next provider after this delay if the current provider is still pending. | | hedgeDelayStrategy | AdaptiveHedgeDelay | optional | Compute hedge delay from recent provider latency. | | signal | AbortSignal | optional | External cancellation signal. | | meta | unknown | optional | User metadata copied into attempt/failure contexts. | | payload | unknown | optional | Request payload copied into attempt/failure contexts. | | shouldRetry | (error, context) => boolean \| Promise<boolean> | optional | Override retry decisions. | | shouldFallback | (error, context) => boolean \| Promise<boolean> | optional | Override provider fallback decisions. | | onAttempt | (context) => void | optional | Called before each attempt. | | onRetry | (attempt, error, delayMs, context) => void | optional | Called before retry wait. | | onSuccess | (context) => void | optional | Called after a successful response. | | onFailure | (error, context) => void | optional | Called before final failure is thrown. | | onBudgetExceeded | (spentUSD, limitUSD) => void | optional | Called when budget is exhausted. |

RetryProvider<T>

{
  name: string
  fn: (context: RetryAttemptContext) => Promise<LLMResponse<T>>
  maxRetries?: number
  timeoutMs?: number
  hedgeDelayMs?: number
  hedgeDelayStrategy?: AdaptiveHedgeDelay
  circuitBreaker?: CircuitBreaker
  costPer1kTokens?: number
  costCalculator?: (usage, context) => number
}

llmRetryStream(options)

Returns { stream, getStats }. The request begins when the returned async iterable is consumed.

| Option | Type | Default | Description | | --- | --- | --- | --- | | stream | (context) => AsyncIterable<TChunk> \| Promise<AsyncIterable<TChunk>> | optional | Primary stream call for the simple API. | | fallbackStream | (context) => AsyncIterable<TChunk> \| Promise<AsyncIterable<TChunk>> | optional | Backup stream call. | | providers | StreamRetryProvider<TChunk>[] | optional | Explicit stream provider chain. | | retryMode | 'before-first-chunk' \| 'always' \| 'never' | 'before-first-chunk' | Controls whether interrupted streams are retried. | | getChunkUsage | (chunk, context) => TokenUsage \| undefined | optional | Extract token usage from stream chunks/events. | | chunkUsageMode | 'delta' \| 'cumulative' | 'delta' | Interpret chunk usage as incremental or cumulative. | | maxRetries | number | 3 | Retries after the first attempt. | | timeoutMs | number | optional | Abort the whole stream workflow after this time. | | globalBudget | GlobalBudgetTracker | optional | Shared rolling cost budget across calls. | | meta | unknown | optional | User metadata copied into contexts. | | payload | unknown | optional | Request payload copied into contexts. | | onChunk | (chunk, context) => void \| Promise<void> | optional | Called for each chunk before it is yielded. | | onChunkError | (error, chunk, context) => void \| Promise<void> | optional | Called when onChunk fails. | | onChunkErrorMode | 'ignore' \| 'throw' | 'ignore' | Decide whether onChunk failures should break the stream. |

CircuitBreaker

new CircuitBreaker({
  failureThreshold: 5,
  windowMs: 60_000,
  cooldownMs: 120_000,
})

snapshot() returns { state, failures, openedAt }, where state is 'closed', 'open', or 'half_open'.

AdaptiveHedgeDelay

new AdaptiveHedgeDelay({
  sampleSize: 100,
  percentile: 0.95,
  minSamples: 5,
  minDelayMs: 250,
  maxDelayMs: 5_000,
  defaultDelayMs: 750,
})

snapshot() returns the current sample count and computed delay per provider.

GlobalBudgetTracker

new GlobalBudgetTracker({
  maxCostUSD: 5,
  windowMs: 60_000,
})

snapshot() returns { spentUSD, limitUSD, windowMs, resetAt, entries }.

RetryResult<T>

{
  data: T
  attempts: number
  provider: string
  usedFallback: boolean
  totalCostUSD: number
  totalTokens: number
}

LLMRetryError

{
  name: 'LLMRetryError'
  primaryError: Error | null
  fallbackError: Error | null
  totalCostUSD: number
  totalTokens: number
  attempts: number
  providers: string[]
  reason: 'failure' | 'budget_exceeded' | 'aborted'
}

Defaults

| Setting | Default | | --- | --- | | maxRetries | 3 | | initialDelayMs | 1000 | | maxDelayMs | 30000 | | costPer1kTokens | 0.002 | | stream retry mode | before-first-chunk | | stream chunk hook error mode | ignore | | fallback on client errors | false | | fallback on transient errors | true | | runtime dependencies | none |

Development

npm install
npm run typecheck
npm test
npm run build
npm pack --dry-run

License

MIT