npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@diabolicallabs/llm-client

v4.1.0

Published

Unified LLM API for Anthropic, OpenAI, Google, and DeepSeek. Streaming, retry/backoff, structured output, token normalization. © Diabolical Labs

Readme

@diabolicallabs/llm-client

Unified LLM API across Anthropic, OpenAI, Google Gemini, DeepSeek, and Perplexity. Single interface for completion, streaming, structured output, and native tool calling. All provider errors are normalized into a consistent LlmError shape. © Diabolical Labs

Status

v1.7.0. All five providers fully implemented. See MIGRATION.md for breaking changes from v0.x.

Highlights:

  • v1.7.0createClient() is now async. pricing.remoteUrl config option fetches a remote PricingTable on init (stale-while-revalidate cache, 24h default TTL). pricing.cacheTtlMs controls the TTL. Structured pricing_source log on every createClient() with pricing config. Requires @diabolicallabs/llm-pricing@^0.2.0.
  • v1.6.0LlmAfterCallContext now carries usage?: LlmUsage for all 5 call types. Non-streaming paths mirror response.usage; stream() captures from the terminal chunk; streamStructured() from the done event. The v1.5.0 caveat ("usage not surfaced for streaming in afterCall") is removed. agent-sdk v2.0.0 uses this to complete its architecture migration.
  • v1.5.0 — Pre-call hooks API (hooks?: LlmHooks on createClient). beforeCall for request mutation and short-circuit caching; afterCall for custom logging and observability. Fires on all 5 call types. Cross-reference: @diabolicallabs/agent-sdk uses hooks internally.
  • v1.4.0 — Provider capability matrix (getModelCapabilities()), linked AbortController helper (linkedAbortController()), response IDs on all response types (id + idSource).
  • v1.3.0 — Streaming structured output (streamStructured()) — token streaming + Zod-validated final object. OpenAI, Anthropic, DeepSeek supported; Gemini and Perplexity throw pre-call.
  • v1.2.0 — Configurable retry strategy (exponential/linear/fixed/decorrelated), provider failover via model: string[], Retry-After header support.
  • v1.1.0 — Per-response cost computation via @diabolicallabs/llm-pricing; concurrency pool at @diabolicallabs/llm-client/pool.
  • v1.0.0 — Native tool calling (withTools()), expanded LlmErrorKind taxonomy, OpenAI Responses API migration.

Install

pnpm add @diabolicallabs/llm-client

Public on npmjs.com — no .npmrc config required.

Usage

import { createClient, createClientFromEnv } from '@diabolicallabs/llm-client';

// From explicit config — createClient() is async (v1.7.0+)
const client = await createClient({
  provider: 'anthropic',
  model: 'claude-sonnet-4-6',
  apiKey: process.env.ANTHROPIC_API_KEY!,
});

// From environment variables — also async
const client = await createClientFromEnv('anthropic', 'claude-sonnet-4-6');

// Non-streaming completion
const response = await client.complete([
  { role: 'user', content: 'Hello' },
]);
console.log(response.content, response.usage);

// Streaming
for await (const chunk of client.stream([{ role: 'user', content: 'Hello' }])) {
  process.stdout.write(chunk.token);
}

// Structured output — Zod 4 schema triggers strict native mode automatically
import { z } from 'zod';
const schema = z.object({ name: z.string(), score: z.number() });
const result = await client.structured(messages, schema);
// result.data is typed as { name: string; score: number }
// result.model and result.id are populated (v0.4.0+)

Strict structured outputs (v0.4.0)

Pass a Zod 4 schema to structured() and the toolkit automatically routes to the strictest native path available for each provider. No opt-in flag required.

import { z } from 'zod';
const schema = z.object({
  topic: z.string(),
  bullets: z.array(z.string()),
});

const result = await client.structured(messages, schema);
// result.data    — typed and Zod-validated
// result.model   — model ID used (always present, v0.4.0+)
// result.id      — provider request ID for tracing (OpenAI + Anthropic)
// result.citations — Perplexity citations if any

How detection works

The toolkit checks for Zod 4's internal _zod marker at runtime. If the schema is a Zod 4 instance, it converts to JSON Schema using Zod 4's built-in z.toJSONSchema() and routes to the native path. If the schema is anything else (plain { parse } object, Zod 3, etc.), it falls back to the v0.3.0 system-prompt path.

Schema-feature support matrix

| Provider | Native mode | What's enforced | Known limits | |---|---|---|---| | OpenAI (gpt-4o, gpt-4o-mini) | text.format: { type: 'json_schema', strict: true } (Responses API, v1.0.0) | Schema structure guaranteed; model cannot produce off-schema output | No format, pattern, or recursive schemas (z.lazy()). Throws at conversion time with clear message. | | Anthropic | Tool-use with forced tool_choice: { type: 'tool', name: 'extract' } | Model must call the tool; input is pre-parsed JSON | Defense-in-depth schema.parse() still runs | | Gemini | responseSchema (OpenAPI 3.0) + responseMimeType: 'application/json' | Schema communicated to the model; belt-and-braces fence-strip retained | OBJECT schemas with empty properties: {} auto-receive a _placeholder sentinel (v1.0.0); stripped before Zod parse. | | DeepSeek | None (prompt-only, API limitation) | System-prompt nudge + schema.parse() | Same as v0.3.0 | | Perplexity | None (prompt-only, API limitation) | System-prompt nudge + <think> strip + schema.parse() | Same as v0.3.0; citations propagated to structured response |

Prompt-mode escape hatch

If your schema uses a feature unsupported in strict mode (e.g. z.function(), z.lazy()) and you need to keep using it, pass the escape hatch:

const result = await client.structured(messages, schema, {
  providerOptions: { structuredMode: 'prompt' },
});
// Forces the v0.3.0 prompt-only path regardless of schema type

Alternatively, catch the LlmError thrown during schema conversion and inform the user:

try {
  const result = await client.structured(messages, schema);
} catch (err) {
  if (err instanceof LlmError && err.kind === 'unknown') {
    // Schema contains an unrepresentable feature — message names it
    console.error(err.message);
  }
}

Zod 3 schemas

If a Zod 3 schema is passed, the toolkit throws LlmError with a clear "upgrade to Zod 4" message rather than silently falling through to prompt mode. Pass providerOptions.structuredMode = 'prompt' if you cannot upgrade immediately.

Anthropic prompt caching (v0.4.3)

Anthropic charges full input tokens on every call by default. Enable prompt caching to have Anthropic cache the system message block between calls, paying a 1.25× surcharge on the first (write) call and a 0.10× discount on every subsequent (read) call within the 5-minute TTL window.

const result = await client.complete(messages, {
  providerOptions: { promptCache: 'ephemeral' },
});

// result.usage.cacheCreationTokens — tokens written to cache (first call)
// result.usage.cacheReadTokens     — tokens read from cache (subsequent calls)

Works identically on complete(), stream(), and structured() (both strict tool-use and prompt-fallback paths):

// complete()
const r = await client.complete(messages, { providerOptions: { promptCache: 'ephemeral' } });

// stream()
for await (const chunk of client.stream(messages, { providerOptions: { promptCache: 'ephemeral' } })) {
  process.stdout.write(chunk.token);
}

// structured() — Zod 4 schema (strict tool-use path)
// Also caches the tool definition as a second cache layer.
const r = await client.structured(messages, zodSchema, {
  providerOptions: { promptCache: 'ephemeral' },
});

// structured() — prompt-fallback path (non-Zod schema or structuredMode: 'prompt')
const r = await client.structured(messages, narrowSchema, {
  providerOptions: { structuredMode: 'prompt', promptCache: 'ephemeral' },
});

Cache semantics

| Field | Description | |---|---| | TTL | 5 minutes (default). Anthropic also offers a 1-hour beta TTL — not yet exposed in the toolkit. | | Minimum block size | 1024 tokens for Claude Sonnet and Opus models; 2048 tokens for Haiku. Below minimum, the API silently ignores the marker — callers pay no write surcharge. | | Write cost | 1.25× normal input token price. | | Read cost | 0.10× normal input token price. | | Break-even | ~3 cache reads within the TTL window. |

The toolkit always sends the cache_control marker and lets Anthropic's API enforce minimum block size. No client-side token estimation is performed — simpler, and the API's behavior is authoritative.

Usage fields

Cache token counts surface in LlmUsage:

interface LlmUsage {
  inputTokens: number;
  outputTokens: number;
  totalTokens: number;
  cacheCreationTokens?: number; // tokens written to cache (Anthropic only)
  cacheReadTokens?: number;     // tokens read from cache (Anthropic only)
}

On a cold call (cache miss): cacheCreationTokens > 0, cacheReadTokens === 0. On a warm call (cache hit within TTL): cacheReadTokens > 0, cacheCreationTokens === 0.

Provider isolation

providerOptions.promptCache is Anthropic-only. Passing it to an OpenAI, Gemini, DeepSeek, or Perplexity client has no effect — those providers ignore unrecognized providerOptions fields.

OpenAI has implicit automatic prompt caching on some models (no opt-in needed). Perplexity and Gemini caching models are different — if needed, those warrant separate briefs.

Provider universe

| Provider | Status | Env var | |---|---|---| | anthropic | Implemented | ANTHROPIC_API_KEY | | openai | Implemented | OPENAI_API_KEY | | gemini | Implemented | GOOGLE_AI_API_KEY | | deepseek | Implemented | DEEPSEEK_API_KEY | | perplexity | Implemented | PERPLEXITY_API_KEY |

Perplexity — web-grounded responses

The Perplexity provider returns real-time web-grounded answers with source citations. Use it via createClient or createClientFromEnv:

const client = createClientFromEnv('perplexity', 'sonar');
const response = await client.complete([
  { role: 'user', content: 'What happened in AI this week?' },
]);

// Citations are deduplicated by URL
console.log(response.citations);
// [{ url: 'https://example.com/article' }, { url: 'https://reuters.com/story' }]

Citations

LlmResponse.citations is populated when Perplexity returns source URLs. It is undefined for all other providers.

interface LlmResponse {
  content: string;
  model: string;
  usage: LlmUsage;
  latencyMs: number;
  citations?: Array<{
    url: string;
    title?: string;  // Perplexity currently returns URLs only; title is always undefined
  }>;
}

Citations are deduplicated by URL within a single response. They are not available in stream mode — use complete() when you need citations.

Search filters via providerOptions

Perplexity supports search-specific parameters. Pass them via the providerOptions escape hatch on any call:

await client.complete(messages, {
  providerOptions: {
    search_recency_filter: 'week',   // 'month' | 'week' | 'day' | 'hour'
    search_domain_filter: ['nytimes.com', 'reuters.com'],  // allowlist
  },
});

providerOptions is Record<string, unknown> — unknown fields are forwarded to the Perplexity API unchanged, so newly-released filters work without a toolkit update. Other providers ignore providerOptions.

Reasoning models

Pass reasoning model IDs as the model string:

const client = createClientFromEnv('perplexity', 'sonar-reasoning-pro');

Available models (verified 2026-05-08):

| Model | Notes | |---|---| | sonar | Lightweight search model. Default. | | sonar-pro | Advanced search, more citations. | | sonar-reasoning-pro | Chain-of-thought reasoning. Replaces deprecated sonar-reasoning. | | sonar-deep-research | Exhaustive research. Perplexity docs indicate async job support — treat as experimental with this toolkit. |

structured() with sonar-reasoning-pro works correctly — reasoning tokens (<think>...</think>) are stripped before JSON parsing.

sonar-deep-research is accepted as a model string. If Perplexity's API returns an incompatible async response shape, the call will throw a clear LlmError. In that case, use sonar-reasoning-pro instead, or wait for a future deep-research-specific brief.

API

createClient(config: LlmClientConfig): LlmClient

Creates an LlmClient for the given provider.

createClientFromEnv(provider, model, overrides?): LlmClient

Reads the API key from the environment automatically:

  • anthropicANTHROPIC_API_KEY
  • openaiOPENAI_API_KEY
  • geminiGOOGLE_AI_API_KEY
  • deepseekDEEPSEEK_API_KEY
  • perplexityPERPLEXITY_API_KEY

LlmClient interface

| Method | Description | |---|---| | complete(messages, options?) | Non-streaming completion. Returns LlmResponse (includes citations for Perplexity). | | stream(messages, options?) | Streaming — async generator of LlmStreamChunk. Final chunk includes usage. Citations unavailable. | | structured(messages, schema, options?) | Structured output validated against a Zod schema. Returns LlmStructuredResponse<T>. | | streamStructured(messages, schema, options?) | Token streaming + Zod-validated final object. Returns AsyncGenerator<LlmStreamStructuredEvent<T>>. See Streaming structured output (v1.3.0). | | withTools(messages, tools, options?) | Native tool calling. Returns LlmToolResponse. See Tool calling. |

All methods accept LlmCallOptions as the options parameter:

interface LlmCallOptions {
  model?: string;
  maxTokens?: number;
  temperature?: number;
  timeoutMs?: number;              // Per-call timeout (ms). Overrides config.timeoutMs.
  signal?: AbortSignal;            // Caller-supplied cancel signal. Never retried.
  streamStallTimeoutMs?: number;   // Per-chunk silence timeout for stream(). Default 30000.
  providerOptions?: Record<string, unknown>;  // Perplexity search filters, etc.
}

Tool calling (v1.0.0)

withTools() enables native function calling across all supported providers. The toolkit handles provider-specific tool shapes, stop-reason mapping, and argument validation internally.

import { z } from 'zod';
import { createClientFromEnv } from '@diabolicallabs/llm-client';

const client = createClientFromEnv('anthropic', 'claude-sonnet-4-6');

const weatherTool = {
  name: 'get_weather',
  description: 'Get the current weather for a city.',
  inputSchema: z.object({ city: z.string() }),
};

const result = await client.withTools(
  [{ role: 'user', content: 'What is the weather in London?' }],
  [weatherTool]
);

// result.stopReason — 'tool_use' | 'end_turn' | 'max_tokens' | 'content_filter' | ...
// result.toolCalls — array of LlmToolCall (may be empty if model responded with text)
// result.content   — any text the model produced alongside tool calls
// result.model     — model ID used
// result.usage     — normalized token usage

if (result.stopReason === 'tool_use') {
  for (const call of result.toolCalls) {
    console.log(call.toolName, call.arguments); // arguments is validated by inputSchema
    console.log(call.id);         // use as tool_call_id in the follow-up message
    console.log(call.rawArguments); // original JSON string from the model
  }
}

Tool options

interface LlmCallWithToolsOptions extends LlmCallOptions {
  toolChoice?: 'auto' | 'any' | 'none' | { name: string };
  parallelToolCalls?: boolean; // default: true (parallel-enabled)
}

toolChoice:

  • 'auto' (default) — model decides whether and which tools to call.
  • 'any' — model must call at least one tool. Maps to 'required' on OpenAI Responses API; { type: 'any' } on Anthropic.
  • 'none' — model must not call any tool.
  • { name: 'tool_name' } — model must call the named tool.

parallelToolCalls: false — disable parallel tool invocations. Maps to parallel_tool_calls: false on OpenAI and DeepSeek; disable_parallel_tool_use: true on Anthropic tool_choice; ignored on Gemini (no equivalent).

Provider tool support matrix

| Provider | Tool calling | parallelToolCalls | Named toolChoice | Stop reasons | |---|---|---|---|---| | OpenAI | Native (Responses API flat shape) | Supported | Supported | tool_use, end_turn, max_tokens, refusal | | Anthropic | Native ({ name, description, input_schema }) | Supported (inverse: disable_parallel_tool_use) | Supported | tool_use, end_turn, max_tokens, stop_sequence, pause_turn, refusal | | Gemini | Native (parametersJsonSchema) | Not applicable (no Gemini equivalent) | Falls back to AUTO | tool_use, end_turn, max_tokens, content_filter, stop_sequence | | DeepSeek | Native (Chat Completions nested shape) | Supported | Supported | tool_use, end_turn, max_tokens, content_filter | | Perplexity | Not supported | N/A | N/A | Throws kind:'bad_request' immediately |

Argument validation

Each LlmTool.inputSchema must expose a .parse(data: unknown) method. If the model returns arguments that fail validation, withTools() throws LlmError with kind: 'tool_arguments_invalid', retryable: false. A Zod 4 schema satisfies this interface automatically.

Gemini ID synthesis

Gemini does not issue native response IDs or tool call IDs. The toolkit synthesizes UUID v7-style IDs (time-based + random) for LlmToolCall.id and for the response-level id field on all Gemini responses (complete(), structured(), withTools()). These IDs are time-sortable (the first 12 hex characters encode the millisecond timestamp) but not cryptographically random. Use idSource to distinguish synthesized IDs from provider-issued ones.

Streaming structured output (v1.3.0)

streamStructured() combines the typing-progress UX of stream() with the Zod-validated final object of structured(). It emits incremental token events as the model generates output, then validates the accumulated text against the schema before emitting a final done event.

import { z } from 'zod';
import { createClientFromEnv } from '@diabolicallabs/llm-client';

const client = createClientFromEnv('openai', 'gpt-4o');
const schema = z.object({ summary: z.string(), sentiment: z.enum(['positive', 'negative', 'neutral']) });

for await (const event of client.streamStructured(
  [{ role: 'user', content: 'Analyze this review: "Great product, fast shipping!"' }],
  schema
)) {
  if (event.type === 'token') {
    process.stdout.write(event.token); // show typing progress
  } else if (event.type === 'done') {
    console.log('\nValidated output:', event.data);
    console.log('Usage:', event.usage);
  }
}

Event shape

type LlmStreamStructuredEvent<T> =
  | { type: 'token'; token: string }  // incremental text chunk
  | { type: 'done'; data: T; usage: LlmUsage };  // final, validated result

token events arrive during generation. Exactly one done event arrives at the end. If JSON.parse() or schema.parse() fails, LlmError with kind: 'structured_parse_failed' is thrown instead (no done event).

Provider support matrix for streamStructured()

| Provider | Support | Notes | |---|---|---| | OpenAI | Supported | Streams output_text.delta events via Responses API. Zod 4 schemas enable json_schema strict mode; non-Zod schemas use json_object mode. | | Anthropic | Supported | Uses forced tool-use (extract tool, tool_choice: tool). Streams input_json_delta events — raw JSON fragments that assemble into the final object. | | DeepSeek | Supported | Streams Chat Completions deltas with response_format: { type: 'json_object' }. Falls back to parseJsonOrThrow if JSON.parse fails (handles chain-of-thought preamble from deepseek-reasoner). | | Gemini | Not supported | Throws LlmError(kind: 'bad_request') immediately. Gemini does not reliably support simultaneous responseSchema constraints and streaming. Use stream() for tokens or structured() for validation. | | Perplexity | Not supported | Throws LlmError(kind: 'bad_request') immediately. Search/retrieval models do not return tool-validated JSON. |

Failover and pricing

streamStructured() does not support provider failover — mid-stream model switching would corrupt the token sequence. It always uses the primary model from a model: string[] config.

streamStructured() does not attach cost — cost computation requires final token counts from a complete response object. Use complete() or structured() if you need cost tracking via config.pricing.

AbortSignal, stall detection (streamStallTimeoutMs), and timeout (timeoutMs) all work identically to stream().

Provider capability matrix (v1.4.0)

Query provider capabilities statically — no client instance needed.

import { getModelCapabilities } from '@diabolicallabs/llm-client';

const caps = getModelCapabilities('anthropic', 'claude-opus-4-7');
if (caps === null) {
  throw new Error('Unknown model');
}

console.log(caps.contextWindow);    // 1_000_000
console.log(caps.tools);            // true
console.log(caps.parallelTools);    // true
console.log(caps.promptCache);      // 'ephemeral'
console.log(caps.structuredOutput); // 'tool-use'
console.log(caps.responseIds);      // 'provider'
console.log(caps.streamStructured); // true

Returns null for unknown models — never throws.

ModelCapabilities shape

interface ModelCapabilities {
  contextWindow: number;          // max input tokens
  maxOutputTokens: number;        // max single-response tokens
  streaming: boolean;             // stream() supported
  tools: boolean;                 // withTools() supported
  parallelTools: boolean;         // model can invoke multiple tools per turn
  promptCache: 'ephemeral' | '1h' | null; // Anthropic only; null for all others
  structuredOutput: 'tool-use' | 'json-schema' | 'response-schema' | null;
  responseIds: 'provider' | 'synthesized'; // Gemini = 'synthesized'
  streamStructured: boolean;      // streamStructured() supported
}

Provider capability summary

| Provider | tools | parallelTools | promptCache | structuredOutput | responseIds | streamStructured | |---|---|---|---|---|---|---| | Anthropic | true | true | 'ephemeral' | 'tool-use' | 'provider' | true | | OpenAI | true | true | null | 'json-schema' | 'provider' | true | | Gemini | true | false | null | 'response-schema' | 'synthesized' | false | | DeepSeek | true | true | null | 'json-schema' | 'provider' | true | | Perplexity | false | false | null | null | 'provider' | false |

getModelCapabilities covers all models in @diabolicallabs/llm-pricing's DEFAULT_PRICING_TABLE. The table is versioned at CAPABILITIES_VERSIONED_AT: '2026-05-13' — import it to detect staleness.

Linked AbortController helper (v1.4.0)

linkedAbortController is a utility for fan-out patterns where a root signal cancels all in-flight calls and individual calls have their own per-call timeouts.

import { linkedAbortController } from '@diabolicallabs/llm-client';

const root = new AbortController();

const calls = tasks.map(t => {
  const child = linkedAbortController(root.signal, { timeoutMs: 30_000 });
  return client
    .complete(t.messages, { signal: child.signal })
    .finally(() => child.dispose()); // clean up on completion — prevents listener leak
});

// Cancel all in-flight calls at once
root.abort('shutdown');

// Or wait for all results (some may have individual timeouts)
const results = await Promise.allSettled(calls);

Behaviour

| Scenario | Result | |---|---| | Parent aborts | Child aborts immediately, forwarding the parent's abort reason | | Parent already aborted at call time | Child aborts synchronously, no API call made | | timeoutMs fires | Child aborts with timeout reason string; independent of the parent signal | | dispose() called | Parent listener + timer cleared; child NOT aborted | | abort() called on handle | Child aborts immediately; dispose() called implicitly |

Always call dispose() in a finally block — it removes the parent listener and clears the timer, preventing leaks if the parent fires after the call completes.

API

function linkedAbortController(
  parentSignal: AbortSignal,
  options?: { timeoutMs?: number }
): {
  signal: AbortSignal;     // pass to client.complete(), stream(), etc.
  abort: (reason?) => void; // abort child immediately
  dispose: () => void;      // clean up without aborting
};

Response IDs everywhere (v1.4.0)

All three response types (LlmResponse, LlmStructuredResponse<T>, LlmToolResponse) now carry id: string and idSource: 'provider' | 'synthesized' on every call.

const response = await client.complete(messages);
console.log(response.id);       // 'msg_abc123' (Anthropic) or synthesized UUID (Gemini)
console.log(response.idSource); // 'provider' | 'synthesized'

ID sources by provider

| Provider | id source | idSource | |---|---|---| | Anthropic | response.id (Anthropic message ID) | 'provider' | | OpenAI | response.id (Responses API) | 'provider' | | DeepSeek | response.id (Chat Completions) | 'provider' | | Perplexity | response.id (Chat Completions) | 'provider' | | Gemini | UUID v7-style synthesized by toolkit | 'synthesized' |

Synthesized IDs are time-sortable (first 12 hex chars encode the millisecond timestamp) — useful for trace correlation without a separate timestamp. Check idSource === 'synthesized' before treating the ID as a durable provider reference.

Migration from v1.3.x: id was previously id? (optional) on LlmStructuredResponse and LlmToolResponse, and absent from LlmResponse. It is now id: string (always present) on all three types. Remove any response.id !== undefined null checks.

Cancellation, timeouts, stall detection

Per-call timeout override

The default timeout is set at client construction via config.timeoutMs (default 30 000 ms). Override it per-call:

const client = createClient({
  provider: 'anthropic',
  model: 'claude-sonnet-4-6',
  apiKey: process.env.ANTHROPIC_API_KEY!,
  timeoutMs: 30_000, // client default
});

// This call gets 90 seconds — useful for sonar-deep-research or long reasoning
const response = await client.complete(messages, { timeoutMs: 90_000 });

On timeout, LlmError.kind === 'timeout' and retryable === true. Each retry attempt gets a fresh deadline — the timeout resets per attempt, not across the full retry sequence.

Caller AbortSignal

Pass any AbortSignal to cancel an in-flight call immediately:

const ac = new AbortController();

// Cancel on user navigation, request supersede, shutdown, etc.
const responsePromise = client.complete(messages, { signal: ac.signal });

// Cancel before the call returns
ac.abort('user navigated away');

try {
  await responsePromise;
} catch (err) {
  if (err instanceof LlmError && err.kind === 'cancelled') {
    // Gracefully handle the cancellation
  }
}
  • A signal already aborted at call time throws immediately — no SDK call is made, no retry.
  • A mid-call abort propagates to the SDK (Anthropic, OpenAI, DeepSeek, Perplexity) or wins a Promise.race (Gemini). kind === 'cancelled', retryable === false. Never retried.

Stream stall detection

A stream that emits a first chunk and then silently hangs will stall the consumer indefinitely without this feature. streamStallTimeoutMs fires a timer per chunk — if no chunk arrives within the window, the stream is aborted and a kind: 'stream_stall' error surfaces:

try {
  for await (const chunk of client.stream(messages, { streamStallTimeoutMs: 10_000 })) {
    process.stdout.write(chunk.token);
  }
} catch (err) {
  if (err instanceof LlmError && err.kind === 'stream_stall') {
    console.error('stream stalled — retry or fallback');
  }
}
  • Default streamStallTimeoutMs: 30 000 ms (set independently of timeoutMs — tolerant of reasoning-model think-pauses).
  • The stall timer resets after each chunk arrives, so slow-but-not-stalled streams complete normally.
  • Stall errors are not retried — partial output is unsafe to re-issue. The error surfaces to the caller.

LlmError.kind discriminator (v1.0.0)

// Full taxonomy — all providers emit one of these kinds
type LlmErrorKind =
  | 'rate_limit'           // 429
  | 'server_error'         // 5xx
  | 'auth'                 // 401, 403
  | 'not_found'            // 404
  | 'bad_request'          // 400
  | 'content_filter'       // model refused, safety block
  | 'context_length'       // prompt too long
  | 'tool_arguments_invalid' // withTools() schema validation failure
  | 'structured_parse_failed' // structured() JSON parse or Zod validation failure
  | 'network'              // ECONNRESET, ETIMEDOUT, etc.
  | 'timeout'              // per-call timeout
  | 'stream_stall'         // stream silence exceeded streamStallTimeoutMs
  | 'cancelled'            // AbortSignal fired
  | 'http'                 // residual unclassified 4xx
  | 'unknown';             // catch-all

class LlmError extends Error {
  readonly provider: string;
  readonly statusCode?: number;
  readonly retryable: boolean;
  readonly kind: LlmErrorKind; // always defined in v1.0.0
}

See MIGRATION.md for the full migration table from err.kind === 'http' checks to the new specific kinds.

Gemini cancellation caveat

@google/genai does not accept a per-call AbortSignal. Cancellation uses Promise.race — when the internal controller aborts, we stop awaiting, but the SDK's HTTP request continues in the background until the SDK-level timeout fires. The SDK client is constructed with httpOptions.timeout = configTimeoutMs * 2 as a backstop. This bounds the leaked request to at most 2× the configured timeout. Native signal support will be added when the SDK provides it.

Error handling

All provider errors are normalized into LlmError:

import { LlmError } from '@diabolicallabs/llm-client';

try {
  const response = await client.complete(messages);
} catch (err) {
  if (err instanceof LlmError) {
    console.error(err.provider, err.statusCode, err.retryable, err.kind);
  }
}

Retryable errors (429, 5xx, network failures, timeout) are retried automatically with exponential backoff and full jitter before throwing. Cancelled and stream-stall errors are never retried.

DeepSeek model IDs (v1.0.1+)

DeepSeek retired the deepseek-chat and deepseek-reasoner identifiers as of 2026. The canonical IDs are:

| Model | API ID | Notes | |---|---|---| | V4 Flash | deepseek-v4-flash | General use and reasoning (thinking mode). Canonical default. | | V4 Pro | deepseek-v4-pro | High-capability tier. Promotional pricing active through 2026-05-31. |

Deprecated aliases — DeepSeek's API still accepts these server-side (they route to V4 variants) but new code should use the canonical IDs:

| Deprecated ID | Now routes to | Change | |---|---|---| | deepseek-chat | deepseek-v4-flash non-thinking | Was DeepSeek-V3; now resolves to V4 | | deepseek-reasoner | deepseek-v4-flash thinking mode | Was DeepSeek-R1; now resolves to V4 thinking |

Usage:

// Canonical V4 Flash (default — replaces deepseek-chat)
const client = createClientFromEnv('deepseek', 'deepseek-v4-flash');

// Canonical V4 Pro
const client = createClientFromEnv('deepseek', 'deepseek-v4-pro');

Per-response cost computation (v1.1.0)

Attach cost?: LlmCost to every response by configuring pricing at client creation time. Requires @diabolicallabs/llm-pricing as an optional peer dep.

pnpm add @diabolicallabs/llm-pricing
import { createClient } from '@diabolicallabs/llm-client';

// createClient() is async (v1.7.0+) — await it
const client = await createClient({
  provider: 'anthropic',
  model: 'claude-sonnet-4-6',
  apiKey: process.env.ANTHROPIC_API_KEY!,
  pricing: { computeOnEveryCall: true },
});

const response = await client.complete(messages);
console.log(response.cost);
// {
//   input:     0.0003,   // USD
//   output:    0.00075,  // USD
//   cacheRead: 0,
//   cacheWrite: 0,
//   total:     0.00105,  // USD
//   currency:  'USD',
//   isPartial: false,    // true for o-series (invisible reasoning tokens) or sonar-deep-research
// }

Remote pricing table (v1.7.0)

Set pricing.remoteUrl to fetch the latest prices from a URL on client init, with a stale-while-revalidate cache. No code change or npm release needed when prices change — consumers pick up updates on the next process restart.

const client = await createClient({
  provider: 'anthropic',
  model: 'claude-sonnet-4-6',
  apiKey: process.env.ANTHROPIC_API_KEY!,
  pricing: {
    remoteUrl: 'https://raw.githubusercontent.com/mannism/dlabs-toolkit/main/pricing/table.json',
    cacheTtlMs: 24 * 60 * 60 * 1000, // 24h (default)
    computeOnEveryCall: true,
  },
});

Precedence (highest → lowest):

| pricing.table | pricing.remoteUrl | Result | |---|---|---| | set | any | Consumer table always wins — no fetch | | unset | set | Fetched on init, cached per TTL | | unset | unset | Bundled DEFAULT_PRICING_TABLE |

On fetch failure (network error, HTTP error, schema validation failure, 5s timeout), the client falls back silently to DEFAULT_PRICING_TABLE and logs a structured warning. Pricing failures never crash LLM calls.

A structured pricing_source log line is emitted on every createClient() with a pricing config:

{ "event": "pricing_source", "source": "remote", "url": "...", "fetchedAt": "..." }

source is one of: "remote" | "cache" | "fallback" | "bundled" | "consumer_override".

Static table override

The pricing.table option accepts a custom PricingTable from @diabolicallabs/llm-pricing to override default rates:

import { DEFAULT_PRICING_TABLE } from '@diabolicallabs/llm-pricing';

const client = await createClient({
  provider: 'openai',
  model: 'gpt-5.5',
  apiKey: process.env.OPENAI_API_KEY!,
  pricing: {
    computeOnEveryCall: true,
    table: {
      ...DEFAULT_PRICING_TABLE,
      openai: {
        'gpt-5.5': { inputPer1M: 4.5, outputPer1M: 28.0, verifiedAt: '2026-05-14', sourceUrl: 'internal' },
      },
    },
  },
});

stream() does not attach cost — cost requires final token counts from a complete response. Use complete() if you need cost tracking. See @diabolicallabs/llm-pricing for the full pricing table, maintenance plan, and pnpm pricing:verify diagnostic script.

Hooks (v1.5.0+)

Attach beforeCall and afterCall hooks to any createClient() config. Hooks fire for all five call types: complete, stream, structured, withTools, streamStructured.

const client = createClient({
  provider: 'anthropic',
  model: 'claude-sonnet-4-6',
  apiKey: process.env.ANTHROPIC_API_KEY!,
  hooks: {
    beforeCall: async (ctx) => {
      // ctx.callType, ctx.provider, ctx.model, ctx.messages, ctx.options
    },
    afterCall: async (ctx) => {
      // ctx.request, ctx.response, ctx.usage, ctx.error, ctx.latencyMs
    },
  },
});

beforeCall — request mutation

Return { messages, options } to replace the originals for that call. Subsequent calls use the original config values.

hooks: {
  // PII redaction before the request leaves the process
  beforeCall: async (ctx) => ({
    messages: ctx.messages.map((m) => ({
      ...m,
      content: redactPii(m.content),
    })),
  }),
}

beforeCall — short-circuit caching

Return { skip: cachedResponse } to return a pre-built response without executing the provider call. The retry and failover layers do not fire.

hooks: {
  beforeCall: async (ctx) => {
    const cached = await cache.get(cacheKey(ctx.messages));
    if (cached) return { skip: cached };
  },
}

For streaming call types (stream, streamStructured), skip must be an AsyncGenerator matching the call's event shape.

afterCall — observability

Fires after the call completes (or after generator exhaustion for streams). Errors in afterCall are caught, logged as a structured warning, and dropped — they never crash the call that already returned.

hooks: {
  afterCall: async (ctx) => {
    logger.info({
      callType: ctx.request.callType,
      model: ctx.request.model,
      latencyMs: ctx.latencyMs,
      inputTokens: ctx.usage?.inputTokens,
      outputTokens: ctx.usage?.outputTokens,
      error: ctx.error?.message,
    });
  },
}

ctx.usage (v1.6.0+): Populated for all 5 call types. For non-streaming paths (complete, structured, withTools), ctx.usage mirrors ctx.response.usage. For stream(), usage comes from the terminal chunk; for streamStructured(), from the done event. ctx.usage is undefined only when the call failed before a response was received.

ctx.response: undefined for stream() and streamStructured() — no accumulated response object exists for streaming calls. Read token counts from ctx.usage instead.

Hook contract

| Property | Value | |---|---| | Firing frequency | Once per public method invocation — NOT per retry attempt | | beforeCall error | Propagates as LlmError({ kind: 'bad_request' }) | | afterCall error | Logged as structured warn, dropped — never propagates | | ctx.model at beforeCall | Primary (first) model in config array. May differ from response.model if failover fires. | | ctx.usage (v1.6.0+) | Populated for all 5 call types. undefined only on error paths (call failed before a response). | | ctx.response on streaming | undefined for stream() and streamStructured() — read token counts from ctx.usage. |

When to use hooks vs instrumentClient

Use hooks when you want request-level interception: PII redaction, system prompt injection, cache short-circuit, custom logging. Hooks are configured directly on createClient().

Use @diabolicallabs/agent-sdk when you want ingestion of CallRecord objects to the Agent Spend Dashboard. instrumentClient() internally uses the hooks infrastructure since v1.4.0, but the public API (instrumentClient, CallRecord, AgentSdkConfig) stays the SDK's entry point.

Both compose: instrumentClient() merges its afterCall handler with any hooks already set on the client config.

Token normalization

All providers return LlmUsage in a consistent shape regardless of the underlying API's field names:

interface LlmUsage {
  inputTokens: number;
  outputTokens: number;
  totalTokens: number;
  cacheCreationTokens?: number; // Anthropic prompt cache only
  cacheReadTokens?: number;     // Anthropic prompt cache only
}