@mzedstudio/llm-cache

v0.1.0

Published

3 months ago

LLM request/response caching with tiered TTL, time travel, and request normalization for Convex.

0High
0Medium
0Low

raymondui

convex convex-component llm cache openai ai ttl idempotency

@mzedstudio/llm-cache

A Convex component that caches LLM API request/response pairs with tiered TTL, time travel, and built-in observability. Stop paying for duplicate calls — get instant responses for identical prompts.

Features

Deterministic cache keys — SHA-256 hash of normalized request parameters. Same prompt always hits the same cache entry, regardless of key order or whitespace.
Tiered TTL with auto-promotion — entries start at 24h, promote to 7 days on first hit, and can be pinned permanently.
Time travel — every time a cached request gets a new response, the old one is archived. Query the full response history for any request to see how model output has changed over time.
Request normalization — trims whitespace, lowercases model names, rounds floats, and sorts keys so {model: "GPT-4o", temperature: 0.70001} hits the same cache as {temperature: 0.7, model: "gpt-4o"}.
Flexible invalidation — delete by cache key, model name, model version, tag, or time range.
Query and inspect — filter cached entries by model, tag, or time range. Get hit counts, stats breakdowns, and storage metrics.
Configurable TTLs — set per-model and per-tag TTL overrides. Give ephemeral chat completions a short TTL and expensive embedding calls a long one.
OpenAI-compatible — works with any request format that has messages, model, and optional parameters like temperature, max_tokens, top_p.

Installation

npm install @mzedstudio/llm-cache convex

Setup

1. Register the component

// convex/convex.config.ts
import { defineApp } from "convex/server";
import llmCache from "@mzedstudio/llm-cache/convex.config.js";

const app = defineApp();
app.use(llmCache, { name: "llmCache" });
export default app;

2. Initialize the client

// convex/llm.ts
import { LLMCache } from "@mzedstudio/llm-cache";
import { components } from "./_generated/api";

const cache = new LLMCache(components.llmCache);

Usage

Cache an LLM call

import { action } from "./_generated/server";
import { v } from "convex/values";

export const chat = action({
  args: {
    messages: v.array(v.object({ role: v.string(), content: v.string() })),
  },
  handler: async (ctx, args) => {
    const request = {
      messages: args.messages,
      model: "gpt-4o",
      temperature: 0.7,
    };

    // Check cache first
    const cached = await cache.lookup(ctx, { request });
    if (cached) return cached.response;

    // Cache miss — call your LLM provider
    const response = await openai.chat.completions.create(request);

    // Store the response
    await cache.store(ctx, { request, response, tags: ["chat"] });

    return response;
  },
});

Pin important responses

Pinned entries never expire (TTL tier 2):

await cache.store(ctx, {
  request,
  response,
  pin: true,
  tags: ["system-prompt"],
});

Read-only peek (no hit counting)

Use peek from query contexts where mutations aren't available:

export const checkCache = query({
  args: { request: v.any() },
  handler: async (ctx, args) => {
    return await cache.peek(ctx, { request: args.request });
  },
});

Filter and browse cached entries

// All GPT-4o entries from the last hour
const entries = await cache.query(ctx, {
  model: "gpt-4o",
  after: Date.now() - 3600000,
});

// All entries tagged "summarize"
const summaries = await cache.query(ctx, { tag: "summarize", limit: 20 });

Time travel

When store() is called with a response that differs from the existing cached response for the same request, the old response is automatically archived. Use history() to retrieve the full timeline:

// Get every response this request has ever produced
const timeline = await cache.history(ctx, { request });

// timeline = [
//   { response: {...}, storedAt: 1706745600000, isCurrent: false },  // oldest
//   { response: {...}, storedAt: 1707000000000, isCurrent: false },
//   { response: {...}, storedAt: 1707350400000, isCurrent: true },   // current
// ]

Each entry includes isCurrent: boolean to distinguish the live cached response from archived ones. Archived entries preserve the model, modelVersion, tags, and metadata they had at the time of archival.

Identical responses are not archived — if the model returns the same output, no duplicate history entry is created.

Invalidate on model upgrade

// Clear all cached responses for a specific model
await cache.invalidate(ctx, { model: "gpt-4o" });

// Or invalidate by model version
await cache.invalidate(ctx, { modelVersion: "gpt-4o-2024-05-13" });

Scheduled cleanup

export const cleanupExpired = action({
  handler: async (ctx) => {
    const result = await cache.cleanup(ctx, { batchSize: 200 });
    console.log(`Deleted ${result.deletedCount} expired entries`);
    // result.hasMore === true means there are more to clean
  },
});

Preview before deleting with dryRun:

const preview = await cache.cleanup(ctx, { dryRun: true });
// preview.keys lists what *would* be deleted, but nothing is removed

Cache stats

const stats = await cache.getStats(ctx);
// {
//   totalEntries: 142,
//   totalHits: 891,
//   entriesByModel: { "gpt-4o": 98, "claude-sonnet-4-5-20250929": 44 },
//   hitsByModel: { "gpt-4o": 720, "claude-sonnet-4-5-20250929": 171 },
//   oldestEntry: 1706745600000,
//   newestEntry: 1707350400000,
// }

Configuration

Tune TTLs, toggle normalization, and set per-model or per-tag overrides:

await cache.setConfig(ctx, {
  config: {
    defaultTtlMs: 12 * 60 * 60 * 1000,            // 12 hours (default: 24h)
    promotionTtlMs: 14 * 24 * 60 * 60 * 1000,      // 14 days (default: 7d)
    ttlByModel: {
      "gpt-4o-mini": 1 * 60 * 60 * 1000,           // 1 hour for cheap models
      "gpt-4o": 48 * 60 * 60 * 1000,               // 2 days for expensive models
    },
    ttlByTag: {
      "embedding": 30 * 24 * 60 * 60 * 1000,       // 30 days for embeddings
    },
    normalizeRequests: true,                         // default: true
    maxEntries: 10000,
  },
});

TTL priority order: tag > model > default. When multiple tags match, the longest TTL wins.

Use replace: true to overwrite the entire config (unset fields revert to defaults):

await cache.setConfig(ctx, {
  config: { defaultTtlMs: 3600000 },
  replace: true,
});

How It Works

Cache Key Generation

Every request is hashed into a deterministic 64-character hex string (SHA-256). Before hashing, the request is normalized:

| Normalization step | Example | |---|---| | Sort object keys | {b:2, a:1} and {a:1, b:2} produce the same key | | Trim message content | " Hello " matches "Hello" | | Lowercase model name | "GPT-4o" matches "gpt-4o" | | Round floats to 2 places | 0.7000001 matches 0.7 | | Strip null/undefined fields | {model: "x", top_p: undefined} matches {model: "x"} |

Normalization can be disabled via config if you need exact-match behavior.

TTL Tiers

Cache entries automatically promote through three tiers based on access patterns:

| Tier | Name | Default Duration | Trigger | |---|---|---|---| | 0 | Default | 24 hours | Entry is first stored | | 1 | Promoted | 7 days | First cache hit (via lookup) | | 2 | Pinned | Never expires | store with pin: true |

On each lookup hit:

Tier 0 entries promote to Tier 1 (expiry extends to 7 days from now)
Tier 1 entries refresh their expiry (another 7 days from now)
Tier 2 entries are unaffected (pinned)

Popular requests naturally survive longer without manual intervention.

Response History (Time Travel)

The component maintains two tables:

cachedResponses — the current/active cached response for each unique request
responseHistory — archived responses that were replaced by newer ones

When store() is called for a request that already has a cached entry:

If the new response differs from the existing one, the old response is archived to responseHistory
The active entry is updated with the new response
If the response is identical, no archive entry is created

This gives you a complete audit trail of how model outputs have changed over time for any given request.

Architecture

Your Convex action                     llm-cache component
──────────────────                     ────────────────────
                                       ┌───────────────────┐
cache.lookup(ctx, { request }) ───────>│ cachedResponses    │
  │                                    │   cacheKey (SHA-256)│
  ├─ hit ──> return cached response    │   request          │
  │   + increment hit count            │   response         │
  │   + promote TTL tier               │   hitCount, ttlTier│
  │                                    │   expiresAt        │
  └─ miss ──> call LLM API            │   model, tags      │
     then:                             └───────────────────┘
cache.store(ctx, {             ───────>        │
  request, response                            │ (if response changed)
})                                             ▼
                                       ┌───────────────────┐
cache.history(ctx, { request })───────>│ responseHistory    │
                                       │   cacheKey         │
                                       │   response         │
                                       │   model, tags      │
                                       │   storedAt         │
                                       └───────────────────┘

API Reference

`LLMCache` class

All methods are accessed through an LLMCache instance:

import { LLMCache } from "@mzedstudio/llm-cache";
const cache = new LLMCache(components.llmCache);

`cache.lookup(ctx, { request, modelVersion? })`

Find a cached response and increment the hit count. Promotes TTL tier on hit.

Context: mutation or action
Returns: CacheEntry | null

const cached = await cache.lookup(ctx, { request });
if (cached) {
  console.log(cached.response, cached.hitCount, cached.ttlTier);
}

`cache.peek(ctx, { request, modelVersion? })`

Read-only cache lookup. Does not increment hit count or promote TTL tier. Safe to use from queries.

Context: any (query, mutation, action)
Returns: CacheEntry | null

`cache.store(ctx, { request, response, tags?, metadata?, pin?, modelVersion? })`

Store an LLM response. If an entry with the same cache key already exists and the response differs, the old response is archived to the history table before the entry is updated.

Context: mutation or action
Returns: string (cache key)

| Parameter | Type | Description | |---|---|---| | request | Record<string, unknown> | The LLM request object | | response | unknown | The LLM response to cache | | tags | string[] | Optional tags for filtering and TTL overrides | | metadata | unknown | Optional arbitrary metadata | | pin | boolean | Pin entry (never expires, TTL tier 2) | | modelVersion | string | Version string for model-version-based invalidation |

`cache.get(ctx, { cacheKey })`

Direct lookup by cache key. No hit counting.

Context: any
Returns: CacheEntry | null

`cache.query(ctx, { model?, tag?, after?, before?, limit? })`

Filter and list cached entries. Results are ordered newest-first.

Context: any
Returns: CacheEntry[]

| Parameter | Type | Description | |---|---|---| | model | string | Filter by model name | | tag | string | Filter by tag | | after | number | Exclude entries created before this timestamp | | before | number | Exclude entries created after this timestamp | | limit | number | Max results (default 50, max 200) |

`cache.history(ctx, { request })`

Get the full response timeline for a request. Returns all archived responses plus the current one, ordered oldest-first.

Context: any
Returns: HistoryEntry[]

Each HistoryEntry contains:

| Field | Type | Description | |---|---|---| | cacheKey | string | The cache key for this request | | request | unknown | The original request | | response | unknown | The response at this point in time | | model | string | Model name | | modelVersion | string? | Model version at time of storage | | tags | string[]? | Tags at time of storage | | metadata | unknown? | Metadata at time of storage | | storedAt | number | Timestamp when this response was stored | | isCurrent | boolean | true for the active cached response, false for archived |

`cache.invalidate(ctx, { cacheKey?, model?, modelVersion?, tag?, before? })`

Delete matching cache entries. At least one filter parameter is required.

Context: mutation or action
Returns: number (count of deleted entries)

`cache.cleanup(ctx, { batchSize?, dryRun? })`

Remove expired entries in batches.

Context: action
Returns: CleanupResult

| Field | Type | Description | |---|---|---| | deletedCount | number | Entries deleted (0 if dryRun) | | keys | string[] | Cache keys that were (or would be) deleted | | hasMore | boolean | true if more expired entries remain |

`cache.setConfig(ctx, { config, replace? })`

Update cache configuration. By default, merges with existing config. Pass replace: true to overwrite entirely.

Context: mutation or action

`cache.getConfig(ctx)`

Read current cache configuration.

Context: any
Returns: CacheConfig

`cache.getStats(ctx)`

Get cache statistics.

Context: any
Returns: CacheStats

| Field | Type | Description | |---|---|---| | totalEntries | number | Total cached entries | | totalHits | number | Sum of all hit counts | | entriesByModel | Record<string, number> | Entry count per model | | hitsByModel | Record<string, number> | Hit count per model | | oldestEntry | number? | Oldest entry timestamp | | newestEntry | number? | Newest entry timestamp |

Configuration options

| Field | Type | Default | Description | |---|---|---|---| | defaultTtlMs | number | 86400000 (24h) | Default TTL for new entries | | promotionTtlMs | number | 604800000 (7d) | TTL after promotion to tier 1 | | ttlByModel | Record<string, number> | {} | Per-model TTL overrides | | ttlByTag | Record<string, number> | {} | Per-tag TTL overrides | | normalizeRequests | boolean | true | Enable request normalization | | maxEntries | number | none | Max cache entries (informational) |

Exported types

import type {
  CacheEntry,      // Full cache entry with metadata
  HistoryEntry,    // Time travel entry (archived or current)
  CacheConfig,     // Configuration document
  CacheStats,      // Statistics response
  CleanupResult,   // Cleanup action result
  ConfigUpdate,    // Partial config for updates
} from "@mzedstudio/llm-cache";

Validators are also exported for use in Convex function args/returns:

import {
  cacheEntryValidator,
  historyEntryValidator,
  configDocValidator,
  cacheStatsValidator,
  cleanupResultValidator,
  configUpdateValidator,
} from "@mzedstudio/llm-cache";

Example App

The example/ directory contains a full Next.js demo with:

Prompt — send prompts to LLM models, see cache hit/miss status
Explorer — browse and filter all cached entries
History — time travel UI showing the full response timeline for a request
Admin — configure TTLs, invalidate entries, run cleanup
Normalize — see how request normalization maps variants to the same cache key

To run the example:

cd example
npm install
npx convex dev   # in one terminal
npm run dev       # in another

Testing

39 tests covering cache operations, normalization, TTL tiers, time travel, queries, invalidation, cleanup, config, and stats.

pnpm test

To register the component in your own tests with convex-test:

import { convexTest } from "convex-test";
import { register } from "@mzedstudio/llm-cache/test";
import schema from "./schema";

const modules = import.meta.glob("./**/*.ts");

test("my test", async () => {
  const t = convexTest(schema, modules);
  register(t, "llmCache");
  // ... test your functions that use the cache
});

License

Apache-2.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@mzedstudio/llm-cache

Features

Installation

Setup

1. Register the component

2. Initialize the client

Usage

Cache an LLM call

Pin important responses

Read-only peek (no hit counting)

Filter and browse cached entries

Time travel

Invalidate on model upgrade

Scheduled cleanup

Cache stats

Configuration

How It Works

Cache Key Generation

TTL Tiers

Response History (Time Travel)

Architecture

API Reference

LLMCache class

cache.lookup(ctx, { request, modelVersion? })

cache.peek(ctx, { request, modelVersion? })

cache.store(ctx, { request, response, tags?, metadata?, pin?, modelVersion? })

cache.get(ctx, { cacheKey })

cache.query(ctx, { model?, tag?, after?, before?, limit? })

cache.history(ctx, { request })

cache.invalidate(ctx, { cacheKey?, model?, modelVersion?, tag?, before? })

cache.cleanup(ctx, { batchSize?, dryRun? })

cache.setConfig(ctx, { config, replace? })

cache.getConfig(ctx)

cache.getStats(ctx)

Configuration options

Exported types

Example App

Testing

License

`LLMCache` class

`cache.lookup(ctx, { request, modelVersion? })`

`cache.peek(ctx, { request, modelVersion? })`

`cache.store(ctx, { request, response, tags?, metadata?, pin?, modelVersion? })`

`cache.get(ctx, { cacheKey })`

`cache.query(ctx, { model?, tag?, after?, before?, limit? })`

`cache.history(ctx, { request })`

`cache.invalidate(ctx, { cacheKey?, model?, modelVersion?, tag?, before? })`

`cache.cleanup(ctx, { batchSize?, dryRun? })`

`cache.setConfig(ctx, { config, replace? })`

`cache.getConfig(ctx)`

`cache.getStats(ctx)`