npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

deja-llm

v0.2.0

Published

Self-hostable multi-layer semantic caching for Node.js LLM applications

Readme

deja-llm

A self-hostable, multi-layer semantic caching library for Node.js LLM applications.

The name is a pun on déjà vu — the library recognizes questions it has seen before and answers instantly without calling the LLM again.

Think of it as GPTCache for Node.js. No vendor lock-in, fully self-hostable, built for production.

Why does this exist? The Node.js ecosystem has no proper solution for this. The closest is @upstash/semantic-cache but it is a fundamentally different concept — it only does semantic similarity matching and is locked to Upstash's hosted infrastructure. deja-llm adds an exact-match layer before the semantic search (so repeated identical queries cost nothing), caches the embeddings themselves to avoid re-embedding, is fully self-hostable with your own Redis and Qdrant instances, and returns full observability on every result including latency breakdown and estimated cost saved.


How it works

Every query passes through two cache layers before falling through to your LLM. You own the LLM call — the library is purely a caching layer.

Query
  │
  ▼
Layer 1 — Redis exact match
  If the exact same conversation was seen before → return instantly, zero cost
  │ miss
  ▼
Layer 2 — Qdrant semantic search
  Embed the conversation, find similar past queries by cosine similarity
  If similarity >= threshold → return cached response
  │ miss
  ▼
Your LLM call
  Call the LLM however you want, then store the response back into the cache

Embeddings are also cached in Redis so the same conversation is never embedded twice.

Every result includes which layer it hit on, similarity score, full latency breakdown, and estimated cost saved.


Install

npm install deja-llm ioredis @qdrant/js-client-rest openai

You also need a running Redis and Qdrant instance. The quickest way to get both locally:

docker run -d -p 6379:6379 redis
docker run -d -p 6333:6333 qdrant/qdrant

Usage

import { DejaLLM } from "deja-llm";
import Anthropic from "@anthropic-ai/sdk";

const deja = new DejaLLM({
  redis: { url: "redis://localhost:6379" },
  qdrant: { url: "http://localhost:6333" },
  embedding: { provider: "openai", apiKey: process.env.OPENAI_API_KEY },
});

const anthropic = new Anthropic();
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "What is the capital of France?" },
];

// Check cache first
const hit = await deja.check(messages);
if (hit) {
  console.log(hit.response); // served from cache
  console.log(hit.layer);    // "exact" | "semantic"
  return;
}

// Cache miss — call the LLM yourself however you want
const res = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages,
});
const response = res.content[0].text;

// Store in cache for next time
await deja.store(messages, response);

Works with any LLM — OpenAI, Anthropic, Mistral, local models, anything that returns a string.

Streaming

check() first, stream on a miss, then store() once the stream is complete:

const hit = await deja.check(messages);
if (hit) return hit.response;

let response = "";
const stream = await anthropic.messages.stream({ model: "claude-sonnet-4-6", max_tokens: 1024, messages });
for await (const chunk of stream) {
  if (chunk.type === "content_block_delta") response += chunk.delta.text;
}

await deja.store(messages, response);

Result object

Both check() and store() return a CacheResult:

{
  response: string;

  // Which layer answered. false means it was a miss (returned by store()).
  layer: "exact" | "semantic" | false;

  // Only present on a semantic hit
  similarity?: number;

  // Only present on a semantic hit
  match?: { cachedAt: Date };

  latency: {
    exactLookup: number;          // ms
    embeddingCacheLookup: number; // ms
    embedding: number | null;     // null if served from embedding cache
    semanticSearch: number;       // ms
    writeBack: number | null;     // null on cache hit
    total: number;                // ms
  };

  savings: {
    embeddingSkipped: boolean;
    estimatedUSD: number | null;  // embedding cost saved; null if model unknown
  };
}

Configuration

Production warning: Always set redis.ttl in production. The Redis exact cache stores the full serialized conversation as the value — without a TTL, the cache grows unbounded as unique conversations accumulate. A recommended pattern is a short TTL on Redis (e.g. 3600s) to catch hot repeated queries, and a longer TTL on Qdrant (e.g. 86400s) for semantic matching over a longer window.

const deja = new DejaLLM({
  redis: {
    url: "redis://localhost:6379",   // default
    ttl: 3600,                       // seconds; strongly recommended in production
    keyPrefix: "deja:",              // default
  },

  qdrant: {
    url: "http://localhost:6333",
    apiKey: "...",                   // for Qdrant Cloud
    collectionName: "my_cache",      // auto-generated from model name if omitted
    ttl: 86400,                      // seconds; omit for no expiry
  },

  embedding: {
    provider: "openai",
    apiKey: "...",
    model: "text-embedding-3-small", // default
  },

  threshold: 0.92,     // semantic similarity threshold, default 0.92
  failSilently: true,  // on cache errors, fall through silently — default true
  logger: console,     // any object with debug/warn/error methods

  hooks: {
    onHit(result) { /* fired on cache hit */ },
    onMiss() { /* fired on cache miss */ },
    onStore(result) { /* fired after store() */ },
  },
});

Bring your own embedding provider

embedding accepts a custom provider instance directly, as long as it implements the interface:

import type { EmbeddingProvider } from "deja-llm";

class MyEmbeddings implements EmbeddingProvider {
  readonly model = "my-model";
  readonly dimensions = 1536;
  async embed(text: string): Promise<number[]> { ... }
}

const deja = new DejaLLM({
  embedding: new MyEmbeddings(),
  // ...
});

Stats

deja-llm tracks hit/miss counters in memory for the lifetime of the current process. Counters reset on every restart and are not shared across multiple instances.

const snap = deja.stats();
// {
//   requests: 42,
//   hits: { exact: 18, semantic: 11, miss: 13 },
//   hitRate: 69,           // percentage, 0–100
//   estimatedUSDSaved: 0.0031
// }

deja.resetStats(); // reset all counters to zero

stats() is a lightweight convenience for local development and quick sanity checks — not a production metrics solution. For persistent, aggregated observability use the hooks below to push events wherever you want.


Hooks

Hooks let you plug into cache events for logging, metrics, or alerting:

const deja = new DejaLLM({
  // ...
  hooks: {
    onHit(result) {
      // fired on exact or semantic cache hit
      console.log(`Cache hit [${result.layer}] — saved ~$${result.savings.estimatedUSD}`);
    },
    onMiss() {
      // fired when both layers miss
      console.log("Cache miss — falling through to LLM");
    },
    onStore(result) {
      // fired after store() completes
      console.log(`Stored in ${result.latency.writeBack}ms`);
    },
  },
});

The result passed to onHit and onStore is the full CacheResult object described above.


Maintenance

Vacuum expired Qdrant points

Qdrant does not expire vectors automatically. Call vacuum() periodically to delete expired points:

const deleted = await deja.vacuum();

Design decisions

Why full conversation is used for caching

Both the exact hash and the semantic embedding are computed from the entire message array — system prompt, conversation history, and the latest user message. This prevents returning a cached response that was generated under a different system prompt or different context. The trade-off is fewer cache hits compared to embedding only the last user message, but no risk of returning wrong answers for context-dependent follow-up questions.

Why embeddings are cached in Redis

Embedding the same conversation twice wastes money. The embedding vector is stored in Redis alongside the exact-match cache, keyed by the same conversation hash. On a Redis hit, the Qdrant search runs without an embedding API call.

Why the Qdrant collection name encodes the model

If you switch embedding models, the existing vectors become incompatible. Encoding the model name and dimensions in the collection name (deja__text_embedding_3_small__1536) means a model change automatically creates a new collection rather than silently searching with mismatched vectors.

Known limitation: ambiguous follow-up questions

Semantic caching works best for self-contained questions. An ambiguous follow-up like "And Germany?" will only match another cached conversation where the full context is semantically similar. This is correct behavior — returning a cached answer from a different context would be wrong. The similarity threshold is the primary safety net.


License

MIT