memflux

v1.0.0

Published

7 days ago

Intelligent AI response caching middleware — reduce costs and latency for LLM-powered applications

0High
0Medium
0Low

ai cache llm gemini openai middleware redis sqlite semantic-cache memflux cost-reduction latency express nodejs typescript

🤖 memflux

Semantic caching layer for OpenAI & Gemini APIs — save up to 70% on AI costs

The Problem

// Without memflux: every call costs money, even for identical questions
const a = await openai.chat.completions.create({ messages: [{ role: 'user', content: 'What is TypeScript?' }] });       // $0.002
const b = await openai.chat.completions.create({ messages: [{ role: 'user', content: 'Explain TypeScript to me' }] });  // $0.002 — same answer!
const c = await openai.chat.completions.create({ messages: [{ role: 'user', content: 'What is TS?' }] });               // $0.002 — again!

Regular caches require exact string matches. They can't help here because users always rephrase the same question differently.

The Solution

import OpenAI from 'openai';
import { aiCache } from 'memflux';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const ai = aiCache(openai);  // ← one line wraps the client

const a = await ai.chat('What is TypeScript?');      // → API call ($0.002)
const b = await ai.chat('Explain TypeScript to me'); // → CACHE HIT (free ✨)
const c = await ai.chat('What is TS?');              // → CACHE HIT (free ✨)

const stats = ai.getStats();
console.log(`Hit rate: ${stats.hitRate}%`);          // → "66.7%"
console.log(`Saved: $${stats.estimatedMoneySaved}`); // → "Saved: $0.004"

memflux converts every question into a semantic embedding vector and compares it against cached questions using cosine similarity. When the meaning is close enough (≥ 85% by default), it returns the cached answer instantly — no API call, no cost.

Installation

npm install memflux

# Optional: Redis persistence (for multi-process / production)
npm install memflux ioredis

# Optional: SQLite persistence (for single-server deployments)
npm install memflux better-sqlite3

Requirements: Node.js ≥ 18, OpenAI SDK ≥ 4 or @google/generative-ai ≥ 0.1

Quick Start

OpenAI

import OpenAI from 'openai';
import { aiCache } from 'memflux';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const ai = aiCache(openai, {
  similarity: {
    threshold: 0.85,       // 85% semantic match = cache hit
    algorithm: 'cosine',   // cosine | euclidean | dot-product
    topK: 5,               // check top 5 candidates
  },
  ttl: 3_600,              // cache entries expire after 1 hour
  debug: false,            // set to true for verbose logs
});

// Single question
const answer = await ai.chat('What is machine learning?');

// With options
const answer2 = await ai.chat('Summarise this article', {
  model: 'gpt-4o',
  temperature: 0.3,
  systemPrompt: 'You are a helpful assistant.',
  bypassCache: false,      // set to true to force a fresh API call
  ttl: 7200,               // per-request TTL override
});

// Multi-turn conversation
const answer3 = await ai.chatWithMessages([
  { role: 'system', content: 'You are an expert in TypeScript.' },
  { role: 'user',   content: 'What are generics?' },
]);

// Full metadata (cache hit? similarity score? latency?)
const result = await ai.chatDetailed('What is TypeScript?');
console.log(result.cached);            // true / false
console.log(result.similarityScore);   // 0.9234
console.log(result.latencyMs);         // 38 (cache hit) or 2847 (API call)
console.log(result.tokensSaved);       // 512

Google Gemini

import { GoogleGenerativeAI } from '@google/generative-ai';
import { aiCacheGemini } from 'memflux/gemini';

const genAI = new GoogleGenerativeAI(process.env.GOOGLE_AI_API_KEY!);

const ai = aiCacheGemini(genAI, {
  defaultModel: 'gemini-2.0-flash',
  similarity: { threshold: 0.85, algorithm: 'cosine', topK: 3 },
  ttl: 86_400,
});

const answer = await ai.chat('Explain quantum computing');

Storage Backends

In-Memory (default — zero dependencies)

const ai = aiCache(openai);
// Data is lost on process restart. Perfect for development.

Redis (production — persistent, multi-process)

import { aiCache, createRedisStore } from 'memflux';

const ai = aiCache(openai, {
  store: createRedisStore({
    url: process.env.REDIS_URL,   // default: redis://localhost:6379
    ttl: 86_400,
    keyPrefix: 'myapp:',          // namespace to avoid key collisions
  }),
});

SQLite (single-server persistent)

import { aiCache, createSQLiteStore } from 'memflux';

const ai = aiCache(openai, {
  store: createSQLiteStore({
    path: './cache.db',
    ttl: 7 * 86_400,  // 7 days
  }),
});
// Cache survives process restarts!

Express.js Middleware

Drop-in middleware for any Express route that accepts a question and returns an AI answer:

import express from 'express';
import OpenAI from 'openai';
import { aiCache, aiCacheMiddleware, aiCacheStatsHandler } from 'memflux';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const ai = aiCache(openai);
const app = express();
app.use(express.json());

// POST /chat with body: { "message": "your question" }
app.post('/chat', aiCacheMiddleware({
  cache: ai,
  includeStats: true,           // include stats in every response
  onHit:  (req, res) => console.log('CACHE HIT!'),
  onMiss: (req, res) => console.log('Cache miss'),
}));

// GET /cache/stats
app.get('/cache/stats', aiCacheStatsHandler(ai));

Response headers automatically added:

X-Cache: HIT
X-Cache-Hit-Rate: 66.7%
X-Cache-Similarity: 0.9234
X-Cache-Latency-Ms: 38
X-Money-Saved: $0.0042

Fastify Plugin

import Fastify from 'fastify';
import OpenAI from 'openai';
import { aiCache } from 'memflux';
import { aiCacheFastifyPlugin } from 'memflux/middleware';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const ai = aiCache(openai);
const app = Fastify({ logger: true });

await app.register(aiCacheFastifyPlugin, {
  cache: ai,
  prefix: '/api',              // registers POST /api/chat, GET /api/cache/stats
  exposeAdminRoutes: true,
});

await app.listen({ port: 3000 });

Configuration Reference

const ai = aiCache(openai, {
  // ── Model ──────────────────────────────────────────────────
  defaultModel: 'gpt-4o-mini',    // Default AI model for completions

  // ── Embedding ──────────────────────────────────────────────
  embedding: {
    model: 'text-embedding-3-small', // Embedding model
    dimensions: 1536,                // Vector dimensions (256–1536)
    batchSize: 100,                  // Texts per batch embedding request
    internalCacheSize: 5_000,        // In-memory embedding cache size
  },

  // ── Similarity ─────────────────────────────────────────────
  similarity: {
    algorithm: 'cosine',   // 'cosine' | 'euclidean' | 'dot-product'
    threshold: 0.85,       // 0.0 = match everything | 1.0 = exact match only
    topK: 5,               // Evaluate top K candidates
  },

  // ── Cache ──────────────────────────────────────────────────
  ttl: 86_400,             // Default TTL in seconds (0 = never expire)
  maxCacheSize: 10_000,    // Max entries before LRU eviction
  namespace: 'memflux',  // Key prefix for Redis/SQLite

  // ── Analytics ─────────────────────────────────────────────
  analytics: {
    enabled: true,
    trackCostSavings: true,
    costPerToken: 0.0000006,         // Adjust to match your model's pricing
    maxRecords: 10_000,
  },

  // ── Store ─────────────────────────────────────────────────
  store: createRedisStore(...),   // Optional custom backing store

  // ── Debug ─────────────────────────────────────────────────
  debug: false,            // Enable verbose stderr logging
});

Threshold Guide

| Threshold | Behaviour | Use case | |-----------|-----------|----------| | 0.70 | Very aggressive caching | High-volume FAQ bots | | 0.80 | Loose matching | Customer support, search | | 0.85 | Default — good balance | General use | | 0.90 | Strict matching | Legal / medical accuracy | | 0.95 | Near-exact only | Code generation |

Statistics & Analytics

const stats = ai.getStats();

stats.totalRequests       // 1000
stats.cacheHits           // 650
stats.cacheMisses         // 350
stats.hitRate             // 65.0 (%)
stats.totalTokensSaved    // 97500
stats.estimatedMoneySaved // 0.0585 ($)
stats.averageLatencyMs.hits   // 38 ms
stats.averageLatencyMs.misses // 2847 ms
stats.topQueries          // [{query, hitCount, savedCost}]
stats.since               // '2025-01-01T00:00:00.000Z'

Cost Projection

import { CostCalculator } from 'memflux';

const calc = new CostCalculator();
const savings = calc.projectMonthlySavings(
  10_000,   // daily requests
  0.60,     // expected 60% hit rate
  0.002,    // $0.002 average cost per uncached request
);

console.log(calc.formatSavings(savings));
// Monthly spend (no cache): $6000.00
// Monthly spend (with cache): $2400.00
// Monthly savings: $3600.00 (60%)
// Annual savings: $43200.00

Cache Management

// Flush all entries + reset statistics
await ai.flush();

// Pre-populate embeddings for anticipated queries (no AI completion calls)
await ai.warmUp([
  'What are your business hours?',
  'How do I reset my password?',
  'What is your refund policy?',
]);

// Direct store access
const size = await ai.store.size();
await ai.store.delete('specific-entry-id');
await ai.store.clear();

How It Works

User question
    │
    ▼
 Normalise text
 (lowercase, trim, remove punctuation)
    │
    ▼
 Embed text via OpenAI text-embedding-3-small
 → 1536-dimensional float vector
    │
    ▼
 Compare against all cached vectors
 using Cosine Similarity
    │
    ├── Score ≥ threshold? ──→ Return cached response (FREE ✨)
    │                           ~38ms avg latency
    │
    └── Score < threshold? ──→ Call AI API (~2000ms)
                                Store response + embedding
                                Return fresh response

Why cosine similarity?

The cosine of the angle between two vectors measures directional similarity — it's sensitive to what the vectors represent (meaning), not how long they are (magnitude). Two different phrasings of the same question will have very similar embedding directions, giving a high cosine score regardless of how the sentence is written.

Performance

| Operation | Time | Notes | |-----------|------|-------| | Cache hit (memory store) | ~5–50ms | Embedding lookup + similarity search | | Cache hit (Redis store) | ~10–80ms | Network round-trip included | | Cache miss | ~1000–5000ms | Full AI API call | | Embedding generation | ~50–150ms | Cached internally after first call |

Throughput (Apple M2, 10,000 cache entries, 1536 dimensions):

cosine similarity: ~150,000 comparisons/second
cosineSimilarityFast (Float32): ~200,000 comparisons/second

For very large caches (>100k entries), consider upgrading to a vector database (Pinecone, Weaviate) with ANN indexing for sub-10ms search.

File Structure

memflux/
├── src/
│   ├── index.ts                       ← Public API entry point
│   ├── types/                         ← TypeScript interfaces
│   │   ├── cache.types.ts
│   │   ├── config.types.ts
│   │   └── provider.types.ts
│   ├── config/
│   │   ├── default.config.ts          ← Sensible defaults
│   │   └── config.validator.ts        ← Input validation
│   ├── cache/
│   │   └── cache.manager.ts           ← Core orchestration logic
│   ├── embeddings/
│   │   └── embedding.service.ts       ← OpenAI embeddings + internal LRU cache
│   ├── similarity/
│   │   ├── cosine.similarity.ts       ← Cosine algorithm (standard + fast)
│   │   ├── euclidean.similarity.ts    ← Euclidean distance algorithm
│   │   ├── dot-product.similarity.ts  ← Dot-product algorithm
│   │   └── similarity.engine.ts       ← Algorithm selector + top-K search
│   ├── storage/
│   │   ├── memory.store.ts            ← In-memory LRU store (default)
│   │   ├── redis.store.ts             ← Redis persistent store
│   │   └── sqlite.store.ts            ← SQLite persistent store
│   ├── analytics/
│   │   ├── stats.tracker.ts           ← Hit/miss statistics
│   │   └── cost.calculator.ts         ← Cost projections + model pricing
│   ├── middleware/
│   │   ├── express.middleware.ts      ← Express.js integration
│   │   └── fastify.middleware.ts      ← Fastify plugin
│   ├── adapters/
│   │   └── gemini.adapter.ts          ← Google Gemini support
│   └── utils/
│       ├── hash.utils.ts              ← SHA-256 ID generation
│       ├── logger.ts                  ← Zero-dependency logger
│       └── token.counter.ts           ← Lightweight token estimator
├── tests/
│   ├── unit/                          ← Pure unit tests (no API calls)
│   └── integration/                   ← Tests with mocked OpenAI client
├── examples/                          ← Runnable code examples
├── benchmarks/                        ← Performance benchmarks
└── .github/workflows/                 ← CI/CD pipelines

Supported Models

| Provider | Completion Models | Embedding Model | |----------|-----------------|-----------------| | OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo | text-embedding-3-small ✓ | | Google | gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash | text-embedding-004 ✓ |

FAQ

Q: Does it work across different languages? A: Yes. The embedding model captures semantic meaning across Arabic, English, French, Spanish, Chinese, etc. A question in Arabic and the same question in English will often produce a cache hit.

Q: What happens if two questions have the same embedding by coincidence? A: True embedding collisions are astronomically unlikely with 1536-dimensional vectors. In practice, you will never see a false positive that isn't semantically related.

Q: Can I use it with streaming responses? A: Currently, memflux buffers the full response before caching. Streaming is on the roadmap.

Q: Is it thread-safe? A: The in-memory store is not safe across multiple Node.js processes. Use Redis for multi-process deployments.

Q: Can I bring my own vector store (Pinecone, Weaviate)? A: Yes — implement the CacheStore interface and pass it as store in the config.

Contributing

git clone https://github.com/Brah-Timo/memflux.git
cd memflux
npm install
npm test            # run tests
npm run build       # build the package
npm run benchmark   # run benchmarks