memflux
v1.0.0
Published
Intelligent AI response caching middleware — reduce costs and latency for LLM-powered applications
Maintainers
Readme
🤖 memflux
Semantic caching layer for OpenAI & Gemini APIs — save up to 70% on AI costs
The Problem
// Without memflux: every call costs money, even for identical questions
const a = await openai.chat.completions.create({ messages: [{ role: 'user', content: 'What is TypeScript?' }] }); // $0.002
const b = await openai.chat.completions.create({ messages: [{ role: 'user', content: 'Explain TypeScript to me' }] }); // $0.002 — same answer!
const c = await openai.chat.completions.create({ messages: [{ role: 'user', content: 'What is TS?' }] }); // $0.002 — again!Regular caches require exact string matches. They can't help here because users always rephrase the same question differently.
The Solution
import OpenAI from 'openai';
import { aiCache } from 'memflux';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const ai = aiCache(openai); // ← one line wraps the client
const a = await ai.chat('What is TypeScript?'); // → API call ($0.002)
const b = await ai.chat('Explain TypeScript to me'); // → CACHE HIT (free ✨)
const c = await ai.chat('What is TS?'); // → CACHE HIT (free ✨)
const stats = ai.getStats();
console.log(`Hit rate: ${stats.hitRate}%`); // → "66.7%"
console.log(`Saved: $${stats.estimatedMoneySaved}`); // → "Saved: $0.004"memflux converts every question into a semantic embedding vector and compares it against cached questions using cosine similarity. When the meaning is close enough (≥ 85% by default), it returns the cached answer instantly — no API call, no cost.
Installation
npm install memflux
# Optional: Redis persistence (for multi-process / production)
npm install memflux ioredis
# Optional: SQLite persistence (for single-server deployments)
npm install memflux better-sqlite3Requirements: Node.js ≥ 18, OpenAI SDK ≥ 4 or @google/generative-ai ≥ 0.1
Quick Start
OpenAI
import OpenAI from 'openai';
import { aiCache } from 'memflux';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const ai = aiCache(openai, {
similarity: {
threshold: 0.85, // 85% semantic match = cache hit
algorithm: 'cosine', // cosine | euclidean | dot-product
topK: 5, // check top 5 candidates
},
ttl: 3_600, // cache entries expire after 1 hour
debug: false, // set to true for verbose logs
});
// Single question
const answer = await ai.chat('What is machine learning?');
// With options
const answer2 = await ai.chat('Summarise this article', {
model: 'gpt-4o',
temperature: 0.3,
systemPrompt: 'You are a helpful assistant.',
bypassCache: false, // set to true to force a fresh API call
ttl: 7200, // per-request TTL override
});
// Multi-turn conversation
const answer3 = await ai.chatWithMessages([
{ role: 'system', content: 'You are an expert in TypeScript.' },
{ role: 'user', content: 'What are generics?' },
]);
// Full metadata (cache hit? similarity score? latency?)
const result = await ai.chatDetailed('What is TypeScript?');
console.log(result.cached); // true / false
console.log(result.similarityScore); // 0.9234
console.log(result.latencyMs); // 38 (cache hit) or 2847 (API call)
console.log(result.tokensSaved); // 512Google Gemini
import { GoogleGenerativeAI } from '@google/generative-ai';
import { aiCacheGemini } from 'memflux/gemini';
const genAI = new GoogleGenerativeAI(process.env.GOOGLE_AI_API_KEY!);
const ai = aiCacheGemini(genAI, {
defaultModel: 'gemini-2.0-flash',
similarity: { threshold: 0.85, algorithm: 'cosine', topK: 3 },
ttl: 86_400,
});
const answer = await ai.chat('Explain quantum computing');Storage Backends
In-Memory (default — zero dependencies)
const ai = aiCache(openai);
// Data is lost on process restart. Perfect for development.Redis (production — persistent, multi-process)
import { aiCache, createRedisStore } from 'memflux';
const ai = aiCache(openai, {
store: createRedisStore({
url: process.env.REDIS_URL, // default: redis://localhost:6379
ttl: 86_400,
keyPrefix: 'myapp:', // namespace to avoid key collisions
}),
});SQLite (single-server persistent)
import { aiCache, createSQLiteStore } from 'memflux';
const ai = aiCache(openai, {
store: createSQLiteStore({
path: './cache.db',
ttl: 7 * 86_400, // 7 days
}),
});
// Cache survives process restarts!Express.js Middleware
Drop-in middleware for any Express route that accepts a question and returns an AI answer:
import express from 'express';
import OpenAI from 'openai';
import { aiCache, aiCacheMiddleware, aiCacheStatsHandler } from 'memflux';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const ai = aiCache(openai);
const app = express();
app.use(express.json());
// POST /chat with body: { "message": "your question" }
app.post('/chat', aiCacheMiddleware({
cache: ai,
includeStats: true, // include stats in every response
onHit: (req, res) => console.log('CACHE HIT!'),
onMiss: (req, res) => console.log('Cache miss'),
}));
// GET /cache/stats
app.get('/cache/stats', aiCacheStatsHandler(ai));Response headers automatically added:
X-Cache: HIT
X-Cache-Hit-Rate: 66.7%
X-Cache-Similarity: 0.9234
X-Cache-Latency-Ms: 38
X-Money-Saved: $0.0042Fastify Plugin
import Fastify from 'fastify';
import OpenAI from 'openai';
import { aiCache } from 'memflux';
import { aiCacheFastifyPlugin } from 'memflux/middleware';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const ai = aiCache(openai);
const app = Fastify({ logger: true });
await app.register(aiCacheFastifyPlugin, {
cache: ai,
prefix: '/api', // registers POST /api/chat, GET /api/cache/stats
exposeAdminRoutes: true,
});
await app.listen({ port: 3000 });Configuration Reference
const ai = aiCache(openai, {
// ── Model ──────────────────────────────────────────────────
defaultModel: 'gpt-4o-mini', // Default AI model for completions
// ── Embedding ──────────────────────────────────────────────
embedding: {
model: 'text-embedding-3-small', // Embedding model
dimensions: 1536, // Vector dimensions (256–1536)
batchSize: 100, // Texts per batch embedding request
internalCacheSize: 5_000, // In-memory embedding cache size
},
// ── Similarity ─────────────────────────────────────────────
similarity: {
algorithm: 'cosine', // 'cosine' | 'euclidean' | 'dot-product'
threshold: 0.85, // 0.0 = match everything | 1.0 = exact match only
topK: 5, // Evaluate top K candidates
},
// ── Cache ──────────────────────────────────────────────────
ttl: 86_400, // Default TTL in seconds (0 = never expire)
maxCacheSize: 10_000, // Max entries before LRU eviction
namespace: 'memflux', // Key prefix for Redis/SQLite
// ── Analytics ─────────────────────────────────────────────
analytics: {
enabled: true,
trackCostSavings: true,
costPerToken: 0.0000006, // Adjust to match your model's pricing
maxRecords: 10_000,
},
// ── Store ─────────────────────────────────────────────────
store: createRedisStore(...), // Optional custom backing store
// ── Debug ─────────────────────────────────────────────────
debug: false, // Enable verbose stderr logging
});Threshold Guide
| Threshold | Behaviour | Use case |
|-----------|-----------|----------|
| 0.70 | Very aggressive caching | High-volume FAQ bots |
| 0.80 | Loose matching | Customer support, search |
| 0.85 | Default — good balance | General use |
| 0.90 | Strict matching | Legal / medical accuracy |
| 0.95 | Near-exact only | Code generation |
Statistics & Analytics
const stats = ai.getStats();
stats.totalRequests // 1000
stats.cacheHits // 650
stats.cacheMisses // 350
stats.hitRate // 65.0 (%)
stats.totalTokensSaved // 97500
stats.estimatedMoneySaved // 0.0585 ($)
stats.averageLatencyMs.hits // 38 ms
stats.averageLatencyMs.misses // 2847 ms
stats.topQueries // [{query, hitCount, savedCost}]
stats.since // '2025-01-01T00:00:00.000Z'Cost Projection
import { CostCalculator } from 'memflux';
const calc = new CostCalculator();
const savings = calc.projectMonthlySavings(
10_000, // daily requests
0.60, // expected 60% hit rate
0.002, // $0.002 average cost per uncached request
);
console.log(calc.formatSavings(savings));
// Monthly spend (no cache): $6000.00
// Monthly spend (with cache): $2400.00
// Monthly savings: $3600.00 (60%)
// Annual savings: $43200.00Cache Management
// Flush all entries + reset statistics
await ai.flush();
// Pre-populate embeddings for anticipated queries (no AI completion calls)
await ai.warmUp([
'What are your business hours?',
'How do I reset my password?',
'What is your refund policy?',
]);
// Direct store access
const size = await ai.store.size();
await ai.store.delete('specific-entry-id');
await ai.store.clear();How It Works
User question
│
▼
Normalise text
(lowercase, trim, remove punctuation)
│
▼
Embed text via OpenAI text-embedding-3-small
→ 1536-dimensional float vector
│
▼
Compare against all cached vectors
using Cosine Similarity
│
├── Score ≥ threshold? ──→ Return cached response (FREE ✨)
│ ~38ms avg latency
│
└── Score < threshold? ──→ Call AI API (~2000ms)
Store response + embedding
Return fresh responseWhy cosine similarity?
The cosine of the angle between two vectors measures directional similarity — it's sensitive to what the vectors represent (meaning), not how long they are (magnitude). Two different phrasings of the same question will have very similar embedding directions, giving a high cosine score regardless of how the sentence is written.
Performance
| Operation | Time | Notes | |-----------|------|-------| | Cache hit (memory store) | ~5–50ms | Embedding lookup + similarity search | | Cache hit (Redis store) | ~10–80ms | Network round-trip included | | Cache miss | ~1000–5000ms | Full AI API call | | Embedding generation | ~50–150ms | Cached internally after first call |
Throughput (Apple M2, 10,000 cache entries, 1536 dimensions):
- cosine similarity: ~150,000 comparisons/second
- cosineSimilarityFast (Float32): ~200,000 comparisons/second
For very large caches (>100k entries), consider upgrading to a vector database (Pinecone, Weaviate) with ANN indexing for sub-10ms search.
File Structure
memflux/
├── src/
│ ├── index.ts ← Public API entry point
│ ├── types/ ← TypeScript interfaces
│ │ ├── cache.types.ts
│ │ ├── config.types.ts
│ │ └── provider.types.ts
│ ├── config/
│ │ ├── default.config.ts ← Sensible defaults
│ │ └── config.validator.ts ← Input validation
│ ├── cache/
│ │ └── cache.manager.ts ← Core orchestration logic
│ ├── embeddings/
│ │ └── embedding.service.ts ← OpenAI embeddings + internal LRU cache
│ ├── similarity/
│ │ ├── cosine.similarity.ts ← Cosine algorithm (standard + fast)
│ │ ├── euclidean.similarity.ts ← Euclidean distance algorithm
│ │ ├── dot-product.similarity.ts ← Dot-product algorithm
│ │ └── similarity.engine.ts ← Algorithm selector + top-K search
│ ├── storage/
│ │ ├── memory.store.ts ← In-memory LRU store (default)
│ │ ├── redis.store.ts ← Redis persistent store
│ │ └── sqlite.store.ts ← SQLite persistent store
│ ├── analytics/
│ │ ├── stats.tracker.ts ← Hit/miss statistics
│ │ └── cost.calculator.ts ← Cost projections + model pricing
│ ├── middleware/
│ │ ├── express.middleware.ts ← Express.js integration
│ │ └── fastify.middleware.ts ← Fastify plugin
│ ├── adapters/
│ │ └── gemini.adapter.ts ← Google Gemini support
│ └── utils/
│ ├── hash.utils.ts ← SHA-256 ID generation
│ ├── logger.ts ← Zero-dependency logger
│ └── token.counter.ts ← Lightweight token estimator
├── tests/
│ ├── unit/ ← Pure unit tests (no API calls)
│ └── integration/ ← Tests with mocked OpenAI client
├── examples/ ← Runnable code examples
├── benchmarks/ ← Performance benchmarks
└── .github/workflows/ ← CI/CD pipelinesSupported Models
| Provider | Completion Models | Embedding Model | |----------|-----------------|-----------------| | OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo | text-embedding-3-small ✓ | | Google | gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash | text-embedding-004 ✓ |
FAQ
Q: Does it work across different languages? A: Yes. The embedding model captures semantic meaning across Arabic, English, French, Spanish, Chinese, etc. A question in Arabic and the same question in English will often produce a cache hit.
Q: What happens if two questions have the same embedding by coincidence? A: True embedding collisions are astronomically unlikely with 1536-dimensional vectors. In practice, you will never see a false positive that isn't semantically related.
Q: Can I use it with streaming responses? A: Currently, memflux buffers the full response before caching. Streaming is on the roadmap.
Q: Is it thread-safe? A: The in-memory store is not safe across multiple Node.js processes. Use Redis for multi-process deployments.
Q: Can I bring my own vector store (Pinecone, Weaviate)?
A: Yes — implement the CacheStore interface and pass it as store in the config.
Contributing
git clone https://github.com/Brah-Timo/memflux.git
cd memflux
npm install
npm test # run tests
npm run build # build the package
npm run benchmark # run benchmarksLicense
MIT — Copyright © 2026 TIMSoftDZ memflux contributors
