composecache
v0.2.5
Published
Compositional semantic caching for LLM APIs and RAG pipelines
Maintainers
Readme
ComposeCache
Adaptive compositional semantic caching for LLM APIs and RAG pipelines.
Why ComposeCache?
Existing semantic caches like GPTCache treat every query atomically. ComposeCache decomposes compositional queries (e.g., "Compare X and Y") into sub-queries, caches each independently, and enables partial hits - saving 50%+ on LLM API costs.
Quick Start
npm install composecache
npx composecache init --db postgres://localhost/myappimport { ComposeCache } from 'composecache';
const cache = new ComposeCache({
database: process.env.DATABASE_URL,
openaiApiKey: process.env.OPENAI_API_KEY,
safeSemantic: {
safeSemanticMode: true,
minSemanticScore: 0.92,
maxSemanticDrift: 0.08
}
});
const response = await cache.complete({
model: 'gpt-3.5-turbo',
messages: [{ role: 'user', content: 'Compare France and Germany' }],
documents: retrievedDocs // Optional: for RAG
});
console.log(response.content); // The answer
console.log(response.cacheType); // 'exact' | 'semantic' | 'partial' | 'miss'
console.log(response.costSaved); // $ savedFeatures
- Compositional query decomposition (novel)
- Document-aware cache keys via MinHash
- Uncertainty-gated population (blocks hallucinations)
- Safe semantic mode with strict relevance gating (default ON)
- Drop-in SDK for Node.js and Python
- Works with your own PostgreSQL database
Safe Semantic Mode
ComposeCache now runs semantic acceptance through strict guards by default so semantic hits are high precision and do not replace exact hash behavior.
Exact hits are unchanged:
- Exact hash match still returns immediately.
- Semantic gates are not evaluated for exact hits.
Semantic and partial reuse now include metadata:
semanticScoreinsubQueryHits(0..1)hitSourceIdinsubQueryHitsacceptanceChecksinsubQueryHitsdecisionReasoninsubQueryHits
Common reasons:
exact_hitsemantic_hitrejected_entity_mismatchrejected_intent_mismatchrejected_low_confidencemiss
Default safety policy (enabled unless overridden):
safeSemantic: {
safeSemanticMode: true,
minSemanticTokens: 4,
minSemanticChars: 12,
minSemanticScore: 0.92,
maxSemanticDrift: 0.08,
requireEntityOverlap: true,
requireIntentMatch: true,
shortUtteranceBypass: true,
adaptiveThresholds: true,
semanticBackoffToMiss: true
}Strict production config example:
const cache = new ComposeCache({
database: process.env.DATABASE_URL!,
openaiApiKey: process.env.OPENAI_API_KEY!,
thresholds: {
query: 0.92,
document: 0.8,
uncertainty: 0.25
},
safeSemantic: {
safeSemanticMode: true,
minSemanticTokens: 5,
minSemanticChars: 16,
minSemanticScore: 0.95,
maxSemanticDrift: 0.05,
requireEntityOverlap: true,
requireIntentMatch: true,
shortUtteranceBypass: true,
adaptiveThresholds: true,
semanticBackoffToMiss: true
}
});Tuning guidance:
- Higher precision: raise
minSemanticScore, lowermaxSemanticDrift, increase minimum token/char gates. - Higher recall: lower
minSemanticScoreslightly and allow a largermaxSemanticDrift. - If your domain has dense entities (country names, SKUs, IDs), keep
requireEntityOverlapenabled.
Migration notes:
- Existing config remains valid; all safe semantic settings are optional.
- Default behavior is stricter for semantic reuse, which may reduce semantic hit rate while improving correctness.
- Use
stats()to inspectsemanticAccepted,semanticRejected, andrejectionReasonswhile tuning thresholds.
Architecture
Query Processing Flow
flowchart TD
Q["Incoming query q"] --> C{"Classify: atomic or compositional"}
C -->|atomic| A["Compute SHA-256 key: norm(q) + doc_fingerprint + params_hash"]
C -->|compositional| D["Decompose into sub-queries s1..sk with dependencies"]
A --> P["Probe cache: exact hash first, then semantic plus document check"]
D --> P
P --> H{All hits?}
H -->|yes| R["Return cached response or compose from sub-answers"]
H -->|no or partial| G["Generate missing sub-answers via RAG plus LLM API"]
R --> F["Compose final response"]
G --> F
F --> U["Uncertainty gate: cache only if uncertainty <= threshold"]System Architecture
flowchart TD
APP["Developer application: Node.js or Python"]
subgraph SDK[ComposeCache middleware SDK npm package]
direction LR
S1[1 Decompose] --> S2[2 Probe] --> S3[3 Resolve] --> S4[4 Compose] --> S5[5 Populate]
end
subgraph MODS[Core modules]
direction LR
E["Embedder all-MiniLM-L6-v2"]
L["Decomposition LLM gpt-4o-mini"]
M["MinHash plus uncertainty estimator"]
end
DB["Developer PostgreSQL plus pgvector: exact keys and semantic vectors"]
API["Upstream LLM API OpenAI or Anthropic"]
APP --> SDK
SDK --> MODS
SDK -->|cache read write| DB
SDK -->|miss only| APIBenchmarks
These synthetic benchmark numbers were collected from a local virtual environment using a deterministic mock LLM latency of about 120ms per call.
Disclaimer: these values are not production throughput guarantees. They are controlled local measurements intended to validate algorithm behavior and relative improvements only.
Benchmark Setup
- Environment: macOS, Node.js runtime in a local virtual development environment
- Workload: compositional query "Compare GDP of France and Germany"
- Iterations: 10 per scenario
- Command:
node scripts/bench.mjs
Results
| Scenario | Avg Latency (ms) | Mock LLM Calls (10 runs) | Avg Tokens Saved | | --- | ---: | ---: | ---: | | No cache baseline | 368.0 | 30 | 0 | | ComposeCache cold (empty cache) | 146.1 | 13 | 126 | | ComposeCache warm partial | 145.6 | 12 | 133 | | ComposeCache warm full | 133.3 | 11 | 140 |
Terminal Output Snapshot
{
"baseline": {
"name": "No cache baseline",
"avgLatencyMs": 368,
"llmCalls": 30
},
"cold": {
"name": "ComposeCache cold (empty cache)",
"avgLatencyMs": 146.1,
"avgTokensSaved": 126,
"llmCalls": 13,
"partialRate": 0
},
"partial": {
"name": "ComposeCache warm partial",
"avgLatencyMs": 145.6,
"avgTokensSaved": 133,
"llmCalls": 12,
"partialRate": 0.1
},
"full": {
"name": "ComposeCache warm full",
"avgLatencyMs": 133.3,
"avgTokensSaved": 140,
"llmCalls": 11,
"partialRate": 0
}
}License
MIT
