composecache

v0.2.5

Published

19 days ago

Compositional semantic caching for LLM APIs and RAG pipelines

0High
0Medium
0Low

rojanupreti

llm semantic-cache rag postgres pgvector typescript

ComposeCache

Adaptive compositional semantic caching for LLM APIs and RAG pipelines.

Why ComposeCache?

Existing semantic caches like GPTCache treat every query atomically. ComposeCache decomposes compositional queries (e.g., "Compare X and Y") into sub-queries, caches each independently, and enables partial hits - saving 50%+ on LLM API costs.

Quick Start

npm install composecache
npx composecache init --db postgres://localhost/myapp

import { ComposeCache } from 'composecache';

const cache = new ComposeCache({
  database: process.env.DATABASE_URL,
  openaiApiKey: process.env.OPENAI_API_KEY,
  safeSemantic: {
    safeSemanticMode: true,
    minSemanticScore: 0.92,
    maxSemanticDrift: 0.08
  }
});

const response = await cache.complete({
  model: 'gpt-3.5-turbo',
  messages: [{ role: 'user', content: 'Compare France and Germany' }],
  documents: retrievedDocs // Optional: for RAG
});

console.log(response.content); // The answer
console.log(response.cacheType); // 'exact' | 'semantic' | 'partial' | 'miss'
console.log(response.costSaved); // $ saved

Features

Compositional query decomposition (novel)
Document-aware cache keys via MinHash
Uncertainty-gated population (blocks hallucinations)
Safe semantic mode with strict relevance gating (default ON)
Drop-in SDK for Node.js and Python
Works with your own PostgreSQL database

Safe Semantic Mode

ComposeCache now runs semantic acceptance through strict guards by default so semantic hits are high precision and do not replace exact hash behavior.

Exact hits are unchanged:

Exact hash match still returns immediately.
Semantic gates are not evaluated for exact hits.

Semantic and partial reuse now include metadata:

semanticScore in subQueryHits (0..1)
hitSourceId in subQueryHits
acceptanceChecks in subQueryHits
decisionReason in subQueryHits

Common reasons:

exact_hit
semantic_hit
rejected_entity_mismatch
rejected_intent_mismatch
rejected_low_confidence
miss

Default safety policy (enabled unless overridden):

safeSemantic: {
  safeSemanticMode: true,
  minSemanticTokens: 4,
  minSemanticChars: 12,
  minSemanticScore: 0.92,
  maxSemanticDrift: 0.08,
  requireEntityOverlap: true,
  requireIntentMatch: true,
  shortUtteranceBypass: true,
  adaptiveThresholds: true,
  semanticBackoffToMiss: true
}

Strict production config example:

const cache = new ComposeCache({
  database: process.env.DATABASE_URL!,
  openaiApiKey: process.env.OPENAI_API_KEY!,
  thresholds: {
    query: 0.92,
    document: 0.8,
    uncertainty: 0.25
  },
  safeSemantic: {
    safeSemanticMode: true,
    minSemanticTokens: 5,
    minSemanticChars: 16,
    minSemanticScore: 0.95,
    maxSemanticDrift: 0.05,
    requireEntityOverlap: true,
    requireIntentMatch: true,
    shortUtteranceBypass: true,
    adaptiveThresholds: true,
    semanticBackoffToMiss: true
  }
});

Tuning guidance:

Higher precision: raise minSemanticScore, lower maxSemanticDrift, increase minimum token/char gates.
Higher recall: lower minSemanticScore slightly and allow a larger maxSemanticDrift.
If your domain has dense entities (country names, SKUs, IDs), keep requireEntityOverlap enabled.

Migration notes:

Existing config remains valid; all safe semantic settings are optional.
Default behavior is stricter for semantic reuse, which may reduce semantic hit rate while improving correctness.
Use stats() to inspect semanticAccepted, semanticRejected, and rejectionReasons while tuning thresholds.

Architecture

Query Processing Flow

flowchart TD
  Q["Incoming query q"] --> C{"Classify: atomic or compositional"}

  C -->|atomic| A["Compute SHA-256 key: norm(q) + doc_fingerprint + params_hash"]
  C -->|compositional| D["Decompose into sub-queries s1..sk with dependencies"]

  A --> P["Probe cache: exact hash first, then semantic plus document check"]
  D --> P

  P --> H{All hits?}
  H -->|yes| R["Return cached response or compose from sub-answers"]
  H -->|no or partial| G["Generate missing sub-answers via RAG plus LLM API"]

  R --> F["Compose final response"]
  G --> F

  F --> U["Uncertainty gate: cache only if uncertainty <= threshold"]

System Architecture

flowchart TD
  APP["Developer application: Node.js or Python"]

  subgraph SDK[ComposeCache middleware SDK npm package]
    direction LR
    S1[1 Decompose] --> S2[2 Probe] --> S3[3 Resolve] --> S4[4 Compose] --> S5[5 Populate]
  end

  subgraph MODS[Core modules]
    direction LR
    E["Embedder all-MiniLM-L6-v2"]
    L["Decomposition LLM gpt-4o-mini"]
    M["MinHash plus uncertainty estimator"]
  end

  DB["Developer PostgreSQL plus pgvector: exact keys and semantic vectors"]
  API["Upstream LLM API OpenAI or Anthropic"]

  APP --> SDK
  SDK --> MODS
  SDK -->|cache read write| DB
  SDK -->|miss only| API

Benchmarks

These synthetic benchmark numbers were collected from a local virtual environment using a deterministic mock LLM latency of about 120ms per call.

Disclaimer: these values are not production throughput guarantees. They are controlled local measurements intended to validate algorithm behavior and relative improvements only.

Benchmark Setup

Environment: macOS, Node.js runtime in a local virtual development environment
Workload: compositional query "Compare GDP of France and Germany"
Iterations: 10 per scenario
Command: node scripts/bench.mjs

Results

| Scenario | Avg Latency (ms) | Mock LLM Calls (10 runs) | Avg Tokens Saved | | --- | ---: | ---: | ---: | | No cache baseline | 368.0 | 30 | 0 | | ComposeCache cold (empty cache) | 146.1 | 13 | 126 | | ComposeCache warm partial | 145.6 | 12 | 133 | | ComposeCache warm full | 133.3 | 11 | 140 |

Terminal Output Snapshot

{
  "baseline": {
    "name": "No cache baseline",
    "avgLatencyMs": 368,
    "llmCalls": 30
  },
  "cold": {
    "name": "ComposeCache cold (empty cache)",
    "avgLatencyMs": 146.1,
    "avgTokensSaved": 126,
    "llmCalls": 13,
    "partialRate": 0
  },
  "partial": {
    "name": "ComposeCache warm partial",
    "avgLatencyMs": 145.6,
    "avgTokensSaved": 133,
    "llmCalls": 12,
    "partialRate": 0.1
  },
  "full": {
    "name": "ComposeCache warm full",
    "avgLatencyMs": 133.3,
    "avgTokensSaved": 140,
    "llmCalls": 11,
    "partialRate": 0
  }
}

License

MIT