opti-llm
v1.0.2
Published
Semantic caching SDK for LLM cost reduction using Qdrant
Readme
OptiLLM - Semantic Caching SDK
A lightweight TypeScript SDK for semantic caching of LLM responses using Qdrant vector database. Reduce costs and improve response times by caching semantically similar queries.
Features
- 🚀 Semantic Caching: Cache LLM responses based on semantic similarity, not exact matches
- 💰 Cost Reduction: Avoid redundant API calls for similar queries
- ⚡ Fast Retrieval: Vector-based similarity search with Qdrant
- 🔧 Flexible: Support for OpenAI embeddings or local TF-IDF fallback
- 🏢 Multi-tenant: Built-in tenant and user scoping
- ⏰ TTL Support: Automatic expiration of cached entries
- 🧠 Typeahead Suggestions: HTTP and WebSocket APIs for live suggestions as users type
Quick Start
1. Install
npm install opti-llm2. Setup Qdrant
# Local Qdrant with Docker
docker run -p 6333:6333 qdrant/qdrant
# Or use Qdrant Cloud (free tier available)3. Basic Usage (SDK)
import { createOptiLLM } from 'opti-llm';
import OpenAI from 'openai';
// Initialize
const optiLLM = createOptiLLM({
qdrantUrl: 'http://localhost:6333',
embedding: {
provider: 'openai', // or 'local' for testing
apiKey: process.env.OPENAI_API_KEY,
},
similarityThreshold: 0.85
});
await optiLLM.init();
// Your LLM client
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// Cached LLM calls
const result = await optiLLM.capture(
{
prompt: "What is Redis vector search?",
metadata: {
provider: 'openai',
model: 'gpt-4o-mini',
tenantId: 'org1',
userId: 'user123'
},
policy: {
maxAge: 3600, // 1 hour TTL
minSimilarity: 0.8
}
},
async () => {
// This expensive call only happens on cache miss
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: "What is Redis vector search?" }]
});
return completion.choices[0].message.content;
}
);
console.log(result.response); // LLM response
console.log(result.cached); // true if from cache
console.log(result.cost_saved); // true if cache hit
// Optional: Suggestions (backend SDK)
const suggestions = await optiLLM.suggest({
text: 'What is Redis vec',
tenantId: 'org1',
limit: 5,
minSimilarity: 0.7,
});
console.log(suggestions);Configuration
interface OptiLLMConfig {
qdrantUrl: string; // Qdrant instance URL
collectionName?: string; // Collection name (default: 'llm_cache')
apiKey?: string; // Qdrant API key (for cloud)
embedding?: {
provider: 'openai' | 'local'; // Embedding provider
apiKey?: string; // OpenAI API key
model?: string; // Embedding model
};
defaultTTL?: number; // Default TTL in seconds
similarityThreshold?: number; // Similarity threshold (0-1)
}Test App (Demo UI + APIs)
Run the included Express test app:
cd test-app
npm install
# Setup environment variables
cp env.example .env
# Edit .env with your actual API keys
# Start the server
npm run devYour .env file should contain:
OPENAI_API_KEY=your_openai_api_key
QDRANT_URL=https://your-cluster.region.aws.cloud.qdrant.io
QDRANT_API_KEY=your_qdrant_api_keyTest endpoints:
# Chat with caching
curl -X POST http://localhost:3000/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "What is semantic caching?"}'
# Test similar query (should hit cache)
curl -X POST http://localhost:3000/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain semantic caching"}'Endpoints
POST /chat— Cached LLM call (uses SDKcapture())- Body:
{ "prompt": string, "tenantId"?: string, "userId"?: string } - Returns:
{ response, cached, cost_saved, duration_ms }
- Body:
GET /suggest?q=...&tenantId=...&limit=...— HTTP suggestions- Returns:
{ items: [{ id, prompt, response, score, createdAt, metadata }] }
- Returns:
WS /ws/suggest?tenantId=...— WebSocket suggestions- Send:
{ text: string, limit?: number, minSimilarity?: number } - Receive:
{ items: [...] }
- Send:
How It Works
- Embedding Generation: Convert prompts to vectors using OpenAI or local embeddings
- Similarity Search: Query Qdrant for semantically similar cached prompts
- Cache Hit/Miss: Return cached response if similarity > threshold, otherwise call LLM
- Storage: Store new LLM responses with metadata and TTL
- Cleanup: Automatic removal of expired entries
Architecture
Based on proven semantic caching patterns from Shuttle.dev's Qdrant guide, adapted for Node.js/TypeScript.
┌─────────────────┐
│ Your App │
│ │
│ ┌───────────┐ │
│ │ OptiLLM │ │ ──── Semantic similarity search
│ │ SDK │ │
│ └─────┬─────┘ │
│ │ │
└────────┼────────┘
│
┌────▼────┐
│ Qdrant │ ──── Vector storage & search
│ │
└─────────┘License
MIT
