@neural-tools/semantic-cache

v0.1.6

Published

11 days ago

Semantic caching for LLM responses

Downloads

475

0High
0Medium
0Low

capitallion

ai llm semantic-cache caching neural-tools embeddings vector-search

@neural-tools/semantic-cache

Semantic caching for LLM responses

Intelligent caching for LLM responses using semantic similarity. Save costs and improve response times by reusing similar completions.

Installation

npm install @neural-tools/semantic-cache @neural-tools/vector-db

Features

Semantic Matching - Finds similar prompts, not just exact matches
Cost Savings - Reduce API calls to expensive LLMs
Fast Responses - Instant replies for cached queries
Configurable - Adjust similarity threshold
Provider Agnostic - Works with any vector database
TTL Support - Automatic cache expiration

Quick Start

import { SemanticCache } from '@neural-tools/semantic-cache';
import { VectorDB } from '@neural-tools/vector-db';

// Setup vector database
const vectorDB = new VectorDB({
  provider: 'pinecone',
  config: {
    apiKey: process.env.PINECONE_API_KEY,
    environment: 'us-west1-gcp',
    indexName: 'llm-cache'
  }
});

// Create semantic cache
const cache = new SemanticCache({
  vectorDB,
  similarityThreshold: 0.9,  // 0-1, higher = more similar
  ttl: 3600                   // Cache lifetime in seconds
});

await cache.initialize();

// Your embedding function
async function embed(text: string): Promise<number[]> {
  // Use OpenAI, Anthropic, or any embedding model
  // Return vector of embeddings
}

// Check cache before calling LLM
const prompt = "What is the capital of France?";
const embedding = await embed(prompt);

const cached = await cache.get(embedding);

if (cached) {
  console.log('Cache hit!', cached.response);
} else {
  // Call your LLM
  const response = await callLLM(prompt);

  // Store in cache
  await cache.set(embedding, {
    prompt,
    response,
    model: 'claude-3-opus',
    timestamp: Date.now()
  });
}

API Reference

Constructor

new SemanticCache(options: SemanticCacheOptions)

interface SemanticCacheOptions {
  vectorDB: VectorDB;
  similarityThreshold?: number;  // Default: 0.9
  ttl?: number;                  // Seconds, default: 3600
  namespace?: string;
}

Methods

`initialize()`

Initialize the cache and vector database connection.

await cache.initialize();

`get(embedding)`

Retrieve a cached response for similar prompts.

const result = await cache.get(embedding);

if (result) {
  console.log(result.response);
  console.log(result.similarity);  // How similar (0-1)
  console.log(result.metadata);
}

`set(embedding, data)`

Store a response in the cache.

await cache.set(embedding, {
  prompt: string;
  response: string;
  model?: string;
  tokens?: number;
  timestamp?: number;
  metadata?: Record<string, any>;
});

`delete(id)`

Remove a specific cache entry.

await cache.delete('cache-entry-id');

`clear()`

Clear all cached entries.

await cache.clear();

`stats()`

Get cache statistics.

const stats = await cache.stats();
console.log(stats);
// {
//   totalEntries: 1234,
//   hitRate: 0.75,
//   avgSimilarity: 0.92
// }

Usage Examples

With OpenAI

import { SemanticCache } from '@neural-tools/semantic-cache';
import { VectorDB } from '@neural-tools/vector-db';
import OpenAI from 'openai';

const openai = new OpenAI();
const vectorDB = new VectorDB({ /* ... */ });
const cache = new SemanticCache({ vectorDB });

await cache.initialize();

async function completionWithCache(prompt: string) {
  // Get embedding
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: prompt
  });
  const embedding = embeddingResponse.data[0].embedding;

  // Check cache
  const cached = await cache.get(embedding);
  if (cached) {
    console.log('Cache hit! Saved API call.');
    return cached.response;
  }

  // Call LLM
  const completion = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: prompt }]
  });

  const response = completion.choices[0].message.content;

  // Cache the response
  await cache.set(embedding, {
    prompt,
    response,
    model: 'gpt-4',
    tokens: completion.usage?.total_tokens
  });

  return response;
}

// Use it
const answer = await completionWithCache('Explain quantum computing');

With Anthropic Claude

import Anthropic from '@anthropic-ai/sdk';
import { SemanticCache } from '@neural-tools/semantic-cache';

const anthropic = new Anthropic();
const cache = new SemanticCache({ /* ... */ });

async function claudeWithCache(prompt: string) {
  const embedding = await getEmbedding(prompt);

  const cached = await cache.get(embedding);
  if (cached) return cached.response;

  const message = await anthropic.messages.create({
    model: 'claude-3-opus-20240229',
    max_tokens: 1024,
    messages: [{ role: 'user', content: prompt }]
  });

  const response = message.content[0].text;

  await cache.set(embedding, {
    prompt,
    response,
    model: 'claude-3-opus-20240229'
  });

  return response;
}

Custom Similarity Threshold

// Strict matching (0.95+)
const strictCache = new SemanticCache({
  vectorDB,
  similarityThreshold: 0.95
});

// Loose matching (0.80+)
const looseCache = new SemanticCache({
  vectorDB,
  similarityThreshold: 0.80
});

// Very strict (0.98+) - almost exact matches only
const veryStrictCache = new SemanticCache({
  vectorDB,
  similarityThreshold: 0.98
});

With TTL (Time-To-Live)

const cache = new SemanticCache({
  vectorDB,
  ttl: 86400  // 24 hours
});

// Cached responses expire after 24 hours

Namespace for Multiple Models

const gpt4Cache = new SemanticCache({
  vectorDB,
  namespace: 'gpt-4'
});

const claudeCache = new SemanticCache({
  vectorDB,
  namespace: 'claude-opus'
});

// Separate caches for different models

Configuration

Similarity Threshold

Controls how similar prompts need to be:

0.99 - Nearly identical prompts
0.95 - Very similar prompts (recommended for production)
0.90 - Similar prompts (good balance)
0.85 - Somewhat similar prompts
0.80 - Loosely similar prompts

TTL (Time-To-Live)

How long to keep cached responses:

{
  ttl: 3600      // 1 hour
  ttl: 86400     // 24 hours
  ttl: 604800    // 1 week
  ttl: 0         // Never expire
}

Cost Savings Example

// Without caching
// 1000 requests to GPT-4 @ $0.03 per 1K tokens
// Average 500 tokens per response
// Cost: 1000 * 0.03 * 0.5 = $15

// With semantic caching (75% hit rate)
// 250 requests to GPT-4
// 750 cache hits (free)
// Cost: 250 * 0.03 * 0.5 = $3.75
// Savings: $11.25 (75%)

const cache = new SemanticCache({ vectorDB });
// Just add caching, save 75%!

Performance

Typical performance characteristics:

Cache Hit: 10-50ms (vector lookup)
Cache Miss: LLM latency + 20-100ms (store)
Memory: Minimal (vectors stored in vector DB)

Best Practices

1. Choose the Right Threshold

// For FAQ / repetitive queries
{ similarityThreshold: 0.85 }

// For production assistants
{ similarityThreshold: 0.92 }

// For high-accuracy requirements
{ similarityThreshold: 0.97 }

2. Set Appropriate TTL

// Real-time data (weather, news)
{ ttl: 300 }  // 5 minutes

// General knowledge
{ ttl: 86400 }  // 24 hours

// Static content
{ ttl: 604800 }  // 1 week

3. Monitor Hit Rates

const stats = await cache.stats();
console.log(`Hit rate: ${(stats.hitRate * 100).toFixed(1)}%`);

// Adjust threshold if hit rate is too low/high

4. Use Namespaces

// Separate caches by use case
const customerSupport = new SemanticCache({
  vectorDB,
  namespace: 'customer-support'
});

const codeGen = new SemanticCache({
  vectorDB,
  namespace: 'code-generation'
});

Dependencies

@neural-tools/core - Core utilities
@neural-tools/vector-db - Vector database abstraction

Contributing

Contributions are welcome! See the main repository for guidelines.

License

MIT - See LICENSE.md for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@neural-tools/semantic-cache

Installation

Features

Quick Start

API Reference

Constructor

Methods

initialize()

get(embedding)

set(embedding, data)

delete(id)

clear()

stats()

Usage Examples

With OpenAI

With Anthropic Claude

Custom Similarity Threshold

With TTL (Time-To-Live)

Namespace for Multiple Models

Configuration

Similarity Threshold

TTL (Time-To-Live)

Cost Savings Example

Performance

Best Practices

1. Choose the Right Threshold

2. Set Appropriate TTL

3. Monitor Hit Rates

4. Use Namespaces

Dependencies

Contributing

License

Links

`initialize()`

`get(embedding)`

`set(embedding, data)`

`delete(id)`

`clear()`

`stats()`