npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@neural-tools/semantic-cache

v0.1.6

Published

Semantic caching for LLM responses

Downloads

475

Readme

@neural-tools/semantic-cache

Semantic caching for LLM responses

npm version License: MIT

Intelligent caching for LLM responses using semantic similarity. Save costs and improve response times by reusing similar completions.

Installation

npm install @neural-tools/semantic-cache @neural-tools/vector-db

Features

  • Semantic Matching - Finds similar prompts, not just exact matches
  • Cost Savings - Reduce API calls to expensive LLMs
  • Fast Responses - Instant replies for cached queries
  • Configurable - Adjust similarity threshold
  • Provider Agnostic - Works with any vector database
  • TTL Support - Automatic cache expiration

Quick Start

import { SemanticCache } from '@neural-tools/semantic-cache';
import { VectorDB } from '@neural-tools/vector-db';

// Setup vector database
const vectorDB = new VectorDB({
  provider: 'pinecone',
  config: {
    apiKey: process.env.PINECONE_API_KEY,
    environment: 'us-west1-gcp',
    indexName: 'llm-cache'
  }
});

// Create semantic cache
const cache = new SemanticCache({
  vectorDB,
  similarityThreshold: 0.9,  // 0-1, higher = more similar
  ttl: 3600                   // Cache lifetime in seconds
});

await cache.initialize();

// Your embedding function
async function embed(text: string): Promise<number[]> {
  // Use OpenAI, Anthropic, or any embedding model
  // Return vector of embeddings
}

// Check cache before calling LLM
const prompt = "What is the capital of France?";
const embedding = await embed(prompt);

const cached = await cache.get(embedding);

if (cached) {
  console.log('Cache hit!', cached.response);
} else {
  // Call your LLM
  const response = await callLLM(prompt);

  // Store in cache
  await cache.set(embedding, {
    prompt,
    response,
    model: 'claude-3-opus',
    timestamp: Date.now()
  });
}

API Reference

Constructor

new SemanticCache(options: SemanticCacheOptions)

interface SemanticCacheOptions {
  vectorDB: VectorDB;
  similarityThreshold?: number;  // Default: 0.9
  ttl?: number;                  // Seconds, default: 3600
  namespace?: string;
}

Methods

initialize()

Initialize the cache and vector database connection.

await cache.initialize();

get(embedding)

Retrieve a cached response for similar prompts.

const result = await cache.get(embedding);

if (result) {
  console.log(result.response);
  console.log(result.similarity);  // How similar (0-1)
  console.log(result.metadata);
}

set(embedding, data)

Store a response in the cache.

await cache.set(embedding, {
  prompt: string;
  response: string;
  model?: string;
  tokens?: number;
  timestamp?: number;
  metadata?: Record<string, any>;
});

delete(id)

Remove a specific cache entry.

await cache.delete('cache-entry-id');

clear()

Clear all cached entries.

await cache.clear();

stats()

Get cache statistics.

const stats = await cache.stats();
console.log(stats);
// {
//   totalEntries: 1234,
//   hitRate: 0.75,
//   avgSimilarity: 0.92
// }

Usage Examples

With OpenAI

import { SemanticCache } from '@neural-tools/semantic-cache';
import { VectorDB } from '@neural-tools/vector-db';
import OpenAI from 'openai';

const openai = new OpenAI();
const vectorDB = new VectorDB({ /* ... */ });
const cache = new SemanticCache({ vectorDB });

await cache.initialize();

async function completionWithCache(prompt: string) {
  // Get embedding
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: prompt
  });
  const embedding = embeddingResponse.data[0].embedding;

  // Check cache
  const cached = await cache.get(embedding);
  if (cached) {
    console.log('Cache hit! Saved API call.');
    return cached.response;
  }

  // Call LLM
  const completion = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: prompt }]
  });

  const response = completion.choices[0].message.content;

  // Cache the response
  await cache.set(embedding, {
    prompt,
    response,
    model: 'gpt-4',
    tokens: completion.usage?.total_tokens
  });

  return response;
}

// Use it
const answer = await completionWithCache('Explain quantum computing');

With Anthropic Claude

import Anthropic from '@anthropic-ai/sdk';
import { SemanticCache } from '@neural-tools/semantic-cache';

const anthropic = new Anthropic();
const cache = new SemanticCache({ /* ... */ });

async function claudeWithCache(prompt: string) {
  const embedding = await getEmbedding(prompt);

  const cached = await cache.get(embedding);
  if (cached) return cached.response;

  const message = await anthropic.messages.create({
    model: 'claude-3-opus-20240229',
    max_tokens: 1024,
    messages: [{ role: 'user', content: prompt }]
  });

  const response = message.content[0].text;

  await cache.set(embedding, {
    prompt,
    response,
    model: 'claude-3-opus-20240229'
  });

  return response;
}

Custom Similarity Threshold

// Strict matching (0.95+)
const strictCache = new SemanticCache({
  vectorDB,
  similarityThreshold: 0.95
});

// Loose matching (0.80+)
const looseCache = new SemanticCache({
  vectorDB,
  similarityThreshold: 0.80
});

// Very strict (0.98+) - almost exact matches only
const veryStrictCache = new SemanticCache({
  vectorDB,
  similarityThreshold: 0.98
});

With TTL (Time-To-Live)

const cache = new SemanticCache({
  vectorDB,
  ttl: 86400  // 24 hours
});

// Cached responses expire after 24 hours

Namespace for Multiple Models

const gpt4Cache = new SemanticCache({
  vectorDB,
  namespace: 'gpt-4'
});

const claudeCache = new SemanticCache({
  vectorDB,
  namespace: 'claude-opus'
});

// Separate caches for different models

Configuration

Similarity Threshold

Controls how similar prompts need to be:

  • 0.99 - Nearly identical prompts
  • 0.95 - Very similar prompts (recommended for production)
  • 0.90 - Similar prompts (good balance)
  • 0.85 - Somewhat similar prompts
  • 0.80 - Loosely similar prompts

TTL (Time-To-Live)

How long to keep cached responses:

{
  ttl: 3600      // 1 hour
  ttl: 86400     // 24 hours
  ttl: 604800    // 1 week
  ttl: 0         // Never expire
}

Cost Savings Example

// Without caching
// 1000 requests to GPT-4 @ $0.03 per 1K tokens
// Average 500 tokens per response
// Cost: 1000 * 0.03 * 0.5 = $15

// With semantic caching (75% hit rate)
// 250 requests to GPT-4
// 750 cache hits (free)
// Cost: 250 * 0.03 * 0.5 = $3.75
// Savings: $11.25 (75%)

const cache = new SemanticCache({ vectorDB });
// Just add caching, save 75%!

Performance

Typical performance characteristics:

  • Cache Hit: 10-50ms (vector lookup)
  • Cache Miss: LLM latency + 20-100ms (store)
  • Memory: Minimal (vectors stored in vector DB)

Best Practices

1. Choose the Right Threshold

// For FAQ / repetitive queries
{ similarityThreshold: 0.85 }

// For production assistants
{ similarityThreshold: 0.92 }

// For high-accuracy requirements
{ similarityThreshold: 0.97 }

2. Set Appropriate TTL

// Real-time data (weather, news)
{ ttl: 300 }  // 5 minutes

// General knowledge
{ ttl: 86400 }  // 24 hours

// Static content
{ ttl: 604800 }  // 1 week

3. Monitor Hit Rates

const stats = await cache.stats();
console.log(`Hit rate: ${(stats.hitRate * 100).toFixed(1)}%`);

// Adjust threshold if hit rate is too low/high

4. Use Namespaces

// Separate caches by use case
const customerSupport = new SemanticCache({
  vectorDB,
  namespace: 'customer-support'
});

const codeGen = new SemanticCache({
  vectorDB,
  namespace: 'code-generation'
});

Dependencies

Contributing

Contributions are welcome! See the main repository for guidelines.

License

MIT - See LICENSE.md for details.

Links