npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@nano-llm-cache/core

v1.0.3

Published

A semantic cache for LLM API calls using local embeddings and vector similarity

Readme

🚀 Nano-LLM-Cache

npm version License: MIT

A Semantic Cache for LLM API Calls - Save money and improve response times by caching based on meaning, not exact matches.

🎯 What is Nano-LLM-Cache?

Nano-LLM-Cache is a TypeScript library that intercepts LLM API calls and returns cached responses based on semantic similarity rather than exact string matching. It uses local embeddings (running entirely in the browser/client-side) to understand the meaning of prompts.

The Problem

Traditional caches look for exact key matches:

  • ❌ "What is the weather in London?" → Cache MISS
  • ❌ "Tell me the London weather" → Cache MISS (different string!)

The Solution

Nano-LLM-Cache uses vector embeddings to understand meaning:

  • ✅ "What is the weather in London?" → Cache HIT
  • ✅ "Tell me the London weather" → Cache HIT (same meaning!)

✨ Features

  • 🧠 Semantic Understanding: Matches prompts by meaning, not exact text
  • 🔒 Privacy-First: Embeddings run locally - your data never leaves the device
  • Fast & Lightweight: Uses quantized models (~20MB, cached forever)
  • 💾 Persistent Storage: IndexedDB for cross-session caching
  • TTL Support: Configurable time-to-live for cache entries
  • 🔌 Drop-in Replacement: Works as an OpenAI SDK wrapper
  • 📊 Cache Analytics: Built-in statistics and monitoring
  • 🎨 TypeScript: Full type safety and IntelliSense support

📦 Installation

npm install @nano-llm-cache/core

🚀 Quick Start

Basic Usage

import { NanoCache } from '@nano-llm-cache/core';

// Create cache instance
const cache = new NanoCache({
  similarityThreshold: 0.95, // 95% similarity required for cache hit
  maxAge: 60 * 60 * 1000,    // 1 hour TTL
  debug: true                 // Enable logging
});

// Save a response
await cache.save(
  'What is the weather in London?',
  'The weather in London is cloudy with a chance of rain, 15°C.'
);

// Query with similar prompt
const result = await cache.query('Tell me the London weather');

if (result.hit) {
  console.log('Cache HIT!', result.response);
  console.log('Similarity:', result.similarity); // 0.98
} else {
  console.log('Cache MISS - call your LLM API');
}

OpenAI Wrapper (Drop-in Replacement)

import OpenAI from 'openai';
import { NanoCache } from '@nano-llm-cache/core';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const cache = new NanoCache({ similarityThreshold: 0.95 });

// Wrap the OpenAI function
const cachedCreate = cache.createChatWrapper(
  openai.chat.completions.create.bind(openai.chat.completions)
);

// Use it exactly like the original!
const response = await cachedCreate({
  model: 'gpt-4',
  messages: [
    { role: 'user', content: 'How do I center a div?' }
  ]
});

console.log(response.choices[0].message.content);

// Second call with similar question - returns cached response instantly!
const response2 = await cachedCreate({
  model: 'gpt-4',
  messages: [
    { role: 'user', content: 'Best way to align a div to the middle?' }
  ]
});

📚 API Reference

NanoCache

Constructor

new NanoCache(config?: NanoCacheConfig)

Configuration Options:

| Option | Type | Default | Description | |--------|------|---------|-------------| | similarityThreshold | number | 0.95 | Minimum similarity (0-1) for cache hit | | maxAge | number | undefined | Max age in ms before entries expire | | modelName | string | 'Xenova/all-MiniLM-L6-v2' | Embedding model to use | | debug | boolean | false | Enable debug logging | | storagePrefix | string | 'nano-llm-cache' | IndexedDB key prefix |

Methods

query(prompt: string): Promise<CacheQueryResult>

Search the cache for a semantically similar prompt.

const result = await cache.query('What is TypeScript?');

if (result.hit) {
  console.log(result.response);    // Cached response
  console.log(result.similarity);  // 0.97
  console.log(result.entry);       // Full cache entry
}
save(prompt: string, response: string, metadata?: object): Promise<void>

Save a prompt-response pair to the cache.

await cache.save(
  'What is TypeScript?',
  'TypeScript is a typed superset of JavaScript.',
  { model: 'gpt-4', timestamp: Date.now() }
);
clear(): Promise<void>

Clear all cache entries.

await cache.clear();
getStats(): Promise<CacheStats>

Get cache statistics.

const stats = await cache.getStats();
console.log(stats.totalEntries);  // 42
console.log(stats.oldestEntry);   // 1707123456789
console.log(stats.newestEntry);   // 1707987654321
preloadModel(): Promise<void>

Preload the embedding model (recommended for better UX).

await cache.preloadModel();
unloadModel(): Promise<void>

Unload the model to free memory.

await cache.unloadModel();
createChatWrapper<T>(originalFn: T): T

Create an OpenAI-compatible wrapper function.

const cachedCreate = cache.createChatWrapper(
  openai.chat.completions.create.bind(openai.chat.completions)
);

🎨 Examples

Example 1: Weather Queries

const cache = new NanoCache({ similarityThreshold: 0.95 });

// Save weather data
await cache.save(
  'What is the weather in London?',
  'Cloudy, 15°C, chance of rain'
);

// These all return the cached response:
await cache.query('Tell me the London weather');        // ✅ HIT
await cache.query('How is the weather in London?');     // ✅ HIT
await cache.query('London weather today');              // ✅ HIT
await cache.query('What is the weather in Paris?');     // ❌ MISS

Example 2: Programming Questions

await cache.save(
  'How do I center a div?',
  'Use flexbox: display: flex; justify-content: center; align-items: center;'
);

// Similar questions hit the cache:
await cache.query('Best way to align a div to the middle?');  // ✅ HIT
await cache.query('Center a div CSS');                        // ✅ HIT
await cache.query('How to make a div centered?');             // ✅ HIT

Example 3: With TTL (Time To Live)

const cache = new NanoCache({
  maxAge: 60 * 60 * 1000  // 1 hour
});

// Weather data expires after 1 hour
await cache.save(
  'Current temperature in NYC',
  '72°F, sunny'
);

// After 1 hour, this will be a cache miss

🧪 Testing

Run the test suite:

npm test

Run tests with UI:

npm run test:ui

Generate coverage report:

npm run test:coverage

🏗️ Building

Build the library:

npm run build

Development mode (watch):

npm run dev

📊 How It Works

1. Vector Embeddings

When you save a prompt, Nano-LLM-Cache converts it into a 384-dimensional vector:

"What is the weather in London?" → [0.12, -0.44, 0.88, ...]
"Tell me the London weather"    → [0.13, -0.43, 0.89, ...]

These vectors are close together in space because they have similar meanings.

2. Cosine Similarity

When querying, we calculate the cosine similarity between vectors:

similarity = dotProduct(vecA, vecB) / (magnitude(vecA) * magnitude(vecB))

A similarity of 0.95 means the prompts are 95% semantically similar.

3. Local Processing

Everything runs locally using WebAssembly:

  • ✅ No API calls for embeddings
  • ✅ No data sent to external servers
  • ✅ Works offline after initial model download
  • ✅ Model cached in browser (~20MB, downloads once)

💡 Use Cases

1. Cost Reduction

LLM APIs charge per token. For a million users asking similar questions:

  • Without cache: $50,000+ in API costs
  • With Nano-LLM-Cache: $500 (99% cache hit rate)

2. Faster Response Times

  • API call: 2-5 seconds
  • Cache hit: <100ms

3. Offline Capability

Once the model is cached, your app works offline for cached queries.

4. Privacy

User prompts are embedded locally - no data leaves the device until the actual LLM call.

⚙️ Configuration Tips

Similarity Threshold

  • 0.99: Very strict - only nearly identical prompts match
  • 0.95: Recommended - catches paraphrases and similar questions
  • 0.90: Looser - may match somewhat related topics
  • 0.85: Very loose - use with caution

Model Selection

Default: Xenova/all-MiniLM-L6-v2 (384 dimensions, ~20MB)

Other options:

  • Xenova/all-MiniLM-L12-v2: Larger, more accurate (~45MB)
  • Xenova/paraphrase-multilingual-MiniLM-L12-v2: Multilingual support

TTL Strategy

// Real-time data (weather, stock prices)
maxAge: 60 * 60 * 1000  // 1 hour

// Static knowledge (programming questions)
maxAge: undefined  // Never expire

// Daily updates (news summaries)
maxAge: 24 * 60 * 60 * 1000  // 24 hours

🔧 Advanced Usage

Custom Storage

import { NanoCache } from '@nano-llm-cache/core';

const cache = new NanoCache({
  storagePrefix: 'my-app-cache'  // Separate cache per app
});

Batch Operations

// Save multiple entries
const entries = [
  { prompt: 'Q1', response: 'A1' },
  { prompt: 'Q2', response: 'A2' },
];

for (const { prompt, response } of entries) {
  await cache.save(prompt, response);
}

Cache Warming

// Preload common queries on app startup
async function warmCache() {
  await cache.preloadModel();
  
  const commonQueries = [
    { q: 'How do I...', a: '...' },
    { q: 'What is...', a: '...' },
  ];
  
  for (const { q, a } of commonQueries) {
    await cache.save(q, a);
  }
}

📈 Performance

| Operation | Time | |-----------|------| | First query (model load) | ~2-3s | | Subsequent queries | ~50-100ms | | Save operation | ~50-100ms | | Cache hit | <10ms |

Memory Usage:

  • Model: ~20MB (cached in browser)
  • Per entry: ~2-3KB (embedding + metadata)
  • 1000 entries: ~2-3MB

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

MIT © [Your Name]

🙏 Acknowledgments

🔗 Links


Made with ❤️ by developers who hate paying for duplicate LLM calls