npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

opti-llm

v1.0.2

Published

Semantic caching SDK for LLM cost reduction using Qdrant

Readme

OptiLLM - Semantic Caching SDK

Based on Qdrant semantic caching patterns

A lightweight TypeScript SDK for semantic caching of LLM responses using Qdrant vector database. Reduce costs and improve response times by caching semantically similar queries.

Features

  • 🚀 Semantic Caching: Cache LLM responses based on semantic similarity, not exact matches
  • 💰 Cost Reduction: Avoid redundant API calls for similar queries
  • Fast Retrieval: Vector-based similarity search with Qdrant
  • 🔧 Flexible: Support for OpenAI embeddings or local TF-IDF fallback
  • 🏢 Multi-tenant: Built-in tenant and user scoping
  • TTL Support: Automatic expiration of cached entries
  • 🧠 Typeahead Suggestions: HTTP and WebSocket APIs for live suggestions as users type

Quick Start

1. Install

npm install opti-llm

2. Setup Qdrant

# Local Qdrant with Docker
docker run -p 6333:6333 qdrant/qdrant

# Or use Qdrant Cloud (free tier available)

3. Basic Usage (SDK)

import { createOptiLLM } from 'opti-llm';
import OpenAI from 'openai';

// Initialize
const optiLLM = createOptiLLM({
  qdrantUrl: 'http://localhost:6333',
  embedding: {
    provider: 'openai', // or 'local' for testing
    apiKey: process.env.OPENAI_API_KEY,
  },
  similarityThreshold: 0.85
});

await optiLLM.init();

// Your LLM client
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// Cached LLM calls
const result = await optiLLM.capture(
  {
    prompt: "What is Redis vector search?",
    metadata: {
      provider: 'openai',
      model: 'gpt-4o-mini',
      tenantId: 'org1',
      userId: 'user123'
    },
    policy: {
      maxAge: 3600, // 1 hour TTL
      minSimilarity: 0.8
    }
  },
  async () => {
    // This expensive call only happens on cache miss
    const completion = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [{ role: 'user', content: "What is Redis vector search?" }]
    });
    return completion.choices[0].message.content;
  }
);

console.log(result.response); // LLM response
console.log(result.cached);   // true if from cache
console.log(result.cost_saved); // true if cache hit

// Optional: Suggestions (backend SDK)
const suggestions = await optiLLM.suggest({
  text: 'What is Redis vec',
  tenantId: 'org1',
  limit: 5,
  minSimilarity: 0.7,
});
console.log(suggestions);

Configuration

interface OptiLLMConfig {
  qdrantUrl: string;              // Qdrant instance URL
  collectionName?: string;        // Collection name (default: 'llm_cache')
  apiKey?: string;                // Qdrant API key (for cloud)
  embedding?: {
    provider: 'openai' | 'local'; // Embedding provider
    apiKey?: string;              // OpenAI API key
    model?: string;               // Embedding model
  };
  defaultTTL?: number;            // Default TTL in seconds
  similarityThreshold?: number;   // Similarity threshold (0-1)
}

Test App (Demo UI + APIs)

Run the included Express test app:

cd test-app
npm install

# Setup environment variables
cp env.example .env
# Edit .env with your actual API keys

# Start the server
npm run dev

Your .env file should contain:

OPENAI_API_KEY=your_openai_api_key
QDRANT_URL=https://your-cluster.region.aws.cloud.qdrant.io
QDRANT_API_KEY=your_qdrant_api_key

Test endpoints:

# Chat with caching
curl -X POST http://localhost:3000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is semantic caching?"}'

# Test similar query (should hit cache)
curl -X POST http://localhost:3000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain semantic caching"}'

Endpoints

  • POST /chat — Cached LLM call (uses SDK capture())

    • Body: { "prompt": string, "tenantId"?: string, "userId"?: string }
    • Returns: { response, cached, cost_saved, duration_ms }
  • GET /suggest?q=...&tenantId=...&limit=... — HTTP suggestions

    • Returns: { items: [{ id, prompt, response, score, createdAt, metadata }] }
  • WS /ws/suggest?tenantId=... — WebSocket suggestions

    • Send: { text: string, limit?: number, minSimilarity?: number }
    • Receive: { items: [...] }

How It Works

  1. Embedding Generation: Convert prompts to vectors using OpenAI or local embeddings
  2. Similarity Search: Query Qdrant for semantically similar cached prompts
  3. Cache Hit/Miss: Return cached response if similarity > threshold, otherwise call LLM
  4. Storage: Store new LLM responses with metadata and TTL
  5. Cleanup: Automatic removal of expired entries

Architecture

Based on proven semantic caching patterns from Shuttle.dev's Qdrant guide, adapted for Node.js/TypeScript.

┌─────────────────┐
│   Your App      │
│                 │
│  ┌───────────┐  │
│  │ OptiLLM   │  │ ──── Semantic similarity search
│  │    SDK    │  │
│  └─────┬─────┘  │
│        │        │
└────────┼────────┘
         │
    ┌────▼────┐
    │ Qdrant  │ ──── Vector storage & search
    │         │
    └─────────┘

License

MIT