npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

magic-retrieval

v1.0.0

Published

Reusable local-first hybrid retrieval core powered by SQLite, LanceDB, and optional OpenAI embeddings.

Readme

magic-retrieval

magic-retrieval is a local-first hybrid retrieval library for Node.js.

It gives you one reusable indexing and search layer that combines:

  • SQLite for document storage, metadata, facets, full-text search, and embedding cache
  • LanceDB for vector search
  • optional semantic retrieval via OpenAI or a custom embedding provider
  • a simple API for indexing documents and searching them with lexical, vector, or hybrid retrieval

It is designed for application-level search flows such as notes, knowledge bases, email archives, and assistant memory.

Why use it

  • local persistence with no external search service required
  • good default hybrid ranking out of the box
  • lexical-only fallback when embeddings are unavailable
  • deterministic chunking and preview generation
  • facet and time filtering
  • persistent embedding cache to avoid recomputing vectors
  • diagnostics for understanding what a search did internally
  • built-in quality tooling for evaluation, tuning, and benchmarking

Requirements

  • Node.js 20+
  • a writable filesystem location for:
    • the SQLite database file
    • the LanceDB directory
  • optional: OPENAI_API_KEY if you want semantic search via OpenAI

Install

npm install magic-retrieval

What the library stores

Each indexed document can contain:

  • top-level fields like documentId, sourceType, sourceId, title, url
  • body text via bodyText or text
  • optional explicit chunks
  • arbitrary metadata
  • normalized facets for filtering
  • timestamps like sortAt, createdAt, and updatedAt

If you do not provide chunks manually, the library chunks the body text for you.

Quick start

import { createMagicRetrievalStore } from "magic-retrieval";

const store = await createMagicRetrievalStore({
  dbPath: "./data/retrieval.sqlite",
  lancedbPath: "./data/lancedb",
  openAIApiKey: process.env.OPENAI_API_KEY,
});

await store.upsertDocuments(
  [
    {
      documentId: "doc-1",
      sourceType: "note",
      sourceId: "workspace-alpha",
      title: "Project overview",
      bodyText: "This document explains the architecture and rollout plan.",
      facets: {
        team: "core",
        visibility: "internal",
      },
      metadata: {
        owner: "platform",
      },
      sortAt: new Date().toISOString(),
    },
    {
      documentId: "doc-2",
      sourceType: "note",
      sourceId: "workspace-alpha",
      title: "Search diagnostics",
      bodyText: "Search diagnostics report candidate counts, timings, and retrieval mode.",
      facets: {
        team: "core",
        topic: "search",
      },
      sortAt: new Date().toISOString(),
    },
  ],
  { replaceAll: true },
);

const results = await store.search("architecture rollout", {
  limit: 5,
  filters: {
    sourceType: "note",
    facets: { team: "core" },
  },
});

console.log(results.map((result) => ({
  documentId: result.documentId,
  title: result.title,
  preview: result.preview,
  score: result.scores.hybrid,
})));

store.close();

Search with diagnostics

If you want observability for ranking and debugging, use searchWithDiagnostics():

const response = await store.searchWithDiagnostics("car engine", {
  limit: 3,
  filters: {
    facets: { category: "transport" },
  },
});

console.log(response.results);
console.log(response.diagnostics);

Diagnostics include:

  • query and limit
  • lexical candidate count
  • vector candidate count
  • fused candidate count
  • returned result count
  • whether embeddings were enabled
  • elapsed time for lexical retrieval, vector retrieval, fusion, and total search
  • the effective lexical/vector candidate multipliers

Retrieval modes

Lexical-only mode

If you do not provide openAIApiKey and do not inject a custom embeddingProvider, the library still works using SQLite FTS5 only.

This is useful when you want:

  • zero external API dependencies
  • predictable local behavior
  • a fallback mode in development or degraded operation

Hybrid mode

If an embedding provider is available, the library:

  1. fetches lexical candidates from SQLite FTS5
  2. fetches vector candidates from LanceDB
  3. fuses both sets with tuned defaults

Custom embedding providers

You can inject your own embedding backend for tests, offline deployments, or non-OpenAI production setups.

import type { EmbeddingProvider } from "magic-retrieval";

class MyEmbeddingProvider implements EmbeddingProvider {
  readonly modelName = "my-model-v1";

  async embed(texts: string[]): Promise<number[][]> {
    return texts.map(() => [0.1, 0.2, 0.3]);
  }
}

const store = await createMagicRetrievalStore({
  dbPath: "./data/retrieval.sqlite",
  lancedbPath: "./data/lancedb",
  embeddingProvider: new MyEmbeddingProvider(),
});

Core API

Store creation

  • createMagicRetrievalStore(options)
  • new MagicRetrievalStore(options) + await store.init()

Store methods

  • store.upsertDocuments(documents, { replaceAll?, refreshVectors? })
  • store.refreshVectorIndex()
  • store.getDocument(documentId)
  • store.search(query, { limit?, filters? })
  • store.searchWithDiagnostics(query, { limit?, filters? })
  • store.lexicalSearch(query, { limit?, filters? })
  • store.vectorSearch(query, { limit?, filters? })
  • store.embedTexts(texts)
  • store.embeddingsEnabled()
  • store.close()

Utility exports

  • chunkText()
  • buildPreview()
  • buildMatchExpression()
  • tokenizeSearchQuery()
  • normalizeWhitespace()
  • truncateText()
  • evaluateQueryResults()
  • evaluateStrategy()
  • precisionAtK()
  • recallAtK()
  • mrrAtK()
  • ndcgAtK()

Important options

MagicRetrievalStoreOptions includes:

  • dbPath: path to the SQLite file
  • lancedbPath: path to the LanceDB directory
  • tableName: LanceDB table name
  • openAIApiKey: OpenAI key for semantic retrieval
  • openAIBaseURL: custom OpenAI-compatible base URL
  • embeddingProvider: injected embedding provider
  • embeddingBatchSize
  • chunkSize
  • chunkOverlap
  • previewSize
  • bodyMaxChars
  • titleMaxChars
  • lexicalCandidateMultiplier
  • vectorCandidateMultiplier
  • ranking: partial override of the ranking config

Defaults are already tuned for a balanced hybrid setup, so you usually should not override ranking until you have corpus-specific evidence.

Filtering

Search filters support:

  • sourceType
  • sourceId
  • documentId
  • since
  • until
  • facets

Example:

const results = await store.search("incident review", {
  limit: 10,
  filters: {
    sourceType: "note",
    sourceId: "team-core",
    since: "2026-01-01T00:00:00.000Z",
    facets: {
      team: "core",
      visibility: "internal",
    },
  },
});

Index management

Replace the full corpus

await store.upsertDocuments(documents, { replaceAll: true });

Skip vector refresh during bulk ingest

await store.upsertDocuments(documents, {
  replaceAll: true,
  refreshVectors: false,
});

await store.refreshVectorIndex();

Result shape

Each search result includes:

  • document identity fields
  • title, url, preview
  • chunkId and chunkIndex
  • sortAt and updatedAt
  • normalized facets
  • original metadata
  • scores.lexical
  • scores.vector
  • scores.hybrid

Development and quality commands

npm run typecheck
npm test
npm run smoke
npm run eval
npm run tune
npm run bench
npm run build

Quality tooling

The package ships with a repeatable quality harness:

  • npm run eval evaluates retrieval quality on bundled corpora
  • npm run tune sweeps ranking profiles and candidate multipliers
  • npm run bench measures indexing and search latency

Included corpora currently cover:

  • generic knowledge retrieval
  • notion-style document retrieval
  • mail-style thread/message retrieval

This gives you a strong regression baseline and a practical way to tune defaults before integrating the library into larger systems.

Notes for production use

  • keep the SQLite file and LanceDB directory on persistent storage
  • re-use one store instance where possible instead of constantly recreating it
  • use replaceAll: false for incremental updates when you are not replacing the entire corpus
  • rely on diagnostics before changing ranking settings
  • if semantic search quality matters, benchmark with your real corpus rather than only synthetic examples
  • because the library uses embedded local databases, disk usage can grow noticeably on large corpora since document text, indexes, cached embeddings, and vector data are stored on disk
  • data does not disappear after a process restart as long as you keep the same SQLite file and LanceDB directory and do not delete or replace them

License

MIT