npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@db0-ai/benchmark

v0.3.0

Published

Memory quality benchmarks for db0. LoCoMo, LongMemEval, recall, and feature test suites for evaluating AI agent memory systems.

Readme

@db0-ai/benchmark

Memory quality benchmarks for db0. Measures retrieval accuracy, answer generation, and end-to-end memory system performance using published academic datasets.

Quick Start

# Recall + feature tests (no API key needed)
npm run bench

# LoCoMo benchmark (requires Gemini API key)
GEMINI_API_KEY=your-key npm run bench:locomo -- --embeddings gemini --samples 1 --queries 200

# LongMemEval benchmark (requires data download + API key)
bash packages/benchmark/scripts/fetch-longmemeval.sh
GEMINI_API_KEY=your-key npm run bench:longmemeval -- --embeddings gemini

Suites

Recall

15-query dataset testing basic memory retrieval: single-hop fact recall, temporal reasoning, multi-hop inference, and unanswerable detection.

npm run bench:recall

Features

db0-specific functional tests: memory superseding, scope isolation, relationship edges, noise filtering, and state branching.

npm run bench:features

LoCoMo

LoCoMo (Long-term Conversational Memory) from Snap Research. 10 conversation samples with ~1986 QA pairs across 5 categories:

| Category | Count | Description | |---|---|---| | Single-hop | 282 | Direct fact recall from one turn | | Multi-hop temporal | 321 | Facts requiring date computation | | Open-domain | 96 | Inference beyond explicit statements | | Multi-session | 841 | Facts spread across sessions | | Unanswerable | 446 | Entity-swap traps with no evidence |

GEMINI_API_KEY=your-key npm run bench:locomo -- --embeddings gemini

LongMemEval-s

LongMemEval (Wu et al., 2025) — a harder benchmark for conversational memory systems. Uses the cleaned variant (MIT license).

  • 500 questions across 6 categories testing 5 core memory abilities
  • ~115k tokens per question (~48 sessions)

| Category | Count | db0 feature tested | |---|---|---| | single-session-user | 70 | Chunk retrieval | | single-session-assistant | 56 | Chunk retrieval | | single-session-preference | 30 | Scoped memory | | multi-session | 133 | Multi-session search | | temporal-reasoning | 133 | Temporal metadata | | knowledge-update | 78 | Superseding |

bash packages/benchmark/scripts/fetch-longmemeval.sh
GEMINI_API_KEY=your-key npm run bench:longmemeval -- --embeddings gemini

# Limit questions or filter by category
GEMINI_API_KEY=your-key npm run bench:longmemeval -- --embeddings gemini --queries 50
GEMINI_API_KEY=your-key npm run bench:longmemeval -- --embeddings gemini --types knowledge-update,temporal-reasoning

CLI Options

| Flag | Description | Default | |---|---|---| | --suite | all, recall, features, locomo, longmemeval | all | | --embeddings | hash, gemini, openai | hash | | --ingest | Ingestion mode (see below) | varies by suite | | --profile | db0 profile name | conversational | | --rerank | Enable Gemini LLM reranking | disabled | | --enrich | Enable chunk enrichment (LLM resolves pronouns/references) | disabled | | --enrich-mode | augment (metadata header) or rewrite (full rewrite) | augment | | --expand | Enable query expansion (LLM generates reformulations, merged via RRF) | disabled | | --latent-bridging | Enable latent semantic bridging (second embedding on inferred meaning) | disabled | | --json | Output JSON report | disabled |

Ingestion Modes

| Mode | Description | Best for | |---|---|---| | turn | Each conversation turn as a separate memory | Fine-grained retrieval | | session | Entire session as one memory | Large-context reasoning | | chunk | Session split into overlapping windows | Balanced retrieval + context | | extract | Rules-based fact extraction + chunks | Structured fact recall | | llm-extract | LLM-based fact extraction + chunks | High-precision extraction | | turn-context | Each turn with surrounding context window | QA benchmarks | | dual | Sessions + individual turns | Broad + precise matching |

Metrics

| Metric | Description | |---|---| | llm_judge | LLM scores answer correctness (binary 0/1) | | token_f1 | Token-level F1 between generated and expected answer | | retrieval_coverage | Whether expected answer text appears in any retrieved result | | hit_rate@K | Binary — is the expected answer in top-K results? |

Results

All results use Gemini embeddings (gemini-embedding-001) and Gemini Flash for LLM judging and answer generation.

LoCoMo

Session ingest, hybrid scoring, 1 sample (199 queries).

| System | LLM Judge | |---|---| | ByteRover | 92.2% | | db0 | 76.9% | | Mem0 | 66.9% | | Zep | 21.3% |

| Category | n | LLM Judge | |---|---|---| | Multi-hop temporal | 37 | 86.5% | | Unanswerable | 47 | 85.1% | | Multi-session | 70 | 84.3% | | Open-domain | 13 | 61.5% | | Single-hop | 32 | 43.8% |

LongMemEval-s

Conversational profile, session ingest, hybrid scoring.

| System | LLM Judge | |---|---| | db0 | 80.0% | | Zep | 71.2% | | Full Context GPT-4o | 63.8% | | Mem0 | 29% |

db0 scores based on a 50-question sample with Gemini embeddings. Published scores from LongMemEval paper.

Programmatic Usage

import { runBenchmark, Db0Adapter, LlmJudgeMetric, TokenF1Metric, loadLoCoMoDataset } from "@db0-ai/benchmark";
import { createSqliteBackend } from "@db0-ai/backends-sqlite";

const adapter = new Db0Adapter({
  createBackend: () => createSqliteBackend({ dbPath: ":memory:" }),
  embeddingFn: yourEmbeddingFn,
  scoring: "hybrid",
  ingestMode: "session",
});

const report = await runBenchmark({
  adapter,
  dataset: loadLoCoMoDataset({ maxSamples: 1 }),
  metrics: [new LlmJudgeMetric({ judgeFn: yourJudgeFn }), new TokenF1Metric()],
  queryLimit: 20,
});

console.log(report.overall);

License

MIT