@kognitivedev/memory-bench

v0.1.6

Published

a month ago

Dedicated benchmark framework for long-term memory systems

0High
0Medium
0Low

vserifsaglam

kognitive memory benchmark longmemeval locomo

@kognitivedev/memory-bench

Benchmark harness for long-term memory systems.

@kognitivedev/memory-bench evaluates memory-backed agents and runtimes against normalized datasets and LongMemEval-style fixtures without coupling the benchmark flow to one database or one app.

Quick Start · Why This Package Exists · How It Works · Adapters · Storage · Outputs · Latest Results

Why This Package Exists

Memory benchmarks are usually the first place architecture starts leaking:

benchmark code reaches directly into app tables
scoring logic gets mixed with storage cleanup
reports are hard to reproduce
migration/setup steps become destructive

This package splits those concerns cleanly:

benchmark core owns loading, execution, official evaluation, and report writing
integration adapters own app-specific cleanup, ingestion, and answering hooks
artifact persistence can run through the shared storage abstraction

What You Get

dataset loaders for normalized JSON and LongMemEval-style fixtures
official-style LongMemEval judging
KognitiveMemoryBenchAdapter for memory runtimes built on @kognitivedev/memory
Markdown, JSON, and JSONL report outputs
optional benchmark artifact persistence through @kognitivedev/storage

Quick Start

Install the package:

bun add @kognitivedev/memory-bench

Run a benchmark:

import { runMemoryBenchmark } from "@kognitivedev/memory-bench";

const report = await runMemoryBenchmark({
  projectId: "project-uuid",
  dataset,
  adapter,
  consolidationMode: "before-question",
});

How It Works

The runner executes this flow for each case:

reset case state
ingest each session
optionally consolidate after each session
optionally consolidate before the final question
answer the benchmark question
run official evaluation
write reports and benchmark artifacts

That keeps the evaluation loop stable while letting each system supply its own ingestion and retrieval behavior.

Kognitive Adapter

KognitiveMemoryBenchAdapter is the reusable adapter for Kognitive-style runtimes.

It expects an injected runtime:

import { KognitiveMemoryBenchAdapter } from "@kognitivedev/memory-bench";

const adapter = new KognitiveMemoryBenchAdapter({
  runtime: {
    processMemoryJob: (userId, projectId, sessionId) =>
      memoryService.processMemoryJob(userId, projectId, sessionId),
    logConversation: (log) => memoryService.logConversation(log),
    getSnapshot: (userId, projectId) =>
      memoryService.getSnapshot(userId, projectId),
  },
  hooks: {
    resetCase: async ({ projectId, userId, caseId }) => {
      // clear memory and session artifacts for this case
    },
    persistSession: async ({ projectId, userId, caseId, session }) => {
      // optional session persistence for consolidation inputs
    },
    consolidate: async ({ projectId, userId, caseId }) => {
      // optional cross-session consolidation
    },
    extractTopicMemories: async ({ projectId, userId, caseId, session }) => {
      // optional topic-memory extraction
    },
    buildTopicContext: async ({ projectId, userId, caseId, question }) => {
      return "";
    },
  },
});

This is deliberate. The benchmark package owns the flow. The app owns environment-specific hooks.

Artifact Storage

The runner can persist benchmark artifacts through @kognitivedev/storage.

import { InMemoryStorageBackend } from "@kognitivedev/storage";
import { runMemoryBenchmark } from "@kognitivedev/memory-bench";

const artifactStorage = new InMemoryStorageBackend();

const report = await runMemoryBenchmark({
  projectId: "project-uuid",
  dataset,
  adapter,
  artifactStorage,
  artifactRunId: "demo-run",
});

Collections written by the runner:

memory_bench_runs
memory_bench_case_results

This is the right benchmark boundary: storage-backed persistence without binding the benchmark core to one database implementation.

Outputs

Each benchmark run writes:

report.json for the full structured report
predictions.jsonl for official evaluator input
official-evaluation.json when official evaluation is enabled
README.md as a human-readable summary

Running From This Repo

Backend composition currently lives in /Users/vserifsaglam/work/memory-experiment/apps/backend/scripts/run-memory-benchmark.ts.

From /Users/vserifsaglam/work/memory-experiment/apps/backend:

bun run db:migrate
bun run benchmark:memory -- \
  --project <project-id-or-slug> \
  --dataset longmemeval \
  --input ../../benchmarks/memory/fixtures/longmemeval-sample.json

Useful flags:

--fast forces consolidation=never
--consolidation never|after-session|before-question
--concurrency <n>
--judge-concurrency <n>
--limit <n>
--judge-model <model-id>
--model <model-id>
--output <dir>

Publishable Results Guidance

official-evaluation.json is the primary benchmark artifact.
predictions.jsonl is the handoff format for external reproduction.
Official evaluation requires the configured judge model through OpenRouter.

Safe Setup

When you are running benchmarks against a real local database, prefer migrations:

bun run db:migrate

Do not use db:push as the default benchmark setup path when preserving existing data matters.

Latest Results

Latest checked-in smoke run:

| Dataset | Adapter | Model | Consolidation | Cases | Official Accuracy | Avg Latency | | -------------------- | ------------------ | -------------------- | ----------------- | ----: | ----------------: | ----------: | | longmemeval-sample | kognitive-direct | x-ai/grok-4.1-fast | before-question | 2 | 1.000 | 438.0 ms |

Report:

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@kognitivedev/memory-bench

Why This Package Exists

What You Get

Quick Start

How It Works

Kognitive Adapter

Artifact Storage

Outputs

Running From This Repo

Publishable Results Guidance

Safe Setup

Latest Results

Related Paths