@kognitivedev/memory-bench
v0.1.6
Published
Dedicated benchmark framework for long-term memory systems
Maintainers
Readme
@kognitivedev/memory-bench
Benchmark harness for long-term memory systems.
@kognitivedev/memory-bench evaluates memory-backed agents and runtimes against normalized datasets and LongMemEval-style fixtures without coupling the benchmark flow to one database or one app.
Quick Start · Why This Package Exists · How It Works · Adapters · Storage · Outputs · Latest Results
Why This Package Exists
Memory benchmarks are usually the first place architecture starts leaking:
- benchmark code reaches directly into app tables
- scoring logic gets mixed with storage cleanup
- reports are hard to reproduce
- migration/setup steps become destructive
This package splits those concerns cleanly:
- benchmark core owns loading, execution, official evaluation, and report writing
- integration adapters own app-specific cleanup, ingestion, and answering hooks
- artifact persistence can run through the shared storage abstraction
What You Get
- dataset loaders for normalized JSON and LongMemEval-style fixtures
- official-style LongMemEval judging
KognitiveMemoryBenchAdapterfor memory runtimes built on@kognitivedev/memory- Markdown, JSON, and JSONL report outputs
- optional benchmark artifact persistence through
@kognitivedev/storage
Quick Start
Install the package:
bun add @kognitivedev/memory-benchRun a benchmark:
import { runMemoryBenchmark } from "@kognitivedev/memory-bench";
const report = await runMemoryBenchmark({
projectId: "project-uuid",
dataset,
adapter,
consolidationMode: "before-question",
});How It Works
The runner executes this flow for each case:
- reset case state
- ingest each session
- optionally consolidate after each session
- optionally consolidate before the final question
- answer the benchmark question
- run official evaluation
- write reports and benchmark artifacts
That keeps the evaluation loop stable while letting each system supply its own ingestion and retrieval behavior.
Kognitive Adapter
KognitiveMemoryBenchAdapter is the reusable adapter for Kognitive-style runtimes.
It expects an injected runtime:
import { KognitiveMemoryBenchAdapter } from "@kognitivedev/memory-bench";
const adapter = new KognitiveMemoryBenchAdapter({
runtime: {
processMemoryJob: (userId, projectId, sessionId) =>
memoryService.processMemoryJob(userId, projectId, sessionId),
logConversation: (log) => memoryService.logConversation(log),
getSnapshot: (userId, projectId) =>
memoryService.getSnapshot(userId, projectId),
},
hooks: {
resetCase: async ({ projectId, userId, caseId }) => {
// clear memory and session artifacts for this case
},
persistSession: async ({ projectId, userId, caseId, session }) => {
// optional session persistence for consolidation inputs
},
consolidate: async ({ projectId, userId, caseId }) => {
// optional cross-session consolidation
},
extractTopicMemories: async ({ projectId, userId, caseId, session }) => {
// optional topic-memory extraction
},
buildTopicContext: async ({ projectId, userId, caseId, question }) => {
return "";
},
},
});This is deliberate. The benchmark package owns the flow. The app owns environment-specific hooks.
Artifact Storage
The runner can persist benchmark artifacts through @kognitivedev/storage.
import { InMemoryStorageBackend } from "@kognitivedev/storage";
import { runMemoryBenchmark } from "@kognitivedev/memory-bench";
const artifactStorage = new InMemoryStorageBackend();
const report = await runMemoryBenchmark({
projectId: "project-uuid",
dataset,
adapter,
artifactStorage,
artifactRunId: "demo-run",
});Collections written by the runner:
memory_bench_runsmemory_bench_case_results
This is the right benchmark boundary: storage-backed persistence without binding the benchmark core to one database implementation.
Outputs
Each benchmark run writes:
report.jsonfor the full structured reportpredictions.jsonlfor official evaluator inputofficial-evaluation.jsonwhen official evaluation is enabledREADME.mdas a human-readable summary
Running From This Repo
Backend composition currently lives in /Users/vserifsaglam/work/memory-experiment/apps/backend/scripts/run-memory-benchmark.ts.
From /Users/vserifsaglam/work/memory-experiment/apps/backend:
bun run db:migrate
bun run benchmark:memory -- \
--project <project-id-or-slug> \
--dataset longmemeval \
--input ../../benchmarks/memory/fixtures/longmemeval-sample.jsonUseful flags:
--fastforcesconsolidation=never--consolidation never|after-session|before-question--concurrency <n>--judge-concurrency <n>--limit <n>--judge-model <model-id>--model <model-id>--output <dir>
Publishable Results Guidance
official-evaluation.jsonis the primary benchmark artifact.predictions.jsonlis the handoff format for external reproduction.- Official evaluation requires the configured judge model through OpenRouter.
Safe Setup
When you are running benchmarks against a real local database, prefer migrations:
bun run db:migrateDo not use db:push as the default benchmark setup path when preserving existing data matters.
Latest Results
Latest checked-in smoke run:
| Dataset | Adapter | Model | Consolidation | Cases | Official Accuracy | Avg Latency |
| -------------------- | ------------------ | -------------------- | ----------------- | ----: | ----------------: | ----------: |
| longmemeval-sample | kognitive-direct | x-ai/grok-4.1-fast | before-question | 2 | 1.000 | 438.0 ms |
Report:
benchmarks/memory/reports/latest-longmemeval-sample/README.mdbenchmarks/memory/reports/latest-longmemeval-sample/report.json
