memorybench
v1.0.0
Published
A pluggable benchmarking framework for evaluating memory and context systems.
Readme
MemoryBench
A pluggable benchmarking framework for evaluating memory and context systems.
Features
- 🔌 Interoperable: mix and match any provider with any benchmark
- 🧩 Bring your own benchmarks: plug in custom datasets and tasks
- ♻️ Checkpointed runs: resume from any pipeline stage (ingest → index → search → answer → evaluate)
- 🆚 Multi‑provider comparison: run the same benchmark across providers side‑by‑side
- 🧪 Judge‑agnostic: swap GPT‑4o, Claude, Gemini, etc. without code changes
- 📊 Structured reports: export run status, failures, and metrics for analysis
- 🖥️ Web UI: inspect runs, questions, and failures interactively, in real-time!
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Benchmarks │ │ Providers │ │ Judges │
│ (LoCoMo, │ │ (Supermem, │ │ (GPT-4o, │
│ LongMem..) │ │ Mem0, Zep) │ │ Claude..) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
└──────────────────┼──────────────────┘
▼
┌───────────────────────┐
│ MemoryBench │
└───────────┬───────────┘
▼
┌────────┬─────────┬────────┬──────────┬────────┐
│ Ingest │ Indexing│ Search │ Answer │Evaluate│
└────────┴─────────┴────────┴──────────┴────────┘Quick Start
bun install
cp .env.example .env.local # Add your API keys
bun run src/index.ts run -p supermemory -b locomoConfiguration
# Providers (at least one)
SUPERMEMORY_API_KEY=
MEM0_API_KEY=
ZEP_API_KEY=
# Judges (at least one)
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_API_KEY=Commands
| Command | Description |
|---------|-------------|
| run | Full pipeline: ingest → index → search → answer → evaluate → report |
| compare | Run benchmark across multiple providers simultaneously |
| ingest | Ingest benchmark data into provider |
| search | Run search phase only |
| test | Test single question |
| status | Check run progress |
| list-questions | Browse benchmark questions |
| show-failures | Debug failed questions |
| serve | Start web UI |
| help | Show help (help providers, help models, help benchmarks) |
Options
-p, --provider Memory provider (supermemory, mem0, zep)
-b, --benchmark Benchmark (locomo, longmemeval, convomem)
-j, --judge Judge model (gpt-4o, sonnet-4, gemini-2.5-flash, etc.)
-r, --run-id Run identifier (auto-generated if omitted)
-m, --answering-model Model for answer generation (default: gpt-4o)
-l, --limit Limit number of questions
-q, --question-id Specific question (for test command)
--force Clear checkpoint and restartExamples
# Full run
bun run src/index.ts run -p mem0 -b locomo
# With custom run ID
bun run src/index.ts run -p mem0 -b locomo -r my-test
# Resume existing run
bun run src/index.ts run -r my-test
# Limited questions
bun run src/index.ts run -p supermemory -b locomo -l 10
# Different models
bun run src/index.ts run -p zep -b longmemeval -j sonnet-4 -m gemini-2.5-flash
# Compare multiple providers
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -s 5
# Test single question
bun run src/index.ts test -r my-test -q question_42
# Debug
bun run src/index.ts status -r my-test
bun run src/index.ts show-failures -r my-testPipeline
1. INGEST Load benchmark sessions → Push to provider
2. INDEX Wait for provider indexing
3. SEARCH Query provider → Retrieve context
4. ANSWER Build prompt → Generate answer via LLM
5. EVALUATE Compare to ground truth → Score via judge
6. REPORT Aggregate scores → Output accuracy + latencyEach phase checkpoints independently. Failed runs resume from last successful point.
Checkpointing
Runs persist to data/runs/{runId}/:
checkpoint.json- Run state and progressresults/- Search results per questionreport.json- Final report
Re-running same ID resumes. Use --force to restart.
Extending
| Component | Guide | |-----------|-------| | Add Provider | src/providers/README.md | | Add Benchmark | src/benchmarks/README.md | | Add Judge | src/judges/README.md | | Project Structure | src/README.md |
License
MIT
