npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

agent-memory-benchmark

v2.0.2

Published

The definitive benchmark for agent memory systems. 56 tests across 8 categories.

Readme

Agent Memory Benchmark (AMB)

The definitive benchmark for agent memory systems. Two evaluation layers, 56+ tests, provider-agnostic.

Why another benchmark? LoCoMo tests synthetic conversational recall. LongMemEval tests long-context extraction. AMB tests what agents actually need: can the memory system store, retrieve, scope, forget, and maintain consistency across the operations that real agents perform?

Quick Start

# Run against Central Intelligence
npx agent-memory-benchmark --provider central-intelligence --api-key $CI_API_KEY

# Run against Mem0
npx agent-memory-benchmark --provider mem0 --api-key $MEM0_API_KEY

# Run the in-memory baseline
npx agent-memory-benchmark --provider in-memory

# Run against any MCP memory server
npx agent-memory-benchmark --provider mcp --mcp-command "npx your-memory-server"

Two Evaluation Layers

Layer 1: Single-Operation Retrieval (56 tests, 8 categories)

| # | Category | Tests | Weight | What It Measures | |---|---|---|---|---| | 1 | Factual Recall | 8 | 15% | Store a fact, retrieve it with a direct query | | 2 | Semantic Search | 8 | 20% | Retrieve using paraphrased/conceptual queries | | 3 | Temporal Reasoning | 7 | 15% | Handle "before/after" and "latest" queries | | 4 | Conflict Resolution | 7 | 10% | When facts contradict, latest should win | | 5 | Selective Forgetting | 6 | 10% | Deleted memories must not resurface | | 6 | Cross-Session | 7 | 15% | Context carries over across sessions | | 7 | Multi-Agent | 6 | 5% | Agent A stores, Agent B retrieves | | 8 | Cost Efficiency | 7 | 10% | Latency, tokens, API calls per operation |

Layer 2: Multi-Step Retrieval (5 scenarios)

| Scenario | What It Tests | |---|---| | Preference Application | Retrieve multiple stored preferences to assemble a complete configuration | | Context Continuity | Retrieve related context from multiple simulated prior sessions | | Conflict Resolution (Multi-Step) | Handle chains of superseding facts | | Cross-Agent Handoff | Agent B retrieves context stored by Agent A | | Redundancy Check | Verify stored facts remain retrievable without re-storing |

Scores are reported separately (Layer 1 score + Layer 2 score). Layer 1 score is backward-compatible with v1.0.

Scores

Layer 1

Default (3s store delay)

| Provider | Overall | Factual | Semantic | Temporal | Conflict | Forgetting | Cross-Session | Multi-Agent | Cost | |---|---|---|---|---|---|---|---|---|---| | Central Intelligence | 90 | 100 | 100 | 86 | 86 | 83 | 86 | 67 | 94 | | In-Memory Baseline | 55 | 100 | 0 | 43 | 86 | 83 | 57 | 50 | 56 | | Zep | 11 | 0 | 0 | 14 | 0 | 67 | 0 | — | 19 | | Mem0 | 7 | 0 | 0 | 14 | 0 | 50 | 0 | 17 | 25 |

Extended delay (10s, --store-delay 10) — async providers at their best

| Provider | Overall | Factual | Semantic | Temporal | Conflict | Forgetting | Cross-Session | Multi-Agent | Cost | |---|---|---|---|---|---|---|---|---|---| | Zep | 39 | 75 | 63 | 29 | 0 | 67 | 0 | — | 19 | | Mem0 | 54 | 100 | 100 | 29 | 29 | 0 | 43 | 17 | 44 |

Note on Zep and Mem0: Both use LLM-based async fact extraction. At the default 3s delay, most memories aren't indexed yet. The 10s table shows scores after allowing more indexing time. For production workloads where memories are stored ahead of retrieval, the 10s scores are more representative.

Layer 2

| Provider | Overall | Preference | Continuity | Conflict | Handoff | Redundancy | |---|---|---|---|---|---|---| | Central Intelligence | 60 | FAIL | FAIL | PASS | PASS | PASS | | In-Memory Baseline | 20 | FAIL | FAIL | FAIL | FAIL | PASS | | Zep | 0 | FAIL | FAIL | FAIL | FAIL | FAIL | | Mem0 | 0 | FAIL | FAIL | FAIL | FAIL | FAIL |

Run npx agent-memory-benchmark --provider <name> to add your provider's scores.

CLI Options

--provider <name>         Provider: central-intelligence | mem0 | in-memory | hindsight | zep | mcp
--api-key <key>           API key (or set AMB_API_KEY env var)
--api-url <url>           API base URL override
--categories <list>       Comma-separated category IDs (default: all)
--output <dir>            Output directory (default: ./amb-results)
--verbose                 Show detailed per-query output
--layer <1|2|all>         Which layer to run (default: all)
--no-delay                Skip inter-test delays (for local/in-memory adapters)
--fixtures-dir <dir>      Fixtures directory for Layer 2 scenarios

MCP-specific:
--mcp-command <cmd>       MCP server command (required for --provider mcp)
--mcp-store-tool <name>   Override MCP store tool name
--mcp-search-tool <name>  Override MCP search tool name
--mcp-delete-tool <name>  Override MCP delete tool name

Output

AMB generates files in ./amb-results/:

  • results.json -- Machine-readable results with Layer 1 + Layer 2 scores
  • report.md -- Human-readable report with tables and failure details
  • badge.svg -- Embeddable Layer 1 score badge

Adding Your Provider

Implement the MemoryAdapter interface:

import { MemoryAdapter, MemoryEntry, StoreOptions, SearchOptions } from "agent-memory-benchmark";

class MyAdapter implements MemoryAdapter {
  name = "My Memory Provider";
  capabilities = { multiAgent: true, scoping: true, temporalDecay: false };

  async initialize(): Promise<void> { /* connect */ }
  async store(content: string, options?: StoreOptions): Promise<MemoryEntry> { /* store */ }
  async search(query: string, options?: SearchOptions): Promise<MemoryEntry[]> { /* search */ }
  async delete(id: string): Promise<boolean> { /* delete */ }
  async cleanup(): Promise<void> { /* cleanup test data */ }
}

Or use the MCP adapter for any MCP-compatible memory server:

npx agent-memory-benchmark --provider mcp --mcp-command "npx your-memory-server"

GitHub Action

Add AMB to your CI:

- uses: AlekseiMarchenko/agent-memory-benchmark/.github/actions/amb@v2
  with:
    provider: your-provider
    api-key: ${{ secrets.PROVIDER_API_KEY }}

Scoring

  • Layer 1: Per-query binary pass/fail based on expected keywords. Per-category: (passed / total) * 100. Overall: weighted average.
  • Layer 2: Per-scenario binary pass/fail. Score: (passed / total) * 100.
  • Scores are separate: Layer 1 and Layer 2 are independent metrics, not blended.
  • Exit code: 0 if Layer 1 score >= 70, 1 otherwise.

Philosophy

  1. Real-world scenarios -- Every test maps to an actual agent workflow
  2. Provider-agnostic -- Same tests, fair comparison
  3. Deterministic scoring -- No LLM-as-judge, no embedding similarity
  4. Two layers -- Single-operation retrieval + multi-step retrieval scenarios
  5. Open source -- MIT licensed. Add your provider, submit PRs

Contributing

  1. Fork the repo
  2. Add your adapter in contrib/ or src/adapters/
  3. Run the benchmark and include results
  4. Submit a PR

See CONTRIBUTING.md for details.

License

MIT -- Aleksei Marchenko