@provos/memory-mcp-server

v0.1.3

Published

3 months ago

Persistent memory MCP server with semantic search, LLM summarization, and automatic compaction

0High
0Medium
0Low

Memory MCP Server

A persistent memory server for LLM agents using the Model Context Protocol. Provides semantic search, optional LLM summarization, and automatic maintenance — all backed by a single SQLite file.

Why Another Memory Server?

Most memory MCP servers fall into two camps: minimal and infrastructure-heavy. The official @modelcontextprotocol/server-memory stores a knowledge graph in a JSON file with substring search only — useful, but limited as memories grow. Mem0, Hindsight, and doobidoo offer semantic search and smart features, but require Python runtimes, Docker containers, or external databases (PostgreSQL, Qdrant, Neo4j).

This server fills a specific gap: a TypeScript MCP server with semantic search, LLM summarization, and automatic maintenance that runs from a single SQLite file with zero external dependencies.

Single command, single file — npx to start, one .db file to back up. SQLite + sqlite-vec + FTS5 handle storage, vector search, and keyword search in-process.
Works without an LLM, improves with one — Extractive retrieval, cosine-only dedup, and bullet-point formatting work out of the box. Adding a cheap LLM endpoint (Haiku, Ollama, etc.) enables abstractive summarization, direct question answering, contradiction detection, and smarter consolidation.
Token-budget-aware responses — Instead of dumping raw memories into your context window, the server summarizes and packs results to fit a specified token budget. The memory_context tool provides a structured session-start briefing — a capability unique to this server.

For a detailed competitive analysis covering 11 memory systems, see docs/designs/memory-server-comparison.md.

Features

Score-based hybrid fusion — vector similarity (BGE-base) + FTS5 BM25 keyword search, merged via Weaviate-style relativeScoreFusion with min-max normalized scores blended by alpha weighting. Unlike pure rank-based fusion (used by Letta, Zep, mind-mem, LangChain), score-based fusion retains the discriminating power of both retrieval signals — an approach validated by production search engines (Weaviate, Elasticsearch, Qdrant) and research (Bruch et al. 2023)
Cross-encoder reranking — ms-marco-MiniLM re-ranks candidates using raw logits with relative gap filtering, improving precision without aggressive cutoffs
Composite scoring — fusion relevance (incorporating vector + BM25 magnitudes) blended with recency, importance, and access pattern signals
Token-budget-aware retrieval — returns pre-summarized context blocks sized to fit your context window
Direct question answering — format="answer" synthesizes across retrieved memories to answer questions directly, without a separate reader LLM
SQLite-native, zero external dependencies — embeddings and reranking run locally in-process; no Postgres, Neo4j, Redis, or Docker required. One .db file to back up
Works without an LLM, improves with one — extractive retrieval and bullet-point formatting work out of the box; adding a cheap LLM (Haiku, Ollama) enables abstractive summarization, direct answers, and contradiction detection
Automatic maintenance — unused memories decay over time; related memories compact into summaries via three-phase pipeline (consolidation, decay, compaction)
Namespace isolation — multiple agents or projects share one database without cross-contamination

Quick Start

Configure in Claude Desktop / MCP client

{
  "mcpServers": {
    "memory": {
      "command": "npx",
      "args": ["-y", "@provos/memory-mcp-server"],
      "env": {
        "MEMORY_DB_PATH": "~/.local/share/memory-mcp/default.db"
      }
    }
  }
}

With an LLM (recommended)

Adding an LLM enables summarization, direct question answering (format="answer"), and contradiction detection. The LLM client uses the OpenAI SDK, so MEMORY_LLM_BASE_URL must point to an OpenAI-compatible endpoint.

{
  "mcpServers": {
    "memory": {
      "command": "npx",
      "args": ["-y", "@provos/memory-mcp-server"],
      "env": {
        "MEMORY_DB_PATH": "~/.local/share/memory-mcp/default.db",
        "MEMORY_LLM_API_KEY": "your-api-key",
        "MEMORY_LLM_BASE_URL": "https://openrouter.ai/api/v1",
        "MEMORY_LLM_MODEL": "anthropic/claude-haiku-4-5-20251001"
      }
    }
  }
}

With Ollama (fully local)

{
  "mcpServers": {
    "memory": {
      "command": "npx",
      "args": ["-y", "@provos/memory-mcp-server"],
      "env": {
        "MEMORY_DB_PATH": "~/.local/share/memory-mcp/default.db",
        "MEMORY_LLM_BASE_URL": "http://localhost:11434/v1",
        "MEMORY_LLM_MODEL": "llama3.2:3b"
      }
    }
  }
}

Tools

memory_store

Store a memory for later retrieval. Memories are automatically embedded for semantic search.

content    (string, required)   — The memory content. Store one fact per call.
tags       (string[], optional) — For filtering, e.g. ["preference", "project:foo"]
importance (number, optional)   — 0-1 scale, controls decay resistance. Default 0.5.

memory_recall

Retrieve memories relevant to a query. Returns formatted results within a token budget.

query        (string, required)   — Natural language query
token_budget (integer, optional)  — Max tokens in response. Default 500.
tags         (string[], optional) — Filter to memories with ALL these tags
format       (string, optional)   — "summary" (default), "list", "raw", or "answer"

Format modes:

summary — LLM-generated briefing (falls back to extractive clusters without LLM)
list — Bullet list with dates and importance scores
raw — Full JSON with all metadata
answer — LLM answers the query directly by synthesizing across retrieved memories (falls back to list without LLM)

memory_context

Get a session briefing of relevant memories. Call at the start of each conversation.

task         (string, optional)   — Brief description of the current task
token_budget (integer, optional)  — Max tokens for briefing. Default 800.

memory_forget

Remove memories by ID, tag, query match, or timestamp.

ids     (string[], optional)  — Specific memory IDs to forget
tags    (string[], optional)  — Forget all memories with ALL these tags
query   (string, optional)    — Forget top-10 matches
before  (string, optional)    — ISO 8601 timestamp cutoff
confirm (boolean, optional)   — Required for query-based or bulk deletion
dry_run (boolean, optional)   — Preview what would be deleted. Default false.

memory_inspect

View memory statistics or inspect specific memories.

view  (string, optional)    — "stats" (default), "recent", "important", "tags", "export"
ids   (string[], optional)  — Inspect specific memories by ID
limit (integer, optional)   — Max items returned. Default 20.

Architecture

┌──────────────────────────────────────────────────┐
│                  MCP Client                      │
│            (Claude, Cursor, etc.)                │
└──────────────┬───────────────────────────────────┘
               │ stdio (JSON-RPC)
┌──────────────▼───────────────────────────────────┐
│  MCP Server (server.ts)                          │
│  Tool handlers: store, recall, context,          │
│                 forget, inspect                  │
└──────────────┬───────────────────────────────────┘
               │
┌──────────────▼───────────────────────────────────┐
│  Engine (engine-impl.ts)                         │
│  Wires together all subsystems                   │
├──────────────────────────────────────────────────┤
│                                                  │
│  ┌─────────────┐  ┌──────────────────────────┐   │
│  │  Embedder   │  │   Retrieval Pipeline     │   │
│  │ BGE-base    │  │  Vector KNN + FTS5 BM25  │   │
│  │  (local)    │  │  → Score-based fusion     │   │
│  └─────────────┘  │  → Composite scoring     │   │
│  ┌─────────────┐  │  → Cross-encoder rerank  │   │
│  │  Reranker   │  │  → Dedup by embedding    │   │
│  │ ms-marco    │  │  → Token budget packing  │   │
│  │  (local)    │  │  → Format (answer/       │   │
│  └─────────────┘  │     summary/list/raw)    │   │
│  ┌─────────────┐  └──────────────────────────┘   │
│  │  LLM Client │                                 │
│  │  (optional) │  ┌──────────────────────────┐   │
│  │  Haiku etc. │  │   Maintenance            │   │
│  └─────────────┘  │  Phase 0: Consolidation  │   │
│                   │    (batch LLM dedup)     │   │
│                   │  Phase 1: Vitality decay │   │
│                   │  Phase 2: Compaction     │   │
│                   └──────────────────────────┘   │
├──────────────────────────────────────────────────┤
│  SQLite (WAL mode)                               │
│  ┌────────────┬──────────────┬────────────────┐  │
│  │  memories   │ vec_memories │  memories_fts  │  │
│  │  (rows)     │ (768-dim)   │  (BM25 index)  │  │
│  └────────────┴──────────────┴────────────────┘  │
└──────────────────────────────────────────────────┘

Store path

Embed content locally (~5ms)
Check for near-exact duplicates (cosine distance < 0.05) — auto-merge, no LLM
Insert with consolidated = false
Every N stores (default 50), run maintenance which includes batch consolidation

Retrieval pipeline

Candidate generation — parallel vector KNN (50 candidates) + FTS5 BM25 search (50 candidates, Porter stemming, bigram phrases)
Score-based fusion — Weaviate-style relativeScoreFusion: min-max normalize vector similarity and BM25 scores independently to [0,1], then blend with alpha weighting (default 0.5). Candidates from only one source get only that source's weighted contribution
Tag filter — optional intersection filter
Composite scoring — fusion relevance (0.65) + recency (0.15) + importance (0.1) + access patterns (0.1). The fusion score already encodes vector + BM25 magnitudes
Relevance gating — drop candidates with fusion score < 5% of max
Cross-encoder reranking — ms-marco-MiniLM-L-6-v2 re-scores candidates; relative gap filter (5 logit points from best)
Deduplication — remove near-duplicates by embedding cosine similarity
Token budget packing — greedily select memories by score until budget is filled (skip, not break)
Formatting — answer (LLM synthesis), summary (LLM briefing), list (bullets), or raw (JSON)

Maintenance (amortized, no background processes)

Consolidation — finds unconsolidated memories, groups close candidates, makes one batched LLM call to classify duplicates/contradictions/distinct. Also runs on server startup to handle leftovers from previous sessions.
Decay — samples memories and checks vitality (half-life proportional to importance, boosted by access frequency). Below-threshold memories decay.
Compaction — clusters decayed memories and summarizes them via LLM, preserving originals as soft-deleted references.

Graceful LLM degradation

Every LLM-enhanced feature has an extractive fallback:

| Feature | With LLM | Without LLM | |---------|----------|-------------| | Recall formatting | Abstractive summary or direct answer | Extractive bullet list | | Dedup on store | Exact-match heuristic only | Same (LLM dedup deferred to consolidation) | | Consolidation | Batch LLM judgment | Mark all as consolidated | | Compaction | LLM-generated summaries | Skip compaction | | Contradiction detection | LLM classification | Only exact-match merging |

Configuration

All settings are controlled via environment variables:

| Variable | Default | Description | |----------|---------|-------------| | MEMORY_DB_PATH | ~/.local/share/memory-mcp/default.db | SQLite database path | | MEMORY_NAMESPACE | default | Namespace for memory isolation | | MEMORY_EMBEDDING_MODEL | Xenova/bge-base-en-v1.5 | HuggingFace embedding model | | MEMORY_EMBEDDING_DTYPE | q8 | Model quantization (q8, fp16, fp32) | | MEMORY_RERANKER_ENABLED | true | Enable cross-encoder reranking | | MEMORY_RERANKER_MODEL | Xenova/ms-marco-MiniLM-L-6-v2 | HuggingFace reranker model | | MEMORY_LLM_API_KEY | (none) | API key for LLM (enables enhanced features) | | MEMORY_LLM_BASE_URL | (none) | OpenAI-compatible endpoint (must support /v1/chat/completions) | | MEMORY_LLM_MODEL | claude-haiku-4-5-20251001 | LLM model name | | MEMORY_DEFAULT_TOKEN_BUDGET | 500 | Default token budget for recall | | MEMORY_MAINTENANCE_INTERVAL | 50 | Stores between maintenance passes | | MEMORY_DECAY_THRESHOLD | 0.05 | Vitality below which memories decay | | MEMORY_COMPACTION_MIN_GROUP | 10 | Min decayed memories before compaction | | MEMORY_CONSOLIDATION_BATCH_SIZE | 50 | Max memories per consolidation pass |

Agent Integration

The package exports system prompts and tool descriptions for integrating the memory server with LLM agents:

import { MEMORY_SYSTEM_PROMPT } from '@provos/memory-mcp-server/prompts';

// Append to your agent's system prompt
const systemPrompt = basePrompt + '\n\n' + MEMORY_SYSTEM_PROMPT;

For custom setups, use the configurable builder:

import { buildMemorySystemPrompt } from '@provos/memory-mcp-server/prompts';

const prompt = buildMemorySystemPrompt({
  persona: 'Research Assistant',
  additionalInstructions: 'Always tag paper references with "paper:<title>".',
});

The TOOL_DESCRIPTIONS export provides enhanced descriptions used in tool registration:

import { TOOL_DESCRIPTIONS } from '@provos/memory-mcp-server/prompts';
// TOOL_DESCRIPTIONS.memory_store, .memory_recall, .memory_context, etc.

Development

npm install
npm run build          # TypeScript compilation
npm test               # Run unit tests
npm run test:e2e       # Run end-to-end tests (spawns real server)
npm run lint           # ESLint
npm run format         # Prettier
npm run benchmark      # Run retrieval quality benchmark

License

Apache-2.0