@kavinbmittal/lia-memory-engine

v1.0.0

Published

19 days ago

Lia-style context engine for OpenClaw — structured compaction, auto-flush, auto-retrieval

0High
0Medium
0Low

kavinbmittal

openclaw plugin context-engine memory compaction

Lia Memory Engine

Lia Memory Engine gives OpenClaw agents the kind of memory that actually works in practice — decisions made three sessions ago surface automatically, context doesn’t silently disappear when conversations get long, and nothing is ever lost.

Two Parts

1. Compaction Upgrade: Structured Memory

OpenClaw’s built-in compaction throws away a lot once the context gets full so your agents sometimes run around like headless chickens. Lia’s Memory Engine brings a meaningful upgrade that solves this problem in a thoughtful and simple way:

When the context window is genuinely near capacity (default 80%), the engine compresses the older half of messages into a summary that explicitly preserves decisions, commitments, open questions, Q&A pairs, and preferences
Token usage is estimated from the live conversation snapshot passed by OpenClaw each turn, so the threshold check is always accurate and compaction only fires when the context is actually full
This structured summarization is generated via Claude Haiku and is kept in context. You keep everything that matters
The engine introduces auto-flush, where every message is written to disk immediately so you also have a full transcript to search against, so nothing is truly gone even after compaction

2. QMD Memory Retrieval

QMD is a framework built by Tobi Lütke. QMD isn’t a search box, it’s a full retrieval pipeline:

BM25 keyword search
Vector semantic search and
LLM reranking running together on-device The impact of this in practice is quite significant: basic search finds the memory that matches your words, QMD finds the memory that matches your intent. A query like “auth decision” will surface the conversation where you chose Supabase Auth because of RLS — not just any file that mentions authentication.

Combined together you have a powerful upgrade where OpenClaw never forgets anything. This is the engine powering Lia, the world’s first AI Chief of Staff.

How it works

Key design principle: OpenClaw owns the conversation. The engine never stores or replaces OpenClaw’s messages. OpenClaw loads conversations from its JSONL session files and passes them to the engine on every turn. The engine reads those messages but never maintains its own copy — it uses a lightweight counter to track what’s been flushed to transcript.

Compaction via Haiku — when context genuinely reaches the threshold (default 80%), the engine takes OpenClaw’s messages, splits at the midpoint, summarizes the older half, and returns the compacted result. OpenClaw replaces its messages with the compacted version. Preserves Q&A pairs, decisions, commitments, open questions, preferences, and emotional context.
Auto-flush every turn — after each turn, the engine identifies new messages (using a counter, not by diffing arrays) and writes them to memory/daily/YYYY-MM-DD.md. Nothing is ever lost.
Auto-retrieval — before every model run, QMD runs a hybrid search (BM25 + vector + LLM reranking) using the last user message as the query. Relevant past context is injected silently into the system prompt. 500ms timeout so it never blocks.
After each message is written to the transcript, the index is updated in the background. This means messages are searchable immediately within the same session — not just from the next session onward.
memory_search tool — agents can explicitly search conversation history. Uses full hybrid search with HyDE reranking for maximum quality.

The plugin connects to a local QMD HTTP daemon at localhost:8181. On bootstrap, it checks if the daemon is running — if not, it spawns qmd mcp --http --daemon in the background. The daemon stays alive between sessions, keeping embedding models warm in memory. If the daemon isn’t available (QMD not installed, model not downloaded yet), the plugin falls back to QMD’s CLI BM25 search. If that’s also unavailable, auto-retrieval is silently skipped — the agent still works, just without memory context.

Requirements

Node.js 18+
OpenClaw v2026.3.x+
QMD — the on-device search engine that powers memory retrieval

Linux / Railway

QMD installs node-llama-cpp as a dependency, which compiles llama.cpp from C++ source at runtime. On macOS, Xcode command line tools include everything needed and this is invisible. On a fresh Linux container (including Railway), you need three things:

apt-get update && apt-get install -y git cmake build-essential

git — node-llama-cpp uses git clone to pull the llama.cpp source. Without it, the build enters an infinite retry loop with no error message.
cmake — required to compile llama.cpp from source.
build-essential — C/C++ compiler toolchain.

Add this to your Railway Dockerfile or nixpacks.toml before the npm install step.

Bun is not supported

QMD must run under Node.js, not Bun. node-llama-cpp ships pre-built native binaries for Node.js only. Under Bun, the native addon crashes silently on model load — no error, no warning, search just returns empty. Everything else (HTTP server, BM25 indexing, SQLite) works fine under Bun, so it looks healthy until you test an actual search query.

If your stack uses Bun, run QMD in a separate process under Node.js.

GPU acceleration

On servers without a GPU (including Railway), set NODE_LLAMA_CPP_GPU=false before starting QMD. Without this, node-llama-cpp tries to compile a Vulkan variant at runtime, which fails and falls back to CPU — but wastes 5+ minutes of build time on every container start.

Cold start

The first search after a fresh deploy takes 10-20s on CPU while QMD loads embedding and reranking models into memory (~2.5GB total). Subsequent searches are fast (<1s). If you're wrapping QMD in a search endpoint, set your timeout to at least 30s to handle the cold start. Don't assume search is broken because the first call is slow.

Setup

1. Install QMD

npm install -g @tobilu/qmd

First run downloads the GGUF embedding model (~400MB). This only happens once.

2. Install the plugin

From npm (recommended):

openclaw plugins install @kavinbmittal/lia-memory-engine

From source:

cd ~/.openclaw/extensions
git clone <this-repo> lia-memory-engine
cd lia-memory-engine
npm install
npm run build

When installing from npm, OpenClaw discovers the plugin automatically — skip step 5 (the plugins.load.paths config).

3. Register your memory collection

Point QMD at the directory where Lia writes transcripts. By default this is memory/ inside your agent’s workspace:

qmd collection add /path/to/your/workspace/memory --name lia-memory

Run this once per workspace. If you’re not sure where your workspace is, check your OpenClaw config — the agent’s working directory is the workspace.

4. Index existing transcripts

qmd embed -c lia-memory

If you’re starting fresh with no prior transcripts, skip this — the plugin will handle it on bootstrap.

5. Add the Engine to OpenClaw Config

In ~/.openclaw/openclaw.json, add the minimum viable config:

{
  “plugins”: {
    “load”: {
      “paths”: [“~/.openclaw/extensions/lia-memory-engine”]
    },
    “slots”: {
      “contextEngine”: “lia-memory-engine”
    },
    “entries”: {
      “lia-memory-engine”: {
        “enabled”: true
      }
    }
  }
}

The plugins.slots.contextEngine line is required. Without it, the plugin installs and shows enabled: true, but OpenClaw silently falls back to its built-in safeguard compaction. The memory_search tool registers, but none of the engine lifecycle methods fire — no assemble(), no ingest(), no compact(), no auto-flush. There’s no error or warning in older versions (v1.1+ logs a warning).

On first session start, the plugin starts the QMD daemon automatically. Models stay warm across sessions — no loading penalty after the first one.

See Configuration for all available options and recommended session settings.

6. Set Engine Parameters in OpenClaw Config

All options go under plugins.entries.lia-memory-engine.config in openclaw.json:

“lia-memory-engine”: {
  “enabled”: true,
  “config”: {
    “compactionThreshold”: 0.80,
    “compactionModel”: “anthropic/claude-haiku-4-5”,
    “autoRetrieval”: false,
    “autoRetrievalTimeoutMs”: 500,
    “transcriptRetentionDays”: 180
  }
}

| Key | Type | Default | Description | |-----|------|---------|-------------| | enabled | boolean | true | Enable/disable the entire plugin | | compactionThreshold | number | 0.80 | Fraction of context window that triggers compaction (0.1–1.0). Measured against the live conversation snapshot each turn. At this threshold, the engine splits messages at the midpoint and summarizes the older half | | compactionModel | string | anthropic/claude-haiku-4-5 | Model used for compaction summarization. Must be a fast model — it runs synchronously during compaction | | autoRetrieval | boolean | false | Automatically search memory files and inject relevant context before every model turn. Uses the last user message as the search query. Disabled by default — breaks prompt cache when enabled | | autoRetrievalTimeoutMs | number | 500 | Maximum time in ms to wait for auto-retrieval results. Keeps the agent responsive — if QMD doesn’t respond in time, the turn proceeds without memory context | | transcriptRetentionDays | number | 180 | Days to keep daily transcript files before cleanup. Set higher if you want longer memory recall | | qmdHost | string | localhost | QMD HTTP daemon hostname | | qmdPort | number | 8181 | QMD HTTP daemon port | | qmdCollectionName | string | lia-memory | QMD collection name. Change this if you run multiple agents with separate memory pools | | enableVectorSearch | boolean | true | Enable vector semantic search + LLM reranking. Requires a ~400MB GGUF model download on first run. When false, only BM25 keyword search is used |

To disable vector search and use BM25 only (no model download required):

{ “enableVectorSearch”: false }

7. Make a note of OpenClaw Session Reset Config

The plugin handles compaction (in-place summarization, no reset), but OpenClaw’s session reset policy is separate. Without configuring it, sessions may reset unexpectedly and lose context that the plugin has been carefully preserving.

{
  “agents”: {
    “defaults”: {
      “session”: {
        “reset”: {
          “mode”: “idle”,
          “idleMinutes”: 10080
        }
      }
    }
  }
}

This sets sessions to reset only after 7 days of inactivity (10080 minutes). Since the plugin’s compaction keeps context usable indefinitely, you don’t need aggressive session resets.

Verify it’s working

After setup, confirm the plugin is actually active as the context engine:

Check gateway logs on startup. Look for:
```
[lia-memory-engine] Registered as context engine
```
If you see WARNING: Plugin loaded but not assigned as context engine instead, the slot assignment is missing — go back to step 5.
Send a message, then check the transcript. After your first message in a session, verify that memory/daily/YYYY-MM-DD.md exists in your workspace and contains conversation entries (format: ## HH:MM with **User:** and **Agent:** sections). If the file doesn’t exist, the engine’s afterTurn() is not being called.
Check /status output. Compaction events should show lia-memory-engine as the source. If you see “safeguard mode” or no compaction source, the plugin isn’t slotted.
Verify QMD search returns results. After a few messages, ask your agent to use the memory_search tool (e.g. “search your memory for [something you just discussed]”). If it returns actual matches with snippets, retrieval is working end-to-end. If it returns “No results found” despite having transcripts on disk, QMD’s search pipeline is broken — check for the issues below. This is the most important step. Without it, you won’t know if memory retrieval silently failed — auto-flush and registration can succeed while search returns nothing.
Check for the infinite clone loop (Linux/Docker). If your logs show this repeating endlessly:
```
[node-llama-cpp] Cloning ggml-org/llama.cpp (local bundle)    0%
[node-llama-cpp] Cloning ggml-org/llama.cpp (GitHub)          0%
```
cmake is not installed. node-llama-cpp needs it to compile llama.cpp from source, and without it enters an infinite retry loop with no error message. See the Linux / Railway section. This won’t happen on macOS (Xcode includes cmake), only on Linux containers.

Architecture

The engine implements OpenClaw's ContextEngine interface with ownsCompaction: true. It never stores messages — OpenClaw's JSONL session files are the source of truth.

| Hook | When | What it does | |------|------|-------------| | bootstrap() | Session start | Creates memory dirs, starts QMD daemon | | assemble() | Before each model run | Passes through OpenClaw's messages, adds QMD auto-retrieval context to system prompt | | afterTurn() | After each turn | Flushes new messages to transcript (counter-based), checks compaction threshold | | compact() | When threshold hit | Takes OpenClaw's messages, summarizes older half via Haiku, returns compacted result | | search() | memory_search tool | Full hybrid QMD search | | dispose() | Shutdown | Clears session trackers (no message data to lose) |

index.ts              Plugin entry point — register(), configSchema, tool registration
src/
  engine.ts           LiaContextEngine — implements ContextEngine interface (stateless, no message storage)
  compact.ts          Compaction logic — midpoint split, Haiku summarization
  auto-flush.ts       Transcript formatting and daily file writes
  search.ts           Search functions — auto-retrieval and memory_search
  qmd-client.ts       QMD HTTP daemon client — hybrid search, daemon lifecycle
  types.ts            Type definitions and config defaults

LLM Access

The plugin needs LLM access for compaction. It tries three methods in order:

api.completeSimple() — if exposed by OpenClaw’s plugin API
@mariozechner/pi-ai — dynamic import (OpenClaw’s internal LLM router)
@anthropic-ai/sdk — direct Anthropic SDK (requires ANTHROPIC_API_KEY env var)