zettel-compress
v1.0.0
Published
Deterministic memory engine for LLM apps — compress, search, and inject conversation memory offline, on-device, or at the edge. Zero external calls, zero infrastructure, zero dependencies.
Downloads
1,397
Maintainers
Readme
zettel-compress
Memory for LLM apps that runs anywhere your code runs — offline, on-device, at the edge. Compresses conversation history into structured, searchable memory and hands your model exactly the context it needs. Zero external calls. Zero infrastructure. Zero dependencies. 18 kB gzipped.
And the property nothing else in this space has: it's deterministic — the same messages produce byte-identical memory, every time, on every machine. Memory you can snapshot-test, replay in CI, diff in code review, and trust in an air-gapped deployment.
▶ Try it in the playground — the full engine runs client-side in your tab (14 kB gzipped): paste a conversation, watch it become structured memory, ask it questions.
import { compress, recall, injectContext } from 'zettel-compress'
const memory = compress(conversationHistory)
// search memory at question time — BM25 + graph expansion, no embeddings
recall(memory, 'what did we decide about authentication?', { topK: 5 })
// or inject a hard-budgeted memory block into your prompt
injectContext(memory, { maxTokenBudget: 300, format: 'markdown' })Measured results
Every number below is reproducible: npm run bench (deterministic) and npm run bench:llm (requires an OpenAI key in .env). Datasets: a real assistant conversation (~2.5k tokens) and two public-domain books — Pride and Prejudice (~182k tokens) and On the Origin of Species (~233k tokens) — each with 12 decision facts planted at seeded positions.
Can a real model answer questions from the compressed memory?
QA accuracy with gpt-4o-mini answering from each context (the reply must contain the unique answer token):
| context given to the model | avg tokens | conversation | novel (182k) | science (233k) |
|---|---|---|---|---|
| nothing | — | 0% | 0% | 0% |
| first 300 tokens of the document | ~370 | 8% | 0% | 0% |
| injectContext top-10, markdown | 430–860 | 58% | 0% | 0% |
| injectContext budget-300, markdown | ~290 | 33% | 0% | 0% |
| recall(question) top-5 | 90–200 | 58% | 67% | 75% |
The honest read:
recall()is the headline. From a 233k-token corpus, ~150 tokens of retrieved memory let the model answer 75% of questions — the naive same-cost baseline answers 0%. No embeddings, no API, sub-millisecond.- Static injection is for conversation-scale memory. On a 19-zettel conversation, top-10 injection carries 83% of planted decision signals (58% end-to-end with the model). On a 1,085-zettel book, 10 zettels cannot cover 12 scattered facts — use
recall()for archives. - Use
format: 'markdown'when injecting directly into prompts — models read plain quotes better than the compact AAAK lines (33% vs 17% at the same budget in our runs). Use AAAK for storage and round-tripping.
Public benchmark: LoCoMo-10 (very-long-term conversational QA)
Full protocol over all 1,986 questions: 10 multi-session conversations (260k
tokens total) compressed once each, gpt-4o-mini answering from
recallContext(question) top-10 passages only (npm run bench:locomo).
| retrieval mode | overall F1 (cat 1–4) | single-hop | temporal | multi-hop | answer-in-context |
|---|---|---|---|---|---|
| quotes only (pre-0.3) | 5.9 | 8.0 | 1.3 | 5.4 | 9.7% |
| small-to-big (recallContext) | 42.4 | 58.4 | 23.9 | 27.8 | 38.1% |
Adversarial category (446 unanswerable questions): 88.3% correct abstention, 5.2% trapped. Average context: ~1,700 tokens/question — under 1% of the conversation. No-context baseline: F1 0.0.
Honest framing: long-context GPT-4-class models with the entire conversation in the prompt score in the 30s–50s F1 band on these categories; LLM-write memory systems (mem0) report ~67 under a more lenient LLM-judge metric. zettel-compress reaches the full-context band at ~1% of the tokens with zero model calls on the memory side. Ceiling analysis: only 45.1% of LoCoMo gold answers appear verbatim anywhere in the conversation (annotators reworded the rest), and BM25 retrieval already surfaces 86% of that ceiling in the top-10 — a measured GloVe-blend spike gained +0.4 points, so the remaining gap is answer rewording and metric strictness, not retrieval ranking.
Speed, size, and guarantees
| dataset | input tokens | zettels | compress time | throughput | |---|---|---|---|---| | conversation | 2,571 | 19 | 5.3 ms | ~485 tok/ms | | novel | 182,179 | 1,085 | 757 ms | ~241 tok/ms | | science | 233,248 | 753 | 525 ms | ~444 tok/ms |
- Compression:
injectContexttop-10 reduces the 233k-token text to 882 tokens (0.38%); a 300-token budget always lands ≤ its ceiling (measured 82–100% utilization, zero overruns across all tiers and datasets). - Lossless round-trip:
decode(encode(result))reproduces zettels, tunnels, and the entity index exactly — verified by deep equality on all three datasets and a 200-case property test (multi-line quotes, unicode, pipes, snake_case all survive). - Streaming:
CompressStreamprocesses ~0.1 ms/message with bounded memory. - Entity detection: 100% precision / 100% recall on the labeled benchmark fixture (10 gold entities among changelog/chat noise).
When to use it — and when not to
Use zettel-compress when your constraints look like this:
- Offline / on-device / air-gapped — local-first apps, privacy-bound deployments, anywhere user text must never leave the process
- Edge runtimes — Cloudflare Workers, Vercel Edge, browsers: no vector DB to stand up, no embedding service to call, nothing to operate
- Determinism matters — agent test suites, replayable sessions, auditable memory; byte-identical output is something no LLM- or embedding-based memory can offer even in principle
- No external calls allowed — compliance, latency budgets measured in milliseconds, or simply zero appetite for another vendor dependency
Use something else when:
- You want maximum recall quality and can run infrastructure → embeddings RAG (pgvector + any embedding model) handles paraphrase ("pottery class" ↔ "ceramics workshop") better than lexical matching ever will
- You want managed memory with fact-updating and contradiction handling → mem0 / Zep are good products; they trade API calls, latency, and nondeterminism for higher QA scores
- Your conversations are short → a sliding window of recent messages is simpler and good enough
| | zettel-compress | mem0 / Zep | embeddings RAG |
|---|---|---|---|
| External calls | none, ever | every write | every index + query |
| Infrastructure | none | hosted service | vector store |
| Runs offline / on-device / edge | yes | no | rarely |
| Deterministic / replayable / testable | byte-exact | no | no |
| Lossless text serialization (diffable memory) | yes (AAAK) | no | no |
| Semantic paraphrase matching | no — lexical + graph | yes | yes |
| LoCoMo QA (our measurement / their reported) | 41.6 F1 | ~67 (LLM-judged) | unmeasured here |
The trade is explicit and we publish the numbers on it.
How it works
Text is chunked on paragraph boundaries (overlap snaps to word boundaries; every chunk carries exact source offsets). Each chunk becomes a zettel:
- entities — proper nouns detected by capitalization evidence (sentence-start noise like
Added,Pleaseis filtered; chat speaker labels are kept), with pronoun coreference:she/helink to the most recent gender-matching entity, so a person stays attached to the conversation after their first mention - topics — key terms with CamelCase/ALL-CAPS/hyphenation boosts
- quote — the most information-dense sentence (TextRank blended with decision-word density; falls back gracefully on lowercase chat text)
- weight — importance in [0, 1], rank-normalized with tie-aware midranks (equal raw scores always get equal weights; relative within a result)
- flags —
DECISION | ORIGIN | CORE | PIVOT | GENESIS | TECHNICAL - emotions — 30 states via word-boundary lexicons with negation scope (a useful filtering signal, not sentiment analysis — calibrate expectations accordingly)
Tunnels link zettels sharing entities/topics above a Jaccard threshold (capped per zettel). recall() runs BM25 over quotes+topics+entities with automatic synonym expansion (move↔relocate, marry↔wedding, job↔career, and city abbreviations like NYC↔"New York") and a date-proximity bonus that boosts zettels whose resolvedDate matches a year/month mentioned in the query. Hits expand one associative hop along tunnels with personalized PageRank.
Documentation
- Getting started — core concepts, the three verbs (compress → recall → inject), persistence, options
- Integration recipes — chatbot memory loop, Vercel AI SDK, Cloudflare Workers + KV, browser/local-first, streams, multi-session, exact token budgets
- Deterministic testing — snapshot-test and replay your agent's memory in CI (the thing only deterministic memory can do)
Install
npm install zettel-compressQuick start
import { compress, injectContext, recall, wakeUp, CompressStream } from 'zettel-compress'
const result = compress(conversationHistory)
// hard token budget — measured output, never exceeds the ceiling
const block = injectContext(result, { maxTokenBudget: 300, format: 'markdown' })
// guarantee decisions survive selection even when ranked low
injectContext(result, { maxZettels: 10, guaranteeFlags: ['DECISION'] })
// diversity-aware selection (maximal marginal relevance)
injectContext(result, { maxZettels: 10, selection: 'mmr' })
// search memory at question time — ranked zettels, or ready-to-inject passages
const hits = recall(result, 'what did we decide about auth?', { topK: 5 })
const context = recallContext(result, 'what did we decide about auth?', { maxTokens: 2000 })
// short narrative of the top moments (top 15% by weight)
const summary = wakeUp(result)
// streaming: compress each message as it arrives, bounded memory
const mem = new CompressStream({ halfLifeTurns: 50, maxZettels: 200 })
mem.push('Alice: the login service keeps timing out')
mem.push('Bob: we decided to rotate tokens hourly')
mem.recall('token decision') // search the live stream
mem.snapshot() // CompressResult at any point — replayableAPI
compress(text, options?): CompressResult
compress(text, {
chunkSize: 800, // chars per chunk (default 800)
chunkOverlap: 100, // overlap, snapped to word boundaries (default 100)
date: '2026-06-12', // ISO date for the AAAK header
title: 'My Session', // title for the AAAK header
minEntityFrequency: 1, // min occurrences to count as entity
stopWords: ['foo'], // extra stop words for topic extraction
temperature: 0.5, // softmax temperature for weight spread
tunnelThreshold: 0.3, // min Jaccard similarity for a tunnel
tunnelTopK: 3, // max tunnels per zettel
dedupe: true, // merge near-duplicate zettels (default false)
dedupeThreshold: 0.9, // token-set Jaccard that counts as duplicate
verboseLabels: true, // tunnel labels as Alice+Bob instead of ALC+BBB
keepSource: true, // retain normalized input on meta.source for
// provenance-expanded recall (default true)
})Zettels carry exact sourceStart/sourceEnd offsets into meta.source; the offsets serialize in AAAK, the source text never does (the format stays compact — re-supply the text via recallContext's source option after decoding).
Tunnel building switches to MinHash/LSH candidate generation above 500 zettels — 10,000 zettels link in ~400ms instead of 50M pairwise comparisons, deterministically.
recall(result, query, options?): Zettel[]
Query-time retrieval: BM25 over each zettel's full source chunk (falling back to quote/topics/entities when no source is kept), with built-in synonym expansion (common paraphrase clusters: move/relocate/transfer, job/career/work, marry/wedding/spouse, and city abbreviations) and a date-proximity bonus that promotes zettels whose resolvedDate matches any year/month found in the query. Hits optionally expand one hop along the tunnel graph with personalized PageRank. { topK?: number, hops?: boolean, expandQuery?: boolean, after?: string, before?: string }. Deterministic.
recallContext(result, query, options?): string
Small-to-big retrieval — the recommended way to build LLM context. Ranks on the compact zettel index, then returns the full source passages the hits came from: overlapping spans merge, a token budget admits passages in rank order, and the output assembles in document order so narrative/temporal flow survives. { topK?, hops?, maxTokens?, source? }. Falls back to quotes when no source text is available (e.g. decoded AAAK; pass source to restore it). This is what lifted LoCoMo F1 from 5.9 to 41.6.
injectContext(result, options?): string
injectContext(result, {
maxZettels: 10, // top N by 0.7·weight + 0.3·signal-flag bonus
selection: 'mmr', // 'weight' (default) | 'mmr' diversity selection
guaranteeFlags: ['DECISION'], // always include one zettel per flag if present
minWeight: 0.5, // weight floor
flags: ['DECISION'], // filter to flags
format: 'markdown', // 'aaak' (default) | 'json' | 'markdown'
maxTokenBudget: 300, // hard ceiling — output measured, never exceeded
countTokens: myTokenizer, // optional exact counter (e.g. js-tiktoken)
})Only tunnels and entity-index entries belonging to the selected zettels are emitted.
CompressStream
Incremental memory for message streams. push(text), snapshot(), recall(query, opts?), size. Options: all of CompressOptions plus halfLifeTurns (recency decay in pushes) and maxZettels (bounded memory via lowest-decayed-weight eviction). With dedupe: true, a re-sent or boilerplate message refreshes the recency of the zettel it duplicates instead of growing the stream — repetition strengthens a memory rather than copying it. Entity codes never change once assigned; replaying the same pushes reproduces a byte-identical snapshot.
wakeUp(result, topPct = 0.15): string
Narrative summary of the top topPct zettels by weight (plus ORIGIN/CORE/GENESIS flags), capped at 5. Never empty on non-empty input.
encode(result): string / decode(aaak, options?): CompressResult
AAAK v2 text serialization — fully lossless: E: lines carry the entity index, quotes/topics/headers are escaped (multi-line quotes, ", |, snake_case topics all survive exactly). decode reads v1 and v2; { strict: true } throws on malformed lines, default mode collects meta.warnings (including header-count mismatches and unknown emotion/flag tokens).
FILE:002|ALC+BOB|2026-06-12|Auth Design|v2
E:ALC=Alice;BOB=Bob
001:ALC+BOB|authentication,security|"We decided to use JWT tokens."|0.91|conviction|DECISION+TECHNICAL
T:001<->002|ALC+BOBOthers
compressMany(texts, options?) · mergeResults(results) (re-normalizes weights onto one scale) · topZettels(result, n) · normalizeWeights(zettels, temperature?) · estimateTokens(text) · encodeZettelLine / encodeTunnelLine · runtime constants ALL_FLAGS, ALL_EMOTIONS.
Integration examples
Vercel AI SDK — budgeted memory block
import { compress, injectContext } from 'zettel-compress'
const memory = compress(messages.map(m => `${m.role}: ${m.content}`).join('\n'))
const block = injectContext(memory, { maxTokenBudget: 300, format: 'markdown' })
const response = await streamText({
model: openai('gpt-4o'),
messages: [
{ role: 'system', content: `Relevant past context:\n${block}` },
...recentMessages,
],
})Question-time recall — only inject what the question needs
import { compress, recall } from 'zettel-compress'
const memory = compress(fullHistory)
const relevant = recall(memory, userQuestion, { topK: 5 })
const block = relevant.map(z => z.quote).join('\n') // ~100–200 tokensCloudflare Workers — persistent compressed memory in KV
import { compress, encode, decode, recall } from 'zettel-compress'
export default {
async fetch(request: Request, env: Env) {
const { sessionId, message, question } = await request.json()
if (question) {
const stored = await env.KV.get(`memory:${sessionId}`)
if (!stored) return Response.json([])
const hits = recall(decode(stored), question, { topK: 5 })
return Response.json(hits.map(z => z.quote))
}
// append to the session log, store the compressed memory alongside it
const log = ((await env.KV.get(`log:${sessionId}`)) ?? '') + '\n\n' + message
await env.KV.put(`log:${sessionId}`, log)
await env.KV.put(`memory:${sessionId}`, encode(compress(log)))
return new Response('ok')
},
}Emotion states detected
conviction, grief, joy, fear, hope, trust, wonder, rage, exhaustion, shame, pride, nostalgia, anxiety, relief, anticipation, frustration, gratitude, loneliness, inspiration, confusion, clarity, guilt, awe, regret, determination, vulnerability, acceptance, resistance, love, loss
Importance flags
| Flag | Triggered by |
|---|---|
| DECISION | "decided", "chose", "committed", "concluded", "agreed to", "going to", "we will", "final decision" |
| ORIGIN | "founded", "originated", "first time ever", "inception", "birth of", "how it began" |
| CORE | "fundamental", "essential", "key principle", "foundation of", "bedrock", "non-negotiable" |
| PIVOT | "turning point", "breakthrough", "changed everything", "transformed", "pivotal", "game changer" |
| GENESIS | "led to", "resulted in", "because of this", "gave rise to", "which caused", "set in motion" |
| TECHNICAL | "architecture", "implement", "deploy", "config", "database", "module", "infrastructure", "stack", "endpoint", "schema" |
All keyword matching is word-boundary anchored with negation-scope handling ("we never decided" does not flag). TECHNICAL is metadata only — it does not affect zettel weight so technical detail chunks don't crowd out decisions and emotional moments in ranked selection.
Reproducing the benchmarks
npm run bench # deterministic: performance, compression, budgets,
# round-trip, answer-in-context QA, MRR, entities, streaming
npm run bench:llm # end-to-end QA with a real model; needs OPENAI_API_KEY in .envBoth harnesses use seeded PRNGs — same machine, same numbers.
License
MIT
