zettel-compress

v1.0.0

Published

4 days ago

Deterministic memory engine for LLM apps — compress, search, and inject conversation memory offline, on-device, or at the edge. Zero external calls, zero infrastructure, zero dependencies.

Downloads

1,397

zettel-compress

Memory for LLM apps that runs anywhere your code runs — offline, on-device, at the edge. Compresses conversation history into structured, searchable memory and hands your model exactly the context it needs. Zero external calls. Zero infrastructure. Zero dependencies. 18 kB gzipped.

And the property nothing else in this space has: it's deterministic — the same messages produce byte-identical memory, every time, on every machine. Memory you can snapshot-test, replay in CI, diff in code review, and trust in an air-gapped deployment.

▶ Try it in the playground — the full engine runs client-side in your tab (14 kB gzipped): paste a conversation, watch it become structured memory, ask it questions.

import { compress, recall, injectContext } from 'zettel-compress'

const memory = compress(conversationHistory)

// search memory at question time — BM25 + graph expansion, no embeddings
recall(memory, 'what did we decide about authentication?', { topK: 5 })

// or inject a hard-budgeted memory block into your prompt
injectContext(memory, { maxTokenBudget: 300, format: 'markdown' })

Measured results

Every number below is reproducible: npm run bench (deterministic) and npm run bench:llm (requires an OpenAI key in .env). Datasets: a real assistant conversation (~2.5k tokens) and two public-domain books — Pride and Prejudice (~182k tokens) and On the Origin of Species (~233k tokens) — each with 12 decision facts planted at seeded positions.

Can a real model answer questions from the compressed memory?

QA accuracy with gpt-4o-mini answering from each context (the reply must contain the unique answer token):

| context given to the model | avg tokens | conversation | novel (182k) | science (233k) | |---|---|---|---|---| | nothing | — | 0% | 0% | 0% | | first 300 tokens of the document | ~370 | 8% | 0% | 0% | | injectContext top-10, markdown | 430–860 | 58% | 0% | 0% | | injectContext budget-300, markdown | ~290 | 33% | 0% | 0% | | recall(question) top-5 | 90–200 | 58% | 67% | 75% |

The honest read:

recall() is the headline. From a 233k-token corpus, ~150 tokens of retrieved memory let the model answer 75% of questions — the naive same-cost baseline answers 0%. No embeddings, no API, sub-millisecond.
Static injection is for conversation-scale memory. On a 19-zettel conversation, top-10 injection carries 83% of planted decision signals (58% end-to-end with the model). On a 1,085-zettel book, 10 zettels cannot cover 12 scattered facts — use recall() for archives.
Use format: 'markdown' when injecting directly into prompts — models read plain quotes better than the compact AAAK lines (33% vs 17% at the same budget in our runs). Use AAAK for storage and round-tripping.

Public benchmark: LoCoMo-10 (very-long-term conversational QA)

Full protocol over all 1,986 questions: 10 multi-session conversations (260k tokens total) compressed once each, gpt-4o-mini answering from recallContext(question) top-10 passages only (npm run bench:locomo).

| retrieval mode | overall F1 (cat 1–4) | single-hop | temporal | multi-hop | answer-in-context | |---|---|---|---|---|---| | quotes only (pre-0.3) | 5.9 | 8.0 | 1.3 | 5.4 | 9.7% | | small-to-big (recallContext) | 42.4 | 58.4 | 23.9 | 27.8 | 38.1% |

Adversarial category (446 unanswerable questions): 88.3% correct abstention, 5.2% trapped. Average context: ~1,700 tokens/question — under 1% of the conversation. No-context baseline: F1 0.0.

Honest framing: long-context GPT-4-class models with the entire conversation in the prompt score in the 30s–50s F1 band on these categories; LLM-write memory systems (mem0) report ~67 under a more lenient LLM-judge metric. zettel-compress reaches the full-context band at ~1% of the tokens with zero model calls on the memory side. Ceiling analysis: only 45.1% of LoCoMo gold answers appear verbatim anywhere in the conversation (annotators reworded the rest), and BM25 retrieval already surfaces 86% of that ceiling in the top-10 — a measured GloVe-blend spike gained +0.4 points, so the remaining gap is answer rewording and metric strictness, not retrieval ranking.

Speed, size, and guarantees

| dataset | input tokens | zettels | compress time | throughput | |---|---|---|---|---| | conversation | 2,571 | 19 | 5.3 ms | ~485 tok/ms | | novel | 182,179 | 1,085 | 757 ms | ~241 tok/ms | | science | 233,248 | 753 | 525 ms | ~444 tok/ms |

Compression: injectContext top-10 reduces the 233k-token text to 882 tokens (0.38%); a 300-token budget always lands ≤ its ceiling (measured 82–100% utilization, zero overruns across all tiers and datasets).
Lossless round-trip: decode(encode(result)) reproduces zettels, tunnels, and the entity index exactly — verified by deep equality on all three datasets and a 200-case property test (multi-line quotes, unicode, pipes, snake_case all survive).
Streaming: CompressStream processes ~0.1 ms/message with bounded memory.
Entity detection: 100% precision / 100% recall on the labeled benchmark fixture (10 gold entities among changelog/chat noise).

When to use it — and when not to

Use zettel-compress when your constraints look like this:

Offline / on-device / air-gapped — local-first apps, privacy-bound deployments, anywhere user text must never leave the process
Edge runtimes — Cloudflare Workers, Vercel Edge, browsers: no vector DB to stand up, no embedding service to call, nothing to operate
Determinism matters — agent test suites, replayable sessions, auditable memory; byte-identical output is something no LLM- or embedding-based memory can offer even in principle
No external calls allowed — compliance, latency budgets measured in milliseconds, or simply zero appetite for another vendor dependency

Use something else when:

You want maximum recall quality and can run infrastructure → embeddings RAG (pgvector + any embedding model) handles paraphrase ("pottery class" ↔ "ceramics workshop") better than lexical matching ever will
You want managed memory with fact-updating and contradiction handling → mem0 / Zep are good products; they trade API calls, latency, and nondeterminism for higher QA scores
Your conversations are short → a sliding window of recent messages is simpler and good enough

| | zettel-compress | mem0 / Zep | embeddings RAG | |---|---|---|---| | External calls | none, ever | every write | every index + query | | Infrastructure | none | hosted service | vector store | | Runs offline / on-device / edge | yes | no | rarely | | Deterministic / replayable / testable | byte-exact | no | no | | Lossless text serialization (diffable memory) | yes (AAAK) | no | no | | Semantic paraphrase matching | no — lexical + graph | yes | yes | | LoCoMo QA (our measurement / their reported) | 41.6 F1 | ~67 (LLM-judged) | unmeasured here |

The trade is explicit and we publish the numbers on it.

How it works

Text is chunked on paragraph boundaries (overlap snaps to word boundaries; every chunk carries exact source offsets). Each chunk becomes a zettel:

entities — proper nouns detected by capitalization evidence (sentence-start noise like Added, Please is filtered; chat speaker labels are kept), with pronoun coreference: she/he link to the most recent gender-matching entity, so a person stays attached to the conversation after their first mention
topics — key terms with CamelCase/ALL-CAPS/hyphenation boosts
quote — the most information-dense sentence (TextRank blended with decision-word density; falls back gracefully on lowercase chat text)
weight — importance in [0, 1], rank-normalized with tie-aware midranks (equal raw scores always get equal weights; relative within a result)
flags — DECISION | ORIGIN | CORE | PIVOT | GENESIS | TECHNICAL
emotions — 30 states via word-boundary lexicons with negation scope (a useful filtering signal, not sentiment analysis — calibrate expectations accordingly)

Tunnels link zettels sharing entities/topics above a Jaccard threshold (capped per zettel). recall() runs BM25 over quotes+topics+entities with automatic synonym expansion (move↔relocate, marry↔wedding, job↔career, and city abbreviations like NYC↔"New York") and a date-proximity bonus that boosts zettels whose resolvedDate matches a year/month mentioned in the query. Hits expand one associative hop along tunnels with personalized PageRank.

Documentation

Getting started — core concepts, the three verbs (compress → recall → inject), persistence, options
Integration recipes — chatbot memory loop, Vercel AI SDK, Cloudflare Workers + KV, browser/local-first, streams, multi-session, exact token budgets
Deterministic testing — snapshot-test and replay your agent's memory in CI (the thing only deterministic memory can do)

Install

npm install zettel-compress

Quick start

import { compress, injectContext, recall, wakeUp, CompressStream } from 'zettel-compress'

const result = compress(conversationHistory)

// hard token budget — measured output, never exceeds the ceiling
const block = injectContext(result, { maxTokenBudget: 300, format: 'markdown' })

// guarantee decisions survive selection even when ranked low
injectContext(result, { maxZettels: 10, guaranteeFlags: ['DECISION'] })

// diversity-aware selection (maximal marginal relevance)
injectContext(result, { maxZettels: 10, selection: 'mmr' })

// search memory at question time — ranked zettels, or ready-to-inject passages
const hits = recall(result, 'what did we decide about auth?', { topK: 5 })
const context = recallContext(result, 'what did we decide about auth?', { maxTokens: 2000 })

// short narrative of the top moments (top 15% by weight)
const summary = wakeUp(result)

// streaming: compress each message as it arrives, bounded memory
const mem = new CompressStream({ halfLifeTurns: 50, maxZettels: 200 })
mem.push('Alice: the login service keeps timing out')
mem.push('Bob: we decided to rotate tokens hourly')
mem.recall('token decision')   // search the live stream
mem.snapshot()                 // CompressResult at any point — replayable

API

`compress(text, options?): CompressResult`

compress(text, {
  chunkSize: 800,          // chars per chunk (default 800)
  chunkOverlap: 100,       // overlap, snapped to word boundaries (default 100)
  date: '2026-06-12',      // ISO date for the AAAK header
  title: 'My Session',     // title for the AAAK header
  minEntityFrequency: 1,   // min occurrences to count as entity
  stopWords: ['foo'],      // extra stop words for topic extraction
  temperature: 0.5,        // softmax temperature for weight spread
  tunnelThreshold: 0.3,    // min Jaccard similarity for a tunnel
  tunnelTopK: 3,           // max tunnels per zettel
  dedupe: true,            // merge near-duplicate zettels (default false)
  dedupeThreshold: 0.9,    // token-set Jaccard that counts as duplicate
  verboseLabels: true,     // tunnel labels as Alice+Bob instead of ALC+BBB
  keepSource: true,        // retain normalized input on meta.source for
                           // provenance-expanded recall (default true)
})

Zettels carry exact sourceStart/sourceEnd offsets into meta.source; the offsets serialize in AAAK, the source text never does (the format stays compact — re-supply the text via recallContext's source option after decoding).

Tunnel building switches to MinHash/LSH candidate generation above 500 zettels — 10,000 zettels link in ~400ms instead of 50M pairwise comparisons, deterministically.

`recall(result, query, options?): Zettel[]`

Query-time retrieval: BM25 over each zettel's full source chunk (falling back to quote/topics/entities when no source is kept), with built-in synonym expansion (common paraphrase clusters: move/relocate/transfer, job/career/work, marry/wedding/spouse, and city abbreviations) and a date-proximity bonus that promotes zettels whose resolvedDate matches any year/month found in the query. Hits optionally expand one hop along the tunnel graph with personalized PageRank. { topK?: number, hops?: boolean, expandQuery?: boolean, after?: string, before?: string }. Deterministic.

`recallContext(result, query, options?): string`

Small-to-big retrieval — the recommended way to build LLM context. Ranks on the compact zettel index, then returns the full source passages the hits came from: overlapping spans merge, a token budget admits passages in rank order, and the output assembles in document order so narrative/temporal flow survives. { topK?, hops?, maxTokens?, source? }. Falls back to quotes when no source text is available (e.g. decoded AAAK; pass source to restore it). This is what lifted LoCoMo F1 from 5.9 to 41.6.

`injectContext(result, options?): string`

injectContext(result, {
  maxZettels: 10,            // top N by 0.7·weight + 0.3·signal-flag bonus
  selection: 'mmr',          // 'weight' (default) | 'mmr' diversity selection
  guaranteeFlags: ['DECISION'], // always include one zettel per flag if present
  minWeight: 0.5,            // weight floor
  flags: ['DECISION'],       // filter to flags
  format: 'markdown',        // 'aaak' (default) | 'json' | 'markdown'
  maxTokenBudget: 300,       // hard ceiling — output measured, never exceeded
  countTokens: myTokenizer,  // optional exact counter (e.g. js-tiktoken)
})

Only tunnels and entity-index entries belonging to the selected zettels are emitted.

`CompressStream`

Incremental memory for message streams. push(text), snapshot(), recall(query, opts?), size. Options: all of CompressOptions plus halfLifeTurns (recency decay in pushes) and maxZettels (bounded memory via lowest-decayed-weight eviction). With dedupe: true, a re-sent or boilerplate message refreshes the recency of the zettel it duplicates instead of growing the stream — repetition strengthens a memory rather than copying it. Entity codes never change once assigned; replaying the same pushes reproduces a byte-identical snapshot.

`wakeUp(result, topPct = 0.15): string`

Narrative summary of the top topPct zettels by weight (plus ORIGIN/CORE/GENESIS flags), capped at 5. Never empty on non-empty input.

`encode(result): string` / `decode(aaak, options?): CompressResult`

AAAK v2 text serialization — fully lossless: E: lines carry the entity index, quotes/topics/headers are escaped (multi-line quotes, ", |, snake_case topics all survive exactly). decode reads v1 and v2; { strict: true } throws on malformed lines, default mode collects meta.warnings (including header-count mismatches and unknown emotion/flag tokens).

FILE:002|ALC+BOB|2026-06-12|Auth Design|v2
E:ALC=Alice;BOB=Bob
001:ALC+BOB|authentication,security|"We decided to use JWT tokens."|0.91|conviction|DECISION+TECHNICAL
T:001<->002|ALC+BOB

Others

compressMany(texts, options?) · mergeResults(results) (re-normalizes weights onto one scale) · topZettels(result, n) · normalizeWeights(zettels, temperature?) · estimateTokens(text) · encodeZettelLine / encodeTunnelLine · runtime constants ALL_FLAGS, ALL_EMOTIONS.

Integration examples

Vercel AI SDK — budgeted memory block

import { compress, injectContext } from 'zettel-compress'

const memory = compress(messages.map(m => `${m.role}: ${m.content}`).join('\n'))
const block = injectContext(memory, { maxTokenBudget: 300, format: 'markdown' })

const response = await streamText({
  model: openai('gpt-4o'),
  messages: [
    { role: 'system', content: `Relevant past context:\n${block}` },
    ...recentMessages,
  ],
})

Question-time recall — only inject what the question needs

import { compress, recall } from 'zettel-compress'

const memory = compress(fullHistory)
const relevant = recall(memory, userQuestion, { topK: 5 })
const block = relevant.map(z => z.quote).join('\n')   // ~100–200 tokens

Cloudflare Workers — persistent compressed memory in KV

import { compress, encode, decode, recall } from 'zettel-compress'

export default {
  async fetch(request: Request, env: Env) {
    const { sessionId, message, question } = await request.json()

    if (question) {
      const stored = await env.KV.get(`memory:${sessionId}`)
      if (!stored) return Response.json([])
      const hits = recall(decode(stored), question, { topK: 5 })
      return Response.json(hits.map(z => z.quote))
    }

    // append to the session log, store the compressed memory alongside it
    const log = ((await env.KV.get(`log:${sessionId}`)) ?? '') + '\n\n' + message
    await env.KV.put(`log:${sessionId}`, log)
    await env.KV.put(`memory:${sessionId}`, encode(compress(log)))
    return new Response('ok')
  },
}

Emotion states detected

conviction, grief, joy, fear, hope, trust, wonder, rage, exhaustion, shame, pride, nostalgia, anxiety, relief, anticipation, frustration, gratitude, loneliness, inspiration, confusion, clarity, guilt, awe, regret, determination, vulnerability, acceptance, resistance, love, loss

Importance flags

| Flag | Triggered by | |---|---| | DECISION | "decided", "chose", "committed", "concluded", "agreed to", "going to", "we will", "final decision" | | ORIGIN | "founded", "originated", "first time ever", "inception", "birth of", "how it began" | | CORE | "fundamental", "essential", "key principle", "foundation of", "bedrock", "non-negotiable" | | PIVOT | "turning point", "breakthrough", "changed everything", "transformed", "pivotal", "game changer" | | GENESIS | "led to", "resulted in", "because of this", "gave rise to", "which caused", "set in motion" | | TECHNICAL | "architecture", "implement", "deploy", "config", "database", "module", "infrastructure", "stack", "endpoint", "schema" |

All keyword matching is word-boundary anchored with negation-scope handling ("we never decided" does not flag). TECHNICAL is metadata only — it does not affect zettel weight so technical detail chunks don't crowd out decisions and emotional moments in ranked selection.

Reproducing the benchmarks

npm run bench       # deterministic: performance, compression, budgets,
                    # round-trip, answer-in-context QA, MRR, entities, streaming
npm run bench:llm   # end-to-end QA with a real model; needs OPENAI_API_KEY in .env

Both harnesses use seeded PRNGs — same machine, same numbers.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

zettel-compress

Measured results

Can a real model answer questions from the compressed memory?

Public benchmark: LoCoMo-10 (very-long-term conversational QA)

Speed, size, and guarantees

When to use it — and when not to

How it works

Documentation

Install

Quick start

API

compress(text, options?): CompressResult

recall(result, query, options?): Zettel[]

recallContext(result, query, options?): string

injectContext(result, options?): string

CompressStream

wakeUp(result, topPct = 0.15): string

encode(result): string / decode(aaak, options?): CompressResult

Others

Integration examples

Vercel AI SDK — budgeted memory block

Question-time recall — only inject what the question needs

Cloudflare Workers — persistent compressed memory in KV

Emotion states detected

Importance flags

Reproducing the benchmarks

License

`compress(text, options?): CompressResult`

`recall(result, query, options?): Zettel[]`

`recallContext(result, query, options?): string`

`injectContext(result, options?): string`

`CompressStream`

`wakeUp(result, topPct = 0.15): string`

`encode(result): string` / `decode(aaak, options?): CompressResult`