@grapine.ai/contextprune

v0.1.4

Published

3 months ago

Garbage collection for LLM context windows.

@grapine.ai/contextprune

Garbage collection for LLM context windows.

Sits between your application and the LLM API. Analyzes your messages[] array, removes dead weight — stale tool outputs, resolved errors, superseded reasoning — and returns a leaner version. Every API call costs less. The model stays focused on what actually matters.

100% local. No data sent anywhere. No LLM calls during compression.

npm install @grapine.ai/contextprune

The problem

Long LLM sessions fill up fast:

Turn  1  ████░░░░░░░░░░░░░░░░░░░░░░░░░░   12%   4,100 tokens
Turn  5  ████████████░░░░░░░░░░░░░░░░░░   38%  12,800 tokens
Turn 10  ████████████████████░░░░░░░░░░   58%  19,400 tokens
Turn 15  ████████████████████████████░░   78%  26,100 tokens  ← quality degrades here
Turn 20  ██████████████████████████████   91%  30,600 tokens  ← coherence cliff

Around 65–75% utilization, model behavior suddenly gets worse — the model loses track of earlier constraints, repeats itself, makes mistakes it wouldn't make with a clean context. Most developers hit this, get confused, and manually clear context — losing all the good state too.

With contextprune:

Turn  1  ████░░░░░░░░░░░░░░░░░░░░░░░░░░   12%   4,100 tokens    —
Turn  5  ████████████░░░░░░░░░░░░░░░░░░   38%  12,800 tokens    —
Turn  6  ████░░░░░░░░░░░░░░░░░░░░░░░░░░   11%   3,700 tokens  ← compressed, 71% saved
Turn 10  ██████████░░░░░░░░░░░░░░░░░░░░   28%   9,500 tokens    —
Turn 11  ████░░░░░░░░░░░░░░░░░░░░░░░░░░   10%   3,200 tokens  ← compressed, 66% saved
Turn 20  ████████████░░░░░░░░░░░░░░░░░░   34%  11,600 tokens    ← never exceeds 40%

Quick start

import { ContextPrune } from '@grapine.ai/contextprune';

const cp = new ContextPrune({ model: 'claude-sonnet-4-5' });

const result = await cp.compress(messages);
// result.messages is a drop-in replacement for messages
// result.summary.tokensSaved — tokens recovered
// result.summary.savingsPercent — e.g. 0.47 = 47% saved

One line changes in your existing code:

// Before
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-5',
  messages,          // ← growing unbounded
  max_tokens: 8096,
});

// After
const { messages: lean } = await cp.compress(messages);
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-5',
  messages: lean,    // ← compressed
  max_tokens: 8096,
});

Installation

npm install @grapine.ai/contextprune

Requires Node 18+. No mandatory peer dependencies — tiktoken is used for token counting when available, otherwise falls back to a character estimate.

CLI

No code required. Run directly with npx — no install needed.

`analyze` — understand what's in your context

npx @grapine.ai/contextprune analyze ./session.json
npx @grapine.ai/contextprune analyze ./session.jsonl   # Claude Code session transcripts too

─── ContextPrune Analysis ──────────────────────────────────────────────────
Model: claude-sonnet-4-5  |  Capacity: 200,000 tokens

  ████████████████░░░░░░░░░░░░░░  56%  used  ·  112,266 / 200,000 tokens

  [SUGGESTED] Context is 56% full. Compression available but not urgent.
  Projected savings: 48,100 tokens (43%)  →  64,166 tokens after

Classification Breakdown:
  Outdated Tool Result    82 msgs   53,099 tokens  ████████████░  47%
  Chat / Filler           54 msgs   24,446 tokens  ████████░░░░░  22%
  Tool Result (active)    86 msgs   23,528 tokens  ████████░░░░░  21%
  Final Answer             1 msgs   11,406 tokens  ████░░░░░░░░░  10%

Compression Strategies:
  Keep                 141 msgs   64,166 tokens
  Remove                69 msgs   37,814 tokens  ← will be dropped
  Trim to Key Output     8 msgs    8,320 tokens  ← key output preserved
  Collapse to 1 Line     1 msgs    1,966 tokens  ← collapsed to marker

Top Token Consumers:
  #32  Final Answer             11,406 tokens   Preserved    no opportunity
  #55  Outdated Tool Result      6,801 tokens   Remove       high opportunity
  #48  Outdated Tool Result      4,992 tokens   Remove       high opportunity
  #61  Tool Result (active)      4,210 tokens   Trim         medium opportunity

# Also print a session brief — a compact handoff prompt for starting a new session
npx @grapine.ai/contextprune analyze ./session.jsonl --brief

`compress` — compress a messages file

npx @grapine.ai/contextprune compress ./session.json -o compressed.json

✔ Compressed  112,266 → 64,166 tokens  (43% saved, 48,100 tokens recovered)

Decisions:
  Removed   69 messages  (Outdated Tool Result, Chat/Filler)
  Trimmed    8 messages  (Tool Result — key output preserved)
  Collapsed  1 message   (Reasoning chain → 1-line marker)
  Kept     141 messages  (constraints, active errors, final answers)

Output is a standard JSON messages array — drop it straight into an API call:

const messages = JSON.parse(fs.readFileSync('compressed.json', 'utf-8'));
await anthropic.messages.create({ model: 'claude-sonnet-4-5', messages, max_tokens: 8096 });

`watch` — live dashboard in your browser

npx @grapine.ai/contextprune watch

Discovers all Claude Code sessions in ~/.claude/projects/ and opens an interactive picker:

  Select a Claude project to monitor:

  › labs/contextprune  #b6c62a11  just now  ● active
    labs/my-app        #a1d3f920  2h ago
    work/api-service   #cc8801ab  1d ago

  ↑↓ to navigate · Enter to select · Ctrl+C to cancel

Opens a browser tab and starts live monitoring. The dashboard updates every time the session file changes.

# Or point directly at a file
npx @grapine.ai/contextprune watch --follow ~/.claude/projects/my-project/session.jsonl

# Use a different port
npx @grapine.ai/contextprune watch --port 8080

Dashboard

A live browser dashboard that monitors your Claude Code sessions in real time. No configuration — run npx @grapine.ai/contextprune watch and it opens automatically.

Healthy Context Dashboard

Healthy Context Dashboard

Context Compression Recommendation Dashboard

Context Compression Recommendation Dashboard

What the dashboard shows:

Context Window — utilization bar with colour-coded status (green → yellow → red). Switches to Compression Suggested / Compress Now badges as context fills up.

Session Cost — cost per API call with input/output/cache breakdown, grouped by calendar day with proportional bars.

Classification Breakdown — how your context is distributed across message types (Outdated Tool Result, Active Tool Result, Chat/Filler, Final Answer, etc.) with token counts and percentages.

Compression Strategies — what contextprune would do right now: Keep / Remove / Trim / Collapse counts.

Compression Projection — before/after utilization bars showing exactly how much would be recovered if you compressed now. Hidden when context is healthy.

Top Consumers — the largest individual messages ranked by token count, with their classification and compression opportunity.

Session Brief — auto-generated handoff prompt that appears at 65%+ utilization. One click copies a compact context summary you can paste into a new session to continue without losing state.

Desktop notifications — opt-in alerts at 65% utilization, then every 5% increment until you compress.

Push data from your own process (no file watching needed):

npx @grapine.ai/contextprune watch &

curl -X POST http://localhost:4242/analyze \
  -H 'Content-Type: application/json' \
  -d '{ "messages": [...], "model": "gpt-4o" }'

Works with any provider — Anthropic, OpenAI, OpenRouter, Groq, or any messages array you construct yourself.

Three ways to use it

1. `compress(messages)` — explicit, you decide when

const result = await cp.compress(messages);

console.log(result.summary.tokensSaved);       // 48100
console.log(result.summary.savingsPercent);    // 0.43
console.log(result.messages.length);           // fewer messages

Compresses unconditionally every time you call it. Use this when you explicitly decide compression is warranted — after a tool-heavy phase, on every N turns, or as part of a LangGraph compress node.

2. `watch(client)` — automatic, zero changes to call sites

// Wrap once at startup
const watched = cp.watch(anthropic);

// Use exactly as before — compression fires automatically when context > 65%
const response = await watched.messages.create({
  model: 'claude-sonnet-4-5',
  messages,
  max_tokens: 8096,
});

Works with Anthropic, OpenAI, and any OpenAI-compatible provider:

// OpenRouter
const client = new OpenAI({ baseURL: 'https://openrouter.ai/api/v1', apiKey: '...' });
const watched = cp.watch(client);
await watched.chat.completions.create({ model: 'meta-llama/llama-3.3-70b-instruct', messages });

// Groq
const watched = cp.watch(new Groq());
await watched.chat.completions.create({ model: 'llama3-70b-8192', messages });

3. `analyze(messages)` — read-only inspection

const analysis = await cp.analyze(messages);

analysis.recommendation.urgency             // 'none' | 'suggested' | 'recommended' | 'critical'
analysis.recommendation.projectedSavings    // tokens that would be saved
analysis.sessionState.tokenBudget.utilizationPercent  // 0.56
analysis.sessionBrief                       // markdown handoff prompt for context continuation

Never compresses — use this to build dashboards, gate on urgency, or log opportunities.

LangGraph

In a LangGraph agent, state["messages"] accumulates every tool result and intermediate step across all graph iterations. By call 20, a typical coding agent has 30–50k tokens of stale tool outputs.

Wrap the client — zero changes inside the graph:

import { ContextPrune } from '@grapine.ai/contextprune';
import Anthropic from '@anthropic-ai/sdk';

const client = new ContextPrune({ model: 'claude-sonnet-4-5' }).watch(new Anthropic());

// Every node compresses automatically, only when context > 65%
function callModel(state: MessagesState) {
  return client.messages.create({         // ← unchanged
    model: 'claude-sonnet-4-5',
    messages: state.messages,
    max_tokens: 8096,
  });
}

Or add a dedicated compress node:

const cp = new ContextPrune({ model: 'claude-sonnet-4-5' });

async function compressNode(state: MessagesState) {
  const result = await cp.compress(state.messages);
  return { messages: result.messages };
}

builder
  .addNode('compress', compressNode)
  .addEdge('tools', 'compress')   // compress after every tool cycle
  .addEdge('compress', 'agent');

When it helps (and when it doesn't)

The core prerequisite: there must be a growing messages[] array that gets passed to an LLM repeatedly.

✓ It helps: single-agent accumulating loops

// ReAct / tool-calling loop — context grows with every iteration
const messages: LLMMessage[] = [{ role: 'system', content: systemPrompt }];

while (!done) {
  const response = await llm.invoke(messages);
  messages.push({ role: 'assistant', content: response.content });
  const toolResult = await runTool(response);
  messages.push({ role: 'user', content: toolResult });

  // ← contextprune here: stale tool results removed before next call
  const { messages: lean } = await cp.compress(messages);
  messages.splice(0, messages.length, ...lean);
}

By call 30, a typical agent has accumulated file reads, bash outputs, error traces, and intermediate reasoning that will never be referenced again. Every call pays for all of it. contextprune removes it.

✗ It doesn't help: parallel stateless fan-out

// Each agent call is 2–3 messages built fresh, discarded after
const [strategy, calendar, copy] = await Promise.all([
  orchestrator.invoke([{ role: 'user', content: strategyPrompt }]),
  strategist.invoke([{ role: 'user', content: calendarPrompt }]),
  copywriter.invoke([{ role: 'user', content: copyPrompt }]),
]);

Each call is constructed fresh and discarded. There is no accumulating history. Nothing to prune.

The diagnostic question:

After N agent calls, is there a single messages[] array that is longer than it was at call 1?

If yes — contextprune helps. If no — each call starts fresh, and contextprune has no leverage point.

Compression modes

| Mode | When compression runs | Default for | |------|----------------------|-------------| | manual | Always, unconditionally | compress() | | auto | Only when utilization ≥ warningThreshold | watch() | | suggest-only | Never — analysis only | analyze() |

const cp = new ContextPrune({
  model: 'claude-sonnet-4-5',
  options: {
    warningThreshold:  0.65,   // start compressing at 65% full (default)
    criticalThreshold: 0.80,   // compress aggressively at 80% (default)
    compressionMode:   'auto', // only compress when needed
  }
});

What gets compressed

| Message type | Strategy | Why | |---|---|---| | Outdated Tool Result | Remove | Not referenced in subsequent turns | | Fixed Error | Remove | Stack trace no longer needed | | Chain of Thought | Collapse to 1 line | Conclusion already in context | | Status Update | Collapse to 1 line | Acknowledged, no longer active | | Tool Result (active) | Trim to key output | Keep answer, drop verbose body | | Chat / Filler | Remove | Low relevance to current task |

Always preserved: system prompts, user corrections, active errors, session goals, final answers.

The classifier assigns one of 11 types to each message. Classification confidence gates compression aggressiveness — if the classifier is uncertain, the message is always preserved.

Supported providers and models

Token budgets are pre-configured for:

| Provider | Models | |---|---| | Anthropic | Claude 4.x, Claude 3.x (all variants) | | OpenAI | GPT-4o, GPT-4.1, GPT-4-turbo, GPT-3.5, o1, o3 series | | Google | Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 1.5 | | Meta | Llama 3.3 / 3.1 (70B, 8B) | | Mistral | Mistral Large/Medium/Small, Mixtral, Codestral | | DeepSeek | DeepSeek Chat, DeepSeek Reasoner | | Cohere | Command R, Command R+ | | OpenRouter | All provider/model prefixed names | | Groq | Llama3, Mixtral, Gemma hosted models |

Any unrecognized model string falls back to a 128k token budget.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@grapine.ai/contextprune

The problem

Quick start

Installation

CLI

analyze — understand what's in your context

compress — compress a messages file

watch — live dashboard in your browser

Dashboard

Three ways to use it

1. compress(messages) — explicit, you decide when

2. watch(client) — automatic, zero changes to call sites

3. analyze(messages) — read-only inspection

LangGraph

When it helps (and when it doesn't)

✓ It helps: single-agent accumulating loops

✗ It doesn't help: parallel stateless fan-out

Compression modes

What gets compressed

Supported providers and models

`analyze` — understand what's in your context

`compress` — compress a messages file

`watch` — live dashboard in your browser

1. `compress(messages)` — explicit, you decide when

2. `watch(client)` — automatic, zero changes to call sites

3. `analyze(messages)` — read-only inspection