@grapine.ai/contextprune
v0.1.4
Published
Garbage collection for LLM context windows.
Maintainers
Readme
@grapine.ai/contextprune
Garbage collection for LLM context windows.
Sits between your application and the LLM API. Analyzes your messages[] array, removes dead weight — stale tool outputs, resolved errors, superseded reasoning — and returns a leaner version. Every API call costs less. The model stays focused on what actually matters.
100% local. No data sent anywhere. No LLM calls during compression.
npm install @grapine.ai/contextpruneThe problem
Long LLM sessions fill up fast:
Turn 1 ████░░░░░░░░░░░░░░░░░░░░░░░░░░ 12% 4,100 tokens
Turn 5 ████████████░░░░░░░░░░░░░░░░░░ 38% 12,800 tokens
Turn 10 ████████████████████░░░░░░░░░░ 58% 19,400 tokens
Turn 15 ████████████████████████████░░ 78% 26,100 tokens ← quality degrades here
Turn 20 ██████████████████████████████ 91% 30,600 tokens ← coherence cliffAround 65–75% utilization, model behavior suddenly gets worse — the model loses track of earlier constraints, repeats itself, makes mistakes it wouldn't make with a clean context. Most developers hit this, get confused, and manually clear context — losing all the good state too.
With contextprune:
Turn 1 ████░░░░░░░░░░░░░░░░░░░░░░░░░░ 12% 4,100 tokens —
Turn 5 ████████████░░░░░░░░░░░░░░░░░░ 38% 12,800 tokens —
Turn 6 ████░░░░░░░░░░░░░░░░░░░░░░░░░░ 11% 3,700 tokens ← compressed, 71% saved
Turn 10 ██████████░░░░░░░░░░░░░░░░░░░░ 28% 9,500 tokens —
Turn 11 ████░░░░░░░░░░░░░░░░░░░░░░░░░░ 10% 3,200 tokens ← compressed, 66% saved
Turn 20 ████████████░░░░░░░░░░░░░░░░░░ 34% 11,600 tokens ← never exceeds 40%Quick start
import { ContextPrune } from '@grapine.ai/contextprune';
const cp = new ContextPrune({ model: 'claude-sonnet-4-5' });
const result = await cp.compress(messages);
// result.messages is a drop-in replacement for messages
// result.summary.tokensSaved — tokens recovered
// result.summary.savingsPercent — e.g. 0.47 = 47% savedOne line changes in your existing code:
// Before
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5',
messages, // ← growing unbounded
max_tokens: 8096,
});
// After
const { messages: lean } = await cp.compress(messages);
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5',
messages: lean, // ← compressed
max_tokens: 8096,
});Installation
npm install @grapine.ai/contextpruneRequires Node 18+. No mandatory peer dependencies — tiktoken is used for token counting when available, otherwise falls back to a character estimate.
CLI
No code required. Run directly with npx — no install needed.
analyze — understand what's in your context
npx @grapine.ai/contextprune analyze ./session.json
npx @grapine.ai/contextprune analyze ./session.jsonl # Claude Code session transcripts too─── ContextPrune Analysis ──────────────────────────────────────────────────
Model: claude-sonnet-4-5 | Capacity: 200,000 tokens
████████████████░░░░░░░░░░░░░░ 56% used · 112,266 / 200,000 tokens
[SUGGESTED] Context is 56% full. Compression available but not urgent.
Projected savings: 48,100 tokens (43%) → 64,166 tokens after
Classification Breakdown:
Outdated Tool Result 82 msgs 53,099 tokens ████████████░ 47%
Chat / Filler 54 msgs 24,446 tokens ████████░░░░░ 22%
Tool Result (active) 86 msgs 23,528 tokens ████████░░░░░ 21%
Final Answer 1 msgs 11,406 tokens ████░░░░░░░░░ 10%
Compression Strategies:
Keep 141 msgs 64,166 tokens
Remove 69 msgs 37,814 tokens ← will be dropped
Trim to Key Output 8 msgs 8,320 tokens ← key output preserved
Collapse to 1 Line 1 msgs 1,966 tokens ← collapsed to marker
Top Token Consumers:
#32 Final Answer 11,406 tokens Preserved no opportunity
#55 Outdated Tool Result 6,801 tokens Remove high opportunity
#48 Outdated Tool Result 4,992 tokens Remove high opportunity
#61 Tool Result (active) 4,210 tokens Trim medium opportunity# Also print a session brief — a compact handoff prompt for starting a new session
npx @grapine.ai/contextprune analyze ./session.jsonl --briefcompress — compress a messages file
npx @grapine.ai/contextprune compress ./session.json -o compressed.json✔ Compressed 112,266 → 64,166 tokens (43% saved, 48,100 tokens recovered)
Decisions:
Removed 69 messages (Outdated Tool Result, Chat/Filler)
Trimmed 8 messages (Tool Result — key output preserved)
Collapsed 1 message (Reasoning chain → 1-line marker)
Kept 141 messages (constraints, active errors, final answers)Output is a standard JSON messages array — drop it straight into an API call:
const messages = JSON.parse(fs.readFileSync('compressed.json', 'utf-8'));
await anthropic.messages.create({ model: 'claude-sonnet-4-5', messages, max_tokens: 8096 });watch — live dashboard in your browser
npx @grapine.ai/contextprune watchDiscovers all Claude Code sessions in ~/.claude/projects/ and opens an interactive picker:
Select a Claude project to monitor:
› labs/contextprune #b6c62a11 just now ● active
labs/my-app #a1d3f920 2h ago
work/api-service #cc8801ab 1d ago
↑↓ to navigate · Enter to select · Ctrl+C to cancelOpens a browser tab and starts live monitoring. The dashboard updates every time the session file changes.
# Or point directly at a file
npx @grapine.ai/contextprune watch --follow ~/.claude/projects/my-project/session.jsonl
# Use a different port
npx @grapine.ai/contextprune watch --port 8080Dashboard
A live browser dashboard that monitors your Claude Code sessions in real time. No configuration — run npx @grapine.ai/contextprune watch and it opens automatically.
Healthy Context Dashboard

Context Compression Recommendation Dashboard

What the dashboard shows:
Context Window — utilization bar with colour-coded status (green → yellow → red). Switches to Compression Suggested / Compress Now badges as context fills up.
Session Cost — cost per API call with input/output/cache breakdown, grouped by calendar day with proportional bars.
Classification Breakdown — how your context is distributed across message types (Outdated Tool Result, Active Tool Result, Chat/Filler, Final Answer, etc.) with token counts and percentages.
Compression Strategies — what contextprune would do right now: Keep / Remove / Trim / Collapse counts.
Compression Projection — before/after utilization bars showing exactly how much would be recovered if you compressed now. Hidden when context is healthy.
Top Consumers — the largest individual messages ranked by token count, with their classification and compression opportunity.
Session Brief — auto-generated handoff prompt that appears at 65%+ utilization. One click copies a compact context summary you can paste into a new session to continue without losing state.
Desktop notifications — opt-in alerts at 65% utilization, then every 5% increment until you compress.
Push data from your own process (no file watching needed):
npx @grapine.ai/contextprune watch &
curl -X POST http://localhost:4242/analyze \
-H 'Content-Type: application/json' \
-d '{ "messages": [...], "model": "gpt-4o" }'Works with any provider — Anthropic, OpenAI, OpenRouter, Groq, or any messages array you construct yourself.
Three ways to use it
1. compress(messages) — explicit, you decide when
const result = await cp.compress(messages);
console.log(result.summary.tokensSaved); // 48100
console.log(result.summary.savingsPercent); // 0.43
console.log(result.messages.length); // fewer messagesCompresses unconditionally every time you call it. Use this when you explicitly decide compression is warranted — after a tool-heavy phase, on every N turns, or as part of a LangGraph compress node.
2. watch(client) — automatic, zero changes to call sites
// Wrap once at startup
const watched = cp.watch(anthropic);
// Use exactly as before — compression fires automatically when context > 65%
const response = await watched.messages.create({
model: 'claude-sonnet-4-5',
messages,
max_tokens: 8096,
});Works with Anthropic, OpenAI, and any OpenAI-compatible provider:
// OpenRouter
const client = new OpenAI({ baseURL: 'https://openrouter.ai/api/v1', apiKey: '...' });
const watched = cp.watch(client);
await watched.chat.completions.create({ model: 'meta-llama/llama-3.3-70b-instruct', messages });
// Groq
const watched = cp.watch(new Groq());
await watched.chat.completions.create({ model: 'llama3-70b-8192', messages });3. analyze(messages) — read-only inspection
const analysis = await cp.analyze(messages);
analysis.recommendation.urgency // 'none' | 'suggested' | 'recommended' | 'critical'
analysis.recommendation.projectedSavings // tokens that would be saved
analysis.sessionState.tokenBudget.utilizationPercent // 0.56
analysis.sessionBrief // markdown handoff prompt for context continuationNever compresses — use this to build dashboards, gate on urgency, or log opportunities.
LangGraph
In a LangGraph agent, state["messages"] accumulates every tool result and intermediate step across all graph iterations. By call 20, a typical coding agent has 30–50k tokens of stale tool outputs.
Wrap the client — zero changes inside the graph:
import { ContextPrune } from '@grapine.ai/contextprune';
import Anthropic from '@anthropic-ai/sdk';
const client = new ContextPrune({ model: 'claude-sonnet-4-5' }).watch(new Anthropic());
// Every node compresses automatically, only when context > 65%
function callModel(state: MessagesState) {
return client.messages.create({ // ← unchanged
model: 'claude-sonnet-4-5',
messages: state.messages,
max_tokens: 8096,
});
}Or add a dedicated compress node:
const cp = new ContextPrune({ model: 'claude-sonnet-4-5' });
async function compressNode(state: MessagesState) {
const result = await cp.compress(state.messages);
return { messages: result.messages };
}
builder
.addNode('compress', compressNode)
.addEdge('tools', 'compress') // compress after every tool cycle
.addEdge('compress', 'agent');When it helps (and when it doesn't)
The core prerequisite: there must be a growing messages[] array that gets passed to an LLM repeatedly.
✓ It helps: single-agent accumulating loops
// ReAct / tool-calling loop — context grows with every iteration
const messages: LLMMessage[] = [{ role: 'system', content: systemPrompt }];
while (!done) {
const response = await llm.invoke(messages);
messages.push({ role: 'assistant', content: response.content });
const toolResult = await runTool(response);
messages.push({ role: 'user', content: toolResult });
// ← contextprune here: stale tool results removed before next call
const { messages: lean } = await cp.compress(messages);
messages.splice(0, messages.length, ...lean);
}By call 30, a typical agent has accumulated file reads, bash outputs, error traces, and intermediate reasoning that will never be referenced again. Every call pays for all of it. contextprune removes it.
✗ It doesn't help: parallel stateless fan-out
// Each agent call is 2–3 messages built fresh, discarded after
const [strategy, calendar, copy] = await Promise.all([
orchestrator.invoke([{ role: 'user', content: strategyPrompt }]),
strategist.invoke([{ role: 'user', content: calendarPrompt }]),
copywriter.invoke([{ role: 'user', content: copyPrompt }]),
]);Each call is constructed fresh and discarded. There is no accumulating history. Nothing to prune.
The diagnostic question:
After N agent calls, is there a single
messages[]array that is longer than it was at call 1?
If yes — contextprune helps. If no — each call starts fresh, and contextprune has no leverage point.
Compression modes
| Mode | When compression runs | Default for |
|------|----------------------|-------------|
| manual | Always, unconditionally | compress() |
| auto | Only when utilization ≥ warningThreshold | watch() |
| suggest-only | Never — analysis only | analyze() |
const cp = new ContextPrune({
model: 'claude-sonnet-4-5',
options: {
warningThreshold: 0.65, // start compressing at 65% full (default)
criticalThreshold: 0.80, // compress aggressively at 80% (default)
compressionMode: 'auto', // only compress when needed
}
});What gets compressed
| Message type | Strategy | Why | |---|---|---| | Outdated Tool Result | Remove | Not referenced in subsequent turns | | Fixed Error | Remove | Stack trace no longer needed | | Chain of Thought | Collapse to 1 line | Conclusion already in context | | Status Update | Collapse to 1 line | Acknowledged, no longer active | | Tool Result (active) | Trim to key output | Keep answer, drop verbose body | | Chat / Filler | Remove | Low relevance to current task |
Always preserved: system prompts, user corrections, active errors, session goals, final answers.
The classifier assigns one of 11 types to each message. Classification confidence gates compression aggressiveness — if the classifier is uncertain, the message is always preserved.
Supported providers and models
Token budgets are pre-configured for:
| Provider | Models |
|---|---|
| Anthropic | Claude 4.x, Claude 3.x (all variants) |
| OpenAI | GPT-4o, GPT-4.1, GPT-4-turbo, GPT-3.5, o1, o3 series |
| Google | Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 1.5 |
| Meta | Llama 3.3 / 3.1 (70B, 8B) |
| Mistral | Mistral Large/Medium/Small, Mixtral, Codestral |
| DeepSeek | DeepSeek Chat, DeepSeek Reasoner |
| Cohere | Command R, Command R+ |
| OpenRouter | All provider/model prefixed names |
| Groq | Llama3, Mixtral, Gemma hosted models |
Any unrecognized model string falls back to a 128k token budget.
