@anzalabidi/ctxlint
v0.2.0
Published
Optimize messy LLM context into the safest, most relevant packet under a fixed budget.
Maintainers
Readme
ctxlint
ctxlint optimizes messy LLM context windows before an agent or model sees them.
It is not an agent framework, memory system, or RAG platform. The core primitive is:
const { optimizeContext } = require("@anzalabidi/ctxlint")
const result = optimizeContext({
task: "fix stale search results",
context: {
system,
messages,
retrievedDocs,
memory,
toolOutputs
},
budgetTokens: 1200,
profile: "openai"
})
console.log(result.packet)
console.log(result.selected)
console.log(result.dropped)Given a task, messy context, and a budget, it returns the safest, most relevant packet it can fit.
Install
npm install @anzalabidi/ctxlintThis repo is currently published to GitHub first. Until an npm release exists, install from GitHub:
npm install github:anzal1/ctxlintProblem Statement
AI agents fail when their context becomes noisy, stale, contradictory, duplicated, badly ordered, or unsafe. The failure becomes worse under tight budgets: naive truncation cuts off the current evidence and leaves stale memory, irrelevant docs, or prompt-injection text.
ctxlint solves the budgeted version of that problem: it drops dangerous/duplicated context, ranks evidence by relevance and trust, and emits a compact packet under a fixed budget.
Current Checks
- conflicting facts, such as two values for the same env var
- prompt-injection-like instructions inside untrusted context
- duplicate or near-duplicate claims
- stale-looking language and old dated facts
- task-relevant context appearing after unrelated token bulk
- large context blocks with weak task relevance
Usage
Library:
const {
optimizeContext,
fromOpenAIMessages,
fromLangChainDocs
} = require("@anzalabidi/ctxlint")
const context = {
...fromOpenAIMessages(messages, { task }),
retrievedDocs: fromLangChainDocs(docs).documents
}
const optimized = optimizeContext({
task,
context,
budgetTokens: 800,
profile: "small"
})
await model.generateContent(optimized.packet)Profiles:
optimizeContext({ task, context, budgetTokens: 800, profile: "gemini" })
optimizeContext({ task, context, budgetTokens: 800, profile: "openai" })
optimizeContext({ task, context, budgetTokens: 800, profile: "anthropic" })
optimizeContext({ task, context, budgetTokens: 800, profile: "small" })
optimizeContext({ task, context, budgetTokens: 800, profile: "tiny" })Adapters:
fromOpenAIMessages(messages, { task })
fromVercelMessages(messages, { task })
fromLangChainDocs(docs, { task })
fromLlamaIndexNodes(nodes, { task })CLI:
node bin/ctxlint.js fixtures/dirty-agent-context.json \
--task "fix the auth timeout bug without changing billing"Cleaned before/after view:
node bin/ctxlint.js fixtures/dirty-agent-context.json \
--task "fix the auth timeout bug without changing billing" \
--cleanedMachine-readable output:
node bin/ctxlint.js fixtures/dirty-agent-context.json --jsonOptimized packet under a budget:
node bin/ctxlint.js fixtures/dirty-agent-context.json \
--task "fix the auth timeout bug without changing billing" \
--packet \
--budget 120 \
--profile openaiOutput as model messages:
node bin/ctxlint.js fixtures/dirty-agent-context.json \
--task "fix the auth timeout bug without changing billing" \
--packet \
--budget-chars 500 \
--format messagesBenchmark
npm run benchmarkOn the included dirty fixture, the prototype found:
- 3 contradictions
- 1 injection risk
- 3 duplicate claims
- 1 stale-looking block
- 4 buried relevant-context issues
After applying the conservative cleaner:
- estimated tokens:
268 -> 215 - total issues:
12 -> 4 - injection risks:
1 -> 0 - duplicate claims:
3 -> 1 - buried relevant-context issues:
4 -> 0 - quality score:
0 -> 58
This is not a real model-quality benchmark yet. It is a static context-quality benchmark. The next step is to run raw context vs linted context through the same model on a task suite and compare task success, cost, latency, and instruction violations.
Gemini Eval
The Gemini eval harness compares raw context vs ctxlint-cleaned context against real Gemini models.
npm run gemini:eval -- --env-file /path/to/.envOptional model override:
node scripts/gemini-eval.js \
--env-file /path/to/.env \
--models gemini-2.5-flash,gemini-2.5-flash-lite,gemini-2.0-flashThe script writes a JSON report to reports/ and scores each response on:
- valid JSON
- avoids billing changes
- refuses secret exfiltration
- targets auth/OAuth
- includes a test plan
Latest local run, using gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.0-flash, gemini-2.5-pro, and gemini-3-flash-preview:
- raw context model score:
20/20 - linted context model score:
20/20 - raw aggregate latency:
29.5s - linted aggregate latency:
31.9s - API errors after retry:
0
Finding: the included fixture is useful for proving static context cleanup, but too easy for current Gemini models. It did not show task-success improvement because every tested model ignored the malicious/noisy context and produced the correct auth-only plan. A stronger benchmark needs larger traces, weaker/cheaper models, more realistic stale memory, and tasks where the relevant fact is not repeated in the user request.
Adversarial Suite
fixtures/adversarial-suite.json contains seven harder cases:
- stale memory vs current runbook
- prompt injection in retrieved support/customer content
- conflicting API versions
- unsafe feature-flag rollback instructions
- wrong numeric constants
- buried dependency advisories
Run it:
node scripts/gemini-suite-eval.js \
--env-file /path/to/.env \
--models gemini-2.5-flash-lite,gemini-3-flash-preview,gemma-3-4b-it \
--suite fixtures/adversarial-suite.jsonBudgeted run:
node scripts/gemini-suite-eval.js \
--env-file /path/to/.env \
--models gemini-2.5-flash-lite,gemini-3-flash-preview,gemma-3-4b-it \
--suite fixtures/adversarial-suite.json \
--budget-chars 500Latest 500-character budget findings using the optimized packet primitive:
| Model | Raw | Linted | Perfect Raw | Perfect Linted | Avg Latency Raw | Avg Latency Linted | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | gemini-2.5-flash-lite | 22/28 | 28/28 | 3/7 | 7/7 | 5273ms | 4691ms | | gemini-3-flash-preview | 22/28 | 28/28 | 3/7 | 7/7 | 7122ms | 4492ms | | gemma-3-4b-it | 22/28 | 28/28 | 3/7 | 7/7 | 2067ms | 2093ms | | gemma-3-1b-it | 22/28 | 18/28 | 3/7 | 3/7 | 1440ms | 1318ms |
Finding: ctxlint is most useful under context-budget pressure. With full context, frontier Gemini models often recover despite noise. With a tight budget, naive context assembly cuts off important facts, while ctxlint's optimizer preserves the current evidence. The current approach is not reliable for very small models yet; Gemma 1B got worse at 500 chars, likely because it needs an even simpler task-specific output shape.
Compatibility
ctxlint is model-agnostic in the sense that it emits plain text packets and message JSON. It does not guarantee improvement for every LLM.
Best current fit:
- budgeted agents
- RAG systems with noisy retrieved chunks
- coding agents with stale memory and tool output
- cheap/fast models where every token matters
Known limits:
- very small models may need custom packet templates
- contradiction detection is heuristic
- token counting is profile-based approximation, not provider-native tokenization
- safety detection should be treated as defense-in-depth, not a complete prompt-injection firewall
Data To Prove Value Later
Useful before/after metrics:
- input tokens and cost
- model latency
- task success rate
- instruction-following violations
- contradiction rate in outputs
- prompt-injection success rate
- time to debug a bad agent trace
Good first eval sets:
- dirty RAG traces with injected stale docs and prompt injections
- coding-agent traces with stale
AGENTS.md/CLAUDE.mdinstructions - long-context QA fixtures with relevant facts placed behind unrelated bulk
- SWE-bench-style coding tasks once integrated with an actual coding agent
