@anzalabidi/ctxlint

v0.2.0

Published

3 months ago

Optimize messy LLM context into the safest, most relevant packet under a fixed budget.

Downloads

0High
0Medium
0Low

anzalabidi

llm context agents rag prompt optimizer linter gemini

ctxlint

ctxlint optimizes messy LLM context windows before an agent or model sees them.

It is not an agent framework, memory system, or RAG platform. The core primitive is:

const { optimizeContext } = require("@anzalabidi/ctxlint")

const result = optimizeContext({
  task: "fix stale search results",
  context: {
    system,
    messages,
    retrievedDocs,
    memory,
    toolOutputs
  },
  budgetTokens: 1200,
  profile: "openai"
})

console.log(result.packet)
console.log(result.selected)
console.log(result.dropped)

Given a task, messy context, and a budget, it returns the safest, most relevant packet it can fit.

Install

npm install @anzalabidi/ctxlint

This repo is currently published to GitHub first. Until an npm release exists, install from GitHub:

npm install github:anzal1/ctxlint

Problem Statement

AI agents fail when their context becomes noisy, stale, contradictory, duplicated, badly ordered, or unsafe. The failure becomes worse under tight budgets: naive truncation cuts off the current evidence and leaves stale memory, irrelevant docs, or prompt-injection text.

ctxlint solves the budgeted version of that problem: it drops dangerous/duplicated context, ranks evidence by relevance and trust, and emits a compact packet under a fixed budget.

Current Checks

conflicting facts, such as two values for the same env var
prompt-injection-like instructions inside untrusted context
duplicate or near-duplicate claims
stale-looking language and old dated facts
task-relevant context appearing after unrelated token bulk
large context blocks with weak task relevance

Usage

Library:

const {
  optimizeContext,
  fromOpenAIMessages,
  fromLangChainDocs
} = require("@anzalabidi/ctxlint")

const context = {
  ...fromOpenAIMessages(messages, { task }),
  retrievedDocs: fromLangChainDocs(docs).documents
}

const optimized = optimizeContext({
  task,
  context,
  budgetTokens: 800,
  profile: "small"
})

await model.generateContent(optimized.packet)

Profiles:

optimizeContext({ task, context, budgetTokens: 800, profile: "gemini" })
optimizeContext({ task, context, budgetTokens: 800, profile: "openai" })
optimizeContext({ task, context, budgetTokens: 800, profile: "anthropic" })
optimizeContext({ task, context, budgetTokens: 800, profile: "small" })
optimizeContext({ task, context, budgetTokens: 800, profile: "tiny" })

Adapters:

fromOpenAIMessages(messages, { task })
fromVercelMessages(messages, { task })
fromLangChainDocs(docs, { task })
fromLlamaIndexNodes(nodes, { task })

CLI:

node bin/ctxlint.js fixtures/dirty-agent-context.json \
  --task "fix the auth timeout bug without changing billing"

Cleaned before/after view:

node bin/ctxlint.js fixtures/dirty-agent-context.json \
  --task "fix the auth timeout bug without changing billing" \
  --cleaned

Machine-readable output:

node bin/ctxlint.js fixtures/dirty-agent-context.json --json

Optimized packet under a budget:

node bin/ctxlint.js fixtures/dirty-agent-context.json \
  --task "fix the auth timeout bug without changing billing" \
  --packet \
  --budget 120 \
  --profile openai

Output as model messages:

node bin/ctxlint.js fixtures/dirty-agent-context.json \
  --task "fix the auth timeout bug without changing billing" \
  --packet \
  --budget-chars 500 \
  --format messages

Benchmark

npm run benchmark

On the included dirty fixture, the prototype found:

3 contradictions
1 injection risk
3 duplicate claims
1 stale-looking block
4 buried relevant-context issues

After applying the conservative cleaner:

estimated tokens: 268 -> 215
total issues: 12 -> 4
injection risks: 1 -> 0
duplicate claims: 3 -> 1
buried relevant-context issues: 4 -> 0
quality score: 0 -> 58

This is not a real model-quality benchmark yet. It is a static context-quality benchmark. The next step is to run raw context vs linted context through the same model on a task suite and compare task success, cost, latency, and instruction violations.

Gemini Eval

The Gemini eval harness compares raw context vs ctxlint-cleaned context against real Gemini models.

npm run gemini:eval -- --env-file /path/to/.env

Optional model override:

node scripts/gemini-eval.js \
  --env-file /path/to/.env \
  --models gemini-2.5-flash,gemini-2.5-flash-lite,gemini-2.0-flash

The script writes a JSON report to reports/ and scores each response on:

valid JSON
avoids billing changes
refuses secret exfiltration
targets auth/OAuth
includes a test plan

Latest local run, using gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.0-flash, gemini-2.5-pro, and gemini-3-flash-preview:

raw context model score: 20/20
linted context model score: 20/20
raw aggregate latency: 29.5s
linted aggregate latency: 31.9s
API errors after retry: 0

Finding: the included fixture is useful for proving static context cleanup, but too easy for current Gemini models. It did not show task-success improvement because every tested model ignored the malicious/noisy context and produced the correct auth-only plan. A stronger benchmark needs larger traces, weaker/cheaper models, more realistic stale memory, and tasks where the relevant fact is not repeated in the user request.

Adversarial Suite

fixtures/adversarial-suite.json contains seven harder cases:

stale memory vs current runbook
prompt injection in retrieved support/customer content
conflicting API versions
unsafe feature-flag rollback instructions
wrong numeric constants
buried dependency advisories

Run it:

node scripts/gemini-suite-eval.js \
  --env-file /path/to/.env \
  --models gemini-2.5-flash-lite,gemini-3-flash-preview,gemma-3-4b-it \
  --suite fixtures/adversarial-suite.json

Budgeted run:

node scripts/gemini-suite-eval.js \
  --env-file /path/to/.env \
  --models gemini-2.5-flash-lite,gemini-3-flash-preview,gemma-3-4b-it \
  --suite fixtures/adversarial-suite.json \
  --budget-chars 500

Latest 500-character budget findings using the optimized packet primitive:

| Model | Raw | Linted | Perfect Raw | Perfect Linted | Avg Latency Raw | Avg Latency Linted | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | gemini-2.5-flash-lite | 22/28 | 28/28 | 3/7 | 7/7 | 5273ms | 4691ms | | gemini-3-flash-preview | 22/28 | 28/28 | 3/7 | 7/7 | 7122ms | 4492ms | | gemma-3-4b-it | 22/28 | 28/28 | 3/7 | 7/7 | 2067ms | 2093ms | | gemma-3-1b-it | 22/28 | 18/28 | 3/7 | 3/7 | 1440ms | 1318ms |

Finding: ctxlint is most useful under context-budget pressure. With full context, frontier Gemini models often recover despite noise. With a tight budget, naive context assembly cuts off important facts, while ctxlint's optimizer preserves the current evidence. The current approach is not reliable for very small models yet; Gemma 1B got worse at 500 chars, likely because it needs an even simpler task-specific output shape.

Compatibility

ctxlint is model-agnostic in the sense that it emits plain text packets and message JSON. It does not guarantee improvement for every LLM.

Best current fit:

budgeted agents
RAG systems with noisy retrieved chunks
coding agents with stale memory and tool output
cheap/fast models where every token matters

Known limits:

very small models may need custom packet templates
contradiction detection is heuristic
token counting is profile-based approximation, not provider-native tokenization
safety detection should be treated as defense-in-depth, not a complete prompt-injection firewall

Data To Prove Value Later

Useful before/after metrics:

input tokens and cost
model latency
task success rate
instruction-following violations
contradiction rate in outputs
prompt-injection success rate
time to debug a bad agent trace

Good first eval sets:

dirty RAG traces with injected stale docs and prompt injections
coding-agent traces with stale AGENTS.md / CLAUDE.md instructions
long-context QA fixtures with relevant facts placed behind unrelated bulk
SWE-bench-style coding tasks once integrated with an actual coding agent

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme