@phoenixaihub/tokenwaste
v1.0.0
Published
Information-theoretic context selection for AI coding agents. TF-IDF + AST call graph for 50-70x context reduction.
Maintainers
Readme
tokenwaste
Information-theoretic context selection for AI coding agents.
TF-IDF + AST call graph to compute mutual information between your query and code chunks. Only includes chunks above a configurable information threshold. Targets 50-70x context reduction.
Why?
AI coding agents stuff entire codebases into context windows. Most of it is noise. TokenWaste uses information theory to select only the code that actually matters for your query.
Install
npm install @phoenixaihub/tokenwaste
# Or globally for CLI
npm install -g @phoenixaihub/tokenwasteCLI Usage
Select relevant context
tokenwaste select ./src --query "authentication middleware" --threshold 0.3
# With token budget
tokenwaste select ./src --query "database connection" --threshold 0.5 --max-tokens 8000
# JSON output for piping
tokenwaste select ./src --query "error handling" --jsonAnalyze all chunks
tokenwaste analyze ./src --query "API routes" --top 5Score specific files
tokenwaste score --query "auth" --files src/auth.ts src/middleware.tsProgrammatic API
selectContext(dir, options)
Main entry point. Analyzes a directory, scores all chunks, returns only relevant ones.
import { selectContext } from '@phoenixaihub/tokenwaste';
const result = await selectContext('./src', {
query: 'authentication middleware',
threshold: 0.5, // minimum bits of mutual information
maxTokens: 8000, // optional token budget
useCallGraph: true, // include AST call graph analysis
});
console.log(`Selected ${result.selectedChunks}/${result.totalChunks} chunks`);
console.log(`Compression: ${result.compressionRatio}x`);
for (const chunk of result.chunks) {
console.log(`${chunk.file}:${chunk.startLine} — ${chunk.mutualInformation} bits`);
console.log(chunk.content);
}analyzeChunks(dir, query, options?)
Returns all chunks scored but unfiltered. Useful for exploration.
import { analyzeChunks } from '@phoenixaihub/tokenwaste';
const scored = await analyzeChunks('./src', 'database connection');
for (const chunk of scored.slice(0, 10)) {
console.log(`${chunk.file}: ${chunk.mutualInformation} bits`);
}scoreRelevance(contents, query)
Score arbitrary code content against a query without reading from disk.
import { scoreRelevance } from '@phoenixaihub/tokenwaste';
const scores = scoreRelevance([
{ id: 'auth.ts', content: 'function authenticate() { ... }' },
{ id: 'utils.ts', content: 'function formatDate() { ... }' },
], 'authentication');
// scores[0].score — mutual information in bits
// scores[0].matchedTerms — which query terms matchedHow It Works
- Chunking: Splits source files into ~50-line chunks with overlap
- TF-IDF: Computes term frequency–inverse document frequency between query terms and chunk content
- Call Graph: Extracts function definitions and call relationships using regex-based AST patterns (JS/TS/Python)
- Mutual Information: Combines TF-IDF similarity (converted to bits) with call graph connectivity boosting
- Threshold: Only returns chunks above the configurable MI threshold
Mutual Information Formula
MI(query; chunk) = 0.7 × TF-IDF_bits + 0.3 × Graph_bits
TF-IDF_bits = -log₂(1 - cosine_similarity)
Graph_bits = log₂(1 + connected_relevant_functions)Supported Languages
| Language | TF-IDF | Call Graph | |------------|--------|------------| | JavaScript | ✅ | ✅ | | TypeScript | ✅ | ✅ | | Python | ✅ | ✅ | | Go, Rust, Java, etc. | ✅ | ❌ (TF-IDF only) |
Configuration
interface SelectContextOptions {
query: string; // Search query (required)
threshold?: number; // MI threshold in bits (default: 0.5)
maxTokens?: number; // Token budget (default: unlimited)
useCallGraph?: boolean; // Use AST analysis (default: true)
maxLines?: number; // Max lines per chunk (default: 50)
overlap?: number; // Chunk overlap lines (default: 5)
tfidfWeight?: number; // TF-IDF weight (default: 0.7)
graphWeight?: number; // Graph weight (default: 0.3)
maxGraphDepth?: number; // Transitive dep depth (default: 3)
}License
MIT
