@sovereign-labs/improve
v0.1.0
Published
Self-improving verification pipeline. Diagnose failures, generate fix candidates, validate in isolation, reject overfitting. Repair with proof.
Downloads
111
Maintainers
Readme
@sovereign-labs/improve
Self-improving verification pipeline. Diagnose failures, generate fix candidates, validate in isolation, reject overfitting. Repair with proof.
Works with any test runner implementing the TestSurface interface — not coupled to @sovereign-labs/verify, though that's the primary consumer.
Pipeline
baseline → bundle → triage → diagnose → generate → validate → rank → holdout → verdict- Baseline — Run all scenarios, identify dirty (failing) ones
- Bundle — Group violations by root cause into evidence bundles
- Triage — Classify confidence:
mechanical(pattern-match),heuristic, orneeds_llm - Diagnose — LLM root-cause analysis (skipped for mechanical triage)
- Generate — LLM produces N fix candidates (search/replace edits)
- Validate — Each candidate tested in isolated subprocess copy
- Rank — Score = improvements - regressions - line penalty
- Holdout — Best candidate re-tested against withheld scenarios (overfitting detection)
- Verdict —
accepted,rejected_regression,rejected_overfitting,rejected_no_fix,skipped_*
Install
npm install @sovereign-labs/improve
# or
bun add @sovereign-labs/improveCLI Usage
# With Gemini
improve --app-dir ./my-package --llm gemini --api-key $GEMINI_KEY
# With local Ollama
improve --app-dir ./my-package --llm ollama --ollama-model qwen3:4b
# With Claude (domain-aware prompts)
improve --app-dir ./my-package --llm claude --api-key $ANTHROPIC_API_KEY
# Dry run (generate candidates, skip validation)
improve --app-dir ./my-package --llm gemini --api-key $GEMINI_KEY --dry-run
# Specific scenario families only
improve --app-dir ./my-package --llm gemini --api-key $GEMINI_KEY --families grounding,constraintsCLI Options
| Option | Description | Default |
|--------|-------------|---------|
| --app-dir <path> | Package directory to improve | . |
| --self-test <script> | Self-test script path (relative to app-dir) | scripts/self-test.ts |
| --llm <provider> | LLM provider: gemini, anthropic, claude, claude-code, ollama, none | none |
| --api-key <key> | API key for cloud providers | — |
| --ollama-model <model> | Ollama model name | qwen3:4b |
| --ollama-host <url> | Ollama host URL | http://localhost:11434 |
| --claude-model <model> | Claude model name | claude-sonnet-4-20250514 |
| --max-candidates <n> | Fix candidates per bundle | 3 |
| --max-lines <n> | Max changed lines per candidate | 30 |
| --families <list> | Comma-separated scenario families | all |
| --dry-run | Generate candidates but skip validation | false |
Programmatic Usage
import { runImproveLoop } from '@sovereign-labs/improve';
import type { TestSurface, ImproveConfig } from '@sovereign-labs/improve';
// Implement the TestSurface interface for your test runner
const surface: TestSurface = {
packageDir: './my-package',
selfTestScript: 'scripts/self-test.ts',
async runBaseline(config) {
// Run your test suite, return LedgerEntry[]
return myTestRunner.run(config);
},
};
const config: ImproveConfig = {
llm: 'gemini',
apiKey: process.env.GEMINI_API_KEY,
maxCandidates: 3,
maxLines: 30,
dryRun: false,
};
const entries = await runImproveLoop(surface, config);
const accepted = entries.filter(e => e.verdict === 'accepted');
console.log(`${accepted.length} improvements applied`);Custom LLM Providers
import { createLLMProvider, createClaudeCodeProvider } from '@sovereign-labs/improve';
// Factory — routes to the right provider based on config
const callLLM = createLLMProvider(config);
// Claude Code callback — for interactive Claude Code sessions
const callLLM = createClaudeCodeProvider(async (system, user) => {
// Your callback sends prompts to Claude Code and returns the response
return { text: response, inputTokens: 0, outputTokens: 0 };
});Provider Comparison
| Provider | Best For | Notes |
|----------|----------|-------|
| gemini | General use | Gemini 2.5 Flash, fast, cheap |
| anthropic | Standard Claude | Claude Sonnet, good reasoning |
| claude | Domain-aware | Enhanced prompts with architecture context |
| claude-code | Interactive | Callback-based, integrates with Claude Code sessions |
| ollama | Air-gap / free | Local models (qwen3:4b default) |
| none | Testing | Skips LLM diagnosis and fix generation |
Key Concepts
TestSurface Interface
Any test runner can be improved — just implement this interface:
interface TestSurface {
runBaseline(config: TestSurfaceConfig): Promise<LedgerEntry[]>;
packageDir: string;
selfTestScript: string;
}Overfitting Detection
Scenarios are split into three sets:
- Dirty — Failing scenarios the fix should repair
- Validation — Clean scenarios that must stay clean
- Holdout — Withheld clean scenarios for overfitting detection
A fix that passes validation but regresses the holdout is rejected as overfitting.
Cross-Run Dedup
Fix candidates are SHA-256 hashed. Failed hashes are stored in data/improve-history.json and skipped on subsequent runs. The LLM also receives prior attempt context to avoid repeating strategies.
Bounded Edit Surface
Edits are constrained to a configurable set of files (DEFAULT_BOUNDED_SURFACE). Frozen files (DEFAULT_FROZEN_FILES) are never modified. This prevents the LLM from rewriting core infrastructure to make tests pass.
Verdicts
| Verdict | Meaning |
|---------|---------|
| accepted | Fix improves dirty scenarios, no regressions, passes holdout |
| rejected_regression | Fix causes regressions in clean scenarios |
| rejected_overfitting | Fix passes validation but fails holdout |
| rejected_no_fix | No valid fix candidates generated |
| skipped_all_clean | All scenarios already clean |
| skipped_no_llm | Needs LLM but no provider configured |
Architecture
improve/
src/
index.ts — Public API exports
cli.ts — CLI entry point
types.ts — All type definitions (zero external imports)
improve.ts — Main orchestrator (7-step pipeline)
triage.ts — Evidence bundling + edit surface guards
prompts.ts — LLM prompt construction (generic + Claude-aware)
providers.ts — LLM provider factory (Gemini, Anthropic, Claude, Ollama)
subprocess.ts — Isolated validation (copy package, apply edits, run tests)
report.ts — Terminal output formatting
utils.ts — JSON extraction, hashing, retry logicZero coupling: The improve package has zero runtime imports from any Sovereign package. All integration happens at the CLI/harness level via subprocess invocation and the TestSurface interface.
License
MIT
