@sovereign-labs/improve

v0.1.0

Published

2 months ago

Self-improving verification pipeline. Diagnose failures, generate fix candidates, validate in isolation, reject overfitting. Repair with proof.

Downloads

0High
0Medium
0Low

vibestarter

ai-agents verification self-improvement auto-repair code-review ci-cd mcp agent-safety llm-testing cursor copilot aider opendevin

@sovereign-labs/improve

Self-improving verification pipeline. Diagnose failures, generate fix candidates, validate in isolation, reject overfitting. Repair with proof.

Works with any test runner implementing the TestSurface interface — not coupled to @sovereign-labs/verify, though that's the primary consumer.

Pipeline

baseline → bundle → triage → diagnose → generate → validate → rank → holdout → verdict

Baseline — Run all scenarios, identify dirty (failing) ones
Bundle — Group violations by root cause into evidence bundles
Triage — Classify confidence: mechanical (pattern-match), heuristic, or needs_llm
Diagnose — LLM root-cause analysis (skipped for mechanical triage)
Generate — LLM produces N fix candidates (search/replace edits)
Validate — Each candidate tested in isolated subprocess copy
Rank — Score = improvements - regressions - line penalty
Holdout — Best candidate re-tested against withheld scenarios (overfitting detection)
Verdict — accepted, rejected_regression, rejected_overfitting, rejected_no_fix, skipped_*

Install

npm install @sovereign-labs/improve
# or
bun add @sovereign-labs/improve

CLI Usage

# With Gemini
improve --app-dir ./my-package --llm gemini --api-key $GEMINI_KEY

# With local Ollama
improve --app-dir ./my-package --llm ollama --ollama-model qwen3:4b

# With Claude (domain-aware prompts)
improve --app-dir ./my-package --llm claude --api-key $ANTHROPIC_API_KEY

# Dry run (generate candidates, skip validation)
improve --app-dir ./my-package --llm gemini --api-key $GEMINI_KEY --dry-run

# Specific scenario families only
improve --app-dir ./my-package --llm gemini --api-key $GEMINI_KEY --families grounding,constraints

CLI Options

| Option | Description | Default | |--------|-------------|---------| | --app-dir <path> | Package directory to improve | . | | --self-test <script> | Self-test script path (relative to app-dir) | scripts/self-test.ts | | --llm <provider> | LLM provider: gemini, anthropic, claude, claude-code, ollama, none | none | | --api-key <key> | API key for cloud providers | — | | --ollama-model <model> | Ollama model name | qwen3:4b | | --ollama-host <url> | Ollama host URL | http://localhost:11434 | | --claude-model <model> | Claude model name | claude-sonnet-4-20250514 | | --max-candidates <n> | Fix candidates per bundle | 3 | | --max-lines <n> | Max changed lines per candidate | 30 | | --families <list> | Comma-separated scenario families | all | | --dry-run | Generate candidates but skip validation | false |

Programmatic Usage

import { runImproveLoop } from '@sovereign-labs/improve';
import type { TestSurface, ImproveConfig } from '@sovereign-labs/improve';

// Implement the TestSurface interface for your test runner
const surface: TestSurface = {
  packageDir: './my-package',
  selfTestScript: 'scripts/self-test.ts',
  async runBaseline(config) {
    // Run your test suite, return LedgerEntry[]
    return myTestRunner.run(config);
  },
};

const config: ImproveConfig = {
  llm: 'gemini',
  apiKey: process.env.GEMINI_API_KEY,
  maxCandidates: 3,
  maxLines: 30,
  dryRun: false,
};

const entries = await runImproveLoop(surface, config);
const accepted = entries.filter(e => e.verdict === 'accepted');
console.log(`${accepted.length} improvements applied`);

Custom LLM Providers

import { createLLMProvider, createClaudeCodeProvider } from '@sovereign-labs/improve';

// Factory — routes to the right provider based on config
const callLLM = createLLMProvider(config);

// Claude Code callback — for interactive Claude Code sessions
const callLLM = createClaudeCodeProvider(async (system, user) => {
  // Your callback sends prompts to Claude Code and returns the response
  return { text: response, inputTokens: 0, outputTokens: 0 };
});

Provider Comparison

| Provider | Best For | Notes | |----------|----------|-------| | gemini | General use | Gemini 2.5 Flash, fast, cheap | | anthropic | Standard Claude | Claude Sonnet, good reasoning | | claude | Domain-aware | Enhanced prompts with architecture context | | claude-code | Interactive | Callback-based, integrates with Claude Code sessions | | ollama | Air-gap / free | Local models (qwen3:4b default) | | none | Testing | Skips LLM diagnosis and fix generation |

Key Concepts

TestSurface Interface

Any test runner can be improved — just implement this interface:

interface TestSurface {
  runBaseline(config: TestSurfaceConfig): Promise<LedgerEntry[]>;
  packageDir: string;
  selfTestScript: string;
}

Overfitting Detection

Scenarios are split into three sets:

Dirty — Failing scenarios the fix should repair
Validation — Clean scenarios that must stay clean
Holdout — Withheld clean scenarios for overfitting detection

A fix that passes validation but regresses the holdout is rejected as overfitting.

Cross-Run Dedup

Fix candidates are SHA-256 hashed. Failed hashes are stored in data/improve-history.json and skipped on subsequent runs. The LLM also receives prior attempt context to avoid repeating strategies.

Bounded Edit Surface

Edits are constrained to a configurable set of files (DEFAULT_BOUNDED_SURFACE). Frozen files (DEFAULT_FROZEN_FILES) are never modified. This prevents the LLM from rewriting core infrastructure to make tests pass.

Verdicts

| Verdict | Meaning | |---------|---------| | accepted | Fix improves dirty scenarios, no regressions, passes holdout | | rejected_regression | Fix causes regressions in clean scenarios | | rejected_overfitting | Fix passes validation but fails holdout | | rejected_no_fix | No valid fix candidates generated | | skipped_all_clean | All scenarios already clean | | skipped_no_llm | Needs LLM but no provider configured |

Architecture

improve/
  src/
    index.ts        — Public API exports
    cli.ts          — CLI entry point
    types.ts        — All type definitions (zero external imports)
    improve.ts      — Main orchestrator (7-step pipeline)
    triage.ts       — Evidence bundling + edit surface guards
    prompts.ts      — LLM prompt construction (generic + Claude-aware)
    providers.ts    — LLM provider factory (Gemini, Anthropic, Claude, Ollama)
    subprocess.ts   — Isolated validation (copy package, apply edits, run tests)
    report.ts       — Terminal output formatting
    utils.ts        — JSON extraction, hashing, retry logic

Zero coupling: The improve package has zero runtime imports from any Sovereign package. All integration happens at the CLI/harness level via subprocess invocation and the TestSurface interface.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@sovereign-labs/improve

Pipeline

Install

CLI Usage

CLI Options

Programmatic Usage

Custom LLM Providers

Provider Comparison

Key Concepts

TestSurface Interface

Overfitting Detection

Cross-Run Dedup

Bounded Edit Surface

Verdicts

Architecture

License