@zagents/darwin

v0.2.6

Published

7 days ago

An LLM supercharger and cost optimizer: freeze the model class, evolve the gemini. Measured on full SWE-bench Lite (300, official swebench Docker): 7.7% open-loop -> 15.3% +repair -> 29.3% (deepseek-v4-pro base) -> 40.3% v4-pro+frontier-tail hybrid, ~$0.0

0High
0Medium
0Low

cvsz

llm cost-optimization llm-optimizer cheap-llm compute-arbitrage agent-gemini self-improvement evolutionary-search swe-bench zagents darwin-mode sandbox

@zagents/darwin

An LLM supercharger and cost optimizer. Keep your model frozen — evolve the gemini around it so a cheap model performs like an expensive one, for a fraction of the cost.

Darwin Mode makes the LLM you already use measurably better and cheaper by evolving the operating system around it — planner, context builder, reviewer, retry/tool/memory/score policy — instead of paying for a bigger model. It mutates one surface at a time, tests each change in a sandbox, and keeps only what measurably improves, building an archive of successful descendants. No weight updates, no fine-tuning — just a population, a benchmark, and an archive.

Why it pays off (measured, not marketing):

Cheap beats frontier. On a 15-model × 6-language execution benchmark, DeepSeek-V3 ($0.4/Mtok) tops quality-per-dollar — and the gemini, not the model, is the lever (ADR-085).
Real bug-fixing for pennies. Resolves real SWE-bench Lite issues at ~$0.01/instance with a sub-$1/Mtok model (ADR-142–146) — vs. $1–20/instance for frontier-model agents.
The gemini is the multiplier. Evolving context-window/selection/retry policy lifts a fixed model's measured outcomes (e.g. finalScore 0.765 → 0.985, ADR-103) — same model, better results.

This follows the Darwin Gödel Machine lineage: iteratively mutate the source of a coding agent, then empirically validate each variant.

repo
  → profile      RepoProfile (pkg mgr, test cmd, source/risk files)
  → baseline     generate the seven mutation-surface files
  → mutate       pick ONE approved surface, perturb it (behind the gate)
  → sandbox      safety-inspect → run the test command (no shell, no net, no secrets)
  → score        weighted base score − hard penalty layer
  → archive      record parent→child as a TREE (not a single best branch)
  → select       sample the next generation from the WHOLE archive
  → repeat

Dependency-free: Node ≥ 20 built-ins only, no runtime dependencies.

Quick start

Build (TypeScript → dist/):

npm run build      # tsc

Then evolve a repo with the CLI (one verb, evolve):

zagents-darwin evolve <repo> [--generations N] [--children N] [--concurrency N] [--seed N] \
    [--bench <suite.json>] [--tie faster] \
    [--selection score|quality-diversity|behavioral-diversity|niche-steering|clade|pareto] \
    [--crossover] [--epistasis] [--risk-budget N] [--fdr Q] [--curriculum] [--sandbox real|mock|agent]

| Flag | Meaning | Default | |------|---------|---------| | --generations N | number of generations to run | 3 | | --children N | children produced per parent per generation | 4 | | --concurrency N | max variants evaluated concurrently (bounded fan-out) | 4 | | --seed N | deterministic seed for mutation selection | 0 | | --bench <suite.json> | route promotion through the statistical benchmark gate (ADR-087) | off | | --tie faster | break score ties by efficiency (ADR-086) | insertion | | --selection … | parent-selection strategy (see Evolutionary stack) | score | | --crossover | recombine two parents' surfaces (ADR-089) | off | | --epistasis | topology-aware crossover via learned linkage (ADR-093) | off | | --risk-budget N | SGM cumulative risk cap on promotions (ADR-090) | off | | --fdr Q | Benjamini-Hochberg FDR control on promotion (ADR-096) | off | | --curriculum | difficulty-ladder over a graded suite (ADR-097) | off | | --sandbox … | evaluation substrate: real (repo test) · mock (surface params, ADR-102) · agent (real surface code, ADR-106) | real |

All flags are opt-in and additive over a frozen, reproducible core — every default-path run is byte-identical to the ADR-070…075 baseline.

The <repo> argument defaults to the current directory. Everything is written under a self-describing .zagents/ work tree inside the repo:

<repo>/.zagents/
├── archive.json          # the population TREE: ArchiveRecord[] (variant + score + children)
├── lineage.json          # serialized graph { nodes, edges } for rendering
├── variants/             # one directory per variant (its mutation-surface files)
│   ├── baseline/
│   ├── g1_v0/  …
├── runs/                 # one <variantId>.json per variant: { traces, score }
└── reports/
    └── winner.json       # the best scored ArchiveRecord

Sample run output (leaderboard + winner lineage, printed to stdout):

Darwin Mode — leaderboard
  0.842  g2_v1  [contextBuilder]  safety=1.00  pass=1.00 ◀ winner
  0.791  g1_v0  [reviewer]        safety=1.00  pass=1.00
  0.788  baseline  [planner]      safety=1.00  pass=1.00
  0.000  g1_v3  [toolPolicy]      safety=0.00  pass=0.00

Winner: g2_v1
Lineage: baseline → g1_v0 → g2_v1
Delta over baseline: +0.054

Artifacts: <repo>/.zagents

The seven mutation surfaces

A child variant may mutate exactly one surface per generation, and a variant directory may contain only these seven files — nothing else (the allowlist is enforced by safety.ts, see FILE_BY_SURFACE / APPROVED_FILES). Each surface is pure, side-effect-free policy logic over injected data.

| Surface (MutationSurface) | File | Governs | |-----------------------------|------|---------| | planner | planner.ts | task string → ordered plan steps (map → inspect → patch → verify) | | contextBuilder | context_builder.ts | ranks candidate files by term overlap with the task | | reviewer | reviewer.ts | flags changed files against an injected risk-file list + test outcome | | retryPolicy | retry_policy.ts | whether/how to retry given a symbolic failure classification | | toolPolicy | tool_policy.ts | allow-list + deterministic ordering over symbolic command kinds | | memoryPolicy | memory_policy.ts | whether an outcome record is worth remembering | | scorePolicy | score_policy.ts | the weight vector a variant proposes over the positive scoring terms |

A variant may propose score weights via scorePolicy, but it can never re-grade itself: the verdict that decides promotion is computed by the frozen kernel scorer (see below), not by the variant's own file.

Scoring and the promotion gate

The scorer (src/scorer.ts, ADR-072) is a pure function — re-running it on the same traces yields an identical verdict. It is a weighted base score over six [0,1] terms (weights from scoreWeights(), summing to 1.0):

baseScore = 0.35·taskSuccess + 0.20·testPassRate + 0.15·traceQuality
          + 0.10·costEfficiency + 0.10·latencyEfficiency + 0.10·safetyScore

minus a hard penalty layer read out of the run traces (a single safety violation can drive the final score negative — that is the point):

finalScore = baseScore − 0.30·secretExposure − 0.25·destructiveAction
                       − 0.20·hallucinatedFile − 0.15·toolLoop − 0.10·costOverrun

A child replaces its parent only when all four promotion clauses hold against the parent:

1. beatsParent       finalScore > parentFinalScore + promotionDelta   (default delta 0.05)
2. safetyOk          safetyScore ≥ 0.95
3. noRegression      testPassRate ≥ parentTestPassRate
4. noBlockedActions  safetyScore == 1.0  (zero blocked actions in any trace)

Non-promoted variants are retained, not deleted — "did not clear the gate" means "not chosen as a parent by the default policy," never "removed."

The archive: evolve like species, not release like software

The archive (src/archive.ts, ADR-073) is a tree of variants keyed by id and persisted as archive.json, not a single best branch. Selection (selectParents) samples the whole archive — including older, non-promoted branches — which is how evolution escapes hill-climbing: when a generation stalls (no promotions), a weak-looking ancestor can still seed a strong branch. Insertion order is preserved, so best(), tie-breaks, and selectParents are all deterministic and reproducible from archive.json alone.

Safety model

A self-modifying agent that can edit anything is a liability. Darwin Mode's bound is enforced in src/safety.ts (ADR-071) as the load-bearing security boundary, with two independent, defense-in-depth checks:

inspectVariant(dir) runs before any variant executes. It disqualifies a variant directory containing anything other than the seven approved files, a blocked filename (.env, secret, id_rsa, .git, package.json, …), a symlink or nested directory, or blocked content (process.env, child_process, eval, fetch, restricted node builtins, shell strings, …).
validateGeneratedCode(code) runs before generated code is written to disk (the LLM-mutator path). Independent pattern set; a violating generation is discarded, never repaired in place.

The gate runs first: a disqualified variant never has its test command run — the sandbox seals the trace with the reserved exit code 99 and records the findings as blockedActions, which zeroes safetyScore and makes promotion impossible. When a variant is admitted, the sandbox (src/sandbox.ts) is shell-free (the test command is split to argv and run via execFile, never a shell — no command-injection surface) and runs under a scrubbed environment (only PATH plus three identifying variables; nothing else from process.env leaks, so secrets, tokens, and proxy settings never reach a variant).

See SECURITY.md for the full threat model.

Programmatic API

import { evolve } from '@zagents/darwin';

const result = await evolve({
  repoRoot: '/abs/path/to/repo',
  workRoot: '/abs/path/to/repo/.zagents',
  generations: 3,
  childrenPerGeneration: 4,
  concurrency: 4,
  promotionDelta: 0.05,
  seed: 0,
  tasks: [
    'run repository test suite',
    'verify generated gemini safety',
    'check trace quality',
  ],
});

result.winner;        // the best scored ArchiveRecord (or null)
result.winnerLineage; // ['baseline', 'g1_v0', 'g2_v1'] — root → winner
result.records;       // every ArchiveRecord, in insertion order
result.baseline;      // the baseline record

The package also re-exports the building blocks behind evolve: profileRepo, generateBaselineHarness, createChildVariant, DeterministicMutator / CodeGenerator, runVariantTask / runVariantTasks, scoreVariant / scoreWeights, Archive, inspectVariant / validateGeneratedCode, plus the SURFACES, FILE_BY_SURFACE, and APPROVED_FILES constants.

Evolutionary stack (ADR-084–105)

The baseline above is the frozen core. On top of it, a set of opt-in, additive, reproducible mechanisms turn the loop from a single-best search into a real evolutionary algorithm. Every one is off by default (so the core stays byte-identical) and individually toggled:

| Capability | ADR | How to enable | |---|---|---| | Failure-driven mutation — feed a parent's failing traces into the mutator | 084 | always (the deterministic mutator ignores it) | | LLM mutator — OpenRouterMutator as a CodeGenerator, behind the same safety gate; model chosen by a 15-model execution benchmark | 085 | config.generator | | Efficiency tie-break — break score ties by speed | 086 | --tie faster | | Graded statistical promotion — public∧hidden∧regression∧safety + seeded bootstrap CI over a hash-pinned suite | 087 | --bench s.json | | MAP-Elites — keep the elite per behaviour niche | 088 | --selection quality-diversity | | Genetic crossover — recombine two parents' surfaces | 089 | --crossover | | SGM risk budget — bound cumulative self-modification | 090 | --risk-budget N | | Hyperbolic phenotyping — Poincaré-ball behavioural niche from traces | 091 | --selection behavioral-diversity | | Active niche steering — drive toward under-explored regions | 092 | --selection niche-steering | | Epistatic linkage — topology-aware crossover of co-adapted surfaces | 093 | --crossover --epistasis | | Clade metaproductivity — select parents by descendant potential (Huxley-Gödel) | 094 | --selection clade | | Benjamini-Hochberg FDR control — correct promotion for multiple testing | 096 | --fdr Q | | Self-directed curriculum — difficulty ladder over a graded suite | 097 | --curriculum | | Multi-objective Pareto — non-dominated (capability × parsimony) front | 100 | --selection pareto |

The evaluation substrate (ADR-101/102)

By default the sandbox runs the repo's test command, which is independent of the gemini surfaces — so the behavioural manifold is degenerate (measured: nicheEntropy = 0, ADR-099). sandboxMode: 'mock' (ADR-102) instead runs a deterministic surface-driven agent loop, so a variant's traces depend on its surface content and the manifold comes alive. sandboxMode: 'agent' (ADR-106) runs a variant's real surface code in a child process. The real-LLM-on-real-code substrate is no longer deferred — it shipped (ADR-106→141) and now runs on canonical SWE-bench Lite (ADR-142+, below).

Validated results (real, reproducible — see `bench/results/`)

Manifold goes live (ADR-102): real nicheEntropy 0 → 0.69, finalScores flat 0.985 → spread 0.435–0.802 under mock mode.
Self-improvement (ADR-103): the loop evolves contextBuilder (window 30 → 70) and climbs finalScore 0.765 → 0.985 by generation 3.
Diversity beats greedy on deception (ADR-105): on a deceptive epistatic landscape across 5 seeds, greedy score selection crosses it 0/5, behavioral-diversity 5/5, clade 4/5 — empirically justifying the diversity machinery.
Polyglot model frontier (ADR-085): 15 models × 6 languages, execution-scored; DeepSeek-V3 ($0.4/Mtok) tops quality-per-dollar — cheap beats frontier for code.

Canonical SWE-bench Lite (real, official gemini — ADR-142–149)

Full reproducible evidence: bench/results/RESULTS.md · measured best-practices: LEARNINGS.md · known-flaky exclusions: bench/swebench/KNOWN_FLAKY.md

Run on the full 300 SWE-bench Lite (test) instances, scored by the official swebench Docker gemini — no cherry-picking, tight CIs. Solver = relevance-ranked context + symbol-aware localization + search/replace patch, deepseek-chat, ~$0.01/instance.

| config | resolved | Wilson 95% CI | ADR | |---|---|---|---| | baseline (open-loop, single-shot) | 23/300 = 7.7% | [5.2, 11.2] | 144 | | + LLM localization | 24/300 = 8.0% | [5.4, 11.6] | 146 | | + closed-loop repair (test-feedback, ≤3) | 46/300 = 15.3% | [11.7, 19.8] | 149 | | + swap base → deepseek-v4-pro (cheap) | 88/300 = 29.3% | [24.5, 34.7] | 151 | | + v4-pro + Scholar hybrid | 121/300 = 40.3% | [34.9, 46.0] | 152 | | + Barbarian&Scholar hybrid (cheap+frontier tail) | 100/300 = 33.3% | [28.2, 38.8] | 148 |

The closed-loop repair loop ~doubles the resolve-rate (7.7% → 15.3%) on the same cheap model (deepseek, <1¢/instance); the baseline and repair CIs are essentially disjoint (11.2 vs 11.7), so it's a real lift, not noise. Honest framing: 15.3% is still a cheap-model number — leaderboard leaders hit 65–88% on Verified using frontier models + deeper agentic loops at $1–20/instance. The contribution is the gemini lift on a fixed cheap model: open-loop 7.7% → closed-loop 15.3% at near-constant cost. Localization lifted file-selection recall 44.7% → 59.7%; the repair loop then converts first-shot misses into fixes via test feedback. Next levers: hybrid cheap→frontier escalation (ADR-148), local-model repair (ADR-150). Every number is reproducible under bench/swebench/.

Status

Working, empirically validated on both the mock substrate and canonical SWE-bench Lite. The DeterministicMutator is seeded and signature-preserving; the OpenRouterMutator (ADR-085) is the production LLM CodeGenerator, behind the same validateGeneratedCode gate. The safety boundary, scorer, archive, and bench layer are kernel code. The real-LLM-on-real-code frontier (once deferred) is now measured: a reproducible 7.7% [5.2–11.2%] open-loop baseline on the full SWE-bench Lite (ADR-144), with localization (146), the repair loop (149), and a hybrid cheap→frontier escalation (148) as the active levers. Darwin Mode also ships integrated into the zagents scaffolder — npx zagents <name> produces a gemini with npm run evolve out of the box (ADR-147).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@zagents/darwin

Quick start

The seven mutation surfaces

Scoring and the promotion gate

The archive: evolve like species, not release like software

Safety model

Programmatic API

Evolutionary stack (ADR-084–105)

The evaluation substrate (ADR-101/102)

Validated results (real, reproducible — see bench/results/)

Canonical SWE-bench Lite (real, official gemini — ADR-142–149)

Status

License

Validated results (real, reproducible — see `bench/results/`)