npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@zagents/darwin

v0.2.6

Published

An LLM supercharger and cost optimizer: freeze the model class, evolve the gemini. Measured on full SWE-bench Lite (300, official swebench Docker): 7.7% open-loop -> 15.3% +repair -> 29.3% (deepseek-v4-pro base) -> 40.3% v4-pro+frontier-tail hybrid, ~$0.0

Readme

@zagents/darwin

An LLM supercharger and cost optimizer. Keep your model frozen — evolve the gemini around it so a cheap model performs like an expensive one, for a fraction of the cost.

Darwin Mode makes the LLM you already use measurably better and cheaper by evolving the operating system around it — planner, context builder, reviewer, retry/tool/memory/score policy — instead of paying for a bigger model. It mutates one surface at a time, tests each change in a sandbox, and keeps only what measurably improves, building an archive of successful descendants. No weight updates, no fine-tuning — just a population, a benchmark, and an archive.

Why it pays off (measured, not marketing):

  • Cheap beats frontier. On a 15-model × 6-language execution benchmark, DeepSeek-V3 ($0.4/Mtok) tops quality-per-dollar — and the gemini, not the model, is the lever (ADR-085).
  • Real bug-fixing for pennies. Resolves real SWE-bench Lite issues at ~$0.01/instance with a sub-$1/Mtok model (ADR-142–146) — vs. $1–20/instance for frontier-model agents.
  • The gemini is the multiplier. Evolving context-window/selection/retry policy lifts a fixed model's measured outcomes (e.g. finalScore 0.765 → 0.985, ADR-103) — same model, better results.

This follows the Darwin Gödel Machine lineage: iteratively mutate the source of a coding agent, then empirically validate each variant.

repo
  → profile      RepoProfile (pkg mgr, test cmd, source/risk files)
  → baseline     generate the seven mutation-surface files
  → mutate       pick ONE approved surface, perturb it (behind the gate)
  → sandbox      safety-inspect → run the test command (no shell, no net, no secrets)
  → score        weighted base score − hard penalty layer
  → archive      record parent→child as a TREE (not a single best branch)
  → select       sample the next generation from the WHOLE archive
  → repeat

Dependency-free: Node ≥ 20 built-ins only, no runtime dependencies.

Quick start

Build (TypeScript → dist/):

npm run build      # tsc

Then evolve a repo with the CLI (one verb, evolve):

zagents-darwin evolve <repo> [--generations N] [--children N] [--concurrency N] [--seed N] \
    [--bench <suite.json>] [--tie faster] \
    [--selection score|quality-diversity|behavioral-diversity|niche-steering|clade|pareto] \
    [--crossover] [--epistasis] [--risk-budget N] [--fdr Q] [--curriculum] [--sandbox real|mock|agent]

| Flag | Meaning | Default | |------|---------|---------| | --generations N | number of generations to run | 3 | | --children N | children produced per parent per generation | 4 | | --concurrency N | max variants evaluated concurrently (bounded fan-out) | 4 | | --seed N | deterministic seed for mutation selection | 0 | | --bench <suite.json> | route promotion through the statistical benchmark gate (ADR-087) | off | | --tie faster | break score ties by efficiency (ADR-086) | insertion | | --selection … | parent-selection strategy (see Evolutionary stack) | score | | --crossover | recombine two parents' surfaces (ADR-089) | off | | --epistasis | topology-aware crossover via learned linkage (ADR-093) | off | | --risk-budget N | SGM cumulative risk cap on promotions (ADR-090) | off | | --fdr Q | Benjamini-Hochberg FDR control on promotion (ADR-096) | off | | --curriculum | difficulty-ladder over a graded suite (ADR-097) | off | | --sandbox … | evaluation substrate: real (repo test) · mock (surface params, ADR-102) · agent (real surface code, ADR-106) | real |

All flags are opt-in and additive over a frozen, reproducible core — every default-path run is byte-identical to the ADR-070…075 baseline.

The <repo> argument defaults to the current directory. Everything is written under a self-describing .zagents/ work tree inside the repo:

<repo>/.zagents/
├── archive.json          # the population TREE: ArchiveRecord[] (variant + score + children)
├── lineage.json          # serialized graph { nodes, edges } for rendering
├── variants/             # one directory per variant (its mutation-surface files)
│   ├── baseline/
│   ├── g1_v0/  …
├── runs/                 # one <variantId>.json per variant: { traces, score }
└── reports/
    └── winner.json       # the best scored ArchiveRecord

Sample run output (leaderboard + winner lineage, printed to stdout):

Darwin Mode — leaderboard
  0.842  g2_v1  [contextBuilder]  safety=1.00  pass=1.00 ◀ winner
  0.791  g1_v0  [reviewer]        safety=1.00  pass=1.00
  0.788  baseline  [planner]      safety=1.00  pass=1.00
  0.000  g1_v3  [toolPolicy]      safety=0.00  pass=0.00

Winner: g2_v1
Lineage: baseline → g1_v0 → g2_v1
Delta over baseline: +0.054

Artifacts: <repo>/.zagents

The seven mutation surfaces

A child variant may mutate exactly one surface per generation, and a variant directory may contain only these seven files — nothing else (the allowlist is enforced by safety.ts, see FILE_BY_SURFACE / APPROVED_FILES). Each surface is pure, side-effect-free policy logic over injected data.

| Surface (MutationSurface) | File | Governs | |-----------------------------|------|---------| | planner | planner.ts | task string → ordered plan steps (map → inspect → patch → verify) | | contextBuilder | context_builder.ts | ranks candidate files by term overlap with the task | | reviewer | reviewer.ts | flags changed files against an injected risk-file list + test outcome | | retryPolicy | retry_policy.ts | whether/how to retry given a symbolic failure classification | | toolPolicy | tool_policy.ts | allow-list + deterministic ordering over symbolic command kinds | | memoryPolicy | memory_policy.ts | whether an outcome record is worth remembering | | scorePolicy | score_policy.ts | the weight vector a variant proposes over the positive scoring terms |

A variant may propose score weights via scorePolicy, but it can never re-grade itself: the verdict that decides promotion is computed by the frozen kernel scorer (see below), not by the variant's own file.

Scoring and the promotion gate

The scorer (src/scorer.ts, ADR-072) is a pure function — re-running it on the same traces yields an identical verdict. It is a weighted base score over six [0,1] terms (weights from scoreWeights(), summing to 1.0):

baseScore = 0.35·taskSuccess + 0.20·testPassRate + 0.15·traceQuality
          + 0.10·costEfficiency + 0.10·latencyEfficiency + 0.10·safetyScore

minus a hard penalty layer read out of the run traces (a single safety violation can drive the final score negative — that is the point):

finalScore = baseScore − 0.30·secretExposure − 0.25·destructiveAction
                       − 0.20·hallucinatedFile − 0.15·toolLoop − 0.10·costOverrun

A child replaces its parent only when all four promotion clauses hold against the parent:

1. beatsParent       finalScore > parentFinalScore + promotionDelta   (default delta 0.05)
2. safetyOk          safetyScore ≥ 0.95
3. noRegression      testPassRate ≥ parentTestPassRate
4. noBlockedActions  safetyScore == 1.0  (zero blocked actions in any trace)

Non-promoted variants are retained, not deleted — "did not clear the gate" means "not chosen as a parent by the default policy," never "removed."

The archive: evolve like species, not release like software

The archive (src/archive.ts, ADR-073) is a tree of variants keyed by id and persisted as archive.json, not a single best branch. Selection (selectParents) samples the whole archive — including older, non-promoted branches — which is how evolution escapes hill-climbing: when a generation stalls (no promotions), a weak-looking ancestor can still seed a strong branch. Insertion order is preserved, so best(), tie-breaks, and selectParents are all deterministic and reproducible from archive.json alone.

Safety model

A self-modifying agent that can edit anything is a liability. Darwin Mode's bound is enforced in src/safety.ts (ADR-071) as the load-bearing security boundary, with two independent, defense-in-depth checks:

  • inspectVariant(dir) runs before any variant executes. It disqualifies a variant directory containing anything other than the seven approved files, a blocked filename (.env, secret, id_rsa, .git, package.json, …), a symlink or nested directory, or blocked content (process.env, child_process, eval, fetch, restricted node builtins, shell strings, …).
  • validateGeneratedCode(code) runs before generated code is written to disk (the LLM-mutator path). Independent pattern set; a violating generation is discarded, never repaired in place.

The gate runs first: a disqualified variant never has its test command run — the sandbox seals the trace with the reserved exit code 99 and records the findings as blockedActions, which zeroes safetyScore and makes promotion impossible. When a variant is admitted, the sandbox (src/sandbox.ts) is shell-free (the test command is split to argv and run via execFile, never a shell — no command-injection surface) and runs under a scrubbed environment (only PATH plus three identifying variables; nothing else from process.env leaks, so secrets, tokens, and proxy settings never reach a variant).

See SECURITY.md for the full threat model.

Programmatic API

import { evolve } from '@zagents/darwin';

const result = await evolve({
  repoRoot: '/abs/path/to/repo',
  workRoot: '/abs/path/to/repo/.zagents',
  generations: 3,
  childrenPerGeneration: 4,
  concurrency: 4,
  promotionDelta: 0.05,
  seed: 0,
  tasks: [
    'run repository test suite',
    'verify generated gemini safety',
    'check trace quality',
  ],
});

result.winner;        // the best scored ArchiveRecord (or null)
result.winnerLineage; // ['baseline', 'g1_v0', 'g2_v1'] — root → winner
result.records;       // every ArchiveRecord, in insertion order
result.baseline;      // the baseline record

The package also re-exports the building blocks behind evolve: profileRepo, generateBaselineHarness, createChildVariant, DeterministicMutator / CodeGenerator, runVariantTask / runVariantTasks, scoreVariant / scoreWeights, Archive, inspectVariant / validateGeneratedCode, plus the SURFACES, FILE_BY_SURFACE, and APPROVED_FILES constants.

Evolutionary stack (ADR-084–105)

The baseline above is the frozen core. On top of it, a set of opt-in, additive, reproducible mechanisms turn the loop from a single-best search into a real evolutionary algorithm. Every one is off by default (so the core stays byte-identical) and individually toggled:

| Capability | ADR | How to enable | |---|---|---| | Failure-driven mutation — feed a parent's failing traces into the mutator | 084 | always (the deterministic mutator ignores it) | | LLM mutatorOpenRouterMutator as a CodeGenerator, behind the same safety gate; model chosen by a 15-model execution benchmark | 085 | config.generator | | Efficiency tie-break — break score ties by speed | 086 | --tie faster | | Graded statistical promotion — public∧hidden∧regression∧safety + seeded bootstrap CI over a hash-pinned suite | 087 | --bench s.json | | MAP-Elites — keep the elite per behaviour niche | 088 | --selection quality-diversity | | Genetic crossover — recombine two parents' surfaces | 089 | --crossover | | SGM risk budget — bound cumulative self-modification | 090 | --risk-budget N | | Hyperbolic phenotyping — Poincaré-ball behavioural niche from traces | 091 | --selection behavioral-diversity | | Active niche steering — drive toward under-explored regions | 092 | --selection niche-steering | | Epistatic linkage — topology-aware crossover of co-adapted surfaces | 093 | --crossover --epistasis | | Clade metaproductivity — select parents by descendant potential (Huxley-Gödel) | 094 | --selection clade | | Benjamini-Hochberg FDR control — correct promotion for multiple testing | 096 | --fdr Q | | Self-directed curriculum — difficulty ladder over a graded suite | 097 | --curriculum | | Multi-objective Pareto — non-dominated (capability × parsimony) front | 100 | --selection pareto |

The evaluation substrate (ADR-101/102)

By default the sandbox runs the repo's test command, which is independent of the gemini surfaces — so the behavioural manifold is degenerate (measured: nicheEntropy = 0, ADR-099). sandboxMode: 'mock' (ADR-102) instead runs a deterministic surface-driven agent loop, so a variant's traces depend on its surface content and the manifold comes alive. sandboxMode: 'agent' (ADR-106) runs a variant's real surface code in a child process. The real-LLM-on-real-code substrate is no longer deferred — it shipped (ADR-106→141) and now runs on canonical SWE-bench Lite (ADR-142+, below).

Validated results (real, reproducible — see bench/results/)

  • Manifold goes live (ADR-102): real nicheEntropy 0 → 0.69, finalScores flat 0.985 → spread 0.435–0.802 under mock mode.
  • Self-improvement (ADR-103): the loop evolves contextBuilder (window 30 → 70) and climbs finalScore 0.765 → 0.985 by generation 3.
  • Diversity beats greedy on deception (ADR-105): on a deceptive epistatic landscape across 5 seeds, greedy score selection crosses it 0/5, behavioral-diversity 5/5, clade 4/5 — empirically justifying the diversity machinery.
  • Polyglot model frontier (ADR-085): 15 models × 6 languages, execution-scored; DeepSeek-V3 ($0.4/Mtok) tops quality-per-dollar — cheap beats frontier for code.

Canonical SWE-bench Lite (real, official gemini — ADR-142–149)

Full reproducible evidence: bench/results/RESULTS.md · measured best-practices: LEARNINGS.md · known-flaky exclusions: bench/swebench/KNOWN_FLAKY.md

Run on the full 300 SWE-bench Lite (test) instances, scored by the official swebench Docker gemini — no cherry-picking, tight CIs. Solver = relevance-ranked context + symbol-aware localization + search/replace patch, deepseek-chat, ~$0.01/instance.

| config | resolved | Wilson 95% CI | ADR | |---|---|---|---| | baseline (open-loop, single-shot) | 23/300 = 7.7% | [5.2, 11.2] | 144 | | + LLM localization | 24/300 = 8.0% | [5.4, 11.6] | 146 | | + closed-loop repair (test-feedback, ≤3) | 46/300 = 15.3% | [11.7, 19.8] | 149 | | + swap base → deepseek-v4-pro (cheap) | 88/300 = 29.3% | [24.5, 34.7] | 151 | | + v4-pro + Scholar hybrid | 121/300 = 40.3% | [34.9, 46.0] | 152 | | + Barbarian&Scholar hybrid (cheap+frontier tail) | 100/300 = 33.3% | [28.2, 38.8] | 148 |

The closed-loop repair loop ~doubles the resolve-rate (7.7% → 15.3%) on the same cheap model (deepseek, <1¢/instance); the baseline and repair CIs are essentially disjoint (11.2 vs 11.7), so it's a real lift, not noise. Honest framing: 15.3% is still a cheap-model number — leaderboard leaders hit 65–88% on Verified using frontier models + deeper agentic loops at $1–20/instance. The contribution is the gemini lift on a fixed cheap model: open-loop 7.7% → closed-loop 15.3% at near-constant cost. Localization lifted file-selection recall 44.7% → 59.7%; the repair loop then converts first-shot misses into fixes via test feedback. Next levers: hybrid cheap→frontier escalation (ADR-148), local-model repair (ADR-150). Every number is reproducible under bench/swebench/.

Status

Working, empirically validated on both the mock substrate and canonical SWE-bench Lite. The DeterministicMutator is seeded and signature-preserving; the OpenRouterMutator (ADR-085) is the production LLM CodeGenerator, behind the same validateGeneratedCode gate. The safety boundary, scorer, archive, and bench layer are kernel code. The real-LLM-on-real-code frontier (once deferred) is now measured: a reproducible 7.7% [5.2–11.2%] open-loop baseline on the full SWE-bench Lite (ADR-144), with localization (146), the repair loop (149), and a hybrid cheap→frontier escalation (148) as the active levers. Darwin Mode also ships integrated into the zagents scaffoldernpx zagents <name> produces a gemini with npm run evolve out of the box (ADR-147).

License

MIT © rUv. See ADRs 070 · 071 · 072 · 073 · 074 · 075, and the repository.