cwv-harness

v0.2.0

Published

8 days ago

Rigorous experimentation harness for Core Web Vitals optimization, operable by an AI agent in any React/Remix/Next.js repo

0High
0Medium
0Low

this.cunningham

cwv-harness

A rigorous experimentation harness for improving Core Web Vitals in any React / Remix / Next.js (or any buildable+servable) web repo, designed to be operated by an AI agent (or a human) running many experiments over time — with machinery that makes it hard to self-deceive about whether a change is actually an improvement.

Instantiated from the reusable harness template (docs/REUSABLE_HARNESS_PROMPT.md in this workspace), which was itself distilled from a 48-experiment BTC trading research program. The founding domain analysis is PHASE0.md; the methodology is PROTOCOL.md. The short version: lab CWV measurements are noisy, optimization pressure manufactures false wins by default, and the counters are paired interleaved measurement, a frozen protocol, an append-only trial ledger feeding a multiple-testing bar, guardrails on everything the primary metric doesn't see, and a sealed verdict tier (held-out pages + post-ship field data).

Architecture: built once, instantiated thinly per repo

The package (this directory) — measurement engine, statistics, gates, ledgers, CLI, local scheduler, protocol docs, agent operating prompt. Shared by every target repo; never reimplemented per repo.
Per target repo (generated by cwv-harness init) — cwv.config.json (build/ serve commands, page set, field source, budgets), harness/FOUNDING.md (repo-specific Phase-0 answers), harness/CLAUDE.md (agent operating instructions), and that repo's own ledgers and experiment directories. Ledgers are per-repo by design — each site is its own population (see PHASE0.md "One ledger per repo").

Local-first by design: everything — measurement, scheduling, integrity — runs on a developer machine with no cloud infrastructure. Field snapshots are scheduled with cwv-harness schedule install (launchd on macOS, systemd/cron on Linux); integrity is cwv-harness verify-integrity before commits. GitHub workflows for both exist as an opt-in extra (cwv-harness init --github-workflows) but nothing depends on them — real RUM sources usually need auth that CI can't have anyway.

Install into a target repo

# from the target repo root — ALWAYS as a devDependency:
npm i -D cwv-harness   # or yarn add -D / pnpm add -D — lighthouse and
                       # chrome-launcher are bundled as dependencies
npx cwv-harness init          # local bin — resolves to this repo's node_modules
npx cwv-harness doctor        # verifies Chrome, build, serve, field source, scheduler

Chrome/Chromium itself is the one system requirement (chrome-launcher finds an installed browser; it doesn't download one) — cwv-harness doctor checks.

Invocation note: the command is cwv-harness, same as the package — there is deliberately no short cwv bin (an unrelated npm package owns that name, and the mismatch confused operators). Install as a devDependency and invoke via your package runner (npx cwv-harness / yarn cwv-harness / pnpm exec cwv-harness): running unpinned from the remote npx cache means the harness version isn't locked by your repo, so the frozen protocol can drift between sessions — cwv-harness doctor detects and explains this.

init detects the framework (Next/Remix/Vite/CRA, overridable), scaffolds cwv.config.json with TODOs the operator must fill (page set, baseline ref, field source), splits pages into measured/held-out, copies the protocol docs and agent prompt, and points you at cwv-harness schedule install to start the prospective field record immediately.

Serving real production apps

Generic preview-server defaults don't survive contact with a clustered prod server. The config gives each reality a first-class seam instead of forcing everything into one serve.command string:

build.installCommand: null auto-detects the package manager from the lockfile (yarn/pnpm/npm) for the baseline worktree.
serve.envFile — KEY=VALUE file loaded into the env of every harness-run command (install, build, prepare, serve, smoke).
serve.prepare — command run in BOTH the candidate root and the baseline cache worktree after build, for gitignored-but-required runtime artifacts (generated locales, env copies, seed data) so both variants reach runtime parity.
{port}/$PORT plus {port2}/$PORT2 — two guaranteed-free ports per server, so an auxiliary listener (metrics, websockets) doesn't collide when baseline and candidate run concurrently. Derive any further ports from $PORT.
serve.readyTimeoutMs defaults to 240 s — clustered prod servers cold-boot slowly, and a ready server is never slowed by a high ceiling.
smoke.command must be self-serving: it runs after the battery's servers are torn down.

The CLI

| command | tier | what it does | |---|---|---| | cwv-harness init [--github-workflows] | — | instantiate the harness in this repo (workflows opt-in; local-first otherwise) | | cwv-harness doctor | — | verify Chrome/build/serve/field source/scheduler work here | | cwv-harness measure [--pages ...] [--rounds N] [--save runs.jsonl] | exploration | free paired measurement with absolute numbers + deltas; never citable | | cwv-harness profile --page /x [--variant V] [--save lhr.json] | exploration | one throttled run: long tasks, per-script CPU, shipped transfer bytes | | cwv-harness screen --name slug | exploration | scaffold a pre-registered GO/NO-GO analysis doc in harness/analysis/ | | cwv-harness new --name NNNN-slug | — | scaffold an experiment from the template | | cwv-harness run --experiment NNNN | confirmation | the frozen battery; appends a trial to the ledger | | cwv-harness regrade --experiment NNNN --profile P | free | re-grade stored raw runs under another profile (new record, not a replacement) | | cwv-harness leaderboard | — | all ledgered trials, gates, verdicts | | cwv-harness field snapshot [--dry-run] | prospective | append current field p75s (RUM/CrUX) to the field ledger (dry-run: validate without polluting it) | | cwv-harness schedule install [--at HH:MM] \| status \| uninstall | prospective | local daily snapshot scheduling: launchd / systemd user timer / cron | | cwv-harness verdict --experiment NNNN --i-accept-the-cost | verdict | held-out pages / field window; PROMOTE only; ledgered | | cwv-harness verify-integrity | — | append-only ledgers, immutable page partition, unweakened gates, report/ledger consistency | | cwv-harness selftest | — | null-runner battery (A vs A): any detected "edge" is a harness exploit |

How a measurement works

Baseline ref (pinned in config) is built in a cached git worktree; the candidate (working tree) is built alongside. serve.prepare then brings both to runtime parity; orphaned servers from a previous crashed run are reaped first; free ports are allocated.
Both are served concurrently on two local ports.
Each Lighthouse run executes in its own subprocess that exits when done — orchestrator memory stays flat across an arbitrarily long battery (Lighthouse caches parsed sourcemaps in module state; on heavyweight apps this OOMs any in-process loop). Fresh Chrome profile per run. Battery runs compute only the metric audits — the sourcemap-based diagnostics the full performance category drags in (most of the per-run time on heavy apps) are skipped; the trace is captured during load either way, so the measured numbers are unchanged. Diagnostics live in cwv-harness profile, which keeps the full category; cwv-harness run --keep-lhr stores full per-run reports for one-off forensics. Under simulated throttling, the two variants of a round run concurrently (same machine-drift window, half the wall clock); under throttlingMethod: "devtools" they run sequentially (real CPU pinning can't be shared). Rounds are ABBA-interleaved either way. runner.parallelArms: true (opt-in) additionally runs the perturbation arms alongside the main arm — about half the battery wall-clock again, at the cost of pair variance on a busy machine (never bias: pairs stay simultaneous). Quiet machine only; protocol-relevant.
Raw per-round results are written to the experiment's raw/ dir before grading — a crash in the stats loses nothing (cwv-harness regrade recovers).
Statistics are computed on paired differences only: per-page median of per-round %Δ, aggregated as the median across pages, significance by a bootstrap that respects page clustering, deflated by the ledgered trial count (Bonferroni).
The full panel — LCP, TBT (INP lab proxy), CLS, FCP, TTFB, LCP element identity, shipped JS bytes, console errors — is recorded on every run regardless of the experiment's primary metric, and the gate battery (validity / objective profile / ruin) produces PROMOTE or ITERATE.

The cloaking check tolerates request-scoped server state (session UUIDs, trace ids, serialized __remixContext/__NEXT_DATA__ blobs): it first establishes the page's own determinism with two same-UA fetches, then compares exactly when deterministic or structurally (tag sequence) when not. Extra app-specific patterns go in cloaking.stripPatterns.

What's tested vs what `doctor` verifies

The statistical core, ledger append-only enforcement, gates, config validation, and the full battery against a deterministic fake runner are covered by node --test in this package. The Lighthouse/Chrome/serve path cannot be exercised in every environment — that's what cwv-harness doctor validates inside each target repo before the first real measurement.

Changelog

0.2.0 (2026-06-12)

Hardening from the first real campaign (clustered Express/Remix prod app, 16 GB machine):

Lighthouse runs in a subprocess per run — fixes the orchestrator OOM on sourcemap-heavy apps (the in-process module cache grew unbounded). One automatic retry per run for transient Chrome failures.
Raw runs persist before grading; reruns write raw-rerun-N/ instead of overwriting the original.
Server log buffers are detached once ready; orphaned servers from crashed runs are reaped on the next start; ports are allocated free instead of hardcoded, with {port2}/$PORT2 for auxiliary listeners.
Paired runs execute concurrently under simulated throttling (runner.concurrentPairs: "auto") — ~half the battery wall-clock.
Cloaking check rebuilt for real servers: UUID/traceparent/hex-id and framework-state normalization, determinism probe, structural fallback, cloaking.stripPatterns for app-specific state.
Local-first: cwv-harness schedule install|status|uninstall (launchd/systemd/ cron) replaces the GitHub field-snapshot workflow as the default capture path; workflows are opt-in via cwv-harness init --github-workflows. Doctor checks the scheduler is installed and the ledger is still accruing.
serve.envFile, serve.prepare, lockfile-aware install detection, 240 s default ready timeout.
New exploration tools: cwv-harness profile (long tasks, script CPU, shipped transfer bytes — never sourcemap byte sums), cwv-harness screen (pre-registered GO/NO-GO scaffold), cwv-harness measure absolute numbers + --save, cwv-harness field snapshot --dry-run, field.windowDays.
Doctor: live rumCommand validation, npx-cache detection, package-runner- aware install hints.
The command is now cwv-harness, matching the package name; the cwv bin is removed. An unrelated npm package owns the name cwv (remote npx cwv fetched it), and the package-vs-command mismatch confused operators. All docs, templates, prompts, and generated scripts say cwv-harness.
lighthouse and chrome-launcher are now regular dependencies (were optional peers) — one npm i -D cwv-harness install, no resolution failures, and the repo's lockfile pins the measuring instrument's version. Chrome itself remains the only system requirement.
The instrument is part of the frozen protocol: the harness package version is recorded in experiment.json at pre-registration and cwv-harness run refuses on mismatch — upgrading or hot-patching the harness mid-experiment now fails loudly instead of silently changing the measurement (0.1.0-era experiments without the field are exempt).
Battery runs trimmed to metric audits only (onlyAudits): no sourcemap fetching/parsing per run — a large cut in per-run time and memory on sourcemap-heavy apps, with identical measured numbers (the trace is captured before those audits would run). Side effect fixed in passing: errors-in-console is a best-practices audit and never actually ran under onlyCategories: ["performance"] — the console-errors ruin gate saw an empty list; it now genuinely executes. cwv-harness profile keeps the full performance category (and now reports the unused-JS estimate); cwv-harness run --keep-lhr stores full per-run LHRs under raw*/lhr/ for forensics.
runner.parallelArms (opt-in, default false): perturbation arms run concurrently with the main arm — roughly halves battery wall-clock again (≈21 → ≈11 pair-slots), at up to 6 concurrent Chromes. No bias (pairs stay simultaneous; contention is common-mode), but added pair variance — use on a quiet machine. Protocol-relevant: set before pre-registering.
cwv-harness init now also installs a Claude Code skill (.claude/skills/cwv-harness/SKILL.md) — the on-demand operational playbook (commands, flows, sharp edges); harness/CLAUDE.md stays the short always-loaded rules.

Upgrade note: new runner defaults (throttlingMethod, concurrentPairs) change protocolHash — in-flight pre-registered experiments will refuse to run (by design: the measurement process changed). Finish open experiments on 0.1.0 or re-scaffold them under the new protocol.

0.1.0 (2026-06-11)

Initial release, built inside the btc-research workspace and distilled from a 48-experiment trading-research program.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

cwv-harness

Architecture: built once, instantiated thinly per repo

Install into a target repo

Serving real production apps

The CLI

How a measurement works

What's tested vs what doctor verifies

Changelog

0.2.0 (2026-06-12)

0.1.0 (2026-06-11)

What's tested vs what `doctor` verifies