cwv-harness
v0.2.0
Published
Rigorous experimentation harness for Core Web Vitals optimization, operable by an AI agent in any React/Remix/Next.js repo
Readme
cwv-harness
A rigorous experimentation harness for improving Core Web Vitals in any React / Remix / Next.js (or any buildable+servable) web repo, designed to be operated by an AI agent (or a human) running many experiments over time — with machinery that makes it hard to self-deceive about whether a change is actually an improvement.
Instantiated from the reusable harness template
(docs/REUSABLE_HARNESS_PROMPT.md in this workspace), which was itself
distilled from a 48-experiment BTC trading research program. The founding
domain analysis is PHASE0.md; the methodology is
PROTOCOL.md. The short version: lab CWV measurements are
noisy, optimization pressure manufactures false wins by default, and the
counters are paired interleaved measurement, a frozen protocol, an
append-only trial ledger feeding a multiple-testing bar, guardrails on
everything the primary metric doesn't see, and a sealed verdict tier
(held-out pages + post-ship field data).
Architecture: built once, instantiated thinly per repo
- The package (this directory) — measurement engine, statistics, gates, ledgers, CLI, local scheduler, protocol docs, agent operating prompt. Shared by every target repo; never reimplemented per repo.
- Per target repo (generated by
cwv-harness init) —cwv.config.json(build/ serve commands, page set, field source, budgets),harness/FOUNDING.md(repo-specific Phase-0 answers),harness/CLAUDE.md(agent operating instructions), and that repo's own ledgers and experiment directories. Ledgers are per-repo by design — each site is its own population (see PHASE0.md "One ledger per repo").
Local-first by design: everything — measurement, scheduling, integrity —
runs on a developer machine with no cloud infrastructure. Field snapshots
are scheduled with cwv-harness schedule install (launchd on macOS, systemd/cron on
Linux); integrity is cwv-harness verify-integrity before commits. GitHub workflows
for both exist as an opt-in extra (cwv-harness init --github-workflows) but
nothing depends on them — real RUM sources usually need auth that CI can't
have anyway.
Install into a target repo
# from the target repo root — ALWAYS as a devDependency:
npm i -D cwv-harness # or yarn add -D / pnpm add -D — lighthouse and
# chrome-launcher are bundled as dependencies
npx cwv-harness init # local bin — resolves to this repo's node_modules
npx cwv-harness doctor # verifies Chrome, build, serve, field source, schedulerChrome/Chromium itself is the one system requirement (chrome-launcher
finds an installed browser; it doesn't download one) — cwv-harness doctor checks.
Invocation note: the command is
cwv-harness, same as the package — there is deliberately no shortcwvbin (an unrelated npm package owns that name, and the mismatch confused operators). Install as a devDependency and invoke via your package runner (npx cwv-harness/yarn cwv-harness/pnpm exec cwv-harness): running unpinned from the remote npx cache means the harness version isn't locked by your repo, so the frozen protocol can drift between sessions —cwv-harness doctordetects and explains this.
init detects the framework (Next/Remix/Vite/CRA, overridable), scaffolds
cwv.config.json with TODOs the operator must fill (page set, baseline ref,
field source), splits pages into measured/held-out, copies the protocol docs
and agent prompt, and points you at cwv-harness schedule install to start the
prospective field record immediately.
Serving real production apps
Generic preview-server defaults don't survive contact with a clustered prod
server. The config gives each reality a first-class seam instead of forcing
everything into one serve.command string:
build.installCommand: nullauto-detects the package manager from the lockfile (yarn/pnpm/npm) for the baseline worktree.serve.envFile— KEY=VALUE file loaded into the env of every harness-run command (install, build, prepare, serve, smoke).serve.prepare— command run in BOTH the candidate root and the baseline cache worktree after build, for gitignored-but-required runtime artifacts (generated locales, env copies, seed data) so both variants reach runtime parity.{port}/$PORTplus{port2}/$PORT2— two guaranteed-free ports per server, so an auxiliary listener (metrics, websockets) doesn't collide when baseline and candidate run concurrently. Derive any further ports from$PORT.serve.readyTimeoutMsdefaults to 240 s — clustered prod servers cold-boot slowly, and a ready server is never slowed by a high ceiling.smoke.commandmust be self-serving: it runs after the battery's servers are torn down.
The CLI
| command | tier | what it does |
|---|---|---|
| cwv-harness init [--github-workflows] | — | instantiate the harness in this repo (workflows opt-in; local-first otherwise) |
| cwv-harness doctor | — | verify Chrome/build/serve/field source/scheduler work here |
| cwv-harness measure [--pages ...] [--rounds N] [--save runs.jsonl] | exploration | free paired measurement with absolute numbers + deltas; never citable |
| cwv-harness profile --page /x [--variant V] [--save lhr.json] | exploration | one throttled run: long tasks, per-script CPU, shipped transfer bytes |
| cwv-harness screen --name slug | exploration | scaffold a pre-registered GO/NO-GO analysis doc in harness/analysis/ |
| cwv-harness new --name NNNN-slug | — | scaffold an experiment from the template |
| cwv-harness run --experiment NNNN | confirmation | the frozen battery; appends a trial to the ledger |
| cwv-harness regrade --experiment NNNN --profile P | free | re-grade stored raw runs under another profile (new record, not a replacement) |
| cwv-harness leaderboard | — | all ledgered trials, gates, verdicts |
| cwv-harness field snapshot [--dry-run] | prospective | append current field p75s (RUM/CrUX) to the field ledger (dry-run: validate without polluting it) |
| cwv-harness schedule install [--at HH:MM] \| status \| uninstall | prospective | local daily snapshot scheduling: launchd / systemd user timer / cron |
| cwv-harness verdict --experiment NNNN --i-accept-the-cost | verdict | held-out pages / field window; PROMOTE only; ledgered |
| cwv-harness verify-integrity | — | append-only ledgers, immutable page partition, unweakened gates, report/ledger consistency |
| cwv-harness selftest | — | null-runner battery (A vs A): any detected "edge" is a harness exploit |
How a measurement works
- Baseline ref (pinned in config) is built in a cached git worktree;
the candidate (working tree) is built alongside.
serve.preparethen brings both to runtime parity; orphaned servers from a previous crashed run are reaped first; free ports are allocated. - Both are served concurrently on two local ports.
- Each Lighthouse run executes in its own subprocess that exits when
done — orchestrator memory stays flat across an arbitrarily long battery
(Lighthouse caches parsed sourcemaps in module state; on heavyweight
apps this OOMs any in-process loop). Fresh Chrome profile per run.
Battery runs compute only the metric audits — the sourcemap-based
diagnostics the full performance category drags in (most of the per-run
time on heavy apps) are skipped; the trace is captured during load
either way, so the measured numbers are unchanged. Diagnostics live in
cwv-harness profile, which keeps the full category;cwv-harness run --keep-lhrstores full per-run reports for one-off forensics. Under simulated throttling, the two variants of a round run concurrently (same machine-drift window, half the wall clock); underthrottlingMethod: "devtools"they run sequentially (real CPU pinning can't be shared). Rounds are ABBA-interleaved either way.runner.parallelArms: true(opt-in) additionally runs the perturbation arms alongside the main arm — about half the battery wall-clock again, at the cost of pair variance on a busy machine (never bias: pairs stay simultaneous). Quiet machine only; protocol-relevant. - Raw per-round results are written to the experiment's
raw/dir before grading — a crash in the stats loses nothing (cwv-harness regraderecovers). - Statistics are computed on paired differences only: per-page median of per-round %Δ, aggregated as the median across pages, significance by a bootstrap that respects page clustering, deflated by the ledgered trial count (Bonferroni).
- The full panel — LCP, TBT (INP lab proxy), CLS, FCP, TTFB, LCP element identity, shipped JS bytes, console errors — is recorded on every run regardless of the experiment's primary metric, and the gate battery (validity / objective profile / ruin) produces PROMOTE or ITERATE.
The cloaking check tolerates request-scoped server state (session UUIDs,
trace ids, serialized __remixContext/__NEXT_DATA__ blobs): it first
establishes the page's own determinism with two same-UA fetches, then
compares exactly when deterministic or structurally (tag sequence) when
not. Extra app-specific patterns go in cloaking.stripPatterns.
What's tested vs what doctor verifies
The statistical core, ledger append-only enforcement, gates, config
validation, and the full battery against a deterministic fake runner are
covered by node --test in this package. The Lighthouse/Chrome/serve path
cannot be exercised in every environment — that's what cwv-harness doctor
validates inside each target repo before the first real measurement.
Changelog
0.2.0 (2026-06-12)
Hardening from the first real campaign (clustered Express/Remix prod app, 16 GB machine):
- Lighthouse runs in a subprocess per run — fixes the orchestrator OOM on sourcemap-heavy apps (the in-process module cache grew unbounded). One automatic retry per run for transient Chrome failures.
- Raw runs persist before grading; reruns write
raw-rerun-N/instead of overwriting the original. - Server log buffers are detached once ready; orphaned servers from crashed
runs are reaped on the next start; ports are allocated free instead of
hardcoded, with
{port2}/$PORT2for auxiliary listeners. - Paired runs execute concurrently under simulated throttling
(
runner.concurrentPairs: "auto") — ~half the battery wall-clock. - Cloaking check rebuilt for real servers: UUID/traceparent/hex-id and
framework-state normalization, determinism probe, structural fallback,
cloaking.stripPatternsfor app-specific state. - Local-first:
cwv-harness schedule install|status|uninstall(launchd/systemd/ cron) replaces the GitHub field-snapshot workflow as the default capture path; workflows are opt-in viacwv-harness init --github-workflows. Doctor checks the scheduler is installed and the ledger is still accruing. serve.envFile,serve.prepare, lockfile-aware install detection, 240 s default ready timeout.- New exploration tools:
cwv-harness profile(long tasks, script CPU, shipped transfer bytes — never sourcemap byte sums),cwv-harness screen(pre-registered GO/NO-GO scaffold),cwv-harness measureabsolute numbers +--save,cwv-harness field snapshot --dry-run,field.windowDays. - Doctor: live rumCommand validation, npx-cache detection, package-runner- aware install hints.
- The command is now
cwv-harness, matching the package name; thecwvbin is removed. An unrelated npm package owns the namecwv(remotenpx cwvfetched it), and the package-vs-command mismatch confused operators. All docs, templates, prompts, and generated scripts saycwv-harness. - lighthouse and chrome-launcher are now regular dependencies (were
optional peers) — one
npm i -D cwv-harnessinstall, no resolution failures, and the repo's lockfile pins the measuring instrument's version. Chrome itself remains the only system requirement. - The instrument is part of the frozen protocol: the harness package
version is recorded in
experiment.jsonat pre-registration andcwv-harness runrefuses on mismatch — upgrading or hot-patching the harness mid-experiment now fails loudly instead of silently changing the measurement (0.1.0-era experiments without the field are exempt). - Battery runs trimmed to metric audits only (
onlyAudits): no sourcemap fetching/parsing per run — a large cut in per-run time and memory on sourcemap-heavy apps, with identical measured numbers (the trace is captured before those audits would run). Side effect fixed in passing:errors-in-consoleis a best-practices audit and never actually ran underonlyCategories: ["performance"]— the console-errors ruin gate saw an empty list; it now genuinely executes.cwv-harness profilekeeps the full performance category (and now reports the unused-JS estimate);cwv-harness run --keep-lhrstores full per-run LHRs underraw*/lhr/for forensics. runner.parallelArms(opt-in, default false): perturbation arms run concurrently with the main arm — roughly halves battery wall-clock again (≈21 → ≈11 pair-slots), at up to 6 concurrent Chromes. No bias (pairs stay simultaneous; contention is common-mode), but added pair variance — use on a quiet machine. Protocol-relevant: set before pre-registering.cwv-harness initnow also installs a Claude Code skill (.claude/skills/cwv-harness/SKILL.md) — the on-demand operational playbook (commands, flows, sharp edges);harness/CLAUDE.mdstays the short always-loaded rules.
Upgrade note: new runner defaults (throttlingMethod,
concurrentPairs) change protocolHash — in-flight pre-registered
experiments will refuse to run (by design: the measurement process changed).
Finish open experiments on 0.1.0 or re-scaffold them under the new protocol.
0.1.0 (2026-06-11)
Initial release, built inside the btc-research workspace and distilled from a 48-experiment trading-research program.
