lightbringer
v0.3.0
Published
Per-step web performance measurement for Playwright: network / CPU / render / INP / memory / coverage at initialization and between steps. Use the test fixture, or the zero-config CLI that runs a JSON scenario and emits budgets. Median regression gate and
Maintainers
Readme
lightbringer
Per-step resource measurement for Playwright scenarios.
Responsibility: lightbringer measures the resources a Playwright scenario consumes — at its initialization (the initial load) and between its steps (each interaction / transition) — so you can optimize that resource usage. It breaks each step's cost into network, CPU, and render (plus web-vitals including INP), so the only way to move a number is to change the implementation, not how the test waits.
Lighthouse only measures the initial load; lightbringer covers the whole scenario lifecycle, step by step.
In scope: per-step network / CPU / render / INP / memory load of one scenario, a median gate for regressions, and a trace drilldown to the responsible code. Out of scope (non-goals): a general always-on profiler, cross-scenario / whole- suite analysis, and heap-snapshot leak analysis (the retained-object graph — which object holds what). Memory is measured as a per-step load (heap / buffer / DOM / listener gauges and their per-step delta), not as a retained-graph diff.
[perf] measure initial load and a follow-up navigation
vitals LCP=120 (good) INP=48 (good) CLS=0 (good) TTFB=12 (good)
initial-load 210ms
net busy=160ms 18reqs 7waves 680KB
cpu block=52ms longtasks=1 maxTask=52ms loaf=1/0ms
render style=7/4.5ms layout=7/8.9ms script=56ms paint=8/1.7ms gpu=6ms
mem heap=12.4MB (+2.1MB) arraybufs=3 listeners=84 (+6) docs=+0 domNodes=512
app-work 22ms
...
app spans (performance.measure):
demo-work 20ms net=0ms/0KB cpu=0ms
total network 21 reqs / 681KBWhat it measures
Per span (one perf.measure(name, action) region):
web-vitals (attribution build) — LCP / INP / CLS / TTFB / FCP with attribution. LCP is broken into its sub-parts (TTFB / resource load delay / load duration / render delay) so you can see whether it's server-, resource-, or render-bound, and
report.renderBlockinglists the<head>stylesheets / parser-blocking scripts standing between navigation and first paint.network (CDP) — request count, transferred KB,
busyMs(union of request intervals = how long the network was actually busy), andwaves(approximate serial-dependency depth of the waterfall), and how many requests were served from cache (fromCacheCount— disk / memory / prefetch / SW, no network fetch). Each request also carries its initiator (the code or parser that issued it);network.byInitiatorrolls them up so a deep waterfall points straight at the responsible function (get /App.tsx:244 (6)) — the network-side analogue of the CPU drilldown.third-party — the slice of the network served from a registrable domain other than the page's: bytes the app didn't ship and network time it didn't ask for (analytics, tag managers, ad tech, embedded widgets), broken down per domain (
network.thirdParty). CPU spent by third-party scripts is attributed by the drilldown, which classifies each CPU-profiler frame by its script URL.cpu — long task count, total blocking time, heaviest long task, LoAF.
interaction (per-step INP) — the worst interaction inside each span, split into input delay / processing / presentation (Event Timing). web-vitals reports one page-global worst INP; this tells you which step was janky and why (e.g. a toggle whose 80 ms is almost all presentation = the repaint after it, not the handler).
interactionMsis budgetable.frames (animation smoothness) — a rAF probe records frame cadence, so each span reports effective fps, dropped frames (gaps ≥ one 60 Hz frame), and the worst hitch (
longestFrameMs). The render metrics say how much paint/GPU work; this says whether it rendered smoothly.droppedFrames/longestFrameMsare budgetable.render — style recalc / layout count and time (from CDP
Performance.getMetricscumulative counters), JS execution time; and, withPERF_TRACE=1, Paint count/time and GPU task time (gpuMs) from the trace. The drilldown rolls GPU work up per type (GPUTask / RasterTask / …) so a step that's cheap on the main thread but GPU-bound is visible.recalcStyleMsis the CSS selector-match cost of a step; the report'scssprofile (selectors × DOM nodes) is the structural cause, andPERF_CSS=1+ the drilldown name the individual costly / wasteful selectors.memory — per-step memory load from
Performance.getMetricsgauges: on-heap JS used + delta (jsHeapUsedMB/jsHeapDeltaMB), live ArrayBuffer count, retained DOM nodes, event-listener count + delta, and document delta. A delta that stays positive across repeated runs of the same step is the leak signal. Counts (listeners / ArrayBuffers / documents) are the reliable signals; heap bytes are noisy unless you force a GC withPERF_MEM=1(which measures the deltas afterHeapProfiler.collectGarbage, i.e. retained memory only). Note byte-level buffer / GPU memory is not observable via CDP — only the ArrayBuffer count is, so binary / GPU-staging memory shows as a climbing count, not bytes. Repeat a step withperf.measureRepeat(name, action, { times })and lightbringer reports whether memory climbs monotonically across the repeats (report.trends, flagged⚠ likely leak) — the reliable leak signal, since repeating averages out the single-step GC noise.media — image over-fetch (intrinsic px ≫ rendered px — a 1760×626 logo shown at 128×46 is 187× too big) and large resources shipped near-uncompressed (decoded ≈ encoded), from Resource Timing + the DOM.
report.media.coverage (PERF_COV=1) — JS + CSS coverage across the whole scenario: how much of each downloaded chunk / stylesheet the run actually executed. Per-chunk
usedPctflags chunks split too coarsely (low usage ⇒ lazy-load / drop candidate), andscripts/coverage.mjsunions the used byte ranges across every scenario to find code no scenario touched (dead code / over-shipping).
All times are unified to epoch ms so spans correlate with network / CPU even across navigations.
CLI (no install, no spec)
The fastest way to try it: no dependency to add, no test file to write. Describe
the scenario as JSON and run it with npx / pnpm dlx. Each step becomes one
measured span.
// scenario.json
{
"url": "http://localhost:5173",
"viewport": { "width": 1280, "height": 800 },
"steps": [
{ "name": "initial-load", "goto": "/", "waitFor": "#app" },
{ "name": "open-cart", "click": "text=Cart", "waitFor": "[role=dialog]" },
{ "name": "pan", "drag": ".map", "by": [-240, -160] }
]
}pnpm dlx lightbringer run scenario.json # measure, print the breakdown
pnpm dlx lightbringer run scenario.json --gpu --cov # real GPU + chunk coverage
pnpm dlx lightbringer run scenario.json --repeat 5 --emit-budgets # write budgets from the medians
pnpm dlx lightbringer run scenario.json --repeat 5 --gate # fail if a median exceeds them--emit-budgets derives lightbringer.budgets.json from the measured medians
(×1.25 headroom) — you never hand-write a number — and --gate fails the run
against it. Step fields: goto, click, fill+text, press, drag+by,
waitFor, wait, and a per-step settle (networkidle (default) / load /
raf / a number of ms). Flags: --repeat N, --out DIR, --gpu, --cpu N,
--net slow-3g|fast-3g|4g, --cov, --mem, --css, --trace, --headed.
The CLI bundles Playwright; the browser binary is the only prerequisite
(npx playwright install chromium). For CI integration or app-code spans, use the
test fixture below instead.
Run an existing Playwright spec (reuse your specs + config)
Already have e2e/*.spec.ts and a playwright.config.ts? Point run at a spec
(anything that isn't .json) and lightbringer runs it through playwright test
with your existing config (webServer / baseURL / projects) — measuring every
navigation and interaction, without editing the spec:
pnpm dlx lightbringer run e2e/checkout.spec.ts
pnpm dlx lightbringer run e2e/ --config playwright.config.ts --repeat 5 --emit-budgetsIt works by injecting a Node loader that swaps the spec's @playwright/test test
for the auto-instrumented one (the same as lightbringer/auto), so each
page.goto / getByRole(...).click() becomes a span. --config, --repeat
(→ --repeat-each) and the PERF flags (--cpu/--mem/--cov/--css/--trace/--net)
are forwarded; --emit-budgets / --gate work per spec file. Same caveat as
auto-span: a span is one action's own cost, not "until the next assertion".
Install (test fixture)
pnpm add -D lightbringer @playwright/testweb-vitals is pulled in automatically.
Usage
Use the extended test fixture and measure named steps:
import { test, expect } from "lightbringer";
test("checkout flow", async ({ page, perf }) => {
await perf.measure("initial-load", async () => {
await page.goto("https://app.example.com");
await expect(page.getByRole("heading")).toBeVisible();
});
await perf.measure("open-cart", async () => {
await page.getByRole("button", { name: "Cart" }).click();
await expect(page.getByRole("dialog")).toBeVisible();
});
});Reports are written to perf-results/<title>.run<idx>.json and a summary is
logged. Put your waitFor/expect assertions inside the action so the span
covers "until the operation is done".
Auto-span (measure an existing spec, ~1-line change)
To put numbers on a spec you already have, without wrapping anything in
perf.measure, import test / expect from lightbringer/auto instead of
@playwright/test:
- import { test, expect } from "@playwright/test";
+ import { test, expect } from "lightbringer/auto";Every navigation and interaction in the spec body — page.goto(...) and Locator
actions like getByRole(...).click(), locator(...).fill(...) — becomes a
measured span automatically (labelled goto /, click #inc, …). The spec body is
otherwise unchanged. The same perf-results/*.json is written and the same PERF_*
flags apply.
Trade-off vs. explicit perf.measure: each auto-span covers one action's own
cost (the action plus a short settle), not "until your next assertion". When you
need the "until settled" window (e.g. goto and wait for the hero to paint as
one span), use the explicit test from lightbringer and perf.measure.
pnpm exec playwright test # measure
PERF_TRACE=1 pnpm exec playwright test # also save a Chrome trace (Paint / GPU / drilldown)
PERF_CPU=4 pnpm exec playwright test # throttle CPU 4x (mid-tier device)
PERF_GPU=1 pnpm exec playwright test # hardware GL (real GPU/paint numbers)
PERF_MEM=1 pnpm exec playwright test # force GC at span boundaries (retained-only memory deltas)
PERF_CSS=1 pnpm exec playwright test # capture per-selector style-recalc match stats (in the trace)
PERF_COV=1 pnpm exec playwright test # record JS+CSS coverage (chunk usage / dead code)
pnpm exec playwright test --repeat-each=5 # multiple runs for median
node node_modules/lightbringer/scripts/median.mjsCPU & network throttling (find bottlenecks hidden by a fast machine)
A fast dev machine on localhost hides both CPU and network cost.
PERF_CPU=Nslows the CPU N times (Emulation.setCPUThrottlingRate), so React re-render storms surface as long tasks. GPU/GL is not throttled, so it isolates JS/main-thread cost. The test timeout is auto-scaled by N; the globalexpect()timeout is not (raise it in config or pass an explicit timeout).PERF_NET=slow-3g|fast-3g|4gemulates a slower network (Network.emulateNetworkConditions), so payload size and waterfall depth have realistic cost (relevant when validating code-splitting / lazy loading).
Median (kill the noise)
A single run is noisy (JIT / cache / GC). For regression checks and before/after comparisons, run N times and take the median:
pnpm exec playwright test --repeat-each=5
lightbringer-median # writes <slug>.median.json, prints median (p25..p75)Each number is shown as median (p25..p75) — the IQR band, not min..max, so one
bad run doesn't blow up the reported spread. A metric whose IQR is wide relative to
its median is flagged !noisy: don't trust that median or gate on it without
more runs. On ad-heavy real sites this is common (e.g. cpu.block and recalc counts
swing run-to-run while requestCount stays stable). The budget gate decides on the
median but warns (~) when a budgeted metric is noisy and its IQR straddles the
budget, since the gate could then flip run-to-run.
Drilldown (find the cause)
When a span's CPU is high, capture a trace and aggregate it within the span window to see which subsystem and which functions spend the time:
PERF_TRACE=1 pnpm exec playwright test
lightbringer-drilldown <slug> <spanName>It prints these views: an event-name breakdown (Layout / Paint / FunctionCall /
WebGL / v8.parseOnBackground / …), a function total time (includes children),
a function self time computed from the V8 CPU profiler — the latter is what
pinpoints the actual hot function (e.g. a specific React render), with V8 synthetic
frames like (idle) / (program) filtered out — a GPU rollup (GPUTask /
RasterTask), and network initiators (which code issued the span's requests,
straight from the report, so it needs no trace).
With PERF_CSS=1 it also prints CSS selector match cost — per-selector style
recalc stats (disabled-by-default-blink.debug SelectorStats): the slowest
selectors by match time, and the wasteful ones (high match_attempts,
match_count 0 — re-tested against the DOM on every recalc but never matching,
prime candidates to delete or scope). This is the answer to "the DOM and selector
count are large and style recalc is expensive — which selectors?". Note PERF_CSS
instruments every match attempt, so it inflates the recalc time; use it to find
which selectors and read recalcStyleMs from a normal run for the real magnitude.
App spans (withSpan)
To attribute cost to a region of your application code, wrap it with
withSpan. It emits a standard performance.measure (visible in the DevTools
Performance panel too), which lightbringer collects and converts into an
OpenTelemetry-style span, nested inside the operation span by time containment:
import { withSpan } from "lightbringer";
const stats = await withSpan("loadStats", () => fetchStats(id), { id });Nothing is sent to a server. The toOtelSpans output is where an OTLP exporter
could plug in later.
Custom settle
After an action, lightbringer waits for the page to "settle" before closing the
span. The default waits for two animation frames. Override it per call or per
controller for app-specific readiness (e.g. waiting for a map's idle event):
await perf.measure("pan-map", async () => { /* ... */ }, {
settle: (page) =>
page.evaluate(() => new Promise<void>((r) => myMap.once("idle", r))),
});Settle is bounded by PERF_SETTLE_TIMEOUT (default 5000ms). If it times out the
span is flagged capped and its durationMs should not be trusted (read the
network / CPU / render breakdown instead).
Memory leak trend (measureRepeat)
A single step's memory delta can't be told apart from GC noise. To catch a leak, repeat the same operation and watch whether memory climbs every time:
await perf.measureRepeat(
"toggle-panel",
async () => {
await page.getByRole("button", { name: "Toggle" }).click();
},
{ times: 6 },
);Each repeat is recorded as a toggle-panel#0..#5 span; lightbringer then reports
whether the heap / listener count / DOM nodes / ArrayBuffer count grow
monotonically across them:
memory trends (across repeated steps):
toggle-panel x6 jsEventListeners 192→212→232→252→272→292 +100 (+20/step) ⚠ likely leak
toggle-panel x6 jsHeapUsedMB 12→21.2→30.4→39.6→48.8→58MB +46MB (+9.2/step) ⚠ likely leakRun it under PERF_MEM=1 so each repeat's memory is measured after a forced GC
(retained-only) — that's what makes even jsHeapUsedMB resolve into a clean line.
A non-leaking step reports no trend. This stays inside one scenario, so it isn't
the out-of-scope cross-scenario analysis.
Coverage & chunk-split analysis (PERF_COV)
PERF_COV=1 records JS + CSS coverage across the whole scenario (it doesn't reset
on navigation, so it accrues over every page and interaction). The per-run summary
shows how much of each chunk the scenario used:
coverage (PERF_COV — scenario-wide):
js 25.8% used (547.1/2119.7KB)
22.7% used 716KB unused /assets/vendor-maplibre-….js
17.1% used 286.8KB unused /assets/vendor-turf-….js
css 95.4% used (96.5/101.2KB)To find code that no scenario in the suite used (dead code / over-shipping),
run the whole suite with PERF_COV=1, then union the per-scenario coverage:
PERF_COV=1 pnpm exec playwright test
node node_modules/lightbringer/scripts/coverage.mjs --min=30[coverage] union across 7 scenario run(s)
JS 31% used overall (640/2060KB, 1420KB never used)
never used by any scenario (dead-code / over-shipping):
45.2KB /assets/admin-….js
under 30% used (split too coarse / lazy-load candidate):
17.1% used 286.8KB unused /assets/vendor-turf-….jsA byte is "used" if any scenario executed it, so a chunk that stays low after
the whole suite is a real split/lazy-load candidate. Notes: measure a production
build (vite build + vite preview) — a dev server ships unbundled modules, so
chunk analysis is meaningless. A framework vendor chunk (react) sitting at ~25% is
expected and not splittable; the actionable signals are feature libs (e.g. turf
only needed for some geo ops) and your own app chunks. Coverage is Chromium-only.
Budgets (CI regression gate)
Declare an upper bound per span; the build fails when it's exceeded. scriptMs
(CDP ScriptDuration) is the recommended bound because it is accurate to ~1ms.
await perf.measure(
"open-cart",
async () => {
await page.getByRole("button", { name: "Cart" }).click();
await expect(page.getByRole("dialog")).toBeVisible();
},
{ budget: { scriptMs: 80, blockingMs: 100 } },
);Two gates, same declared budget:
Median gate (recommended for CI): run N times, then
median.mjscompares the median to the budget and exits non-zero on violation. Robust against the per-run noise ofdurationMs/blockingMs.pnpm exec playwright test --repeat-each=5 node node_modules/lightbringer/scripts/median.mjs # exit 1 if any median > budgetInline gate (fast local fail):
PERF_ASSERT=1fails the test in teardown on the single run. Best for stable metrics (scriptMs); off by default.
Either way, violations are also printed in the per-run summary (! budget: ...).
Span budget fields: durationMs, scriptMs, blockingMs, encodedKB,
requestCount, waves, busyMs, thirdPartyKB, thirdPartyRequestCount,
layoutCount, recalcStyleMs, recalcStyleCount, nodes, jsHeapUsedMB,
jsHeapDeltaMB, listenersDelta, interactionMs, droppedFrames,
longestFrameMs, paintCount / paintMs / gpuMs (PERF_TRACE only). The memory bounds
(jsHeapDeltaMB especially) are only trustworthy under PERF_MEM=1; prefer the
count-based listenersDelta for a GC-stable gate. For page-global web-vitals,
declare a separate budget once per test:
perf.setVitalsBudget({ LCP: 2500, INP: 200, CLS: 0.1 });It's gated the same way (median, with noisy warnings).
Regression gate (baseline-relative)
Budgets are absolute bounds you maintain by hand. The other half of "drive the optimization" is catching a relative regression without declaring a number — the PR made this step 35% slower than main. Produce a baseline median set, then the current one, and diff them:
# baseline (e.g. on main)
PERF_OUT_DIR=perf-baseline pnpm exec playwright test --repeat-each=5
PERF_OUT_DIR=perf-baseline node node_modules/lightbringer/scripts/median.mjs
# current (on the PR)
pnpm exec playwright test --repeat-each=5
node node_modules/lightbringer/scripts/median.mjs
# fail if anything got >15% worse
node node_modules/lightbringer/scripts/regress.mjs perf-baseline perf-results --threshold=0.15[regress] baseline perf-baseline vs current perf-results (gate: +15%)
open-cart
increment-click / render.scriptMs 2.1 → 133.1 (+6238%) ✗
increment-click / cpu.blockingMs 0 → 134 (new) ✗
increment-click / memory.jsHeapUsedMB 23.9 → 50.4 (+111%) ✗
vitals.INP 24 → 152 (+533%) ~Every tracked metric is lower-is-better, so a regression is an increase past both
the relative gate and an absolute floor (so a 1ms→2ms swing isn't flagged as
+100%). A metric that's noisy on either side (wide IQR) is downgraded to a warning
(~) — the comparison can't be trusted, add runs. Exits non-zero on any hard
regression, so it drops straight into CI alongside the budget gate.
CI
.github/workflows/perf.yml is a working perf gate —
lightbringer measuring its own fixtures — that doubles as the copy-this template:
- run: pnpm exec playwright install --with-deps chromium
- run: pnpm exec playwright test <your-specs> --repeat-each=5 # median needs N runs
- run: node node_modules/lightbringer/scripts/median.mjs # exits 1 on budget violationTwo things make it reliable in CI:
- Gate on the median, not a single run.
--repeat-each=5+median.mjsabsorbs JIT/GC/cache noise and printsmedian (p25..p75); a metric whose IQR is wide is flagged!noisyand shouldn't gate. - CI runners have no GPU (SwiftShader), so budget the main-thread metrics
(
scriptMs,layoutCount,nodes,requestCount,waves,recalcStyleMs) — notgpuMs/paintMs, which are unreliable there.
To add the baseline-relative regress gate (catch "this PR got 15% slower than main" without hand-set budgets), measure both revisions and diff:
- run: pnpm exec playwright test <specs> --repeat-each=5
- run: node scripts/median.mjs # current → perf-results/*.median.json
- name: baseline from main
run: |
git worktree add ../base origin/main
cd ../base && pnpm install --frozen-lockfile
pnpm exec playwright test <specs> --repeat-each=5
PERF_OUT_DIR="$PWD/perf-results" node scripts/median.mjs
- run: node scripts/regress.mjs ../base/perf-results perf-results --threshold=0.15The bench specs here default to their slow path (to demonstrate each metric), so
the template passes BENCH_FIXED=1 to run the optimized path and stay green —
your own specs won't need that.
Accuracy
Measured against a known busy-loop in a page-owned click handler (see
examples/accuracy.spec.ts):
| metric | accuracy |
| --- | --- |
| render.scriptMs (CDP ScriptDuration) | ±1ms — the most reliable CPU number |
| cpu.blockingMs (Long Tasks API) | exact when it fires, but lossy (see below) |
| durationMs | ground truth + ~15–30ms harness overhead (CDP round-trips + settle) |
| tracing observer effect | negligible on scriptMs |
Things that bite, learned from the accuracy probe:
- Work injected via
page.evaluateis invisible to the Long Tasks API and to ScriptDuration (onlydurationMssees it). Drive the work from the page's own scripts/events, not fromevaluate, or you will measure nothing. page.setContentdoes not run init scripts in this setup, so the in-page collector never initializes and vitals / cpu / render silently vanish (onlyscriptMssurvives via CDP). Always reach the page withpage.goto(adata:text/html,...URL works). The harness warns (collectorMissing) when it detects this.- PerformanceObserver callbacks are async, so a long task at the very end of a
span would be missed; the collector drains
takeRecords()(flush) before reading, which fixes per-span attribution retroactively.
Caveats
- Default headless Chromium uses SwiftShader (software GL). WebGL / ReadPixels
/
gpuMsand thecpu.blockof GPU-heavy steps balloon far beyond real hardware and can mask real JS cost. Measure withPERF_GPU=1to use hardware GL (ANGLE Metal on macOS). Verify via the WebGL renderer string; on CI without a real GPU it stays SwiftShader, so don't trust map/canvas GPU/CPU numbers there. - Dev servers differ from production. A dev server that serves unbundled ES modules inflates request counts and transfer size. Measure a production build for network/bundle decisions. Runtime responsiveness (INP) is build-independent.
cpu.blockis the sum of long tasks. Very short synchronous work, or work that doesn't cross a task boundary, may not register as a long task (it shows in LoAF instead). Use the trace for fine-grained attribution.logSummarywarns automatically when the WebGL renderer is SwiftShader (software GL → fake GPU numbers) or when uncaught page errors occurred during the run (a broken / stale build makes the measurement invalid). The report carriesglRendererandpageErrors.- The drilldown's self time comes from the V8 CPU profiler (sampling), so it is approximate at very short durations; the total view and event-name view complement it.
(net-saturated: busyMs ≈ window)is shown when the network is busy for ~the whole span (continuous loading: ads, polling, long-polling). There,busyMsanddurationMsreflect the wait window you chose, not a discrete load cost — read the discrete metrics (cpu.block/script/ recalc counts / vitals) andwaves/requestCountinstead.- Heavy traces stream to disk. A busy page emits tens-to-hundreds of MB of
trace events; the collector streams them straight to
<slug>.trace.jsonand keeps only Paint/GPUTask in memory, so the fixture doesn't buffer + stringify the whole trace (which would OOM). Thedrilldownscript, however, loads the full trace file (JSON.parse) — fine for normal traces, but a multi-GB trace will strain it.examples/stress.spec.tsis the regression fixture for this. - Per-span request detail is capped at the 20 slowest.
requestCount,encodedKB,busyMs,waves, andthirdPartyare computed over all requests; only the per-requestrequests[]list is truncated, so a request-heavy page doesn't bloat every report. - First/third-party split is by registrable domain (eTLD+1) using a compact
built-in suffix set, not the full Public Suffix List. It's correct for common
hosts (subdomains of your site count as first-party;
*.co.uketc. handled), but exotic public suffixes may misclassify. The page's own domain (frompage.url()) is the first-party anchor;data:/blob:count as first-party. Third-party CPU requiresPERF_TRACE=1(the CPU profiler carries script URLs). - Memory deltas need a GC to be trustworthy. Without
PERF_MEM=1a span'sjsHeapDeltaMBincludes the step's own not-yet-collected garbage, so a fixed (non-leaking) step looks the same as a leaking one.PERF_MEM=1forcesHeapProfiler.collectGarbageat both span boundaries so the delta is retained-only — but the GC adds wall time to the span, so it's opt-in and you shouldn't readdurationMsfrom aPERF_MEMrun. Even then, on-heap objects can survive a single step's GC, sojsHeapDeltaMBis directional; the counts (listenersDelta, ArrayBuffer count, document delta) are the reliable per-run leak signals.JSHeapUsedSizeexcludes off-heap buffer bytes (typed arrays / wasm / GPU staging), which is why a leaked 8 MBFloat64Arrayshows only as the ArrayBuffer count going up, not as heap MB. - Request initiators are best-effort. They come from CDP
Network.requestWillBeSent.initiator: ascriptinitiator carries a JS call stack (lightbringer keeps the topmost frame with a URL), aparserinitiator points at the referencing document, andpreload/othercarry no frame. A fetch deep inside a bundled/minified vendor chunk attributes to that chunk's url:line, not your source, unless source maps are applied downstream. - Media analysis caveats. Image over-fetch uses intrinsic vs rendered pixels,
so it works for any
<img>, butdata:URLs and cross-origin images withoutTiming-Allow-Originreport 0 KB (they're not in Resource Timing) — the over-fetch ratio is still correct, only the byte figure is missing. The uncompressed-resource check depends on the serving layer:vite previewmay not gzip, so it can flag chunks a real CDN would compress — confirm against production hosting. - Per-span interaction uses Event Timing with a 16 ms
durationThreshold, so sub-16 ms (already-responsive) interactions don't appear — absence is good news. - Frames come from a rAF probe, so dropped frames are measured against a 60 Hz budget (16.7 ms) even though headless Chromium runs unthrottled (~120 fps) — a smooth span simply reports no hitch. The probe pushes one timestamp per frame (negligible), and the summary only prints the line when there's a real hitch.
- Cache hits are detected within the run (memory/disk/SW); a true reload-diff ("what's re-fetched on a second visit") means navigating twice in the scenario.
- Render-blocking counts
<head>stylesheets and classic (non-async/defer, non-module)<script src>; a single app CSS bundle showing as 1 blocking sheet is normal — the signal is unexpected extra blocking resources. PERF_PORToverrides the fixture dev-server port (default 5173). Set it to a free port when another Vite project is already on 5173 — otherwise Playwright reuses that server and silently measures the wrong app.
Bench fixtures
fixtures/app is a tiny React app (served by Vite) with deliberate, fixable
bottlenecks, grouped by the two halves of the responsibility — initialization
and between steps. Each spec in examples/ measures one and doubles as a
regression fixture for the tool. ?fixed (or BENCH_FIXED=1) toggles the fix.
Initialization (the initial-load span):
| scenario | bottleneck | metric | slow → fixed | fix |
| --- | --- | --- | --- | --- |
| init-eager | expensive work at boot that the first view doesn't need | render.scriptMs | 1038 → 21 ms | don't compute at init (lazy / on demand) |
| init-waterfall | boot fetches awaited one-by-one (fetch-on-render) | network.busyMs | 475 → 169 ms | parallelize the boot fetches |
| init-reflow | a mount layout effect forces a reflow per element | render.layoutCount | 2002 / 402 ms → 3 / 5 ms | batch reads then writes |
| cls | a banner inserted after load pushes content down | vitals.CLS | 0.4 (poor) → 0 | reserve the space up front |
Between steps (per interaction):
| scenario | bottleneck | metric | slow → fixed | fix |
| --- | --- | --- | --- | --- |
| rerender | unrelated heavy list re-renders on click | render.scriptMs | 129 → 1.8 ms | React.memo |
| reflow | write-then-read geometry in a loop (forced sync layout) | render.layoutCount / layoutMs | 2000 / 335 ms → 1 / 1.6 ms | batch reads then writes |
| input | heavy sync work per keystroke | vitals.INP | 64 → 8 ms | useDeferredValue |
| network | four independent requests awaited one-by-one | network.waves / busyMs | 4 waves / 808 ms → 1 / 203 ms | Promise.all |
| nplus1 | list, then one request per item | network.requestCount / waves | 6 / 6 → 2 / 2 | batch endpoint |
| chain | each request depends on the previous result | network.waves | 4 waves / 608 ms → 1 / 156 ms | combined endpoint |
| huge-dom | rendering 30k list items | render.nodes | ~120k nodes / layout 100 ms → ~400 / fast | windowing / pagination |
| paint | animating box-shadow every frame (no layout) | render.paintCount (PERF_TRACE) | 196 paints → 4 | animate transform (compositor-only) |
| thirdparty | analytics / ad / tag-manager scripts from another origin | network.thirdParty (KB / reqs / CPU) | 4 reqs / 265 KB / 70 ms CPU → 0 | drop / defer / self-host the script |
| leak | each click retains objects / buffers / listeners forever | memory.listenersDelta / arrayBuffers (PERF_MEM) | +19 listeners / +30 buffers → ~0 | drop refs, unbind listeners |
| leak-trend | the same leak, repeated 6× via measureRepeat | report.trends (PERF_MEM) | heap +9.2 MB/step monotonic → flat | drop refs, unbind listeners |
| selector-cost | big DOM × many matching complex selectors; toggle restyles all | render.recalcStyleMs (+ PERF_CSS drilldown) | ~22 → 0.5 ms | fewer / flatter / scoped selectors |
| image | a 1600×1600 image rendered in an 80×80 box | report.media.oversized | 400× over-fetch → 1× | serve at display size (or 2× DPR) |
stress is different: it doesn't measure an app bottleneck, it stresses
lightbringer's own data handling — 600 concurrent requests + a 150k-mark trace
(~50 MB / ~200k events). It verifies the collector survives a dataCollected
batch larger than the spread-call argument limit and streams the trace to disk
instead of OOMing. Run it with PERF_TRACE=1.
Production fixture (build-dependent metrics)
Some axes only produce real numbers against a production build — a dev server
ships unbundled modules (no chunks), injects CSS via JS (no render-blocking
<link>), and the bench fixtures use data: images (0 bytes). fixtures/bundle
is a separate Vite project, built and vite preview-served by
playwright.bundle.config.ts, that makes them concrete:
PERF_COV=1 pnpm test:bundle # build → preview → measure with coverage
pnpm coverage # union the per-scenario coverageIt surfaces, with real numbers:
- chunk coverage —
vendor-react~23% used (a big framework chunk the app barely exercises), and afeatureschunk imported as a namespace and dispatched dynamically, so it's shipped whole but ~25% used (the over-shipping pathology). - render-blocking — the extracted
<link rel=stylesheet>(1 css). - media over-fetch with real bytes — a generated 1600×1200 PNG shown at
128×96 (≈150× over-fetch, ~46 KB), not a 0-byte
data:URL. - CSS coverage —
style.csshas matching and non-matching rules, so CSS lands ~25% used.
The PNG is generated (pnpm bundle:gen, gitignored) so no binary is committed.
Notes worth internalizing:
reflow/init-reflow:scriptMsis only ~5–8 ms, so a CPU-only view misses them — the layout breakdown is what surfaces the cost.- the network trio is the waterfall fix taxonomy with zero CPU: parallelize independent requests, batch an N+1, combine a dependent chain.
init-eager: deferring the work to idle would not help — that reschedules it without reducing the resource used at init; the fix is to not do it at init.- the
rerenderdrilldown's self time points straight at the app's ownexpensiveValue(with file:line), not a library.
npx playwright test reflow.spec.ts # slow
BENCH_FIXED=1 npx playwright test reflow.spec.ts # fixedLicense
MIT
