@mizchi/vlmkit
v0.6.0
Published
VLM-driven frontend toolkit: visual regression (snapshot / diff / regression-watch), markup synthesis from screenshots, design-token / theme / a11y / i18n audits, and a 2-stage VLM+LLM CSS auto-repair loop.
Readme
vlmkit
VLM-driven frontend toolkit. Combines:
- Visual regression — pixel + computed-style + a11y tree diff, multi-viewport snapshot, regression watch across runs.
- Markup synthesis — build a component from a target screenshot; detect components in a screenshot; extract design tokens.
- Design audits — token-scale conformance, theme parity, a11y contrast / touch / focus order, i18n stress, cross-browser parity, Web Vitals.
- CSS auto-repair — 2-stage VLM + LLM pipeline that detects a CSS regression and proposes a fix.
Originated as @mizchi/vrt (visual regression only); rebranded to
vlmkit at 0.6.0 to reflect the broader scope. The VRT capabilities
remain a first-class feature group within the kit.
Requires Node 24+.
npm install -g @mizchi/vlmkit
# or, from this repo:
pnpm install && pnpm buildThe CLI is organized into verb groups. Run vlmkit <group> --help
for options.
| Group | Subcommands |
|---|---|
| vlmkit diff | html, png, elements, browsers, agent, runs |
| vlmkit check | a11y {contrast,touch,focus}, tokens, theme, perf, drift {component,pages} |
| vlmkit inspect | interact, explore, smoke |
| vlmkit stress | i18n, media |
| vlmkit scan | component, breakpoints |
| vlmkit build | component |
| vlmkit snapshot | [<url>...], approve, fix-prompt, stability, flipbook, report |
| vlmkit migration | compare, blind, subagent |
| vlmkit workflow | init, capture, verify, approve, graph, affected, introspect, spec-verify, expect |
| Standalone | vlmkit watch, vlmkit manifest, vlmkit diff-pr, vlmkit baseline, vlmkit api, vlmkit bench, vlmkit report, vlmkit skill |
The single-token commands from 0.4.x (vrt compare, vrt png-diff,
vrt theme-parity, …) remain as deprecation shims that forward to
the new names and print a one-line hint. See the CHANGELOG
for the full old → new mapping.
Features
- Pixel diff — pixelmatch v7 + heatmap generation
- Computed style diff —
getComputedStylecapture including hover/focus states - A11y tree diff — accessibility snapshot comparison
- CSS challenge bench — automated CSS deletion/recovery with detection rate tracking (96.7%)
- 2-stage AI pipeline — VLM (image → structured diff) + LLM (diff → CSS fix)
- Migration VRT — compare HTML before/after across responsive viewports
- Snapshot — URL-based multi-viewport capture with baseline diff
- Mask — selector-based masking for dynamic content (animations, counters)
- Crater integration — lightweight prescanner via BiDi (1.66x speedup, 0% false positive)
- Markup-assistance toolkit (10+ commands): build from screenshot, theme-parity, i18n stress, a11y contrast / touch / focus-order, media-variant adaptations, cross-browser parity, design-token conformance, interaction sequences.
Quick Start
The examples below assume the vrt command is already installed and available on your PATH.
pnpm install
# Run tests
pnpm test
# Compare two HTML files
vlmkit diff html before.html after.html --output reports/
# Render the diff into an agent-friendly Markdown report
vlmkit diff agent reports/diff-report.json > reports/diff.md
# Compare two existing PNG screenshots without Playwright
vlmkit diff png baselines/home.png snapshots/home.png
# Compare two URLs
vlmkit diff html --url http://localhost:3000/ --current-url http://localhost:8080/ \
--output reports/
# Snapshot URLs (creates baseline on first run, diffs on subsequent runs)
vlmkit snapshot http://localhost:3000/ http://localhost:3000/about/ --output snapshots/
# Use explicit labels when URL-derived names are not ideal
vlmkit snapshot http://localhost:3000/issues?severity=critical --label critical-issues
# Fail CI when diffs or new baselines are detected
vlmkit snapshot http://localhost:3000/ --fail-on-diff --fail-on-new-baseline --max-diff-ratio 0.01
# Promote accepted snapshot diffs to the new baseline
vlmkit snapshot approve --output snapshots/
# Load snapshot targets from vrt.config.json
vlmkit snapshot
# Dev inner loop with rich signal output (token-aware + cross-round)
vlmkit diff html baseline.html variant.html --tokens DESIGN.md --output reports/
vlmkit watch baseline.html variant.html --tokens DESIGN.md
# Author approval rules (sub-pixel deviations, intentional design exceptions, etc.)
vlmkit manifest add --selector .hero__body --max-px 2 --reason "AA artifact" --expires 2026-08-15
vlmkit manifest add --a11y-contrast --selector "button" --reason "decorative" --expires 2026-08-15
vlmkit manifest add --from-run .vrt/runs/diff-pr/ # auto-acknowledge sub-pixel deltas
vlmkit manifest list
# CI gate — declare routes in vrt.config.json, pin baselines, gate per PR
vlmkit baseline pin # on main
vlmkit baseline verify # in PR
vlmkit baseline post --pr owner/repo#123 # send summary.md as PR comment
# Legacy internal-dogfood verification loop (vrt's own e2e suite)
vlmkit workflow init
vlmkit workflow capture
vlmkit workflow verify
# Workflow loop with external-project routes/config
vlmkit workflow init --config ./vrt.config.json
vlmkit workflow capture --config ./vrt.config.json
# Prepare a migration diff packet for an external fixer
pkf run migration-subagent-prepare -- --report test-results/migration/migration-report.json --output test-results/migration/subagent-task.md
# Measure success rate from before/after migration reports
pkf run migration-subagent-evaluate -- --before-report test-results/migration/migration-report.json --after-report test-results/migration/migration-report.after.json
# Inspect blind migration scenarios
pkf run migration-blind-list
pkf run migration-blind-show --scenario shadcn-to-luna
pkf run migration-blind-prepare --scenario shadcn-to-luna -- --packet test-results/migration/blind/shadcn-to-luna/task.md
pkf run migration-blind-solo --scenario shadcn-to-luna -- --output test-results/migration/blind/shadcn-to-luna/solo/after-blind.html --report-output-dir test-results/migration/blind/shadcn-to-luna/solo-report
pkf run migration-blind-evaluate --scenario shadcn-to-luna -- --before-report test-results/migration/blind/shadcn-to-luna/migration-report.json --after-report test-results/migration/blind/shadcn-to-luna/solo-report/migration-report.json --rounds 1
# Mask dynamic content
vlmkit snapshot http://localhost:3000/ --mask ".marquee-container,.hero-badge"
# Detect broken baseline renders (e.g. CDN failed to load) — on by default
vlmkit diff html --dir fixtures/migration/tailwind-to-vanilla \
--baseline before.html --variants after.html
# Add --strict-baseline-sanity to exit non-zero when warnings fire,
# or --no-baseline-sanity to skip the check entirely.
# CSS challenge benchmark
pkf run css-bench --fixture page --trials 30
# Fix loop (break CSS → VLM analyze → LLM fix → verify)
pkf run fix-loop --fixture page --seed 42Task runner & spec gates (pkfire / pkspec)
Tasks live in Taskfile.pkl (typed; replaces the previous bash
justfile). Specs live in
Spec.pkl (Goals + Scenarios) and Test.pkl (smoke implementations).
# List every available task
pkf list
# Run a task (mirrors any old `just <name>` invocation)
pkf run smoke-all
pkf run vrt-test
# Spec gates
pkf run spec-check # pkspec check Spec.pkl Test.pkl
pkf run spec-render # render Spec.pkl → docs/SPEC.md
pkf run spec-run # execute every Test.pkl testInstall pkf / pkspec via nix flake:
nix run git+https://github.com/mizchi/pkfire -- list
nix run git+https://github.com/mizchi/pkspec -- check Spec.pkl Test.pklCLI Surface
Diff (compare two things)
vlmkit diff html <baseline> <variant> # HTML/URL pair → multi-viewport diff + report.json
vlmkit diff agent <report.json> # Render report.json as agent-friendly Markdown
vlmkit diff png <baseline.png> <current.png> # Direct PNG pixel diff + heatmap
vlmkit diff elements [options] # Element-level diff with shift isolation
vlmkit diff browsers <html|url> # chromium / firefox / webkit parity
vlmkit diff runs <dir...> # Aggregate multiple VRT runs into one tableSnapshot (URL → baseline + diff)
vlmkit snapshot <url1> [url2] ... # First run: baseline. Subsequent: baseline + diff
vlmkit snapshot approve # Promote *-current.png → *-baseline.png
vlmkit snapshot fix-prompt # Emit a subagent-ready prompt from snapshot-report.json
vlmkit snapshot stability <url...> # Run N iterations and report false-positive rate
vlmkit snapshot flipbook # Diff three-frame (baseline ↔ current ↔ heatmap) HTML flipbooks
vlmkit snapshot report # Render snapshot-report.json as MarkdownCheck (gates: a11y / tokens / theme / perf / drift)
vlmkit check a11y contrast <html> # WCAG AA contrast scan
vlmkit check a11y touch <html|url> # Touch target size (WCAG 2.5.5 / 2.5.8)
vlmkit check a11y focus <html|url> # Tab order vs visual order
vlmkit check tokens <html> # radius/spacing/z-index/shadow scale conformance
vlmkit check theme <html> # prefers-color-scheme dark / unthemed components
vlmkit check perf <html|url> # Web Vitals (CLS / LCP / FCP)
vlmkit check drift component <html> --selector .card
vlmkit check drift pages --selector .footer --files A.html B.html C.htmlBuild / Scan / Inspect / Stress (markup-assistance)
# Build component from a target screenshot, iterate until close.
vlmkit build component <target.png> <current.html>
# signals: bbox + heatmap regions + dominant fill + typography hints
# + spacing-fix table + palette diff + multi-state suspect flags.
# Detect components in a screenshot.
vlmkit scan component <screenshot.png> # Crop to standalone PNGs
vlmkit scan breakpoints <html-file> # Discover responsive breakpoints
# Scripted / exploratory interaction.
vlmkit inspect interact <html|url> --sequence <path.json>
vlmkit inspect explore <html|url> # Auto-discover declared actions and diff each
vlmkit inspect smoke <html|url> # A11y-driven exploratory smoke test
# Stress tests.
vlmkit stress i18n <html> # Text-node inflation overflow detection
vlmkit stress media <html> # forced-colors, reduced-motion, print, RTL, 200% zoomAll emit a self-contained Markdown report under --output-dir. Each
finding includes pasteable hex / px values + a heuristic remediation
hint. See docs/reports/2026-05-13-capability-survey.md for the full
scenario × coverage matrix.
Snapshot labels are query-aware by default, so /issues and /issues?severity=critical no longer share the same baseline name.
Use repeated --label flags to override labels explicitly when needed.
The same --label flag can be used with vlmkit snapshot approve to approve only selected labels.
Minimal vrt.config.json:
{
"baseUrl": "http://localhost:3000",
"routes": [
"/",
{ "path": "/issues?severity=critical", "label": "critical-issues" }
],
"outputDir": "test-results/snapshots/sample-webapp-2026",
"threshold": 0.1,
"failOnDiff": true,
"maxDiffRatio": 0.01,
"workflow": {
"captureSpec": "./e2e/vrt-capture.spec.ts"
}
}When vrt.config.json exists in the current directory, vlmkit snapshot loads it automatically. Use --config <path> to point at another file, and pass URLs or flags directly when you want CLI values to override config defaults.
vlmkit workflow init and vlmkit workflow capture also auto-load the same file, reuse baseUrl/routes, and accept workflow.captureSpec or --capture-spec <path> when you want a custom Playwright entrypoint.
Subagent-ready fix prompt
vlmkit snapshot fix-prompt reads the last snapshot-report.json and emits a structured task list that a coding agent can act on:
# Markdown prompt to stdout (default)
vlmkit snapshot fix-prompt --output test-results/snapshots
# Limit to the 5 worst diffs, write to a file
vlmkit snapshot fix-prompt --output test-results/snapshots --limit 5 --out fix-prompt.md
# Filter by label, minimum diff ratio, and emit JSON for programmatic use
vlmkit snapshot fix-prompt --label home --min-diff 0.01 --format jsonThe prompt includes per-task URL, viewport, diff ratio (with shift compensation), and relative paths to the baseline / current / heatmap PNGs plus the captured HTML, so a subagent can map the visual regression back to source code.
Measuring false-positive rate
vlmkit snapshot stability captures the same URLs across N iterations against a
baseline locked in on iteration 0, then reports how often comparisons showed a
non-zero diff. Useful for tracking renderer noise, animation leakage, or mask
gaps before turning on --fail-on-diff in CI:
# 3 iterations (default), any non-zero diff counts as a positive
vlmkit snapshot stability http://localhost:3000/ http://localhost:3000/about/
# Fail CI if the overall FP rate exceeds 5%
vlmkit snapshot stability http://localhost:3000/ \
--iterations 5 \
--fail-above-rate 0.05 \
--output test-results/stability
# Only count diffs above 1% as positives (filters out subpixel noise)
vlmkit snapshot stability http://localhost:3000/ --fp-threshold 0.01The run writes stability-report.json to the output directory with per-URL +
per-viewport FP rate, mean / max diff ratios, and shift-compensated max — well
suited to artifact upload + over-time tracking.
Capture backend (--backend)
By default vlmkit snapshot launches a local Chromium via Playwright. To offload
capture to Cloudflare Browser Run
without installing Playwright browsers in CI, switch the backend:
# Connect via CDP WebSocket; credentials come from env vars
CLOUDFLARE_ACCOUNT_ID=... CLOUDFLARE_API_TOKEN=... \
vlmkit snapshot --backend cloudflare http://localhost:3000/Resolution order for the backend selector:
--backend <local|cloudflare>CLI flagVRT_CAPTURE_BACKENDenv var- Default
local
For the Cloudflare backend, additional env vars are required:
| Variable | Required | Purpose |
|---|---|---|
| CLOUDFLARE_ACCOUNT_ID | yes | Account id for the CDP URL |
| CLOUDFLARE_API_TOKEN | yes | Token with Browser Rendering permissions |
| CLOUDFLARE_BROWSER_RUN_ENDPOINT | no | Override the default WS endpoint |
See examples/vrt-snapshot-cloudflare.workflow.yml for a complete GitHub
Actions template that skips the local Playwright install step.
Visualizing the VRT process — flipbooks + video
The VRT process can be saved as a self-contained HTML "flipbook" (PNGs embedded as base64, vanilla-JS play/pause/scrub controls). One file per scenario, no external assets, opens in any browser, attachable to PRs:
# 1. Fix-loop convergence (or any ordered PNG sequence)
vlmkit snapshot flipbook round-0.png round-1.png round-2.png \
--label "round 0" --label "round 1" --label "round 2" \
--title "Fix-loop convergence" --out fix-loop.html
# 2. Diff three-frame (baseline ↔ current ↔ heatmap) for every regressed entry
vlmkit snapshot flipbook --output test-results/snapshots
# → test-results/snapshots/flipbooks/<label>-<viewport>.html
# 3. Stability iterations as flipbook per (URL, viewport)
vlmkit snapshot stability http://localhost:3000/ \
--iterations 5 --flipbook --output test-results/stability
# → test-results/stability/flipbooks/<label>-<viewport>-stability.html
# 4. WebM recording of a smoke-test session (Playwright recordVideo)
vlmkit inspect smoke --url http://localhost:3000/ --max-actions 20 --record-video videos/
# → videos/<hash>.webmCommon flags: --delay <ms> controls per-frame duration (default 700),
--no-loop stops at the last frame, --no-autoplay opens paused.
Agent-friendly diff summary
When a coding agent is iterating with vlmkit diff html, the natural workflow
(see docs/reports/2026-05-12-dogfood-shadcn-luna.md)
is: read the worst-viewport PNGs side-by-side, then write a CSS patch.
vlmkit diff agent collapses the inputs the agent needs into a single
Markdown blob:
vlmkit diff html --dir fixtures/migration/shadcn-to-luna \
--baseline before.html --variants working.html \
--output test-results/iter1
vlmkit diff agent test-results/iter1/diff-report.json --max-viewports 2The output contains: a worst-first diff table, category totals across
viewports, fix candidates aggregated by (selector, property) with the
number of viewports each is flagged on, and absolute paths to the
baseline / current / heatmap PNGs for the worst N viewports — all in
one context window.
Workflow Commands
These commands manage state under the current project root: baselines/, snapshots/, output/, vrt-report.json, expectation.json, and spec.json.
Before running them, start the target app and point VRT_BASE_URL at it when needed.
The built-in capture workflow defaults to http://127.0.0.1:4174.
vlmkit workflow verify itself only compares the PNG and .a11y.json artifacts already present under baselines/ and snapshots/; it does not launch Playwright.
vlmkit workflow init
vlmkit workflow capture
vlmkit workflow verify
vlmkit workflow approve
vlmkit workflow report
vlmkit workflow graph
vlmkit workflow affected
vlmkit workflow introspect
vlmkit workflow spec-verify
vlmkit workflow expectIf vrt.config.json defines routes, the built-in capture spec uses those routes instead of the repo-local defaults.
The PR workflow also runs a deterministic snapshot false-positive check against fixtures/css-challenge using .github/vrt-snapshot-ci.config.json.
It creates baselines once, re-runs the same URLs, and summarizes test-results/snapshots/ci/snapshot-report.json with vlmkit snapshot report.
For migration workflows, vlmkit migration subagent packages the highest-impact diff per variant into a prompt for an external fixer, then compares before/after migration-report.json files to measure resolved/improved success rates.
Blind migration scenarios are declared in fixtures/migration/blind-scenarios.json, including the existing reset-css blind target and a scaffolded shadcn-to-luna/after-blind.html target for reproducible E3 runs. vlmkit migration blind supports list, show, prepare, solo, and evaluate so the blind run can emit a fresh compare report, generate a fixer packet, run a deterministic reference-CSS repair, and check the diff < 1% within 3 rounds contract without hand-assembling paths.
Workflow aliases are kept for ergonomics where they do not collide:
vlmkit init,vlmkit capture,vlmkit verify,vlmkit approvevlmkit graph,vlmkit affected,vlmkit introspect,vlmkit spec-verify,vlmkit expect
vlmkit report remains the detection pattern report, so verification output lives under vlmkit workflow report.
Capture routes for external projects
vlmkit workflow init|capture runs e2e/vrt-capture.spec.ts, which now resolves
its route list from your project rather than hard-coding vrt's own pages.
Drop a vrt.config.json next to your app with a capture block:
{
"baseUrl": "http://localhost:3000",
"capture": {
"routes": [
{ "name": "home", "path": "/", "waitFor": "main" },
{ "name": "about", "path": "/about" },
"/contact"
]
}
}Each route accepts name (defaults to a sanitized form of path), path, and
an optional waitFor CSS selector. Resolution order:
VRT_CAPTURE_ROUTESenv var (JSON-encoded array)--config <path>flag orVRT_CONFIG_PATHenv varvrt.config.jsonauto-discovered in the working directory- Built-in defaults (vlmkit's own UI — useful only when developing vlmkit itself)
# External project usage
vlmkit workflow init --config ./vrt.config.json --base-url http://localhost:5173
vlmkit workflow capture --config ./vrt.config.json
vlmkit workflow verifyAPI Commands
vlmkit api serve [--port 3456] # Start HTTP API server
vlmkit api status [--url http://localhost:3456]Compatibility aliases:
vlmkit serve->vlmkit api servevlmkit status->vlmkit api status
HTTP API
Start the server:
vlmkit api serve --port 3456The shared Hono app also exposes a Cloudflare Workers entry point at worker/index.ts.
Available endpoints:
GET /api/openapi.json— OpenAPI 3.1 spec for the current HTTP surfaceGET /api/status— server version, backends, and capabilitiesPOST /api/compare— compare baseline/current HTML or URLs across viewportsPOST /api/compare-renderers— compare Chromium vs Crater renderingPOST /api/reason— VLM/LLM reasoning pipeline for diff analysis and fixesPOST /api/smoke-test— random or reasoning-guided a11y smoke test
When running on Workers, /api/status also reports detected R2 / KV / D1 storage bindings.
TypeScript client:
import { VrtClient } from "@mizchi/vlmkit/client";
const client = new VrtClient("http://localhost:3456");
const status = await client.status();
const result = await client.compareHtml(
"<main><button>Before</button></main>",
"<main><button class='primary'>After</button></main>",
);Install: pnpm add @mizchi/vlmkit
compareUrls(...) is intended for public HTTP(S) targets. The API server rejects localhost and private-network URLs.
Architecture
HTML (file or URL)
│
├── Pixel diff (pixelmatch v7 → heatmap → diff ratio)
├── Computed style diff (getComputedStyle → property-level changes)
├── A11y tree diff (accessibility snapshot → structural changes)
└── Paint tree diff (Crater BiDi → layout tree comparison)
│
▼
Detection & Classification
│
▼
AI Fix Pipeline (optional)
Stage 1: VLM (cheap) → structured CHANGE report
Stage 2: LLM (accurate) → CSS fix suggestions
│
▼
Dry-run verification → rollback if worseEnvironment Variables
| Variable | Purpose | Default |
|----------|---------|---------|
| VRT_LLM_PROVIDER | LLM provider | gemini |
| VRT_LLM_MODEL | LLM model | provider default |
| VRT_VLM_MODEL | VLM model (OpenRouter) | qwen/qwen3-vl-8b-instruct |
| OPENROUTER_API_KEY | OpenRouter API key | — |
| GEMINI_API_KEY | Google AI API key | — |
| ANTHROPIC_API_KEY | Anthropic API key | — |
Project Structure
src/
vrt.ts # Unified public CLI entry point
vrt-command-router.ts # Root command routing + usage text
vrt-cli.ts # Stateful workflow CLI
vrt-client.ts # TypeScript client for the HTTP API
snapshot.ts # URL snapshot + baseline diff
migration-compare.ts # HTML/URL comparison across viewports
css-challenge-bench.ts # CSS deletion/recovery benchmark
fix-loop.ts # AI-powered CSS fix loop
vrt-reasoning-pipeline.ts # 2-stage VLM + LLM pipeline
heatmap.ts # Pixel diff + heatmap generation
mask.ts # Selector-based visibility masking
vlm-client.ts # OpenRouter / Gemini VLM client
llm-client.ts # Multi-provider LLM client
crater-client.ts # Crater BiDi WebSocket client
api-server.ts # Hono API server
fixtures/
css-challenge/ # 9 HTML fixtures for CSS bench
migration/ # Migration comparison fixtures
docs/
knowledge.md # Accumulated experimental findings
reports/ # Dated experiment reportsAgent Skills (APM)
vlmkit ships five coding-agent skills under .claude/skills/. They wrap
the most common workflows as standalone, agent-readable playbooks.
Other repos can install them via APM:
# Install a single skill into the current repo's .claude/skills/
apm install mizchi/vlmkit/.claude/skills/vrt-visual-diff
# Install all five
apm install mizchi/vlmkit/.claude/skills/vrt-visual-diff \
mizchi/vlmkit/.claude/skills/vrt-migration-eval \
mizchi/vlmkit/.claude/skills/vrt-css-fix-loop \
mizchi/vlmkit/.claude/skills/vrt-markup-synth \
mizchi/vlmkit/.claude/skills/vrt-regression-watch| Skill | Entry workflow | Use when |
|---|---|---|
| vrt-visual-diff | vlmkit diff html → vlmkit diff agent | One-shot "did this CSS edit visibly change something?" |
| vrt-migration-eval | vlmkit migration compare\|blind\|subagent | Framework / CSS-lib / build-system swap audit |
| vrt-css-fix-loop | fix-loop.ts (VLM-driven) | Closed-loop CSS auto-repair benchmark |
| vrt-markup-synth | vlmkit build\|scan\|check\|stress * | Screenshot → HTML/CSS, token / theme / i18n audits |
| vrt-regression-watch | vlmkit diff agent --previous --fail-on-regression | Per-PR or scheduled regression gate |
Each skill assumes the vrt CLI is on $PATH (this repo published as
a Node package, or built from source) and Node 24+. VLM-using skills
(fix-loop, markup-synth, migration subagent) additionally need
one of OPENROUTER_API_KEY / GEMINI_API_KEY / ANTHROPIC_API_KEY
depending on the model selected via VRT_VLM_MODEL.
License
MIT
