@se-studio/site-check
v1.0.24
Published
Validate SE marketing sites (sitemap, llms.txt) and download markdown files preserving structure
Maintainers
Readme
@se-studio/site-check
Validate SE marketing sites (sitemap.xml, sitemap-unindexed.xml, llms.txt) and download all markdown files into a local directory, preserving URL path structure.
Usage
npx @se-studio/site-check <baseUrl> [-o dir] [-H "Name: value"] [--vercel-bypass [secret]] [--check-against <devUrl>] [--check-rewrites] [--ignore-missing-child-sitemaps]Examples
# Local site
npx @se-studio/site-check http://localhost:3015
# Custom output directory
npx @se-studio/site-check http://localhost:3015 -o ./out
# Compare production vs development: use production sitemap, check pages exist on dev
npx @se-studio/site-check https://example.com --check-against http://localhost:3010
# Enable rewrite check (optional; off by default)
npx @se-studio/site-check https://example.com --check-rewrites
# Vercel Deployment Protection: use env secret
VERCEL_AUTOMATION_BYPASS_SECRET=your-secret npx @se-studio/site-check https://preview.vercel.app --vercel-bypass
# Vercel bypass with explicit secret
npx @se-studio/site-check https://preview.vercel.app --vercel-bypass your-secret
# Custom headers
npx @se-studio/site-check https://example.com -H "Authorization: Bearer token" -H "X-Custom: value"
# Sitemap index lists a child that does not exist locally (404) — skip it and use remaining children
npx @se-studio/site-check http://localhost:3016 --ignore-missing-child-sitemaps -o ./outWhat it does
Validates the site by fetching:
sitemap.xml(required) — must return 200 and contain<loc>entries; if it is a sitemap index, child sitemaps are fetched and must return valid urlset content (or use--ignore-missing-child-sitemapsto skip child URLs that return 404)sitemap-unindexed.xml(optional) — if 404, a warning is printed and the run continues. When it exists (returns 200), its page URLs are included in all checks (existence on dev, optional rewrite check, markdown collection).llms.txt(optional) — if 404, a warning is printed and the run continues; if it returns 200 it is validated
Collects markdown URLs from sitemap.xml and sitemap-unindexed.xml (Option B): when either URL returns a sitemap index (e.g.
<sitemapindex>with child<sitemap><loc>...</loc></sitemap>), the tool follows those links and collects page URLs from each child sitemap’s urlset (one level only). When the response is a normal urlset, every<loc>is treated as a page URL. Each page URL is then converted to a.mdURL (path +.md). This ensures all indexed and unindexed pages are checked.Checks for unexpected rewrites (only when
--check-rewritesis set): For each page URL from the sitemaps, the tool sends a HEAD request and reads thex-nextjs-rewritten-pathresponse header. If the header is present and the rewritten path is different from the requested path (after normalisation), the run fails (exit 1). This catches cases where the sitemap lists one URL (e.g./learning-hub/blog/) but the app rewrites to another (e.g./articles/blog). Rewrites that only add or remove a locale prefix are allowed whenSITE_CHECK_LOCALESis set (see Options). By default the rewrite check is disabled.Checks and downloads: For each markdown URL, the tool fetches it. If any return non-2xx, the run fails (exit 1) and reports which URLs are missing or errored. Otherwise it saves each response to the output directory, preserving path structure (e.g.
blog/foo.md→./markdown-export/blog/foo.md,es-US/about.md→./markdown-export/es-US/about.md).
Compare mode (--check-against)
When --check-against <devUrl> is set, the tool runs in compare mode:
- Production (the
<baseUrl>) is the source of the URL list: sitemap.xml and sitemap-unindexed.xml are fetched from production. Validation (and llms.txt) applies to production. - Development (
<devUrl>) is the site checked: each production sitemap page URL (including from sitemap-unindexed.xml when present) is mapped to the same path on the development origin. The tool then checks that each of those development URLs returns 2xx (HEAD). If--check-rewritesis set, it also runs the rewrite check against the development site. Markdown collection and download are not performed in compare mode.
Use this to ensure a development or staging build has the same pages as production (e.g. before release). Page title comparison is not yet implemented; a future option may add it.
Options
| Option | Description |
|--------|-------------|
| baseUrl | Base URL of the site (required). In compare mode this is the production URL (sitemap source). Trailing slash is stripped. |
| -o, --out <dir> | Output directory for markdown files (default: ./markdown-export). Ignored in compare mode. |
| -H, --header "Name: value" | Add a request header (repeatable). Applied to both production and development requests in compare mode. |
| --check-against <devUrl> | Compare mode: use production sitemap(s) but check that each page exists on development (same path on <devUrl>). No markdown download. |
| --check-rewrites | Enable checking for unexpected rewrites (x-nextjs-rewritten-path). Default: off. When set, the run fails if any sitemap page URL rewrites to a different path (locale-only rewrites allowed with SITE_CHECK_LOCALES). |
| --ignore-missing-child-sitemaps | When the root sitemap (or sitemap-unindexed.xml) is a sitemap index, any child <loc> that returns 404 is skipped instead of failing validation. Other non-2xx responses still fail. After skipping, at least one page URL must still be collected from the remaining children. |
| --vercel-bypass [secret] | Set x-vercel-protection-bypass for Vercel Deployment Protection. If secret is omitted, uses VERCEL_AUTOMATION_BYPASS_SECRET from the environment. |
Headers from -H override the Vercel bypass header if the same name is used.
Environment
| Variable | Description |
|----------|-------------|
| SITE_CHECK_LOCALES | Comma-separated list of locale path segments (e.g. en,en-gb,de). When set, a rewrite is considered acceptable if the only difference between the requested path and the rewritten path is a leading locale segment. If unset, any rewrite to a different path fails. |
Exit codes
0— Validation passed (sitemap.xml required; llms.txt and sitemap-unindexed.xml optional). If--check-rewriteswas set, no unexpected rewrites. Every sitemap page returned 200 for its.mdURL and files were saved (or in compare mode: all production sitemap pages exist on development and, if--check-rewrites, no unexpected rewrites).1— Usage error (missing baseUrl), validation failed (sitemap.xml), one or more sitemap pages rewrite to a different path when--check-rewriteswas set, one or more markdown URLs returned non-2xx (pages missing or broken), or in compare mode one or more development URLs did not return 2xx. Validation does not fail when llms.txt returns 404.
