@davo20019/seo-audit
v0.8.0
Published
Crawl-based SEO audit CLI with JS rendering, JSON-LD validation, CrUX field data, sitemap walking, HTML/PDF reports, and crawl diffing.
Downloads
1,090
Maintainers
Readme
SEO Analysis Tool
This project is a crawl-based SEO CLI for auditing websites.
It focuses on technical SEO issues that can be derived from the crawl itself, with optional headless-Chromium rendering (--render), Google CrUX field data (--crux), and Lighthouse audits for a small set of pages. Optional traffic enrichment from Google Search Console (--gsc) and Google Analytics 4 (--ga4) ranks issues by real-world impact.
Every successful audit is auto-persisted to ~/.config/seo-audit/crawls/; seo-audit diff <url> compares the two most recent crawls of a host, and --fail-on <severity> gates CI/cron jobs against regressions vs. the previous persisted crawl. Opt out of persistence with --no-persist or SEO_AUDIT_NO_PERSIST=1.
What It Checks
- missing, short, long, duplicate, and multiple title tags
- missing, short, long, and duplicate meta descriptions
- suspicious metadata values like
[object Object],undefined, ornull - canonical issues, including missing, invalid, cross-host, and URL-mismatch canonicals
- HTTP pages instead of HTTPS
- locale-aware
html langchecks - hreflang extraction and validation
- hreflang duplicate values, missing
x-default, missing self-reference, missing return links, redirecting targets, and locale-target mismatches - missing and multiple H1 headings
noindexdirectives- images without alt text
- exact missing Open Graph fields
- missing or invalid JSON-LD structured data
- low body word count
- pages with no crawlable internal links
- internal links with missing anchor text
- internal links with generic anchor text like
read moreorclick here - broken internal links, redirecting internal links, and redirect chains
- weak internal-link support, including pages with only one incoming internal link
- orphan candidates based on the crawled internal-link graph
- sitemap inclusion checks for crawled pages
- sitemap-vs-canonical mismatches
- missing or blocking
robots.txt - missing
sitemap.xml - missing or empty
llms.txt - optional Lighthouse audits for performance, accessibility, best practices, and SEO
- images missing explicit width and height attributes
- images without
loading="lazy"hints - images served in legacy formats (jpg/png/gif) instead of webp/avif
- missing or zoom-blocking viewport meta (mobile audit)
- missing Strict-Transport-Security, missing Content-Type, or overly defensive Cache-Control on HTTP responses
X-Robots-Tagnoindex/nofollow directives delivered via HTTP response headers- HTTP
Link: rel="canonical"mismatches with the HTML<link rel="canonical">, multiple/invalid canonical Link values - HTML responses served without
Content-Encoding(gzip/br/zstd) compression - compressed responses missing
Vary: Accept-Encoding(shared-cache hazard) - URLs explicitly disallowed by
robots.txtfor the configured user agent - JSON-LD validation against rich-result requirements: Product, Article (BlogPosting/NewsArticle), FAQPage, BreadcrumbList, Organization, LocalBusiness
- JSON-LD Product offers without
priceorpriceCurrency - nested sitemap-index resolution (walks one level of nested sitemaps, capped at 50 children)
- sitemap entries with
lastmodolder than 12 months - real-user Core Web Vitals from Google CrUX (LCP/INP/CLS p75) when
--cruxis enabled andCRUX_API_KEYis set - optional AI-agent readiness scoring (
--agent-readiness) covering AI-bot rules, llms.txt depth,llms-full.txt, markdown content negotiation, well-known endpoints (agent-skills,api-catalog,mcp/server-card, OAuth discovery), Web Bot Auth, andLink:headers - optional Google Search Console enrichment (
--gsc) merging clicks/impressions/CTR/avg-position per crawled URL, plus a "Priority issues" summary that ranks high/medium-severity issues by traffic exposure - optional Google Analytics 4 enrichment (
--ga4) merging sessions/pageviews/users/engagement-rate per crawled URL; feeds the "Priority issues" summary as a fallback when GSC isn't available - crawl persistence + diffing: every audit auto-saves to
~/.config/seo-audit/crawls/<host>/<timestamp>.json;seo-audit diff <url>auto-picks the two most recent crawls, and--fail-on <severity>gates CI/cron against regressions vs. the previous persisted crawl - near-duplicate content detection (MinHash, Jaccard ≥ 0.85 over 5-word shingles): clusters of pages with substantially similar body text get a "Content duplicates" summary section + medium-severity
CONTENT_NEAR_DUPLICATEper-page issue. Skip with--no-content-dedup. - internal link equity (PageRank, damping 0.85, 20 iterations): per-page
pageRankscore plus an "underlinked important pages" highlight for high-content pages with below-median rank — the "your money page gets 1 internal link" insight. Skip with--no-link-graph.
Install
Run instantly with npx (no install needed):
npx @davo20019/seo-audit https://example.comInstall globally:
npm install -g @davo20019/seo-audit
seo-audit https://example.com --max-pages 25Add to a project (use as a library too):
npm install @davo20019/seo-auditimport { analyzeSite } from "@davo20019/seo-audit";
const report = await analyzeSite("https://example.com", {
maxPages: 50,
onProgress: (event) => {
// Structured progress events for agents, jobs, and custom UIs.
if (event.phase === "page-complete") {
console.error(`crawled ${event.crawledPages}/${event.maxPages ?? "all"}`);
}
},
});Quick Start (from source)
git clone https://github.com/davo20019/seo-analysis.git
cd seo-analysis
npm install
npm run dev -- https://example.comBuild the CLI:
npm run build
npm run start -- https://example.com --max-pages 20Write JSON output to a file:
npm run dev -- https://example.com --json --output report.jsonInteractive terminal runs show a single updating crawl progress line on stderr.
It is hidden automatically in CI/non-TTY runs and can be disabled with
--no-progress or SEO_AUDIT_NO_PROGRESS=1, so JSON/stdout output remains
machine-readable for agents and scripts.
Useful Options
| Flag | Description |
|---|---|
| --output <path> | Write JSON report to a file (default: stdout). |
| --extract <json> | Inline JSON of extraction rules. Mutually exclusive with --extract-file. |
| --extract-file <path> | JSON file of extraction rules. |
| --no-progress | Disable the interactive stderr crawl progress line. |
| --no-persist | Skip persisting the crawl to ~/.config/seo-audit/crawls/. Default: every successful audit is persisted. Set SEO_AUDIT_NO_PERSIST=1 to default the same. |
| --fail-on <severity> | Exit non-zero if issues at <severity> increased. Fresh-audit mode: compares to the previous persisted crawl. Diff mode: compares the two passed report files. One of: high, medium, low. |
# Faster crawl with retries and sitemap seeding
npm run dev -- https://example.com --max-pages 50 --concurrency 6 --retries 2
# Crawl every URL surfaced by discovered sitemap files
npm run dev -- https://example.com --full-sitemap --concurrency 12
# Sample representative sitemap URLs up to --max-pages
npm run dev -- https://example.com --sample-sitemap --max-pages 25
# Limit the crawl to a site section
npm run dev -- https://example.com --include-path '^/blog'
# Skip utility or archive paths
npm run dev -- https://example.com --exclude-path '/tag/' --exclude-path '/page/[0-9]+'
# Add Lighthouse for a few representative pages
npm run dev -- https://example.com --lighthouse --lighthouse-pages 3
# Override the User-Agent string sent by the crawler
npm run dev -- https://example.com --user-agent "Mozilla/5.0 (compatible; MyCrawler/1.0)"# Render pages with headless Chromium (Playwright) — needed for SPAs,
# JS-challenge sites (Cloudflare turnstile), and pages whose final DOM
# depends on JS. Slower and heavier than the default static fetch.
npm run dev -- https://example.com --render --max-pages 5Notes on --render:
- First
npm installauto-downloads Chromium (~300MB). To skip (e.g. CI), setPLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1and runnpx playwright install chromiumlater. - Rendering is slower than raw fetch (typical: 2–10s per page). Use
--max-pagesto scope. - Render uses
--retries(default 3) with exponential backoff on transient failures (navigation timeouts, ad-script hangs). - Known limitation:
redirectChainis not captured for rendered pages in v1. ThefinalUrlis still accurate.
Custom extractions
Define CSS-selector-based field extractions and pull them off every crawled page — useful for content audits, schema/data validation at scale, migration QA, and competitive teardowns.
seo-audit https://example.com \
--extract '{"h1":"h1","price":"[itemprop=price]@content"}' \
--json --output report.jsonOr with a config file (recommended for repeatable audits):
echo '{"h1":"h1","author":"meta[name=author]@content"}' > extractions.json
seo-audit https://example.com --extract-file extractions.json --jsonSelector grammar (suffix optional):
| Suffix | Returns |
|-----------|-----------------------------------------------|
| (none) | Trimmed text content |
| @attr | Attribute value (e.g. meta[name=author]@content) |
| #html | Inner HTML |
Object form unlocks all (multi-match → array) and required (emits a
low-severity EXTRACTION_MISSING_REQUIRED issue if no match — integrates
with --fail-on low):
{
"h1": "h1",
"price": "[itemprop=price]@content",
"intro": "article > p:first-of-type#html",
"faqQuestions": { "selector": ".faq h3", "all": true },
"title": { "selector": "title", "required": true }
}Per-page values appear under pages[].extracted in the JSON report.
A summary appears at extractionSummary. The HTML/text/PDF reports show
match coverage and missing-required counts only — full per-page detail
stays in JSON for downstream tools (jq, spreadsheets, CI gates).
Log-file analysis (seo-audit logs)
Parse a web-server / CDN access log, verify bot identities via reverse-DNS, and join the results against the most recent persisted crawl. No server access required — point the tool at a log file or pipe one in.
# Local file
seo-audit logs ./access.log --site https://example.com
# Stdin
zcat cloudflare-logs-*.gz | seo-audit logs - --site https://example.com --format cloudflare
# JSON output for downstream agents / scripts
seo-audit logs ./access.log --site https://example.com --json --output logs.jsonFindings (when a persisted crawl exists for the host):
- Orphan pages — bot-visited URLs not in your internal link graph.
- Stale priorities — top-PageRank URLs that bots haven't crawled in 30+ days.
- Status mismatches — pages your audit recorded as 200 that bots saw return 4xx/5xx.
Supported formats: Apache/Nginx Combined Log Format, generic JSON
(one object per line), Cloudflare Logpush JSON, Fastly Real-Time JSON.
Auto-detected; override with --format.
Bot verification is on by default and uses reverse-DNS → forward-DNS
suffix matching with an in-memory cache. Disable with --no-verify-bots.
In sandboxed/air-gapped environments without DNS, the tool emits a
LOG_DNS_UNAVAILABLE issue and continues with all hits flagged as
unverified instead of stalling.
Privacy: no raw IPs are written to any output; the DNS cache is in-memory only; no telemetry.
# Query Google's CrUX API for real-user Core Web Vitals (requires CRUX_API_KEY env var)
CRUX_API_KEY=your-google-api-key npm run dev -- https://example.com --crux --max-pages 5# Score the site for AI-agent readiness (llms.txt depth, AI-bot policy, well-known endpoints)
npm run dev -- https://example.com --agent-readiness --max-pages 25# Enrich with Google Search Console traffic data (priority-rank issues by impressions)
GOOGLE_APPLICATION_CREDENTIALS=./gsc-service-account.json \
npm run dev -- https://example.com --gsc --max-pages 100The --gsc flag pulls clicks, impressions, CTR, and average position from Search Console for every crawled URL and adds a Priority issues section to the summary — high/medium-severity issues sorted by impressions, so the audit answers "which problem affects pages that actually get traffic?" rather than just "what problems exist?".
Auth uses a Google Cloud service account (no OAuth browser flow, no token caching). The fastest way to set it up is the bundled wizard:
seo-audit --gsc-setup https://your-site.comThe wizard:
- Detects
gcloudand (if present) creates the service account, downloads the key to~/.config/seo-audit/gsc-key.json, and chmods it600. - Falls back to printed step-by-step instructions if
gcloudisn't available. - Prints the service-account email to grant in Search Console (Settings → Users and permissions → Add user → Restricted).
- Verifies the credentials by exchanging them for a real access token before declaring success.
After setup, every subsequent run is silent:
export GOOGLE_APPLICATION_CREDENTIALS=~/.config/seo-audit/gsc-key.json
seo-audit https://your-site.com --gscIf you prefer manual setup or are running in CI, the auth layer also accepts:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json(file path)GOOGLE_APPLICATION_CREDENTIALS_JSON='{...inline json...}'(CI-friendly, one secret)--gsc-service-account-key-file /path/to/key.json(CLI override)
The same credentials work for any future Google integration (e.g. a future --ga4).
Optional flags: --gsc-property to override property auto-detection (URL-prefix or sc-domain:example.com), --gsc-days to change the lookback window (default 90).
Google Analytics 4 enrichment (--ga4)
Merges per-page sessions, pageviews, users, and engagement rate from the GA4 Data API into the crawl report. Useful when you want to know which pages get real traffic — not just search impressions — and prioritize technical-issue remediation accordingly.
Setup (one-time, ~30 seconds if --gsc-setup has already run):
If you haven't already run
seo-audit --gsc-setup, run it now. It creates a service account and downloads its JSON key. The same SA is reused for GA4 — no second key needed.Open your GA4 property → Admin → Property Access Management.
Click + → Add users. Paste the service-account email (visible in
~/.config/seo-audit/gsc-key.jsonunderclient_email).Set Direct roles to Viewer. Save.
Run:
GOOGLE_APPLICATION_CREDENTIALS=~/.config/seo-audit/gsc-key.json \ seo-audit https://your-site.com --gsc --ga4
The CLI auto-detects the GA4 property by matching the crawl origin against
each accessible property's web data stream defaultUri. If multiple
properties match (e.g., separate prod/staging streams for the same domain),
the audit fails with the candidate list — pass --ga4-property properties/N
to disambiguate.
Priority issues are ranked by GSC impressions when --gsc is enabled,
with GA4 sessions as a fallback when GSC is unavailable. The HTML report
shows a "via GSC" / "via GA4" badge per row.
The --agent-readiness flag adds an opinionated rubric (inspired by Cloudflare's agent-readiness framework) covering four buckets:
- Discoverability — robots.txt, sitemap.xml,
Link:HTTP headers (RFC 8288). - Content accessibility — llms.txt presence and content depth (H1, sections, links, size),
llms-full.txt, and markdown content negotiation (Accept: text/markdown). - Bot access control — explicit rules for known AI user agents (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider, Applebot-Extended, …),
Content-Signaldirectives (search,ai-train,ai-input), and Web Bot Auth (/.well-known/http-message-signatures-directory). - Capabilities & protocols — well-known endpoints (
/.well-known/agent-skills/index.json,/.well-known/api-catalog,/.well-known/mcp/server-card.json, OAuth discovery) plus structured-data coverage from an agent perspective (Organization/WebSite on the homepage, Article schema on article-like paths).
Each bucket is scored 0–100 and averaged into a single score (also 0–100). The full breakdown — including which probes succeeded — lands in the JSON, text, and HTML reports as a separate Agent Readiness section.
# Generate a polished HTML report (open in any browser, email to a client)
npm run dev -- https://example.com --max-pages 25 --html-report report.html
# Generate a PDF report (uses Playwright/Chromium under the hood)
npm run dev -- https://example.com --max-pages 25 --pdf-report report.pdf
# Generate both at once
npm run dev -- https://example.com --max-pages 25 --html-report report.html --pdf-report report.pdfDiff Reports
Compare two crawl reports to see what changed between runs:
# Compare last week's crawl to this week's
npm run dev -- diff old-report.json new-report.json
# Output as JSON for piping into another tool
npm run dev -- diff old.json new.json --json --output diff.json
# Fail (non-zero exit) if any high-severity issues were added — useful in CI
npm run dev -- diff old.json new.json --fail-on highThe diff highlights:
- New issue codes (didn't appear in old)
- Resolved issue codes (appeared in old, not in new)
- Counts that increased or decreased
- HTTP status changes on common pages
- Pages added or removed from the crawl
GitHub Action
Run SEO audits in CI without writing any glue code:
# .github/workflows/seo.yml
name: SEO Audit
on: [pull_request]
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: davo20019/seo-analysis@v1
with:
url: https://staging.mysite.com
max-pages: 50
fail-on: high # block PRs that introduce high-severity issuesInputs
| Input | Default | Description |
|---|---|---|
| url | (required) | URL to crawl |
| max-pages | 50 | Maximum pages to crawl |
| concurrency | (CLI default) | Pages fetched in parallel |
| render | false | Use Playwright for JS rendering |
| fail-on | (none) | Fail the workflow if issues at this severity exist (high, medium, low) |
| crux | false | Query Google CrUX for real-user Core Web Vitals |
| crux-api-key | (none) | API key for CrUX (use a repo secret) |
| agent-readiness | false | Score AI-agent readiness (llms.txt depth, AI-bot rules, well-known endpoints) |
| gsc | false | Enrich crawled pages with Google Search Console clicks/impressions/CTR/position |
| gsc-property | (auto-detect) | Override GSC property (URL-prefix or sc-domain:example.com) |
| gsc-days | 90 | Days of GSC history to query |
| gsc-service-account-key | (none) | Service-account JSON for GSC (use a repo secret); the SA email needs Restricted access on the property |
| ga4 | false | Enrich crawled pages with Google Analytics 4 metrics (sessions, pageviews, users, engagement rate) |
| ga4-property | (auto-detect) | Override property auto-detection (e.g. properties/123456789) |
| ga4-days | 90 | Days of GA4 data to query |
| ga4-service-account-key | (none) | Service-account JSON for GA4 (use a repo secret); the SA email needs Viewer role on the property |
| user-agent | (default) | Override the crawler's User-Agent |
| include-paths | (none) | Comma-separated regex; only crawl matching URLs |
| exclude-paths | (none) | Comma-separated regex; skip matching URLs |
| output-json | seo-report.json | Where to write the JSON report |
| output-html | (none) | Optional path for the HTML report |
Outputs
| Output | Description |
|---|---|
| high-issues | Count of high-severity issues |
| medium-issues | Count of medium-severity issues |
| low-issues | Count of low-severity issues |
| pages-crawled | Number of pages successfully crawled |
| report-path | Path to the JSON report (use with actions/upload-artifact) |
Keyword Search
Search for specific keywords across a site:
# Search for keywords
npm run dev -- https://example.com --keyword "seo audit" --keyword "site speed"
# Load keywords from a file (one per line, # comments supported)
npm run dev -- https://example.com --keyword-file keywords.txt
# Combine keyword search with term extraction
npm run dev -- https://example.com --keyword-file keywords.txt --extract-terms --top-terms 30Offline Search
Search pre-downloaded HTML files for faster repeated searches:
# Download a site with httrack
httrack "https://example.com" -O "./site_backup"
# Search the local copy (no network requests)
npm run dev -- --from-directory ./site_backup --keyword-file keywords.txt
npm run dev -- --from-directory ./site_backup --extract-termsCrawl Behavior
- crawls up to
--max-pagespages, or the full sitemap with--full-sitemap - fetches pages in parallel with
--concurrency(run--helpfor the default) - retries slow or retryable requests with exponential backoff (also applies to
--renderon transient navigation failures) - deduplicates redirected pages by final URL
- seeds the crawl queue from
sitemap.xmlautomatically - walks one level of nested sitemap-index files (capped at 50 children)
- supports
--include-pathand--exclude-pathregex filters - samples representative sitemap URLs with
--sample-sitemap - captures response headers per page for
X-Robots-Tag, HSTS, Cache-Control, and Content-Type checks - optionally renders pages with headless Chromium via
--render(Playwright) for SPAs and JS-challenge sites
Notes And Limits
--sample-sitemapis useful when you want representative sitemap coverage quickly, but link-graph findings are less complete because the crawl is intentionally sampled.- Sitemap reconciliation relies on the set of sitemap URLs the CLI was able to collect. The report marks sitemap coverage as partial when that set was truncated.
- JSON-LD validation covers Google's rich-result requirements for Product, Article (BlogPosting/NewsArticle), FAQPage, BreadcrumbList, Organization, and LocalBusiness. Other schema types are still presence-only.
--render(Playwright/Chromium) handles SPAs and JS-challenge sites (Cloudflare turnstile, JS-rendered DOM) butredirectChainis not captured for rendered pages.
Privacy
This tool does not collect telemetry. No analytics, no phone-home, no install tracking. Crawl reports stay on your machine. The only outbound network traffic is:
- HTTP fetches to the URLs you ask the tool to crawl
- Optional: Google's CrUX API when you pass
--crux(sends an origin string + your API key) - Optional: Chromium downloads from Microsoft's Playwright CDN on first install
If a future version ever adds opt-in telemetry, it will be exactly that — opt-in, with explicit disclosure.
Persistence
Every successful audit is auto-saved to ~/.config/seo-audit/crawls/<host>/<timestamp>.json
(mode 0700, alongside the existing gsc-key.json). This enables:
seo-audit diff <url>— compare the two most recent crawls of a host without having to remember file paths.seo-audit <url> --fail-on <severity>— gate cron / CI runs against regressions vs. the previous persisted crawl.
The persisted JSON contains everything the audit captured, including page
metadata, GSC/GA4 traffic data when enriched (--gsc / --ga4), and the full
body text per page. Treat the directory as you would any other client-data
artifact.
Opt out with --no-persist per run, or SEO_AUDIT_NO_PERSIST=1 (or
SEO_AUDIT_NO_PERSIST=true) in your environment.
Override the location with SEO_AUDIT_CRAWLS_DIR=<path> — useful when the
default $HOME/.config/ doesn't fit (sandboxed CI runners, separate volume, XDG
preferences).
Retention: none. Crawls accumulate forever. At ~5 MB per crawl × 250 weekly
crawls × 5 years per host, the footprint is around 1.3 GB per host long-term —
manageable. rm -rf ~/.config/seo-audit/crawls/<host>/ if you ever want to
reset.
CHANGELOG
v0.8.0 — 2026-04-27
- Added:
seo-audit logs <path>subcommand. Parses Apache/Nginx Combined Log Format, generic JSON, Cloudflare Logpush, and Fastly Real-Time logs. Verifies Googlebot/Bingbot/Applebot/DuckDuckBot/AI bots via reverse-DNS with in-memory caching. Auto-falls-back to unverified mode when DNS is unavailable. - Added: Joined-to-crawl findings — orphan pages
(
LOG_ORPHAN_PAGE), stale priorities (LOG_STALE_PRIORITY_PAGE), status mismatches (LOG_STATUS_MISMATCH). All low severity, integrate with--fail-on. - Added:
analyzeLogs(input, options)library API accepting a path orReadablestream.LogAnalysisReportshape stable from v0.8. - Added: Structured
LogAnalysisProgressEventcallback for long-running log analyses. - Privacy: no raw IPs in any output, DNS cache in-memory only, no telemetry.
v0.7.0 — 2026-04-26
- Added: custom extraction rules. Define CSS-selector-based field
extractions with
--extract '<json>'or--extract-file <path>. Selector grammar:selector(text),selector@attr(attribute),selector#html(inner HTML). Object form supportsall: true(multi-match) andrequired: true(low-severityEXTRACTION_MISSING_REQUIREDissue, integrates with--fail-on). - Added:
AnalyzeOptions.extractfor library consumers, plusPageReport.extractedandSiteReport.extractionSummaryin the JSON report. - Added: "Custom extractions" summary section in HTML/text/PDF reports. Full per-page detail remains in JSON.
v0.6.0 — 2026-04-26
- Added: interactive crawl progress for CLI runs. The indicator writes to
stderr only, is enabled only for TTY sessions, and is disabled in CI/non-TTY
contexts. Use
--no-progressorSEO_AUDIT_NO_PROGRESS=1to opt out. - Added:
AnalyzeOptions.onProgress, a structured progress callback for library, job-runner, and agent integrations that need status updates without parsing terminal output.
v0.5.0 — 2026-04-26
- Added: near-duplicate content detection. Audits now include a
report.contentDedupsummary with clusters of pages whose body text exceeds 85% Jaccard similarity. Each cluster member gets a new medium-severityCONTENT_NEAR_DUPLICATEissue. Skip via--no-content-dedup. - Added: internal link-equity (PageRank) computation. Audits now
include a
report.linkGraphsummary with the top 10 pages by PageRank and an "underlinked important pages" list (high content, low rank — the "your money page gets 1 internal link" finding). EachPageReportgains alinkGraph.pageRankfield. Skip via--no-link-graph. - Changed: the HTML report Pages table grows a conditional
PageRankcolumn when any page has link-graph metrics, sortable alongside the existing Title / Status / Issues columns. - Note for
--fail-oncron users: the newCONTENT_NEAR_DUPLICATEissue is medium severity and will surface on most e-commerce / programmatic-SEO sites. If you have an existing persisted baseline from v0.4 (or earlier), the first v0.5 run may exit non-zero because the new check wasn't running when the baseline was recorded — the new findings look like a regression to the diff comparison. Recovery: run the audit once with--no-content-dedupto refresh a clean baseline, then re-enable it; or raise--fail-on high; or pass--no-content-deduppermanently to opt out. Users on a fresh install (no prior crawl) are unaffected — the first run skips the regression check entirely with a "No prior crawl found" warning and exits 0.
v0.4.0 — 2026-04-26
- Added: every successful audit is now persisted to
~/.config/seo-audit/crawls/<host>/<timestamp>.json(mode0700). Opt out via--no-persistorSEO_AUDIT_NO_PERSIST=1. - Added:
seo-audit diff <url>auto-picks the two most recent persisted crawls of the host. The existingseo-audit diff <old.json> <new.json>form still works for explicit comparisons. - Added:
SEO_AUDIT_CRAWLS_DIRenv var overrides the default crawls directory location. - Added:
evaluateFailOnhelper exported fromdist/diff.jsfor programmatic consumers. - Changed:
--fail-on <severity>now applies in fresh-audit mode too. It compares the new audit against the previous persisted crawl and exits non-zero if issues at the named severity increased. On the first run for a host, prints a "skipping regression check" warning and exits0. (Diff-mode behavior is unchanged.)
v0.3.0 — 2026-04-25
- Added:
--ga4flag and supporting--ga4-property,--ga4-days,--ga4-service-account-key-fileoptions. Enriches each crawled page with GA4 sessions, pageviews, users, and engagement rate; reuses the service-account workflow set up by--gsc-setup. - Added:
ga4,ga4-property,ga4-days, andga4-service-account-keyinputs on the GitHub Action. Thega4-service-account-keyinput falls back togsc-service-account-keywhen blank, since the same SA can read both. - Added:
report.ga4enrichment-result summary alongsidereport.gsc. - Added:
page.metrics.ga4per-page GA4 metrics alongsidepage.metrics.gsc. - Changed:
summary.priorityIssues[]entries now carryrankedBy,rankValue, andmetricsfields; the GSC-specificimpressions,clicks, andpositionfields are kept populated for one minor cycle (deprecated). - Changed:
summary.priorityIssuesis now built whenever GSC or GA4 enrichment succeeds (previously only when GSC succeeded). - Changed: HTML report now renders
<h2>Analytics</h2>(when GA4 is enabled) and<h2>Priority issues</h2>as top-level sections; the priority list previously rendered nested under<h2>Search Console</h2>, which hid it on GA4-only audits.
