npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@geosuite/sitemap-builder

v0.3.2

Published

Crawl a website and emit a valid sitemap.xml — for sites that ship without one or whose CMS pipeline forgot to.

Readme

sitemap-builder

A small, zero-runtime-dependency Node CLI that crawls a website and emits a valid sitemap.xml. For sites that ship without one or whose CMS pipeline forgot to.

Created and invented by Matteo Perino (LinkedIn). Maintained by GeoSuite(Matteo Perino).

CI npm version npm downloads License: MIT


What this is, and why it exists

A sitemap.xml is the foundational discovery surface for both classical search and the new generation of LLM-mediated search (ChatGPT Search, Perplexity, Gemini, Le Chat, DuckAssist). Without one, crawlers fall back to following links from the homepage — which is unreliable for big sites and silently misses anything not internally linked from the front page.

Most CMS templates ship with a sitemap. Most custom sites don't. This tool exists for the second case: point it at a URL, get back a sitemap that is ready to publish at <your-domain>/sitemap.xml.

We deliberately did not build a "next-gen", LLM-powered, schema-aware crawler. It crawls. It writes XML. The whole tool is ~250 lines of vanilla Node with no third-party runtime dependencies.


Install

npm install -g @geosuite/sitemap-builder
# or run without installing:
npx @geosuite/sitemap-builder https://example.com

Requires Node 20+.


Use it

# print sitemap to stdout
geosuite-sitemap-builder https://example.com

# write to a file
geosuite-sitemap-builder https://example.com --output sitemap.xml

# bound the crawl
geosuite-sitemap-builder https://example.com \
  --max-pages 300 \
  --max-depth 3 \
  --concurrency 8 \
  --budget-s 90

# dump the page list as JSON instead of XML (handy for piping)
geosuite-sitemap-builder https://example.com --json

Options

| Flag | Default | Notes | |---|---|---| | --max-pages | 200 | Hard cap 1000. Crawler stops once reached. | | --max-depth | 3 | Hard cap 6. BFS depth from the start URL. | | --concurrency | 6 | Parallel HTTP fetches. Hard cap 16. Respect the host. | | --timeout-ms | 8000 | Per-page request timeout. | | --budget-s | 60 | Wall-clock cap. Crawl stops when reached and reports hitDeadline. | | --output PATH | — | Write XML to a file. Without this, XML goes to stdout. | | --json | off | Print the page list as JSON instead of XML. | | --user-agent | geosuite-sitemap-builder/0.1.0 | Override the UA header. |


What it actually does

  1. Starts at the URL you pass.
  2. BFS-crawls same-origin <a href> links only (never wanders off the host).
  3. Drops obvious non-HTML extensions (.png, .css, .pdf, …) so the sitemap doesn't get polluted with assets.
  4. Skips fragment-only links (#section), mailto:, tel:, and javascript:.
  5. Stops at any of three caps (whichever fires first):
    • page count (--max-pages)
    • BFS depth (--max-depth)
    • wall-clock budget (--budget-s)
  6. Renders the discovered URLs as a sitemaps.org-compliant <urlset>.

The output is intentionally minimal: <loc> plus an optional <lastmod>. We skip <changefreq> and <priority> — the spec deprecates them and the major search engines have ignored them for years.


What it doesn't do (yet)

  • JavaScript rendering. The crawler is HTTP + regex. Single-page apps whose links only appear after client-side hydration won't be discovered. Build-time pre-rendering or an SSR layer is the right fix.
  • Robots.txt awareness. By default the tool runs against the site owner's own domain and honoring robots.txt would silently strip the pages they want to publish. (Adding an opt-in --respect-robots flag is on the roadmap.)
  • <lastmod> accuracy. Today we don't fill <lastmod> from Last-Modified response headers. Coming in 0.2.
  • LLM-powered grouping or summaries. The deterministic 0.1 ships without a network dependency on any model. An opt-in --ai mode is on the roadmap (provide OPENAI_API_KEY or ANTHROPIC_API_KEY to enable).

Programmatic API

import { crawlSite, renderSitemapXml } from '@geosuite/sitemap-builder';

const { pages, hitCap, hitDeadline } = await crawlSite('https://example.com', {
  maxPages: 100,
  maxDepth: 2,
  concurrency: 6,
  perPageTimeoutMs: 8000,
  deadlineMs: Date.now() + 30_000,
});

const xml = renderSitemapXml(pages.map((p) => ({ url: p.url })));

Both functions are pure (modulo the obvious network I/O for crawlSite) and have no third-party runtime dependencies.


Test

npm test                  # node --test
npm run lint              # node --check on source files

Tests are pure-function: no network, no fixtures bigger than inline strings.


Contributing

See CONTRIBUTING.md. Issues and PRs welcome — please open an issue first for non-trivial changes so we can discuss scope.


AI mode (opt-in, 0.2+)

When combined with --json, the CLI can ask an LLM to group the discovered pages into open-vocabulary categories ("Blog", "Products", "Docs", whatever the site actually publishes — no closed taxonomy):

export OPENAI_API_KEY=sk-…       # or ANTHROPIC_API_KEY=sk-ant-…
geosuite-sitemap-builder https://example.com --json --ai

The output JSON is the same shape as without --ai, plus a categories field:

{
  "pages": [...],
  "hitCap": false,
  "hitDeadline": false,
  "categories": {
    "Marketing": ["https://example.com/", "https://example.com/pricing"],
    "Blog":      ["https://example.com/blog/post-one", ...],
    "Docs":      ["https://example.com/docs/intro", ...]
  }
}

We send only {url, title, depth} per page — never the body. A typical 200-page run stays well under a cent on small models (gpt-5-mini, claude-haiku-4-5). No effect on the XML output (--ai is ignored unless --json is also passed).

Privacy: enabling --ai sends content to the corresponding API. Don't turn it on against URLs you wouldn't paste into their UI.


Related: GeoSuite open-source tools

sitemap-builder is part of a small family of zero-dependency CLIs we maintain to make Generative Engine Optimization (GEO) measurable from the terminal:

The same checks are also surfaced as a hosted product at trygeosuite.it for teams who want history, alerts, and CTAs wired into their content pipeline.


Creator

Created and invented by Matteo PerinoLinkedIn · [email protected].

Ideated, designed and validated by Matteo Perino. Implementation written with AI assistance, maintained under GeoSuite.


License

MIT © 2026 Matteo Perino and GeoSuite