npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

sitedex

v0.3.0

Published

Site crawler CLI that generates sitemap.xml plus llms.txt/llms-full.txt for LLM-friendly site maps

Readme

SiteDex

Site crawler CLI that generates sitemap.xml plus llms.txt/llms-full.txt for LLM-friendly site maps

Tests TypeScript Node License

SiteDex is a Node.js CLI that crawls a public website, gathers the internal pages that respond with HTTP 200, and produces machine‑readable artifacts for LLM agents (llms.txt and llms-full.txt) together with a standards-compliant sitemap.xml. Under the hood it uses Crawlee for traversal, Cheerio/Turndown for content extraction, and a batteries-included CLI experience with helpful diagnostics and progress reporting.

Features

  • Accurate crawling – Only internal links are followed, duplicate URLs are deduplicated, and depth is tracked to prevent infinite loops.
  • Robots aware – The CLI fetches and honors robots.txt, including user-agent specific rules and Crawl-delay, with clear messaging when the start URL is blocked.
  • Adaptive throttling – Request pacing automatically combines CLI delay settings with any Crawl-delay directive to stay friendly to target sites.
  • Structured outputs – Generates sitemap.xml, llms.txt, and a richer llms-full.txt (H3 sections with Markdown payloads) ready for LLM ingestion.
  • Markdown transformation – Navigation, scripts, and other boilerplate are stripped before the readable content is converted into Markdown.
  • Optional page detection – Archive, legacy, and deprecated URLs are automatically routed to an Optional section so agents can de-prioritize them.
  • Detailed logging – Crawl progress, statistics, and actionable troubleshooting tips are surfaced directly in the CLI.
  • Comprehensive tests – Unit, integration, and contract suites validate URL utilities, content processing, generators, robots compliance, and file output.

Installation

npm install -g sitedex

Usage

Quick start

sitedex https://example.com

By default the crawl output is written to ./output and includes sitemap.xml plus llms.txt. Add --full if you want llms-full.txt, which embeds full Markdown for each page.

CLI options

| Flag | Description | Default | | --- | --- | --- | | --output <dir> | Destination directory for generated files | ./output | | --delay <ms> | Minimum delay (ms) between requests | 1000 | | --max-depth <n> | Maximum crawl depth based on link hierarchy (use 0 for unlimited). Depth 0 = start URL, depth 1 = pages linked from start URL, etc. | 3 | | --max-pages <n> | Maximum number of pages to visit (use 0 for unlimited) | unlimited | | --timeout <ms> | Request timeout (ms) passed to Crawlee | 30000 | | --user-agent <string> | Custom User-Agent string | sitedex/<version> (+https://github.com/chunkai1312/sitedex) | | --no-robots | Ignore robots.txt rules (use only when you have explicit permission) | disabled | | --full | Emit llms-full.txt with Markdown snapshots | disabled | | --silent | Hide progress logs and run in quiet mode | disabled |

Example invocations

# Crawl at most 100 pages two levels deep and include full content
sitedex https://example.com --max-pages 100 --max-depth 2 --full

# Use unlimited mode to crawl entire site
sitedex https://example.com --max-depth 0 --max-pages 0

# Respect a slower schedule to reduce load on the target site
sitedex https://example.com --delay 5000

# Write files to a custom directory with a custom UA
sitedex https://example.com --output ~/Desktop/sitedex-out --user-agent "labs-bot/0.1"

Generated files

sitemap.xml

Created via src/generators/sitemap-generator.ts, this XML document complies with Sitemap Protocol 0.9. URLs are normalized (host lower-casing, fragment removal, sorted query params) and capped at 50 000 entries. The generator also warns when no URLs were eligible, or when the raw crawl produced more than the allowed maximum.

llms.txt

src/generators/llmstxt-generator.ts emits the compact, link-focused format defined by llmstxt.org. Pages are grouped by their first path segment (with an Optional section for archive/legacy content) and include Markdown bullet lists with titles and descriptions.

# Example Site
> A short description of the target site

## Docs
- [Getting Started](https://example.com/docs/start): Quick intro
- [API Reference](https://example.com/docs/api): REST endpoints

llms-full.txt

When the --full flag is provided, src/generators/llmstxt-full-generator.ts writes a richer document. Each page becomes an H3 section with a bold URL, full Markdown converted from HTML, and separator lines (---). This is ideal for downstream LLM agents that need context beyond simple descriptions.

Robots & rate limiting

  • fetchRobotsTxt retrieves /robots.txt, parses it with robots-parser, and stores metadata such as specific rules and crawl delay.
  • Before each request the crawler verifies the URL is allowed; disallowed URLs are counted as skipped.
  • RateLimiter compares the CLI delay with Crawl-delay and always uses the larger value to avoid overwhelming the target site.
  • Friendly error messages point out when the start URL is blocked and explain how to proceed responsibly.

If robots.txt is unreachable the tool logs a warning and assumes crawling is permitted. You can opt out entirely with --no-robots, but use that only when you have explicit permission.

Contributing

Issues and pull requests are welcome. Please run npm run lint and npm test before submitting, and follow the conventional commit style enforced by Commitlint/Husky.

License

MIT © Chun-Kai Wang