sitedex

v0.3.0

Published

2 months ago

Site crawler CLI that generates sitemap.xml plus llms.txt/llms-full.txt for LLM-friendly site maps

0High
0Medium
0Low

chunkai1312

SiteDex

Site crawler CLI that generates sitemap.xml plus llms.txt/llms-full.txt for LLM-friendly site maps

SiteDex is a Node.js CLI that crawls a public website, gathers the internal pages that respond with HTTP 200, and produces machine‑readable artifacts for LLM agents (llms.txt and llms-full.txt) together with a standards-compliant sitemap.xml. Under the hood it uses Crawlee for traversal, Cheerio/Turndown for content extraction, and a batteries-included CLI experience with helpful diagnostics and progress reporting.

Features

Accurate crawling – Only internal links are followed, duplicate URLs are deduplicated, and depth is tracked to prevent infinite loops.
Robots aware – The CLI fetches and honors robots.txt, including user-agent specific rules and Crawl-delay, with clear messaging when the start URL is blocked.
Adaptive throttling – Request pacing automatically combines CLI delay settings with any Crawl-delay directive to stay friendly to target sites.
Structured outputs – Generates sitemap.xml, llms.txt, and a richer llms-full.txt (H3 sections with Markdown payloads) ready for LLM ingestion.
Markdown transformation – Navigation, scripts, and other boilerplate are stripped before the readable content is converted into Markdown.
Optional page detection – Archive, legacy, and deprecated URLs are automatically routed to an Optional section so agents can de-prioritize them.
Detailed logging – Crawl progress, statistics, and actionable troubleshooting tips are surfaced directly in the CLI.
Comprehensive tests – Unit, integration, and contract suites validate URL utilities, content processing, generators, robots compliance, and file output.

Installation

npm install -g sitedex

Usage

Quick start

sitedex https://example.com

By default the crawl output is written to ./output and includes sitemap.xml plus llms.txt. Add --full if you want llms-full.txt, which embeds full Markdown for each page.

CLI options

| Flag | Description | Default | | --- | --- | --- | | --output <dir> | Destination directory for generated files | ./output | | --delay <ms> | Minimum delay (ms) between requests | 1000 | | --max-depth <n> | Maximum crawl depth based on link hierarchy (use 0 for unlimited). Depth 0 = start URL, depth 1 = pages linked from start URL, etc. | 3 | | --max-pages <n> | Maximum number of pages to visit (use 0 for unlimited) | unlimited | | --timeout <ms> | Request timeout (ms) passed to Crawlee | 30000 | | --user-agent <string> | Custom User-Agent string | sitedex/<version> (+https://github.com/chunkai1312/sitedex) | | --no-robots | Ignore robots.txt rules (use only when you have explicit permission) | disabled | | --full | Emit llms-full.txt with Markdown snapshots | disabled | | --silent | Hide progress logs and run in quiet mode | disabled |

Example invocations

# Crawl at most 100 pages two levels deep and include full content
sitedex https://example.com --max-pages 100 --max-depth 2 --full

# Use unlimited mode to crawl entire site
sitedex https://example.com --max-depth 0 --max-pages 0

# Respect a slower schedule to reduce load on the target site
sitedex https://example.com --delay 5000

# Write files to a custom directory with a custom UA
sitedex https://example.com --output ~/Desktop/sitedex-out --user-agent "labs-bot/0.1"

Generated files

`sitemap.xml`

Created via src/generators/sitemap-generator.ts, this XML document complies with Sitemap Protocol 0.9. URLs are normalized (host lower-casing, fragment removal, sorted query params) and capped at 50 000 entries. The generator also warns when no URLs were eligible, or when the raw crawl produced more than the allowed maximum.

`llms.txt`

src/generators/llmstxt-generator.ts emits the compact, link-focused format defined by llmstxt.org. Pages are grouped by their first path segment (with an Optional section for archive/legacy content) and include Markdown bullet lists with titles and descriptions.

# Example Site
> A short description of the target site

## Docs
- [Getting Started](https://example.com/docs/start): Quick intro
- [API Reference](https://example.com/docs/api): REST endpoints

`llms-full.txt`

When the --full flag is provided, src/generators/llmstxt-full-generator.ts writes a richer document. Each page becomes an H3 section with a bold URL, full Markdown converted from HTML, and separator lines (---). This is ideal for downstream LLM agents that need context beyond simple descriptions.

Robots & rate limiting

fetchRobotsTxt retrieves /robots.txt, parses it with robots-parser, and stores metadata such as specific rules and crawl delay.
Before each request the crawler verifies the URL is allowed; disallowed URLs are counted as skipped.
RateLimiter compares the CLI delay with Crawl-delay and always uses the larger value to avoid overwhelming the target site.
Friendly error messages point out when the start URL is blocked and explain how to proceed responsibly.

If robots.txt is unreachable the tool logs a warning and assumes crawling is permitted. You can opt out entirely with --no-robots, but use that only when you have explicit permission.

Contributing

Issues and pull requests are welcome. Please run npm run lint and npm test before submitting, and follow the conventional commit style enforced by Commitlint/Husky.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme