sitedex
v0.3.0
Published
Site crawler CLI that generates sitemap.xml plus llms.txt/llms-full.txt for LLM-friendly site maps
Readme
SiteDex
Site crawler CLI that generates sitemap.xml plus llms.txt/llms-full.txt for LLM-friendly site maps
SiteDex is a Node.js CLI that crawls a public website, gathers the internal pages that respond with HTTP 200, and produces machine‑readable artifacts for LLM agents (llms.txt and llms-full.txt) together with a standards-compliant sitemap.xml. Under the hood it uses Crawlee for traversal, Cheerio/Turndown for content extraction, and a batteries-included CLI experience with helpful diagnostics and progress reporting.
Features
- Accurate crawling – Only internal links are followed, duplicate URLs are deduplicated, and depth is tracked to prevent infinite loops.
- Robots aware – The CLI fetches and honors
robots.txt, including user-agent specific rules andCrawl-delay, with clear messaging when the start URL is blocked. - Adaptive throttling – Request pacing automatically combines CLI delay settings with any
Crawl-delaydirective to stay friendly to target sites. - Structured outputs – Generates
sitemap.xml,llms.txt, and a richerllms-full.txt(H3 sections with Markdown payloads) ready for LLM ingestion. - Markdown transformation – Navigation, scripts, and other boilerplate are stripped before the readable content is converted into Markdown.
- Optional page detection – Archive, legacy, and deprecated URLs are automatically routed to an
Optionalsection so agents can de-prioritize them. - Detailed logging – Crawl progress, statistics, and actionable troubleshooting tips are surfaced directly in the CLI.
- Comprehensive tests – Unit, integration, and contract suites validate URL utilities, content processing, generators, robots compliance, and file output.
Installation
npm install -g sitedexUsage
Quick start
sitedex https://example.comBy default the crawl output is written to ./output and includes sitemap.xml plus llms.txt. Add --full if you want llms-full.txt, which embeds full Markdown for each page.
CLI options
| Flag | Description | Default |
| --- | --- | --- |
| --output <dir> | Destination directory for generated files | ./output |
| --delay <ms> | Minimum delay (ms) between requests | 1000 |
| --max-depth <n> | Maximum crawl depth based on link hierarchy (use 0 for unlimited). Depth 0 = start URL, depth 1 = pages linked from start URL, etc. | 3 |
| --max-pages <n> | Maximum number of pages to visit (use 0 for unlimited) | unlimited |
| --timeout <ms> | Request timeout (ms) passed to Crawlee | 30000 |
| --user-agent <string> | Custom User-Agent string | sitedex/<version> (+https://github.com/chunkai1312/sitedex) |
| --no-robots | Ignore robots.txt rules (use only when you have explicit permission) | disabled |
| --full | Emit llms-full.txt with Markdown snapshots | disabled |
| --silent | Hide progress logs and run in quiet mode | disabled |
Example invocations
# Crawl at most 100 pages two levels deep and include full content
sitedex https://example.com --max-pages 100 --max-depth 2 --full
# Use unlimited mode to crawl entire site
sitedex https://example.com --max-depth 0 --max-pages 0
# Respect a slower schedule to reduce load on the target site
sitedex https://example.com --delay 5000
# Write files to a custom directory with a custom UA
sitedex https://example.com --output ~/Desktop/sitedex-out --user-agent "labs-bot/0.1"Generated files
sitemap.xml
Created via src/generators/sitemap-generator.ts, this XML document complies with Sitemap Protocol 0.9. URLs are normalized (host lower-casing, fragment removal, sorted query params) and capped at 50 000 entries. The generator also warns when no URLs were eligible, or when the raw crawl produced more than the allowed maximum.
llms.txt
src/generators/llmstxt-generator.ts emits the compact, link-focused format defined by llmstxt.org. Pages are grouped by their first path segment (with an Optional section for archive/legacy content) and include Markdown bullet lists with titles and descriptions.
# Example Site
> A short description of the target site
## Docs
- [Getting Started](https://example.com/docs/start): Quick intro
- [API Reference](https://example.com/docs/api): REST endpointsllms-full.txt
When the --full flag is provided, src/generators/llmstxt-full-generator.ts writes a richer document. Each page becomes an H3 section with a bold URL, full Markdown converted from HTML, and separator lines (---). This is ideal for downstream LLM agents that need context beyond simple descriptions.
Robots & rate limiting
fetchRobotsTxtretrieves/robots.txt, parses it withrobots-parser, and stores metadata such as specific rules and crawl delay.- Before each request the crawler verifies the URL is allowed; disallowed URLs are counted as skipped.
RateLimitercompares the CLI delay withCrawl-delayand always uses the larger value to avoid overwhelming the target site.- Friendly error messages point out when the start URL is blocked and explain how to proceed responsibly.
If robots.txt is unreachable the tool logs a warning and assumes crawling is permitted. You can opt out entirely with --no-robots, but use that only when you have explicit permission.
Contributing
Issues and pull requests are welcome. Please run npm run lint and npm test before submitting, and follow the conventional commit style enforced by Commitlint/Husky.
License
MIT © Chun-Kai Wang
