@mdream/crawl
v0.15.3
Published
Mdream Crawl generates comprehensive llms.txt artifacts from a single URL, using mdream to convert HTML to Markdown.
Readme
@mdream/crawl
Multi-page website crawler that generates comprehensive llms.txt files by following internal links and processing entire websites using mdream HTML-to-Markdown conversion.
Note: For single-page HTML-to-Markdown conversion, use the
mdreambinary instead.@mdream/crawlis specifically designed for crawling entire websites with multiple pages.
Installation
npm install @mdream/crawlUsage
Simply run the command to start the interactive multi-page website crawler:
npx @mdream/crawlThe crawler will automatically discover and follow internal links to crawl entire websites. The interactive interface provides:
- ✨ Beautiful prompts powered by Clack
- 🎯 Step-by-step configuration guidance
- ✅ Input validation and helpful hints
- 📋 Configuration summary before crawling
- 🎉 Clean result display with progress indicators
- 🧹 Automatic cleanup of crawler storage
Programmatic Usage
You can also use @mdream/crawl programmatically in your Node.js applications:
import { crawlAndGenerate, generateLlmsTxt } from '@mdream/crawl'
// Crawl entire websites programmatically
const results = await crawlAndGenerate({
urls: ['https://docs.example.com'], // Starting URLs for website crawling
outputDir: './output',
maxRequestsPerCrawl: 100, // Maximum pages per website
generateLlmsTxt: true,
followLinks: true, // Follow internal links to crawl entire site
maxDepth: 3, // How deep to follow links
driver: 'http', // or 'playwright' for JS-heavy sites
verbose: true
})
// Generate llms.txt manually from existing results
await generateLlmsTxt({
siteName: 'Example Site',
description: 'Documentation for Example Site',
results: crawlResults,
outputPath: './output/llms.txt'
})Output
The crawler generates comprehensive output from entire websites:
- Markdown files - One
.mdfile per crawled page with clean markdown content - llms.txt - Comprehensive site overview file following the llms.txt specification
Example llms.txt output
# example.com
## Pages
- [Example Domain](https---example-com-.md): https://example.com/
- [About Us](https---example-com-about.md): https://example.com/aboutFeatures
- ✅ Multi-Page Website Crawling: Designed specifically for crawling entire websites by following internal links
- ✅ Purely Interactive: No complex command-line options to remember
- ✅ Dual Crawler Support: Fast HTTP crawler (default) + Playwright for JavaScript-heavy sites
- ✅ Smart Link Discovery: Uses mdream's extraction plugin to find and follow internal links
- ✅ Rich Metadata Extraction: Extracts titles, descriptions, keywords, and author info from all pages
- ✅ Comprehensive llms.txt Generation: Creates complete site documentation files
- ✅ Configurable Depth Crawling: Follow links with customizable depth limits (1-10 levels)
- ✅ Clean Markdown Conversion: Powered by mdream's HTML-to-Markdown engine
- ✅ Performance Optimized: HTTP crawler is 5-10x faster than browser-based crawling
- ✅ Beautiful Output: Clean result display with progress indicators
- ✅ Automatic Cleanup: Purges crawler storage after completion
- ✅ TypeScript Support: Full type definitions with excellent IDE support
Use Cases
Perfect for:
- 📚 Documentation Sites: Crawl entire documentation websites (GitBook, Docusaurus, etc.)
- 🏢 Company Websites: Generate comprehensive site overviews for LLM context
- 📝 Blogs: Process entire blog archives with proper categorization
- 🔗 Multi-Page Resources: Any website where you need all pages, not just one
Not suitable for: Single-page conversions (use mdream binary instead)
License
MIT
