@mdream/crawl
v1.0.3
Published
Mdream Crawl generates comprehensive llms.txt artifacts from a single URL, using mdream to convert HTML to Markdown.
Readme
@mdream/crawl
Multi-page website crawler that generates llms.txt files. Follows internal links and converts HTML to Markdown using mdream.
Setup
npm install @mdream/crawlFor JavaScript-heavy sites that require browser rendering, install the optional Playwright dependencies:
npm install crawlee playwrightCLI Usage
Interactive Mode
Run without arguments to start the interactive prompt-based interface:
npx @mdream/crawlDirect Mode
Pass arguments directly to skip interactive prompts:
npx @mdream/crawl -u https://docs.example.comCLI Options
| Flag | Alias | Description | Default |
|------|-------|-------------|---------|
| --url <url> | -u | Website URL to crawl (supports glob patterns) | Required |
| --output <dir> | -o | Output directory | output |
| --depth <number> | -d | Crawl depth (0 for single page, max 10) | 3 |
| --single-page | | Only process the given URL(s), no crawling. Alias for --depth 0 | |
| --driver <type> | | Crawler driver: http or playwright | http |
| --artifacts <list> | | Comma-separated output formats: llms.txt, llms-full.txt, markdown | all three |
| --origin <url> | | Origin URL for resolving relative paths (overrides auto-detection) | auto-detected |
| --site-name <name> | | Override the auto-extracted site name used in llms.txt | auto-extracted |
| --description <desc> | | Override the auto-extracted site description used in llms.txt | auto-extracted |
| --max-pages <number> | | Maximum pages to crawl | unlimited |
| --crawl-delay <seconds> | | Delay between requests in seconds | from robots.txt or none |
| --exclude <pattern> | | Exclude URLs matching glob patterns (repeatable) | none |
| --skip-sitemap | | Skip sitemap.xml and robots.txt discovery | false |
| --allow-subdomains | | Crawl across subdomains of the same root domain | false |
| --verbose | -v | Enable verbose logging | false |
| --help | -h | Show help message | |
| --version | | Show version number | |
CLI Examples
# Basic crawl with specific artifacts
npx @mdream/crawl -u harlanzw.com --artifacts "llms.txt,markdown"
# Shallow crawl (depth 2) with only llms-full.txt output
npx @mdream/crawl --url https://docs.example.com --depth 2 --artifacts "llms-full.txt"
# Exclude admin and API routes
npx @mdream/crawl -u example.com --exclude "*/admin/*" --exclude "*/api/*"
# Single page mode (no link following)
npx @mdream/crawl -u example.com/pricing --single-page
# Use Playwright for JavaScript-heavy sites
npx @mdream/crawl -u example.com --driver playwright
# Skip sitemap discovery with verbose output
npx @mdream/crawl -u example.com --skip-sitemap --verbose
# Crawl across subdomains (docs.example.com, blog.example.com, etc.)
npx @mdream/crawl -u example.com --allow-subdomains
# Override site metadata
npx @mdream/crawl -u example.com --site-name "My Company" --description "Company documentation"Glob Patterns
URLs support glob patterns for targeted crawling. When a glob pattern is provided, the crawler uses sitemap discovery to find all matching URLs.
# Crawl only the /docs/ section
npx @mdream/crawl -u "docs.example.com/docs/**"
# Crawl pages matching a prefix
npx @mdream/crawl -u "example.com/blog/2024*"Patterns are matched against the URL pathname using picomatch syntax. A trailing single * (e.g. /fieldtypes*) automatically expands to match both the path itself and all subdirectories.
Programmatic API
crawlAndGenerate(options, onProgress?)
The main entry point for programmatic use. Returns a Promise<CrawlResult[]>.
import { crawlAndGenerate } from '@mdream/crawl'
const results = await crawlAndGenerate({
urls: ['https://docs.example.com'],
outputDir: './output',
})CrawlOptions
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| urls | string[] | Required | Starting URLs for crawling |
| outputDir | string | Required | Directory to write output files |
| driver | 'http' \| 'playwright' | 'http' | Crawler driver to use |
| maxRequestsPerCrawl | number | Number.MAX_SAFE_INTEGER | Maximum total pages to crawl |
| followLinks | boolean | false | Whether to follow internal links discovered on pages |
| maxDepth | number | 1 | Maximum link-following depth. 0 enables single-page mode |
| generateLlmsTxt | boolean | true | Generate an llms.txt file |
| generateLlmsFullTxt | boolean | false | Generate an llms-full.txt file with full page content |
| generateIndividualMd | boolean | true | Write individual .md files for each page |
| origin | string | auto-detected | Origin URL for resolving relative paths in HTML |
| siteNameOverride | string | auto-extracted | Override the site name in the generated llms.txt |
| descriptionOverride | string | auto-extracted | Override the site description in the generated llms.txt |
| globPatterns | ParsedUrlPattern[] | [] | Pre-parsed URL glob patterns (advanced usage) |
| exclude | string[] | [] | Glob patterns for URLs to exclude |
| crawlDelay | number | from robots.txt | Delay between requests in seconds |
| skipSitemap | boolean | false | Skip sitemap.xml and robots.txt discovery |
| allowSubdomains | boolean | false | Crawl across subdomains of the same root domain (e.g. docs.example.com + blog.example.com). Output files are namespaced by hostname to avoid collisions |
| useChrome | boolean | false | Use system Chrome instead of Playwright's bundled browser (Playwright driver only) |
| chunkSize | number | | Chunk size passed to mdream for markdown conversion |
| verbose | boolean | false | Enable verbose error logging |
| hooks | Partial<CrawlHooks> | | Hook functions for the crawl pipeline (see Hooks) |
| onPage | (page: PageData) => Promise<void> \| void | | Deprecated. Use hooks['crawl:page'] instead. Still works for backwards compatibility |
CrawlResult
interface CrawlResult {
url: string
title: string
content: string
filePath?: string // Set when generateIndividualMd is true
timestamp: number // Unix timestamp of processing time
success: boolean
error?: string // Set when success is false
metadata?: PageMetadata
depth?: number // Link-following depth at which this page was found
}
interface PageMetadata {
title: string
description?: string
keywords?: string
author?: string
links: string[] // Internal links discovered on the page
}PageData
The shape passed to the onPage callback:
interface PageData {
url: string
html: string // Raw HTML (empty string if content was already markdown)
title: string
metadata: PageMetadata
origin: string
}Progress Callback
The optional second argument to crawlAndGenerate receives progress updates:
await crawlAndGenerate(options, (progress) => {
// progress.sitemap.status: 'discovering' | 'processing' | 'completed'
// progress.sitemap.found: number of sitemap URLs found
// progress.sitemap.processed: number of URLs after filtering
// progress.crawling.status: 'starting' | 'processing' | 'completed'
// progress.crawling.total: total URLs to process
// progress.crawling.processed: pages completed so far
// progress.crawling.failed: pages that errored
// progress.crawling.currentUrl: URL currently being fetched
// progress.crawling.latency: { total, min, max, count } in ms
// progress.generation.status: 'idle' | 'generating' | 'completed'
// progress.generation.current: description of current generation step
})Examples
Custom page processing with onPage
import { crawlAndGenerate } from '@mdream/crawl'
const pages = []
await crawlAndGenerate({
urls: ['https://docs.example.com'],
outputDir: './output',
generateIndividualMd: false,
generateLlmsTxt: false,
onPage: (page) => {
pages.push({
url: page.url,
title: page.title,
description: page.metadata.description,
})
},
})
console.log(`Discovered ${pages.length} pages`)Glob filtering with exclusions
import { crawlAndGenerate } from '@mdream/crawl'
await crawlAndGenerate({
urls: ['https://example.com/docs/**'],
outputDir: './docs-output',
exclude: ['/docs/deprecated/*', '/docs/internal/*'],
followLinks: true,
maxDepth: 2,
})Crawling across subdomains
await crawlAndGenerate({
urls: ['https://example.com'],
outputDir: './output',
allowSubdomains: true, // Will also crawl docs.example.com, blog.example.com, etc.
followLinks: true,
maxDepth: 2,
})Single-page mode
Set maxDepth: 0 to process only the provided URLs without crawling or link following:
await crawlAndGenerate({
urls: ['https://example.com/pricing', 'https://example.com/about'],
outputDir: './output',
maxDepth: 0,
})Config File
Create a mdream.config.ts (or .js, .mjs) in your project root to set defaults and register hooks. Loaded via c12.
import { defineConfig } from '@mdream/crawl'
export default defineConfig({
exclude: ['*/admin/*', '*/internal/*'],
driver: 'http',
maxDepth: 3,
hooks: {
'crawl:page': (page) => {
// Strip branding from all page titles
page.title = page.title.replace(/ \| My Brand$/, '')
},
},
})CLI arguments override config file values. Array options like exclude are concatenated (config + CLI).
Config Options
| Option | Type | Description |
|--------|------|-------------|
| exclude | string[] | Glob patterns for URLs to exclude |
| driver | 'http' \| 'playwright' | Crawler driver |
| maxDepth | number | Maximum crawl depth |
| maxPages | number | Maximum pages to crawl |
| crawlDelay | number | Delay between requests (seconds) |
| skipSitemap | boolean | Skip sitemap discovery |
| allowSubdomains | boolean | Crawl across subdomains |
| verbose | boolean | Enable verbose logging |
| artifacts | string[] | Output formats: llms.txt, llms-full.txt, markdown |
| hooks | object | Hook functions (see below) |
Hooks
Four hooks let you intercept and transform data at each stage of the crawl pipeline. Hooks receive mutable objects. Mutate in-place to transform output.
crawl:url
Called before fetching a URL. Set ctx.skip = true to skip it entirely (saves the network request).
defineConfig({
hooks: {
'crawl:url': (ctx) => {
// Skip large asset pages
if (ctx.url.includes('/assets/') || ctx.url.includes('/downloads/'))
ctx.skip = true
},
},
})crawl:page
Called after HTML-to-Markdown conversion, before storage. Mutate page.title or other fields. This replaces the onPage callback (which still works for backwards compatibility).
defineConfig({
hooks: {
'crawl:page': (page) => {
// page.url, page.html, page.title, page.metadata, page.origin
page.title = page.title.replace(/ - Docs$/, '')
},
},
})crawl:content
Called before markdown is written to disk. Transform the final output content or change the file path.
defineConfig({
hooks: {
'crawl:content': (ctx) => {
// ctx.url, ctx.title, ctx.content, ctx.filePath
ctx.content = ctx.content.replace(/CONFIDENTIAL/g, '[REDACTED]')
ctx.filePath = ctx.filePath.replace('.md', '.mdx')
},
},
})crawl:done
Called after all pages are crawled, before llms.txt generation. Filter or reorder results.
defineConfig({
hooks: {
'crawl:done': (ctx) => {
// Remove short pages from the final output
const filtered = ctx.results.filter(r => r.content.length > 100)
ctx.results.length = 0
ctx.results.push(...filtered)
},
},
})Programmatic Hooks
Hooks can also be passed directly to crawlAndGenerate:
import { crawlAndGenerate } from '@mdream/crawl'
await crawlAndGenerate({
urls: ['https://example.com'],
outputDir: './output',
hooks: {
'crawl:page': (page) => {
page.title = page.title.replace(/ \| Brand$/, '')
},
'crawl:done': (ctx) => {
ctx.results.sort((a, b) => a.url.localeCompare(b.url))
},
},
})Crawl Drivers
HTTP Driver (default)
Uses ofetch for page fetching with up to 20 concurrent requests.
- Automatic retry (2 retries with 500ms delay)
- 10 second request timeout
- Respects
Retry-Afterheaders on 429 responses (automatically adjusts crawl delay) - Detects
text/markdowncontent types and skips HTML-to-Markdown conversion
Playwright Driver
For sites that require a browser to render content. Requires crawlee and playwright as peer dependencies (see Setup).
npx @mdream/crawl -u example.com --driver playwrightawait crawlAndGenerate({
urls: ['https://spa-app.example.com'],
outputDir: './output',
driver: 'playwright',
})Waits for networkidle before extracting content. Automatically detects and uses system Chrome when available, falling back to Playwright's bundled browser.
Sitemap and Robots.txt Discovery
By default, the crawler performs sitemap discovery before crawling:
- Fetches
robots.txtto findSitemap:directives andCrawl-delayvalues - Loads sitemaps referenced in
robots.txt - Falls back to
/sitemap.xml - Tries common alternatives:
/sitemap_index.xml,/sitemaps.xml,/sitemap-index.xml - Supports sitemap index files (recursively loads child sitemaps)
- Filters discovered URLs against glob patterns and exclusion rules
The home page is always included for metadata extraction (site name, description).
Disable with --skip-sitemap or skipSitemap: true.
Output Formats
Individual Markdown Files
One .md file per crawled page, written to the output directory preserving the URL path structure. For example, https://example.com/docs/getting-started becomes output/docs/getting-started.md.
llms.txt
A site overview file following the llms.txt specification, listing all crawled pages with titles and links to their markdown files.
# example.com
## Pages
- [Example Domain](index.md): https://example.com/
- [About Us](about.md): https://example.com/aboutllms-full.txt
Same structure as llms.txt but includes the full markdown content of every page inline.
