npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@mdream/crawl

v1.0.3

Published

Mdream Crawl generates comprehensive llms.txt artifacts from a single URL, using mdream to convert HTML to Markdown.

Readme

@mdream/crawl

Multi-page website crawler that generates llms.txt files. Follows internal links and converts HTML to Markdown using mdream.

Setup

npm install @mdream/crawl

For JavaScript-heavy sites that require browser rendering, install the optional Playwright dependencies:

npm install crawlee playwright

CLI Usage

Interactive Mode

Run without arguments to start the interactive prompt-based interface:

npx @mdream/crawl

Direct Mode

Pass arguments directly to skip interactive prompts:

npx @mdream/crawl -u https://docs.example.com

CLI Options

| Flag | Alias | Description | Default | |------|-------|-------------|---------| | --url <url> | -u | Website URL to crawl (supports glob patterns) | Required | | --output <dir> | -o | Output directory | output | | --depth <number> | -d | Crawl depth (0 for single page, max 10) | 3 | | --single-page | | Only process the given URL(s), no crawling. Alias for --depth 0 | | | --driver <type> | | Crawler driver: http or playwright | http | | --artifacts <list> | | Comma-separated output formats: llms.txt, llms-full.txt, markdown | all three | | --origin <url> | | Origin URL for resolving relative paths (overrides auto-detection) | auto-detected | | --site-name <name> | | Override the auto-extracted site name used in llms.txt | auto-extracted | | --description <desc> | | Override the auto-extracted site description used in llms.txt | auto-extracted | | --max-pages <number> | | Maximum pages to crawl | unlimited | | --crawl-delay <seconds> | | Delay between requests in seconds | from robots.txt or none | | --exclude <pattern> | | Exclude URLs matching glob patterns (repeatable) | none | | --skip-sitemap | | Skip sitemap.xml and robots.txt discovery | false | | --allow-subdomains | | Crawl across subdomains of the same root domain | false | | --verbose | -v | Enable verbose logging | false | | --help | -h | Show help message | | | --version | | Show version number | |

CLI Examples

# Basic crawl with specific artifacts
npx @mdream/crawl -u harlanzw.com --artifacts "llms.txt,markdown"

# Shallow crawl (depth 2) with only llms-full.txt output
npx @mdream/crawl --url https://docs.example.com --depth 2 --artifacts "llms-full.txt"

# Exclude admin and API routes
npx @mdream/crawl -u example.com --exclude "*/admin/*" --exclude "*/api/*"

# Single page mode (no link following)
npx @mdream/crawl -u example.com/pricing --single-page

# Use Playwright for JavaScript-heavy sites
npx @mdream/crawl -u example.com --driver playwright

# Skip sitemap discovery with verbose output
npx @mdream/crawl -u example.com --skip-sitemap --verbose

# Crawl across subdomains (docs.example.com, blog.example.com, etc.)
npx @mdream/crawl -u example.com --allow-subdomains

# Override site metadata
npx @mdream/crawl -u example.com --site-name "My Company" --description "Company documentation"

Glob Patterns

URLs support glob patterns for targeted crawling. When a glob pattern is provided, the crawler uses sitemap discovery to find all matching URLs.

# Crawl only the /docs/ section
npx @mdream/crawl -u "docs.example.com/docs/**"

# Crawl pages matching a prefix
npx @mdream/crawl -u "example.com/blog/2024*"

Patterns are matched against the URL pathname using picomatch syntax. A trailing single * (e.g. /fieldtypes*) automatically expands to match both the path itself and all subdirectories.

Programmatic API

crawlAndGenerate(options, onProgress?)

The main entry point for programmatic use. Returns a Promise<CrawlResult[]>.

import { crawlAndGenerate } from '@mdream/crawl'

const results = await crawlAndGenerate({
  urls: ['https://docs.example.com'],
  outputDir: './output',
})

CrawlOptions

| Option | Type | Default | Description | |--------|------|---------|-------------| | urls | string[] | Required | Starting URLs for crawling | | outputDir | string | Required | Directory to write output files | | driver | 'http' \| 'playwright' | 'http' | Crawler driver to use | | maxRequestsPerCrawl | number | Number.MAX_SAFE_INTEGER | Maximum total pages to crawl | | followLinks | boolean | false | Whether to follow internal links discovered on pages | | maxDepth | number | 1 | Maximum link-following depth. 0 enables single-page mode | | generateLlmsTxt | boolean | true | Generate an llms.txt file | | generateLlmsFullTxt | boolean | false | Generate an llms-full.txt file with full page content | | generateIndividualMd | boolean | true | Write individual .md files for each page | | origin | string | auto-detected | Origin URL for resolving relative paths in HTML | | siteNameOverride | string | auto-extracted | Override the site name in the generated llms.txt | | descriptionOverride | string | auto-extracted | Override the site description in the generated llms.txt | | globPatterns | ParsedUrlPattern[] | [] | Pre-parsed URL glob patterns (advanced usage) | | exclude | string[] | [] | Glob patterns for URLs to exclude | | crawlDelay | number | from robots.txt | Delay between requests in seconds | | skipSitemap | boolean | false | Skip sitemap.xml and robots.txt discovery | | allowSubdomains | boolean | false | Crawl across subdomains of the same root domain (e.g. docs.example.com + blog.example.com). Output files are namespaced by hostname to avoid collisions | | useChrome | boolean | false | Use system Chrome instead of Playwright's bundled browser (Playwright driver only) | | chunkSize | number | | Chunk size passed to mdream for markdown conversion | | verbose | boolean | false | Enable verbose error logging | | hooks | Partial<CrawlHooks> | | Hook functions for the crawl pipeline (see Hooks) | | onPage | (page: PageData) => Promise<void> \| void | | Deprecated. Use hooks['crawl:page'] instead. Still works for backwards compatibility |

CrawlResult

interface CrawlResult {
  url: string
  title: string
  content: string
  filePath?: string // Set when generateIndividualMd is true
  timestamp: number // Unix timestamp of processing time
  success: boolean
  error?: string // Set when success is false
  metadata?: PageMetadata
  depth?: number // Link-following depth at which this page was found
}

interface PageMetadata {
  title: string
  description?: string
  keywords?: string
  author?: string
  links: string[] // Internal links discovered on the page
}

PageData

The shape passed to the onPage callback:

interface PageData {
  url: string
  html: string // Raw HTML (empty string if content was already markdown)
  title: string
  metadata: PageMetadata
  origin: string
}

Progress Callback

The optional second argument to crawlAndGenerate receives progress updates:

await crawlAndGenerate(options, (progress) => {
  // progress.sitemap.status: 'discovering' | 'processing' | 'completed'
  // progress.sitemap.found: number of sitemap URLs found
  // progress.sitemap.processed: number of URLs after filtering

  // progress.crawling.status: 'starting' | 'processing' | 'completed'
  // progress.crawling.total: total URLs to process
  // progress.crawling.processed: pages completed so far
  // progress.crawling.failed: pages that errored
  // progress.crawling.currentUrl: URL currently being fetched
  // progress.crawling.latency: { total, min, max, count } in ms

  // progress.generation.status: 'idle' | 'generating' | 'completed'
  // progress.generation.current: description of current generation step
})

Examples

Custom page processing with onPage

import { crawlAndGenerate } from '@mdream/crawl'

const pages = []

await crawlAndGenerate({
  urls: ['https://docs.example.com'],
  outputDir: './output',
  generateIndividualMd: false,
  generateLlmsTxt: false,
  onPage: (page) => {
    pages.push({
      url: page.url,
      title: page.title,
      description: page.metadata.description,
    })
  },
})

console.log(`Discovered ${pages.length} pages`)

Glob filtering with exclusions

import { crawlAndGenerate } from '@mdream/crawl'

await crawlAndGenerate({
  urls: ['https://example.com/docs/**'],
  outputDir: './docs-output',
  exclude: ['/docs/deprecated/*', '/docs/internal/*'],
  followLinks: true,
  maxDepth: 2,
})

Crawling across subdomains

await crawlAndGenerate({
  urls: ['https://example.com'],
  outputDir: './output',
  allowSubdomains: true, // Will also crawl docs.example.com, blog.example.com, etc.
  followLinks: true,
  maxDepth: 2,
})

Single-page mode

Set maxDepth: 0 to process only the provided URLs without crawling or link following:

await crawlAndGenerate({
  urls: ['https://example.com/pricing', 'https://example.com/about'],
  outputDir: './output',
  maxDepth: 0,
})

Config File

Create a mdream.config.ts (or .js, .mjs) in your project root to set defaults and register hooks. Loaded via c12.

import { defineConfig } from '@mdream/crawl'

export default defineConfig({
  exclude: ['*/admin/*', '*/internal/*'],
  driver: 'http',
  maxDepth: 3,
  hooks: {
    'crawl:page': (page) => {
      // Strip branding from all page titles
      page.title = page.title.replace(/ \| My Brand$/, '')
    },
  },
})

CLI arguments override config file values. Array options like exclude are concatenated (config + CLI).

Config Options

| Option | Type | Description | |--------|------|-------------| | exclude | string[] | Glob patterns for URLs to exclude | | driver | 'http' \| 'playwright' | Crawler driver | | maxDepth | number | Maximum crawl depth | | maxPages | number | Maximum pages to crawl | | crawlDelay | number | Delay between requests (seconds) | | skipSitemap | boolean | Skip sitemap discovery | | allowSubdomains | boolean | Crawl across subdomains | | verbose | boolean | Enable verbose logging | | artifacts | string[] | Output formats: llms.txt, llms-full.txt, markdown | | hooks | object | Hook functions (see below) |

Hooks

Four hooks let you intercept and transform data at each stage of the crawl pipeline. Hooks receive mutable objects. Mutate in-place to transform output.

crawl:url

Called before fetching a URL. Set ctx.skip = true to skip it entirely (saves the network request).

defineConfig({
  hooks: {
    'crawl:url': (ctx) => {
      // Skip large asset pages
      if (ctx.url.includes('/assets/') || ctx.url.includes('/downloads/'))
        ctx.skip = true
    },
  },
})

crawl:page

Called after HTML-to-Markdown conversion, before storage. Mutate page.title or other fields. This replaces the onPage callback (which still works for backwards compatibility).

defineConfig({
  hooks: {
    'crawl:page': (page) => {
      // page.url, page.html, page.title, page.metadata, page.origin
      page.title = page.title.replace(/ - Docs$/, '')
    },
  },
})

crawl:content

Called before markdown is written to disk. Transform the final output content or change the file path.

defineConfig({
  hooks: {
    'crawl:content': (ctx) => {
      // ctx.url, ctx.title, ctx.content, ctx.filePath
      ctx.content = ctx.content.replace(/CONFIDENTIAL/g, '[REDACTED]')
      ctx.filePath = ctx.filePath.replace('.md', '.mdx')
    },
  },
})

crawl:done

Called after all pages are crawled, before llms.txt generation. Filter or reorder results.

defineConfig({
  hooks: {
    'crawl:done': (ctx) => {
      // Remove short pages from the final output
      const filtered = ctx.results.filter(r => r.content.length > 100)
      ctx.results.length = 0
      ctx.results.push(...filtered)
    },
  },
})

Programmatic Hooks

Hooks can also be passed directly to crawlAndGenerate:

import { crawlAndGenerate } from '@mdream/crawl'

await crawlAndGenerate({
  urls: ['https://example.com'],
  outputDir: './output',
  hooks: {
    'crawl:page': (page) => {
      page.title = page.title.replace(/ \| Brand$/, '')
    },
    'crawl:done': (ctx) => {
      ctx.results.sort((a, b) => a.url.localeCompare(b.url))
    },
  },
})

Crawl Drivers

HTTP Driver (default)

Uses ofetch for page fetching with up to 20 concurrent requests.

  • Automatic retry (2 retries with 500ms delay)
  • 10 second request timeout
  • Respects Retry-After headers on 429 responses (automatically adjusts crawl delay)
  • Detects text/markdown content types and skips HTML-to-Markdown conversion

Playwright Driver

For sites that require a browser to render content. Requires crawlee and playwright as peer dependencies (see Setup).

npx @mdream/crawl -u example.com --driver playwright
await crawlAndGenerate({
  urls: ['https://spa-app.example.com'],
  outputDir: './output',
  driver: 'playwright',
})

Waits for networkidle before extracting content. Automatically detects and uses system Chrome when available, falling back to Playwright's bundled browser.

Sitemap and Robots.txt Discovery

By default, the crawler performs sitemap discovery before crawling:

  1. Fetches robots.txt to find Sitemap: directives and Crawl-delay values
  2. Loads sitemaps referenced in robots.txt
  3. Falls back to /sitemap.xml
  4. Tries common alternatives: /sitemap_index.xml, /sitemaps.xml, /sitemap-index.xml
  5. Supports sitemap index files (recursively loads child sitemaps)
  6. Filters discovered URLs against glob patterns and exclusion rules

The home page is always included for metadata extraction (site name, description).

Disable with --skip-sitemap or skipSitemap: true.

Output Formats

Individual Markdown Files

One .md file per crawled page, written to the output directory preserving the URL path structure. For example, https://example.com/docs/getting-started becomes output/docs/getting-started.md.

llms.txt

A site overview file following the llms.txt specification, listing all crawled pages with titles and links to their markdown files.

# example.com

## Pages

- [Example Domain](index.md): https://example.com/
- [About Us](about.md): https://example.com/about

llms-full.txt

Same structure as llms.txt but includes the full markdown content of every page inline.