npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

web-to-markdown-crawler

v1.0.3

Published

Web crawler that converts site pages to markdown, mirroring the URL structure locally

Readme

web-to-markdown-crawler

npm version CI

A CLI tool that crawls a website and converts every page to Markdown, mirroring the site's URL structure as a local directory tree. Internal links are rewritten to relative .md paths so the output works as a self-contained document collection.

Features

  • Mirrors URL structure on disk (/docs/introdocs/intro.md)
  • Rewrites internal links to relative .md paths
  • Extracts <main> / <article> / [role="main"] content before converting
  • Prepends YAML frontmatter (url, crawledAt) to every file
  • Handles redirects — the final URL is used as the canonical path
  • Query-string URLs are disambiguated (/search?q=foosearch-q-foo.md)
  • Produces a nodemap.json graph of every discovered URL and its status
  • Graceful error handling — one bad page never aborts the crawl

Requirements

Installation

git clone https://github.com/leochilds/web-to-markdown-crawler.git
cd web-to-markdown-crawler
bun install

Usage

crawl <url> [options]

Options

| Flag | Default | Description | |---|---|---| | -o, --output <dir> | ./output | Output directory | | -c, --concurrency <n> | 5 | Parallel fetch limit | | --max-depth <n> | unlimited | Stop following links beyond this depth (0 = start page only) | | --max-pages <n> | unlimited | Stop after writing this many pages | | --delay <ms> | none | Delay between requests (polite crawling) |

Examples

# Crawl a docs site into ./output
bun run dev https://docs.example.com

# Limit depth and add a polite delay
bun run dev https://docs.example.com --max-depth 3 --delay 500

# Custom output directory with a concurrency limit
bun run dev https://docs.example.com -o ./docs-mirror -c 3 --max-pages 100

Output

output/
  index.md          ← https://example.com/
  about.md          ← https://example.com/about
  docs/
    index.md        ← https://example.com/docs/
    intro.md        ← https://example.com/docs/intro
  nodemap.json      ← full link graph with per-URL status

Each .md file begins with YAML frontmatter:

---
url: https://example.com/docs/intro
crawledAt: 2026-04-05T09:00:00.000Z
---

nodemap.json records every URL the crawler encountered (including skipped external links and errors):

{
  "startUrl": "https://example.com/",
  "crawledAt": "2026-04-05T09:00:00.000Z",
  "totalPages": 42,
  "nodes": {
    "https://example.com/": { "status": "success", "outputPath": "output/index.md", "outLinks": [...] },
    "https://external.com/": { "status": "skipped", "outLinks": [] }
  }
}

Development

bun run dev <url>      # run from source
bun run typecheck      # TypeScript type check
bun run test           # run the test suite (77 tests)
bun run build          # compile to dist/

Built with

  • got — HTTP requests with retries and redirect handling
  • cheerio — HTML parsing and link extraction
  • turndown — HTML → Markdown conversion
  • graphjs — directed graph for the link nodemap
  • p-limit — concurrency control

Built with Claude Code