npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@sandgarden/mdfetch

v1.1.0

Published

CLI tool to convert web pages to clean markdown

Downloads

278

Readme

mdfetch

A fast, reliable CLI tool to convert web pages into clean, readable markdown.

Convert any web article into clean markdown with a single command. Uses Mozilla's Readability algorithm to extract the main content and Turndown to convert it to GitHub Flavored Markdown.

Features

  • 🚀 Fast & Reliable - Built with TypeScript, exponential backoff retry logic, and robust error handling
  • 📄 Multiple Output Formats - Get markdown (default), HTML, or plain text
  • 🎯 Smart Content Extraction - Uses Mozilla Readability to extract just the article content
  • 🔗 Absolute URLs - Automatically converts relative image and link paths to absolute URLs
  • 📊 GitHub Flavored Markdown - Full support for tables, code blocks, strikethrough, and task lists
  • ⚙️ Configurable - Customize timeout, retries, and output format
  • 📦 Zero Config - Works out of the box with sensible defaults

Installation

Global Installation (Recommended)

npm install -g @sandgarden/mdfetch

This installs the mdfetch command globally.

Local Installation

npm install @sandgarden/mdfetch

Then use via npx:

npx mdfetch <url>

Usage

Basic Usage

# Output markdown to stdout
mdfetch https://example.com/article

# Save to a file
mdfetch https://example.com/article -o article.md

# Get HTML instead of markdown
mdfetch https://example.com/article --html

# Get plain text
mdfetch https://example.com/article --text

Advanced Options

# Custom timeout (in milliseconds)
mdfetch https://example.com/article --timeout 60000

# Custom retry settings
mdfetch https://example.com/article --retries 5 --retry-delay 2000

# Custom User-Agent header
mdfetch https://example.com/article --user-agent "my-bot/1.0"

# Combine options
mdfetch https://example.com/article -o article.md --timeout 45000

# Force Readability to parse short or borderline pages that would normally be rejected
mdfetch https://example.com/short-note --always-readable

# Append every qualifying link from the raw page as markdown footnotes
mdfetch https://example.com/link-heavy-page --all-links

# Compose both flags: loosen Readability AND include a full link archive
mdfetch https://example.com/tricky-page --always-readable --all-links

By default, mdfetch identifies itself with the User-Agent mdfetch/<version> (+https://github.com/sandgardenhq/mdfetch). Use --user-agent to override it when a site requires a specific string.

All Options

Usage: mdfetch [options] <url>

CLI tool to convert web pages to clean markdown

Arguments:
  url                  URL of the web page to convert

Options:
  -V, --version        output the version number
  -o, --output <file>  Output file path (defaults to stdout)
  --html               Output readable HTML instead of markdown
  --text               Output plain text instead of markdown
  --timeout <ms>       Request timeout in milliseconds (default: "30000")
  --retries <count>    Number of retry attempts (default: "3")
  --retry-delay <ms>   Delay between retries in milliseconds (default: "1000")
  --always-readable    Relax Readability thresholds so short/borderline pages
                       still parse
  --all-links          Extract every qualifying link from the raw page and
                       append as markdown footnotes
  --user-agent <string>  Custom User-Agent header (defaults to mdfetch identifier)
  -h, --help           display help for command

--always-readable and --all-links

These two flags are independent and can be combined.

  • --always-readable — Use this when a page is too short or too lightly structured for Mozilla Readability to accept by default (for example, a brief note, a changelog entry, or a landing page). With the flag set, Readability runs with a relaxed character threshold, and if it still cannot extract an article, mdfetch falls back to the raw <body> HTML with the <title> as the heading. This is best-effort: truly empty pages will still error.

  • --all-links — Use this when you want a full archive of every outbound link on a page, regardless of whether Readability could extract the article. Links are collected from the raw HTML (including <nav>, <footer>, and sidebars, which Readability normally discards), filtered to http(s) only, deduplicated by URL, and appended to the markdown output as numbered footnotes:

    Article body...
    
    ---
    
    [^1]: [Link text](https://example.com/one)
    [^2]: [Another link](https://example.com/two)

    If Readability fails but there are extractable links, mdfetch still returns a minimal document containing the page title and the footnote block rather than erroring out.

  • Composing them--always-readable --all-links gives you the best-effort article extraction plus the full link archive. Useful for index/hub pages that are mostly links with a small amount of introductory text.

Examples

Save Article as Markdown

mdfetch https://blog.example.com/great-article -o great-article.md

The output will include a metadata header:

# Article Title

**By:** Author Name
**Source:** Example Blog
**URL:** https://blog.example.com/great-article

---

Article content starts here...

Extract Just the HTML

mdfetch https://example.com/article --html -o article.html

Get Plain Text for Processing

mdfetch https://example.com/article --text | wc -w

Pipeline Usage

# Fetch multiple articles
cat urls.txt | xargs -I {} mdfetch {} -o {}.md

# Convert and immediately view
mdfetch https://example.com/article | less

Library Usage

You can also use mdfetch as a library in your Node.js projects:

import { readURL } from '@sandgarden/mdfetch';

// Fetch and convert a URL
const result = await readURL('https://example.com/article');

console.log(result.markdown);     // Markdown version
console.log(result.plainText);    // Plain text version
console.log(result.readableHTML); // Clean HTML version

// Access metadata
console.log(result.title);        // Article title
console.log(result.byline);       // Author
console.log(result.excerpt);      // Summary
console.log(result.publishedTime);// Publication date
console.log(result.length);       // Reading length

// Custom options
const result = await readURL('https://example.com/article', {
  timeout: 60000,
  retries: 5,
  retryDelay: 2000
});

API Documentation

Full API documentation is available by generating TypeDoc:

npm run docs

Then open docs/index.html in your browser.

How It Works

  1. Fetch - Downloads the HTML content with retry logic and timeout protection
  2. Extract - Uses Mozilla's Readability algorithm to extract the main article content
  3. Process - Converts relative URLs to absolute URLs for images and links
  4. Convert - Transforms HTML to clean markdown using Turndown with GFM support
  5. Output - Returns content in all three formats: markdown, HTML, and plain text

Supported Content

Works best with:

  • Blog posts and articles
  • News articles
  • Documentation pages
  • Medium posts
  • Substack articles
  • Academic papers
  • Technical tutorials

May not work well with:

  • Paywalled content
  • JavaScript-heavy SPAs (requires pre-rendered HTML)
  • Sites with aggressive bot detection

Error Handling

The tool includes robust error handling:

  • Network Errors: Automatic retry with exponential backoff
  • Timeouts: Configurable timeout with graceful cancellation
  • 4xx Errors: No retry (client errors like 404, 403)
  • 5xx Errors: Automatic retry (server errors)
  • Content Errors: Clear error messages when content can't be extracted

Development

Setup

git clone https://github.com/yourusername/mdfetch.git
cd mdfetch
npm install

Commands

npm run build      # Compile TypeScript
npm test           # Run tests in watch mode
npm run test:run   # Run tests once
npm run test:coverage  # Run tests with coverage
npm run docs       # Generate API documentation
npm run dev        # Run CLI in development mode

Testing

The project has comprehensive test coverage (90%+ on all metrics):

  • Unit tests for all core functions
  • Integration tests for the CLI
  • Mocked tests for HTTP fetching
  • Edge case tests for error handling
npm test

Project Structure

mdfetch/
├── src/
│   ├── cli.ts           # CLI entry point
│   ├── reader.ts        # Main library function
│   ├── fetcher.ts       # HTTP fetching with retries
│   ├── readable.ts      # Readability extraction
│   ├── types.ts         # TypeScript interfaces
│   └── __tests__/       # Test files
├── dist/                # Compiled JavaScript
├── docs/                # Generated API docs
└── package.json

Dependencies

Runtime

  • @mozilla/readability - Content extraction
  • linkedom - Lightweight DOM implementation
  • turndown - HTML to Markdown conversion
  • turndown-plugin-gfm - GitHub Flavored Markdown support
  • commander - CLI argument parsing

Development

  • typescript - Type checking and compilation
  • vitest - Fast unit testing
  • typedoc - API documentation generation

Development Workflow

This project follows strict TDD (Test-Driven Development):

  1. Write failing tests first (RED)
  2. Write minimal code to pass tests (GREEN)
  3. Refactor while keeping tests green (REFACTOR)
  4. Maintain 90%+ test coverage

See CLAUDE.md for detailed development rules.

Credits


Made with ❤️ and strict TDD practices