@sandgarden/mdfetch

v1.1.0

Published

25 days ago

CLI tool to convert web pages to clean markdown

Downloads

278

0High
0Medium
0Low

britt

cli markdown readability web-scraping content-extraction

mdfetch

A fast, reliable CLI tool to convert web pages into clean, readable markdown.

Convert any web article into clean markdown with a single command. Uses Mozilla's Readability algorithm to extract the main content and Turndown to convert it to GitHub Flavored Markdown.

Features

🚀 Fast & Reliable - Built with TypeScript, exponential backoff retry logic, and robust error handling
📄 Multiple Output Formats - Get markdown (default), HTML, or plain text
🎯 Smart Content Extraction - Uses Mozilla Readability to extract just the article content
🔗 Absolute URLs - Automatically converts relative image and link paths to absolute URLs
📊 GitHub Flavored Markdown - Full support for tables, code blocks, strikethrough, and task lists
⚙️ Configurable - Customize timeout, retries, and output format
📦 Zero Config - Works out of the box with sensible defaults

Installation

Global Installation (Recommended)

npm install -g @sandgarden/mdfetch

This installs the mdfetch command globally.

Local Installation

npm install @sandgarden/mdfetch

Then use via npx:

npx mdfetch <url>

Usage

Basic Usage

# Output markdown to stdout
mdfetch https://example.com/article

# Save to a file
mdfetch https://example.com/article -o article.md

# Get HTML instead of markdown
mdfetch https://example.com/article --html

# Get plain text
mdfetch https://example.com/article --text

Advanced Options

# Custom timeout (in milliseconds)
mdfetch https://example.com/article --timeout 60000

# Custom retry settings
mdfetch https://example.com/article --retries 5 --retry-delay 2000

# Custom User-Agent header
mdfetch https://example.com/article --user-agent "my-bot/1.0"

# Combine options
mdfetch https://example.com/article -o article.md --timeout 45000

# Force Readability to parse short or borderline pages that would normally be rejected
mdfetch https://example.com/short-note --always-readable

# Append every qualifying link from the raw page as markdown footnotes
mdfetch https://example.com/link-heavy-page --all-links

# Compose both flags: loosen Readability AND include a full link archive
mdfetch https://example.com/tricky-page --always-readable --all-links

By default, mdfetch identifies itself with the User-Agent mdfetch/<version> (+https://github.com/sandgardenhq/mdfetch). Use --user-agent to override it when a site requires a specific string.

All Options

Usage: mdfetch [options] <url>

CLI tool to convert web pages to clean markdown

Arguments:
  url                  URL of the web page to convert

Options:
  -V, --version        output the version number
  -o, --output <file>  Output file path (defaults to stdout)
  --html               Output readable HTML instead of markdown
  --text               Output plain text instead of markdown
  --timeout <ms>       Request timeout in milliseconds (default: "30000")
  --retries <count>    Number of retry attempts (default: "3")
  --retry-delay <ms>   Delay between retries in milliseconds (default: "1000")
  --always-readable    Relax Readability thresholds so short/borderline pages
                       still parse
  --all-links          Extract every qualifying link from the raw page and
                       append as markdown footnotes
  --user-agent <string>  Custom User-Agent header (defaults to mdfetch identifier)
  -h, --help           display help for command

`--always-readable` and `--all-links`

These two flags are independent and can be combined.

--always-readable — Use this when a page is too short or too lightly structured for Mozilla Readability to accept by default (for example, a brief note, a changelog entry, or a landing page). With the flag set, Readability runs with a relaxed character threshold, and if it still cannot extract an article, mdfetch falls back to the raw <body> HTML with the <title> as the heading. This is best-effort: truly empty pages will still error.
--all-links — Use this when you want a full archive of every outbound link on a page, regardless of whether Readability could extract the article. Links are collected from the raw HTML (including <nav>, <footer>, and sidebars, which Readability normally discards), filtered to http(s) only, deduplicated by URL, and appended to the markdown output as numbered footnotes:
```
Article body...

---

[^1]: [Link text](https://example.com/one)
[^2]: [Another link](https://example.com/two)
```
If Readability fails but there are extractable links, mdfetch still returns a minimal document containing the page title and the footnote block rather than erroring out.
Composing them — --always-readable --all-links gives you the best-effort article extraction plus the full link archive. Useful for index/hub pages that are mostly links with a small amount of introductory text.

Examples

Save Article as Markdown

mdfetch https://blog.example.com/great-article -o great-article.md

The output will include a metadata header:

# Article Title

**By:** Author Name
**Source:** Example Blog
**URL:** https://blog.example.com/great-article

---

Article content starts here...

Extract Just the HTML

mdfetch https://example.com/article --html -o article.html

Get Plain Text for Processing

mdfetch https://example.com/article --text | wc -w

Pipeline Usage

# Fetch multiple articles
cat urls.txt | xargs -I {} mdfetch {} -o {}.md

# Convert and immediately view
mdfetch https://example.com/article | less

Library Usage

You can also use mdfetch as a library in your Node.js projects:

import { readURL } from '@sandgarden/mdfetch';

// Fetch and convert a URL
const result = await readURL('https://example.com/article');

console.log(result.markdown);     // Markdown version
console.log(result.plainText);    // Plain text version
console.log(result.readableHTML); // Clean HTML version

// Access metadata
console.log(result.title);        // Article title
console.log(result.byline);       // Author
console.log(result.excerpt);      // Summary
console.log(result.publishedTime);// Publication date
console.log(result.length);       // Reading length

// Custom options
const result = await readURL('https://example.com/article', {
  timeout: 60000,
  retries: 5,
  retryDelay: 2000
});

API Documentation

Full API documentation is available by generating TypeDoc:

npm run docs

Then open docs/index.html in your browser.

How It Works

Fetch - Downloads the HTML content with retry logic and timeout protection
Extract - Uses Mozilla's Readability algorithm to extract the main article content
Process - Converts relative URLs to absolute URLs for images and links
Convert - Transforms HTML to clean markdown using Turndown with GFM support
Output - Returns content in all three formats: markdown, HTML, and plain text

Supported Content

Works best with:

Blog posts and articles
News articles
Documentation pages
Medium posts
Substack articles
Academic papers
Technical tutorials

May not work well with:

Paywalled content
JavaScript-heavy SPAs (requires pre-rendered HTML)
Sites with aggressive bot detection

Error Handling

The tool includes robust error handling:

Network Errors: Automatic retry with exponential backoff
Timeouts: Configurable timeout with graceful cancellation
4xx Errors: No retry (client errors like 404, 403)
5xx Errors: Automatic retry (server errors)
Content Errors: Clear error messages when content can't be extracted

Development

Setup

git clone https://github.com/yourusername/mdfetch.git
cd mdfetch
npm install

Commands

npm run build      # Compile TypeScript
npm test           # Run tests in watch mode
npm run test:run   # Run tests once
npm run test:coverage  # Run tests with coverage
npm run docs       # Generate API documentation
npm run dev        # Run CLI in development mode

Testing

The project has comprehensive test coverage (90%+ on all metrics):

Unit tests for all core functions
Integration tests for the CLI
Mocked tests for HTTP fetching
Edge case tests for error handling

npm test

Project Structure

mdfetch/
├── src/
│   ├── cli.ts           # CLI entry point
│   ├── reader.ts        # Main library function
│   ├── fetcher.ts       # HTTP fetching with retries
│   ├── readable.ts      # Readability extraction
│   ├── types.ts         # TypeScript interfaces
│   └── __tests__/       # Test files
├── dist/                # Compiled JavaScript
├── docs/                # Generated API docs
└── package.json

Dependencies

Runtime

@mozilla/readability - Content extraction
linkedom - Lightweight DOM implementation
turndown - HTML to Markdown conversion
turndown-plugin-gfm - GitHub Flavored Markdown support
commander - CLI argument parsing

Development

typescript - Type checking and compilation
vitest - Fast unit testing
typedoc - API documentation generation

Development Workflow

This project follows strict TDD (Test-Driven Development):

Write failing tests first (RED)
Write minimal code to pass tests (GREEN)
Refactor while keeping tests green (REFACTOR)
Maintain 90%+ test coverage

See CLAUDE.md for detailed development rules.

Credits

Built with Mozilla Readability
Markdown conversion by Turndown
DOM implementation by linkedom

Made with ❤️ and strict TDD practices

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

mdfetch

Features

Installation

Global Installation (Recommended)

Local Installation

Usage

Basic Usage

Advanced Options

All Options

--always-readable and --all-links

Examples

Save Article as Markdown

Extract Just the HTML

Get Plain Text for Processing

Pipeline Usage

Library Usage

API Documentation

How It Works

Supported Content

Error Handling

Development

Setup

Commands

Testing

Project Structure

Dependencies

Runtime

Development

Development Workflow

Credits

`--always-readable` and `--all-links`