@sandgarden/mdfetch
v1.1.0
Published
CLI tool to convert web pages to clean markdown
Downloads
278
Maintainers
Readme
mdfetch
A fast, reliable CLI tool to convert web pages into clean, readable markdown.
Convert any web article into clean markdown with a single command. Uses Mozilla's Readability algorithm to extract the main content and Turndown to convert it to GitHub Flavored Markdown.
Features
- 🚀 Fast & Reliable - Built with TypeScript, exponential backoff retry logic, and robust error handling
- 📄 Multiple Output Formats - Get markdown (default), HTML, or plain text
- 🎯 Smart Content Extraction - Uses Mozilla Readability to extract just the article content
- 🔗 Absolute URLs - Automatically converts relative image and link paths to absolute URLs
- 📊 GitHub Flavored Markdown - Full support for tables, code blocks, strikethrough, and task lists
- ⚙️ Configurable - Customize timeout, retries, and output format
- 📦 Zero Config - Works out of the box with sensible defaults
Installation
Global Installation (Recommended)
npm install -g @sandgarden/mdfetchThis installs the mdfetch command globally.
Local Installation
npm install @sandgarden/mdfetchThen use via npx:
npx mdfetch <url>Usage
Basic Usage
# Output markdown to stdout
mdfetch https://example.com/article
# Save to a file
mdfetch https://example.com/article -o article.md
# Get HTML instead of markdown
mdfetch https://example.com/article --html
# Get plain text
mdfetch https://example.com/article --textAdvanced Options
# Custom timeout (in milliseconds)
mdfetch https://example.com/article --timeout 60000
# Custom retry settings
mdfetch https://example.com/article --retries 5 --retry-delay 2000
# Custom User-Agent header
mdfetch https://example.com/article --user-agent "my-bot/1.0"
# Combine options
mdfetch https://example.com/article -o article.md --timeout 45000
# Force Readability to parse short or borderline pages that would normally be rejected
mdfetch https://example.com/short-note --always-readable
# Append every qualifying link from the raw page as markdown footnotes
mdfetch https://example.com/link-heavy-page --all-links
# Compose both flags: loosen Readability AND include a full link archive
mdfetch https://example.com/tricky-page --always-readable --all-linksBy default, mdfetch identifies itself with the User-Agent
mdfetch/<version> (+https://github.com/sandgardenhq/mdfetch). Use
--user-agent to override it when a site requires a specific string.
All Options
Usage: mdfetch [options] <url>
CLI tool to convert web pages to clean markdown
Arguments:
url URL of the web page to convert
Options:
-V, --version output the version number
-o, --output <file> Output file path (defaults to stdout)
--html Output readable HTML instead of markdown
--text Output plain text instead of markdown
--timeout <ms> Request timeout in milliseconds (default: "30000")
--retries <count> Number of retry attempts (default: "3")
--retry-delay <ms> Delay between retries in milliseconds (default: "1000")
--always-readable Relax Readability thresholds so short/borderline pages
still parse
--all-links Extract every qualifying link from the raw page and
append as markdown footnotes
--user-agent <string> Custom User-Agent header (defaults to mdfetch identifier)
-h, --help display help for command--always-readable and --all-links
These two flags are independent and can be combined.
--always-readable— Use this when a page is too short or too lightly structured for Mozilla Readability to accept by default (for example, a brief note, a changelog entry, or a landing page). With the flag set, Readability runs with a relaxed character threshold, and if it still cannot extract an article, mdfetch falls back to the raw<body>HTML with the<title>as the heading. This is best-effort: truly empty pages will still error.--all-links— Use this when you want a full archive of every outbound link on a page, regardless of whether Readability could extract the article. Links are collected from the raw HTML (including<nav>,<footer>, and sidebars, which Readability normally discards), filtered tohttp(s)only, deduplicated by URL, and appended to the markdown output as numbered footnotes:Article body... --- [^1]: [Link text](https://example.com/one) [^2]: [Another link](https://example.com/two)If Readability fails but there are extractable links, mdfetch still returns a minimal document containing the page title and the footnote block rather than erroring out.
Composing them —
--always-readable --all-linksgives you the best-effort article extraction plus the full link archive. Useful for index/hub pages that are mostly links with a small amount of introductory text.
Examples
Save Article as Markdown
mdfetch https://blog.example.com/great-article -o great-article.mdThe output will include a metadata header:
# Article Title
**By:** Author Name
**Source:** Example Blog
**URL:** https://blog.example.com/great-article
---
Article content starts here...Extract Just the HTML
mdfetch https://example.com/article --html -o article.htmlGet Plain Text for Processing
mdfetch https://example.com/article --text | wc -wPipeline Usage
# Fetch multiple articles
cat urls.txt | xargs -I {} mdfetch {} -o {}.md
# Convert and immediately view
mdfetch https://example.com/article | lessLibrary Usage
You can also use mdfetch as a library in your Node.js projects:
import { readURL } from '@sandgarden/mdfetch';
// Fetch and convert a URL
const result = await readURL('https://example.com/article');
console.log(result.markdown); // Markdown version
console.log(result.plainText); // Plain text version
console.log(result.readableHTML); // Clean HTML version
// Access metadata
console.log(result.title); // Article title
console.log(result.byline); // Author
console.log(result.excerpt); // Summary
console.log(result.publishedTime);// Publication date
console.log(result.length); // Reading length
// Custom options
const result = await readURL('https://example.com/article', {
timeout: 60000,
retries: 5,
retryDelay: 2000
});API Documentation
Full API documentation is available by generating TypeDoc:
npm run docsThen open docs/index.html in your browser.
How It Works
- Fetch - Downloads the HTML content with retry logic and timeout protection
- Extract - Uses Mozilla's Readability algorithm to extract the main article content
- Process - Converts relative URLs to absolute URLs for images and links
- Convert - Transforms HTML to clean markdown using Turndown with GFM support
- Output - Returns content in all three formats: markdown, HTML, and plain text
Supported Content
Works best with:
- Blog posts and articles
- News articles
- Documentation pages
- Medium posts
- Substack articles
- Academic papers
- Technical tutorials
May not work well with:
- Paywalled content
- JavaScript-heavy SPAs (requires pre-rendered HTML)
- Sites with aggressive bot detection
Error Handling
The tool includes robust error handling:
- Network Errors: Automatic retry with exponential backoff
- Timeouts: Configurable timeout with graceful cancellation
- 4xx Errors: No retry (client errors like 404, 403)
- 5xx Errors: Automatic retry (server errors)
- Content Errors: Clear error messages when content can't be extracted
Development
Setup
git clone https://github.com/yourusername/mdfetch.git
cd mdfetch
npm installCommands
npm run build # Compile TypeScript
npm test # Run tests in watch mode
npm run test:run # Run tests once
npm run test:coverage # Run tests with coverage
npm run docs # Generate API documentation
npm run dev # Run CLI in development modeTesting
The project has comprehensive test coverage (90%+ on all metrics):
- Unit tests for all core functions
- Integration tests for the CLI
- Mocked tests for HTTP fetching
- Edge case tests for error handling
npm testProject Structure
mdfetch/
├── src/
│ ├── cli.ts # CLI entry point
│ ├── reader.ts # Main library function
│ ├── fetcher.ts # HTTP fetching with retries
│ ├── readable.ts # Readability extraction
│ ├── types.ts # TypeScript interfaces
│ └── __tests__/ # Test files
├── dist/ # Compiled JavaScript
├── docs/ # Generated API docs
└── package.jsonDependencies
Runtime
@mozilla/readability- Content extractionlinkedom- Lightweight DOM implementationturndown- HTML to Markdown conversionturndown-plugin-gfm- GitHub Flavored Markdown supportcommander- CLI argument parsing
Development
typescript- Type checking and compilationvitest- Fast unit testingtypedoc- API documentation generation
Development Workflow
This project follows strict TDD (Test-Driven Development):
- Write failing tests first (RED)
- Write minimal code to pass tests (GREEN)
- Refactor while keeping tests green (REFACTOR)
- Maintain 90%+ test coverage
See CLAUDE.md for detailed development rules.
Credits
- Built with Mozilla Readability
- Markdown conversion by Turndown
- DOM implementation by linkedom
Made with ❤️ and strict TDD practices
