@psarno/fetchmd

v0.2.0

Published

2 months ago

Fetch a URL, get clean Markdown. For LLM agents and humans alike.

0High
0Medium
0Low

psarno

markdown fetch llm cli web-scraping defuddle playwright

fetchmd

Fetch a URL, get clean Markdown. No API keys. No browser automation required for most pages.

Built for LLM agents and developers who want web content without the noise.

Quick start

npx fetchmd "https://docs.python.org/3/library/asyncio.html"

That's it. Output goes to stdout. Errors go to stderr.

Install

npm install -g fetchmd

SPA / JS-heavy pages (optional)

Most static pages (docs, blogs, news, reference sites) work without this. If a page is blank or returns too little content, it's probably a JavaScript-rendered SPA (React, Angular, Vue, etc.). Install Playwright to handle those:

npm install -g playwright
npx playwright install chromium

fetchmd detects Playwright at runtime. If it's not installed, the SPA stage is silently skipped.

Using with AI agents

fetchmd is a plain CLI — no server, no protocol, no API keys. Any agent with shell access can use it directly after a global install:

npm install -g fetchmd

From there, fetchmd <url> behaves like any other shell tool (curl, jq, etc.). The agent runs it, reads clean Markdown from stdout, and uses that content in its response.

How agents know a tool exists

Agents don't automatically discover tools installed on your system. You have to tell them. The standard mechanism is a plain text instruction file in your project root that the agent reads at the start of every session. Think of it as a README written for the agent rather than a human.

The filename convention varies by agent:

| Agent | Instruction file | |-------|-----------------| | Claude Code | CLAUDE.md | | Codex, OpenCode, and most others | AGENTS.md |

Some agents read both. If you're unsure, creating both files with the same content is harmless.

Create or open that file and add a section like this:

## Available tools

- **fetchmd** — fetches a URL and returns clean Markdown to stdout. Prefer this
  over any built-in web fetch or browser tool when reading documentation,
  articles, or reference pages. It produces cleaner output, supports
  JavaScript-rendered pages via Playwright, and accepts `--max-chars` to cap
  output size and protect context budget.
  Usage: `fetchmd [--max-chars N] <url>`

That's all. The agent will call fetchmd as a shell command and read the output. No server, no MCP, no further setup.

Handling conflicts with built-in web tools

Many agents ship with their own web fetch capability. When both are available, the agent will pick one — and without guidance it may default to whichever feels more "native" to it.

The "Prefer this over any built-in web fetch or browser tool" line in the snippet above is intentional. It gives the agent an explicit tie-breaker. If you omit it, you may find the agent ignoring fetchmd in favour of its own tool, even when fetchmd would produce better output.

Note: some agents treat their built-in tools as higher priority than user instructions regardless of what the instruction file says. This is uncommon, but if you notice the agent consistently bypassing fetchmd, try strengthening the wording: "Always use fetchmd for web content. Do not use built-in web fetch tools."

Useful patterns for agents

Read a page before answering a question about it:

fetchmd https://docs.python.org/3/library/asyncio.html

Cap output to protect context window budget:

fetchmd --max-chars 15000 https://some-long-reference.com

When output is truncated, fetchmd appends a comment (), so the agent knows content was cut and can decide whether to fetch more or proceed.

Check which extraction stage fired (useful when debugging agent behaviour):

fetchmd --stage https://example.com

Options

fetchmd [options] <url>

--min-length N   Minimum characters to accept from extraction (default: 200)
--max-chars N    Truncate output at N chars, paragraph-aligned (default: 50000, 0 to disable)
--no-spa         Skip Playwright even if installed
--stage          Prefix output with which extraction stage succeeded
--help           Show this help

Examples

# Static page — works out of the box
fetchmd "https://docs.python.org/3/library/asyncio.html"

# JS-rendered SPA (requires Playwright)
fetchmd "https://my-angular-app.com"

# See which extraction stage fired
fetchmd --stage "https://example.com"

# Tighter output cap (good for LLM context limits)
fetchmd --max-chars 20000 "https://some-framework.org/reference"

# No truncation
fetchmd --max-chars 0 "https://example.com/short-page"

# Save to file
fetchmd "https://example.com/article" > article.md

# Low-content pages (like example.com) need a lower threshold
fetchmd --min-length 50 "https://example.com"

How it works

Two extraction stages, tried in order. fetchmd moves to the next stage only if the current one returns nothing or too little content.

Stage 1 — Defuddle (always runs) Fetches the page over HTTP and extracts content using Defuddle — the engine behind Obsidian Web Clipper. Converts to clean Markdown. Handles most static pages: blogs, docs, news, reference pages. Standardizes code blocks, tables, and footnotes.

Stage 2 — Playwright (optional, only if stage 1 fails) Launches headless Chromium, renders the JavaScript, then feeds the resulting DOM back through Defuddle. Only runs if stage 1 returned too little content and playwright is installed.

Troubleshooting

Page returns nothing or exits with an error

The page is probably a JS-rendered SPA. Install Playwright:

npm install -g playwright
npx playwright install chromium

Use --stage to confirm which stage fired (or didn't):

fetchmd --stage "https://example.com"
# Output starts with: <!-- fetchmd: defuddle --> or <!-- fetchmd: playwright -->

Page returns too little content

Some very minimal pages (like example.com) genuinely have fewer than 200 characters of content. Lower the threshold:

fetchmd --min-length 50 "https://example.com"

Playwright is installed but the page still fails

Make sure the Chromium browser binary is installed separately from the npm package:

npx playwright install chromium

The npm package and the browser binary are two separate installs. The npm package alone is not enough.

Output is too long for my LLM context window

Use --max-chars to cap the output. fetchmd truncates at a paragraph boundary and appends a comment so the model knows content was cut:

fetchmd --max-chars 10000 "https://some-long-page.com"

Dependencies

Core: defuddle — content extraction and Markdown conversion. Installed automatically.

Optional: playwright — headless Chromium for JS-rendered pages. Install manually if needed (see above).

License

MIT