kosyak-fetch-mcp

v1.1.8

Published

11 days ago

MCP server for fetching web content as Markdown. Transparent handling of PDFs, Reddit/Medium/Hacker News/Discourse, Cloudflare anti-bot, and YouTube transcripts.

0High
0Medium
0Low

maximpiragov

mcp model-context-protocol fetch web-scraping markdown readability pdf youtube-transcript reddit medium hacker-news discourse cloudflare-bypass claude

kosyak-fetch-mcp

Model Context Protocol server for fetching web content as clean Markdown from Claude Code, Claude Desktop, or any MCP client. 3 tools, transparent handling of PDFs + 5 scraper-hostile platforms.

Highlights

Zero-setup content extraction — Mozilla Readability + OpenGraph metadata header (title · author · site · date), strips nav/ads/sidebars by default
PDFs just work — application/pdf auto-detected, text-extracted via pdf-parse
Cloudflare auto-bypass — transparent fallback to a Chrome TLS fingerprint (CycleTLS, bundled Go binary)
Reddit / Medium / Hacker News / Discourse rewriters — URLs of these scraper-hostile platforms are rewritten under the hood to their scraper-friendly counterparts (Atom feed, readmedium.com, Algolia API, .json endpoint)
YouTube transcripts — timestamped captions via yt-dlp, prefers human-written → auto-generated fallback
SSRF-hardened — reserved-IP blocklist (CVE-2025-8020 patched), DNS pinning via undici Agent, per-hop redirect validation
LLM-friendly errors — HTTP status hints (404 → "not found", 429 → "rate limited"), JSON-on-HTML tells the model to switch tool
Auto-retry on transient 5xx/429 with exponential backoff + Retry-After support
In-memory LRU cache — pagination on a large page doesn't re-fetch
Charset-aware decoding — Latin-1 / Shift-JIS / Windows-1251 pages no longer come back as replacement characters

Quick Start

Add to ~/.claude.json (or Claude Desktop's %APPDATA%\Claude\claude_desktop_config.json on Windows, ~/Library/Application Support/Claude/claude_desktop_config.json on macOS).

Windows

{
  "mcpServers": {
    "fetch": {
      "type": "stdio",
      "command": "npx.cmd",
      "args": ["-y", "kosyak-fetch-mcp"],
      "env": {
        "DEFAULT_LIMIT": "",
        "MAX_RESPONSE_BYTES": "",
        "FETCH_TIMEOUT_SECONDS": "",
        "FETCH_MAX_RETRIES": "",
        "FETCH_CACHE_DISABLED": "",
        "FETCH_CACHE_TTL_SECONDS": "",
        "FETCH_CACHE_MAX": "",
        "FETCH_CYCLETLS_DISABLED": "",
        "CYCLETLS_JA3": "",
        "PROXY_URL": ""
      }
    }
  }
}

Claude Desktop on Windows does not spawn MCP servers through a shell, so use npx.cmd (not npx) to avoid spawn npx ENOENT. -y skips the install prompt so first launch doesn't hang.

macOS / Linux

{
  "mcpServers": {
    "fetch": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "kosyak-fetch-mcp"],
      "env": {
        "DEFAULT_LIMIT": "",
        "MAX_RESPONSE_BYTES": "",
        "FETCH_TIMEOUT_SECONDS": "",
        "FETCH_MAX_RETRIES": "",
        "FETCH_CACHE_DISABLED": "",
        "FETCH_CACHE_TTL_SECONDS": "",
        "FETCH_CACHE_MAX": "",
        "FETCH_CYCLETLS_DISABLED": "",
        "CYCLETLS_JA3": "",
        "PROXY_URL": ""
      }
    }
  }
}

Leave unused keys as empty strings — they fall back to defaults documented below. Restart your MCP client after saving.

Environment Variables

| Variable | Default | Description | |----------|---------|-------------| | DEFAULT_LIMIT | 5000 | Default max_length (0 = unlimited) | | MAX_RESPONSE_BYTES | 10485760 | Body size cap (10 MB) | | FETCH_TIMEOUT_SECONDS | 30 | Per-request HTTP timeout | | FETCH_MAX_RETRIES | 2 | Auto-retries on transient 5xx/429 (0 = disable) | | FETCH_CACHE_DISABLED | — | Set to 1 to disable response cache | | FETCH_CACHE_TTL_SECONDS | 300 | Cache TTL | | FETCH_CACHE_MAX | 50 | LRU cache size | | FETCH_CYCLETLS_DISABLED | — | Set to 1 to disable Cloudflare fallback | | CYCLETLS_JA3 | Chrome 144 (source) | Override the TLS fingerprint used by the Cloudflare-fallback backend. Default matches bogdanfinn/tls-client's Chrome_144 profile (modern extension set incl. ApplicationSettings 17613, CompressCertificate 27, GREASE ECH 65037, post-quantum X25519MLKEM768); the User-Agent advertised on the same request also says Chrome 144 so the UA ↔ TLS fingerprint pair stays internally consistent. JA3 string format: <version>,<ciphers>,<extensions>,<elliptic-curves>,<ec-point-formats>. | | PROXY_URL | — | HTTP(S) proxy for all outbound requests |

All variables go in the env block of your MCP client config.

Tools (3)

`fetch_page`

Fetch any URL and return Markdown. Extracts the main article body by default (Mozilla Readability); set fullpage: true to include navigation, menus, and sidebars — use only for structural queries like "list all links on this docs index" or "extract this nav menu".

| Parameter | Type | Description | |-----------|------|-------------| | url | string (required) | URL to fetch | | headers | object | Custom HTTP headers | | max_length | number | Max chars to return (default: 5000) | | start_index | number | Offset to continue from a truncated previous call | | fullpage | boolean | Return whole page instead of article extraction |

Handles PDFs, Cloudflare-protected pages, Reddit threads, Medium articles, Hacker News items, and Discourse threads transparently. See Transparent platform handling.

`fetch_json`

Fetch a JSON endpoint (REST APIs, OpenAPI specs, package registries, manifests) and return the parsed JSON as a compact string. Gives an LLM-actionable hint if the URL returns HTML instead of JSON.

Works directly against package registries:

https://registry.npmjs.org/PACKAGE
https://pypi.org/pypi/PACKAGE/json
https://crates.io/api/v1/crates/NAME

`fetch_youtube_transcript`

Fetch a YouTube video's transcript as timestamped lines. Prefers human-written captions, falls back to auto-generated. Requires yt-dlp on PATH.

| Parameter | Type | Description | |-----------|------|-------------| | url | string (required) | YouTube video URL | | lang | string | BCP-47 caption language (default: en) | | max_length, start_index | number | Pagination |

Install yt-dlp:

winget install --id yt-dlp.yt-dlp --source winget   # Windows
brew install yt-dlp                                  # macOS
pipx install yt-dlp                                  # Linux

Examples

CLI (after npm i -g kosyak-fetch-mcp):

# Article extraction (default)
mcp-fetch page https://example.com/blog/post

# Whole page including nav / menus
mcp-fetch page https://example.com --fullpage

# Package metadata via the registry API
mcp-fetch json https://registry.npmjs.org/undici

# YouTube transcript
mcp-fetch youtube https://www.youtube.com/watch?v=UF8uR6Z6KLc --lang en

# Paginate a large page
mcp-fetch page https://very-long-post.example/ --max-length 10000 --start-index 10000

Transparent platform handling

Some URLs are routed through alternative endpoints to bypass anti-scraper blocks or recover lost structure. The caller never sees this — the LLM passes the original URL, we transparently hit the working source.

| Platform | Rewrite | Why | |----------|---------|-----| | PDFs | — | Content-Type sniffed, pdf-parse extracts text + metadata | | Reddit threads | old.reddit.com/…/.rss | Main site 403s scrapers; RSS exposes post + top-level comments as Atom | | Medium + publications | readmedium.com proxy | Medium blocks non-browser TLS; readmedium SSRs the article | | Hacker News /item?id=N | hn.algolia.com/api/v1/items/N | HN's <table>-layout HTML breaks Turndown; Algolia returns clean JSON + nested comments | | Discourse threads | /t/slug/ID.json | Static HTML on Discourse is a mostly-empty Ember shell; JSON has the full post_stream | | Cloudflare-protected | Retry via CycleTLS (Chrome JA3) | Node's default TLS fingerprint gets 403/503 from CF; Chrome fingerprint passes |

Discourse communities recognised out of the box: Rust (users/internals), Elixir, PyTorch, HuggingFace, OpenAI, Django, Erlang, freeCodeCamp, and a few more. Add more via PR.

What Cloudflare bypass does NOT help with (needs a real browser): Turnstile captcha, DataDome / PerimeterX JS challenges, cookie-based sessions.

Security

SSRF protections active on every request:

URL validation — rejects non-HTTP(S) schemes, localhost, and direct IP URLs in reserved ranges
Reserved-IP blocklist — full IANA reserved ranges for IPv4 and IPv6, including 224.0.0.0/4 multicast (CVE-2025-8020 patched — was missing in the private-ip package we replaced)
DNS pinning — undici Agent.connect.lookup hook returns the pre-validated IP; the runtime can't re-resolve to something private
Per-hop redirect validation — each redirect target goes through the same URL + IP checks before the next fetch
User-supplied proxies rejected — proxy is server-only via PROXY_URL to prevent SSRF bypass via proxy=http://169.254.169.254/
Credentials scrubbed from error messages — https://user:pass@host is converted to Authorization: Basic before fetch; the password never appears in logs or error output

The Cloudflare-fallback path (CycleTLS Go subprocess) cannot pin DNS — a TTL=0 rebinding attack window exists there. Disable the fallback with FETCH_CYCLETLS_DISABLED=1 if you're running in a cloud environment with reachable internal metadata endpoints.

Troubleshooting

spawn npx ENOENT on Windows → use "command": "npx.cmd", not "npx".
Claude Desktop hangs on first use → -y missing from args.
yt-dlp not found after winget install → winget drops it in a WinGet\Packages\… subdir that isn't on PATH by default. Add it to PATH or copy yt-dlp.exe to a dir that already is.

Development

git clone https://github.com/kosyakdev/fetch-mcp.git
cd fetch-mcp
bun install
bun run dev     # watch mode
bun test        # 316 tests
bun run build   # produces dist/

License

MIT. Forked from zcaceres/fetch-mcp and rebuilt around a different tool surface, URL-rewriter layer, PDF support, and Cloudflare auto-bypass.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

kosyak-fetch-mcp

Highlights

Quick Start

Windows

macOS / Linux

Environment Variables

Tools (3)

fetch_page

fetch_json

fetch_youtube_transcript

Examples

Transparent platform handling

Security

Troubleshooting

Development

License

`fetch_page`

`fetch_json`

`fetch_youtube_transcript`