kosyak-fetch-mcp
v1.1.8
Published
MCP server for fetching web content as Markdown. Transparent handling of PDFs, Reddit/Medium/Hacker News/Discourse, Cloudflare anti-bot, and YouTube transcripts.
Maintainers
Readme
kosyak-fetch-mcp
Model Context Protocol server for fetching web content as clean Markdown from Claude Code, Claude Desktop, or any MCP client. 3 tools, transparent handling of PDFs + 5 scraper-hostile platforms.
Highlights
- Zero-setup content extraction — Mozilla Readability + OpenGraph metadata header (title · author · site · date), strips nav/ads/sidebars by default
- PDFs just work —
application/pdfauto-detected, text-extracted viapdf-parse - Cloudflare auto-bypass — transparent fallback to a Chrome TLS fingerprint (CycleTLS, bundled Go binary)
- Reddit / Medium / Hacker News / Discourse rewriters — URLs of these scraper-hostile platforms are rewritten under the hood to their scraper-friendly counterparts (Atom feed, readmedium.com, Algolia API,
.jsonendpoint) - YouTube transcripts — timestamped captions via
yt-dlp, prefers human-written → auto-generated fallback - SSRF-hardened — reserved-IP blocklist (CVE-2025-8020 patched), DNS pinning via undici Agent, per-hop redirect validation
- LLM-friendly errors — HTTP status hints (404 → "not found", 429 → "rate limited"), JSON-on-HTML tells the model to switch tool
- Auto-retry on transient 5xx/429 with exponential backoff +
Retry-Aftersupport - In-memory LRU cache — pagination on a large page doesn't re-fetch
- Charset-aware decoding — Latin-1 / Shift-JIS / Windows-1251 pages no longer come back as replacement characters
Quick Start
Add to ~/.claude.json (or Claude Desktop's %APPDATA%\Claude\claude_desktop_config.json on Windows, ~/Library/Application Support/Claude/claude_desktop_config.json on macOS).
Windows
{
"mcpServers": {
"fetch": {
"type": "stdio",
"command": "npx.cmd",
"args": ["-y", "kosyak-fetch-mcp"],
"env": {
"DEFAULT_LIMIT": "",
"MAX_RESPONSE_BYTES": "",
"FETCH_TIMEOUT_SECONDS": "",
"FETCH_MAX_RETRIES": "",
"FETCH_CACHE_DISABLED": "",
"FETCH_CACHE_TTL_SECONDS": "",
"FETCH_CACHE_MAX": "",
"FETCH_CYCLETLS_DISABLED": "",
"CYCLETLS_JA3": "",
"PROXY_URL": ""
}
}
}
}Claude Desktop on Windows does not spawn MCP servers through a shell, so use
npx.cmd (not npx) to avoid spawn npx ENOENT. -y skips the install
prompt so first launch doesn't hang.
macOS / Linux
{
"mcpServers": {
"fetch": {
"type": "stdio",
"command": "npx",
"args": ["-y", "kosyak-fetch-mcp"],
"env": {
"DEFAULT_LIMIT": "",
"MAX_RESPONSE_BYTES": "",
"FETCH_TIMEOUT_SECONDS": "",
"FETCH_MAX_RETRIES": "",
"FETCH_CACHE_DISABLED": "",
"FETCH_CACHE_TTL_SECONDS": "",
"FETCH_CACHE_MAX": "",
"FETCH_CYCLETLS_DISABLED": "",
"CYCLETLS_JA3": "",
"PROXY_URL": ""
}
}
}
}Leave unused keys as empty strings — they fall back to defaults documented below. Restart your MCP client after saving.
Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| DEFAULT_LIMIT | 5000 | Default max_length (0 = unlimited) |
| MAX_RESPONSE_BYTES | 10485760 | Body size cap (10 MB) |
| FETCH_TIMEOUT_SECONDS | 30 | Per-request HTTP timeout |
| FETCH_MAX_RETRIES | 2 | Auto-retries on transient 5xx/429 (0 = disable) |
| FETCH_CACHE_DISABLED | — | Set to 1 to disable response cache |
| FETCH_CACHE_TTL_SECONDS | 300 | Cache TTL |
| FETCH_CACHE_MAX | 50 | LRU cache size |
| FETCH_CYCLETLS_DISABLED | — | Set to 1 to disable Cloudflare fallback |
| CYCLETLS_JA3 | Chrome 144 (source) | Override the TLS fingerprint used by the Cloudflare-fallback backend. Default matches bogdanfinn/tls-client's Chrome_144 profile (modern extension set incl. ApplicationSettings 17613, CompressCertificate 27, GREASE ECH 65037, post-quantum X25519MLKEM768); the User-Agent advertised on the same request also says Chrome 144 so the UA ↔ TLS fingerprint pair stays internally consistent. JA3 string format: <version>,<ciphers>,<extensions>,<elliptic-curves>,<ec-point-formats>. |
| PROXY_URL | — | HTTP(S) proxy for all outbound requests |
All variables go in the env block of your MCP client config.
Tools (3)
fetch_page
Fetch any URL and return Markdown. Extracts the main article body by default
(Mozilla Readability); set fullpage: true to include navigation, menus, and
sidebars — use only for structural queries like "list all links on this docs
index" or "extract this nav menu".
| Parameter | Type | Description |
|-----------|------|-------------|
| url | string (required) | URL to fetch |
| headers | object | Custom HTTP headers |
| max_length | number | Max chars to return (default: 5000) |
| start_index | number | Offset to continue from a truncated previous call |
| fullpage | boolean | Return whole page instead of article extraction |
Handles PDFs, Cloudflare-protected pages, Reddit threads, Medium articles, Hacker News items, and Discourse threads transparently. See Transparent platform handling.
fetch_json
Fetch a JSON endpoint (REST APIs, OpenAPI specs, package registries, manifests) and return the parsed JSON as a compact string. Gives an LLM-actionable hint if the URL returns HTML instead of JSON.
Works directly against package registries:
https://registry.npmjs.org/PACKAGEhttps://pypi.org/pypi/PACKAGE/jsonhttps://crates.io/api/v1/crates/NAME
fetch_youtube_transcript
Fetch a YouTube video's transcript as timestamped lines. Prefers human-written
captions, falls back to auto-generated. Requires yt-dlp on PATH.
| Parameter | Type | Description |
|-----------|------|-------------|
| url | string (required) | YouTube video URL |
| lang | string | BCP-47 caption language (default: en) |
| max_length, start_index | number | Pagination |
Install yt-dlp:
winget install --id yt-dlp.yt-dlp --source winget # Windows
brew install yt-dlp # macOS
pipx install yt-dlp # LinuxExamples
CLI (after npm i -g kosyak-fetch-mcp):
# Article extraction (default)
mcp-fetch page https://example.com/blog/post
# Whole page including nav / menus
mcp-fetch page https://example.com --fullpage
# Package metadata via the registry API
mcp-fetch json https://registry.npmjs.org/undici
# YouTube transcript
mcp-fetch youtube https://www.youtube.com/watch?v=UF8uR6Z6KLc --lang en
# Paginate a large page
mcp-fetch page https://very-long-post.example/ --max-length 10000 --start-index 10000Transparent platform handling
Some URLs are routed through alternative endpoints to bypass anti-scraper blocks or recover lost structure. The caller never sees this — the LLM passes the original URL, we transparently hit the working source.
| Platform | Rewrite | Why |
|----------|---------|-----|
| PDFs | — | Content-Type sniffed, pdf-parse extracts text + metadata |
| Reddit threads | old.reddit.com/…/.rss | Main site 403s scrapers; RSS exposes post + top-level comments as Atom |
| Medium + publications | readmedium.com proxy | Medium blocks non-browser TLS; readmedium SSRs the article |
| Hacker News /item?id=N | hn.algolia.com/api/v1/items/N | HN's <table>-layout HTML breaks Turndown; Algolia returns clean JSON + nested comments |
| Discourse threads | /t/slug/ID.json | Static HTML on Discourse is a mostly-empty Ember shell; JSON has the full post_stream |
| Cloudflare-protected | Retry via CycleTLS (Chrome JA3) | Node's default TLS fingerprint gets 403/503 from CF; Chrome fingerprint passes |
Discourse communities recognised out of the box: Rust (users/internals), Elixir, PyTorch, HuggingFace, OpenAI, Django, Erlang, freeCodeCamp, and a few more. Add more via PR.
What Cloudflare bypass does NOT help with (needs a real browser): Turnstile captcha, DataDome / PerimeterX JS challenges, cookie-based sessions.
Security
SSRF protections active on every request:
- URL validation — rejects non-HTTP(S) schemes,
localhost, and direct IP URLs in reserved ranges - Reserved-IP blocklist — full IANA reserved ranges for IPv4 and IPv6, including
224.0.0.0/4multicast (CVE-2025-8020 patched — was missing in theprivate-ippackage we replaced) - DNS pinning — undici
Agent.connect.lookuphook returns the pre-validated IP; the runtime can't re-resolve to something private - Per-hop redirect validation — each redirect target goes through the same URL + IP checks before the next fetch
- User-supplied proxies rejected — proxy is server-only via
PROXY_URLto prevent SSRF bypass viaproxy=http://169.254.169.254/ - Credentials scrubbed from error messages —
https://user:pass@hostis converted toAuthorization: Basicbefore fetch; the password never appears in logs or error output
The Cloudflare-fallback path (CycleTLS Go subprocess) cannot pin DNS — a TTL=0
rebinding attack window exists there. Disable the fallback with
FETCH_CYCLETLS_DISABLED=1 if you're running in a cloud environment with
reachable internal metadata endpoints.
Troubleshooting
spawn npx ENOENTon Windows → use"command": "npx.cmd", not"npx".- Claude Desktop hangs on first use →
-ymissing fromargs. yt-dlpnot found afterwinget install→ winget drops it in aWinGet\Packages\…subdir that isn't on PATH by default. Add it to PATH or copyyt-dlp.exeto a dir that already is.
Development
git clone https://github.com/kosyakdev/fetch-mcp.git
cd fetch-mcp
bun install
bun run dev # watch mode
bun test # 316 tests
bun run build # produces dist/License
MIT. Forked from zcaceres/fetch-mcp and rebuilt around a different tool surface, URL-rewriter layer, PDF support, and Cloudflare auto-bypass.
