html-extractor-mcp

v1.0.0

Published

3 months ago

HTML Extractor MCP server — fetch URLs, extract text/links, call JSON APIs

Downloads

257

0High
0Medium
0Low

jkumonpm

mcp model-context-protocol web-fetcher web-scraper html-extractor

html-extractor-mcp

HTML Extractor MCP server. Fetch URLs, extract text/links, call JSON APIs.

No external dependencies for basic fetching. Optional playwright-cli for SPA/anti-scraping sites.

Install

npm install -g html-extractor-mcp

Usage

Claude Desktop / Cursor / OpenCode

Add to your MCP config:

{
  "mcpServers": {
    "html-extractor": {
      "command": "npx",
      "args": ["-y", "html-extractor-mcp"]
    }
  }
}

Tools

`fetch_url` — Fetch URL

Fetch a webpage and return its HTML content.

Input:

{
  "url": "https://example.com",
  "timeout": 30000,
  "use_browser": false
}

Output:

{
  "url": "https://example.com",
  "status": 200,
  "content_length": 1256,
  "html": "<!doctype html><html>..."
}

`extract_text` — Extract Text

Fetch a webpage and extract plain text content (strips HTML tags, scripts, styles).

Input:

{
  "url": "https://example.com"
}

Output:

{
  "url": "https://example.com",
  "status": 200,
  "text_length": 280,
  "text": "Example Domain This domain is for use in illustrative examples..."
}

`extract_links` — Extract Links

Fetch a webpage and extract all links (href + text).

Input:

{
  "url": "https://example.com"
}

Output:

{
  "url": "https://example.com",
  "status": 200,
  "link_count": 1,
  "links": [
    { "text": "More information...", "href": "https://www.iana.org/domains/example" }
  ]
}

`fetch_json` — Fetch JSON API

Fetch a URL and parse the response as JSON.

Input:

{
  "url": "https://httpbin.org/get"
}

Output:

{
  "url": "https://httpbin.org/get",
  "status": 200,
  "data": {
    "args": {},
    "headers": { "Host": "httpbin.org" },
    "url": "https://httpbin.org/get"
  }
}

Browser Engine (Playwright CLI)

For SPA sites or anti-scraping protection, set use_browser: true:

{
  "url": "https://medium.com",
  "use_browser": true
}

This uses @playwright/cli (installed as a dependency) to render the page in a real browser environment.

Design

| Feature | Why | |---------|-----| | Zero core deps | Only @modelcontextprotocol/sdk and zod | | Native fetch | Node 18+ built-in, no external HTTP library | | Browser fallback | @playwright/cli for SPA/anti-scraping sites | | Text extraction | Strips scripts, styles, HTML tags | | Link extraction | Extracts href + text from anchor tags |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

html-extractor-mcp

Install

Usage

Claude Desktop / Cursor / OpenCode

Tools

fetch_url — Fetch URL

extract_text — Extract Text

extract_links — Extract Links

fetch_json — Fetch JSON API

Browser Engine (Playwright CLI)

Design

License

`fetch_url` — Fetch URL

`extract_text` — Extract Text

`extract_links` — Extract Links

`fetch_json` — Fetch JSON API