html-extractor-mcp
v1.0.0
Published
HTML Extractor MCP server — fetch URLs, extract text/links, call JSON APIs
Maintainers
Readme
html-extractor-mcp
HTML Extractor MCP server. Fetch URLs, extract text/links, call JSON APIs.
No external dependencies for basic fetching. Optional playwright-cli for SPA/anti-scraping sites.
Install
npm install -g html-extractor-mcpUsage
Claude Desktop / Cursor / OpenCode
Add to your MCP config:
{
"mcpServers": {
"html-extractor": {
"command": "npx",
"args": ["-y", "html-extractor-mcp"]
}
}
}Tools
fetch_url — Fetch URL
Fetch a webpage and return its HTML content.
Input:
{
"url": "https://example.com",
"timeout": 30000,
"use_browser": false
}Output:
{
"url": "https://example.com",
"status": 200,
"content_length": 1256,
"html": "<!doctype html><html>..."
}extract_text — Extract Text
Fetch a webpage and extract plain text content (strips HTML tags, scripts, styles).
Input:
{
"url": "https://example.com"
}Output:
{
"url": "https://example.com",
"status": 200,
"text_length": 280,
"text": "Example Domain This domain is for use in illustrative examples..."
}extract_links — Extract Links
Fetch a webpage and extract all links (href + text).
Input:
{
"url": "https://example.com"
}Output:
{
"url": "https://example.com",
"status": 200,
"link_count": 1,
"links": [
{ "text": "More information...", "href": "https://www.iana.org/domains/example" }
]
}fetch_json — Fetch JSON API
Fetch a URL and parse the response as JSON.
Input:
{
"url": "https://httpbin.org/get"
}Output:
{
"url": "https://httpbin.org/get",
"status": 200,
"data": {
"args": {},
"headers": { "Host": "httpbin.org" },
"url": "https://httpbin.org/get"
}
}Browser Engine (Playwright CLI)
For SPA sites or anti-scraping protection, set use_browser: true:
{
"url": "https://medium.com",
"use_browser": true
}This uses @playwright/cli (installed as a dependency) to render the page in a real browser environment.
Design
| Feature | Why |
|---------|-----|
| Zero core deps | Only @modelcontextprotocol/sdk and zod |
| Native fetch | Node 18+ built-in, no external HTTP library |
| Browser fallback | @playwright/cli for SPA/anti-scraping sites |
| Text extraction | Strips scripts, styles, HTML tags |
| Link extraction | Extracts href + text from anchor tags |
License
MIT
