html2text-mcp-server

v1.0.1

Published

2 months ago

A web scraping and content extraction service built with Model Context Protocol (MCP). This server provides various tools for fetching, scraping, and extracting data from web pages, supporting both static and dynamic content.

0High
0Medium
0Low

houxuelin112

HTML2Text MCP Server

Features

Fetch raw HTML content from URLs
Scrape static web pages with CSS selectors
Scrape dynamic SPAs using Puppeteer
Extract structured data from HTML
Batch scraping multiple pages
API endpoint scraping
Web page change monitoring

Installation

npm install

Configuration

The server supports the following configuration options:

headless: Whether to run browser in headless mode (default: true)
timeout: Request timeout in milliseconds (default: 30000)
userAgent: Custom User-Agent string
chromeExecutablePath: Path to Chrome executable (optional, defaults to standard location)

Tools

fetchHtml

Fetches raw HTML content from a URL.

Input schema:

url (string, required): The webpage URL
method (string): HTTP method (GET, POST), default: GET
headers (object): Custom request headers
timeout (number): Timeout in milliseconds, default: 10000

scrapeStatic

Scrapes content from static web pages.

Input schema:

url (string, required): The webpage URL
selector (string): CSS selector for content extraction
attribute (string): Attribute to extract (e.g., href, src)
extract (string): Extraction type (text, html, all), default: text

scrapeDynamic

Scrapes content from dynamically rendered web pages (SPAs).

Input schema:

url (string, required): The webpage URL
waitForSelector (string): Element selector to wait for
waitForXPath (string): XPath to wait for
waitForFunction (string): JavaScript function to wait for
waitTimeout (number): Wait timeout in milliseconds, default: 10000
executeScript (string): JavaScript to execute after page load
scroll (boolean): Whether to scroll to bottom, default: false
screenshot (boolean): Whether to capture screenshot, default: false
chromeExecutablePath (string): Path to Chrome executable

extractData

Extracts structured data from HTML content.

Input schema:

html (string, required): HTML content to extract from
selectors (object): Selector mapping like {title: 'h1', content: '.article'}
xpaths (object): XPath mapping
jsonLd (boolean): Whether to extract JSON-LD data, default: true
metadata (boolean): Whether to extract metadata, default: true

scrapeMultiple

Batch scrapes multiple pages at once.

Input schema:

urls (array, required): List of URLs to scrape
concurrency (number): Number of concurrent requests (1-10), default: 3
strategy (string): Scraping strategy (static, dynamic), default: static

scrapeApi

Scrapes data from API endpoints.

Input schema:

url (string, required): API endpoint URL
method (string): HTTP method (GET, POST, PUT, DELETE), default: GET
headers (object): Request headers
body (object): Request body (JSON)
params (object): Query parameters

monitorChanges

Monitors changes on web pages.

Input schema:

url (string, required): URL to monitor
selector (string): Selector to monitor for changes
interval (number): Check interval in minutes, default: 5
previousContent (string): Previously captured content for comparison

Usage

The server implements the Model Context Protocol and can be integrated with compatible clients.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

HTML2Text MCP Server

Features

Installation

Configuration

Tools

fetchHtml

scrapeStatic

scrapeDynamic

extractData

scrapeMultiple

scrapeApi

monitorChanges

Usage

License