html2text-mcp-server
v1.0.1
Published
A web scraping and content extraction service built with Model Context Protocol (MCP). This server provides various tools for fetching, scraping, and extracting data from web pages, supporting both static and dynamic content.
Readme
HTML2Text MCP Server
A web scraping and content extraction service built with Model Context Protocol (MCP). This server provides various tools for fetching, scraping, and extracting data from web pages, supporting both static and dynamic content.
Features
- Fetch raw HTML content from URLs
- Scrape static web pages with CSS selectors
- Scrape dynamic SPAs using Puppeteer
- Extract structured data from HTML
- Batch scraping multiple pages
- API endpoint scraping
- Web page change monitoring
Installation
npm installConfiguration
The server supports the following configuration options:
headless: Whether to run browser in headless mode (default: true)timeout: Request timeout in milliseconds (default: 30000)userAgent: Custom User-Agent stringchromeExecutablePath: Path to Chrome executable (optional, defaults to standard location)
Tools
fetchHtml
Fetches raw HTML content from a URL.
Input schema:
url(string, required): The webpage URLmethod(string): HTTP method (GET, POST), default: GETheaders(object): Custom request headerstimeout(number): Timeout in milliseconds, default: 10000
scrapeStatic
Scrapes content from static web pages.
Input schema:
url(string, required): The webpage URLselector(string): CSS selector for content extractionattribute(string): Attribute to extract (e.g., href, src)extract(string): Extraction type (text, html, all), default: text
scrapeDynamic
Scrapes content from dynamically rendered web pages (SPAs).
Input schema:
url(string, required): The webpage URLwaitForSelector(string): Element selector to wait forwaitForXPath(string): XPath to wait forwaitForFunction(string): JavaScript function to wait forwaitTimeout(number): Wait timeout in milliseconds, default: 10000executeScript(string): JavaScript to execute after page loadscroll(boolean): Whether to scroll to bottom, default: falsescreenshot(boolean): Whether to capture screenshot, default: falsechromeExecutablePath(string): Path to Chrome executable
extractData
Extracts structured data from HTML content.
Input schema:
html(string, required): HTML content to extract fromselectors(object): Selector mapping like {title: 'h1', content: '.article'}xpaths(object): XPath mappingjsonLd(boolean): Whether to extract JSON-LD data, default: truemetadata(boolean): Whether to extract metadata, default: true
scrapeMultiple
Batch scrapes multiple pages at once.
Input schema:
urls(array, required): List of URLs to scrapeconcurrency(number): Number of concurrent requests (1-10), default: 3strategy(string): Scraping strategy (static, dynamic), default: static
scrapeApi
Scrapes data from API endpoints.
Input schema:
url(string, required): API endpoint URLmethod(string): HTTP method (GET, POST, PUT, DELETE), default: GETheaders(object): Request headersbody(object): Request body (JSON)params(object): Query parameters
monitorChanges
Monitors changes on web pages.
Input schema:
url(string, required): URL to monitorselector(string): Selector to monitor for changesinterval(number): Check interval in minutes, default: 5previousContent(string): Previously captured content for comparison
Usage
The server implements the Model Context Protocol and can be integrated with compatible clients.
License
MIT
