webscraper-mcp-server
v2.0.0
Published
MCP server for web scraping and converting web pages to Markdown with Playwright support
Downloads
17
Maintainers
Readme
WebScraper MCP Server v2.0 (Playwright Edition)
A Model Context Protocol (MCP) server that provides advanced web scraping and HTML to Markdown conversion using Microsoft Playwright. This version automatically detects and handles JavaScript-rendered pages.
🆕 What's New in v2.0
- 🚀 Microsoft Playwright - Superior JavaScript rendering with automatic fallback
- 🎯 Smart Detection - Automatically switches to JS rendering when needed
- 📸 Screenshots - Capture page screenshots as base64
- ⏱️ Custom Waits - Wait for specific selectors or time periods
- 🔄 Dual Mode - Static scraping for speed, JS rendering for dynamic content
- 📊 Performance Metrics - Track load times and render methods
Features
Core Capabilities
- 🌐 Intelligent Web Scraping: Automatic detection of static vs dynamic pages
- 📝 HTML to Markdown: Clean, well-formatted Markdown conversion
- 🎭 JavaScript Rendering: Full Playwright support for SPA and dynamic content
- 🔗 Link Extraction: Extract all hyperlinks with filtering options
- 🖼️ Image Extraction: Extract images including lazy-loaded ones
- 📦 Batch Processing: Scrape up to 10 URLs simultaneously
- 🎯 Metadata Extraction: Title, description, author, keywords, and more
- ⚙️ Flexible Options: Control timeouts, redirects, content inclusion
- 📊 Multiple Formats: Output in Markdown or JSON
- 📸 Screenshot Capture: Get base64 screenshots of pages
Rendering Modes
Static Mode (Default, Fast)
- Uses Axios + Cheerio
- Suitable for traditional HTML pages
- Fastest performance
JavaScript Mode (Auto-detected or Forced)
- Uses Playwright with Chromium
- Executes JavaScript
- Handles SPAs, lazy loading, dynamic content
- Auto-activates when static mode returns < 50 words
Installation
# Clone or navigate to the project
cd webscraper-mcp-server-v2
# Install dependencies
npm install
# Install Playwright browsers
npm run install:browsers
# Build the project
npm run buildUsage
Running with stdio (Local)
npm startRunning with HTTP (Remote)
TRANSPORT=http PORT=3000 npm startAvailable Tools
1. webscraper_scrape_page - Advanced Web Scraping
Automatically detects and handles both static and dynamic pages.
New Parameters:
use_javascript(boolean): Force JavaScript renderingwait_for_selector(string): CSS selector to wait forwait_time(number): Additional wait time in millisecondstake_screenshot(boolean): Capture page screenshot
Example - Force JavaScript Rendering:
{
"url": "https://docs.uazapi.com/endpoint/post/instance~init",
"use_javascript": true,
"wait_for_selector": ".content",
"wait_time": 3000,
"take_screenshot": true
}Example - Auto-Detection:
{
"url": "https://example.com/spa-app"
}Automatically switches to JavaScript if static content is insufficient
2. webscraper_extract_links - Link Extraction
New Parameter:
use_javascript(boolean): Use JavaScript rendering for dynamic links
Example:
{
"url": "https://example.com",
"use_javascript": true,
"filter_external": true
}3. webscraper_extract_images - Image Extraction
New Parameter:
use_javascript(boolean): Extract lazy-loaded images
Example:
{
"url": "https://example.com/gallery",
"use_javascript": true,
"limit": 50
}4. webscraper_batch_scrape - Batch Operations
New Parameter:
use_javascript(boolean): Use JavaScript for all URLs
Example:
{
"urls": ["https://page1.com", "https://page2.com"],
"use_javascript": true,
"timeout": 60000
}Configuration
Environment Variables
TRANSPORT: Transport type ('stdio' or 'http', default: 'stdio')PORT: HTTP server port (default: 3000, only for HTTP transport)
Client Configuration (Claude Desktop)
{
"mcpServers": {
"webscraper": {
"command": "node",
"args": ["/path/to/webscraper-mcp-server-v2/dist/index.js"]
}
}
}Output Formats
Markdown Format (Enhanced)
# Page Title
**URL:** https://example.com
**Render Method:** javascript
**Description:** Page description
**Author:** Author Name
**Word Count:** 1500 | **Status:** 200 | **Load Time:** 2340ms
---
[Page content in Markdown...]JSON Format (Enhanced)
{
"url": "https://example.com",
"title": "Page Title",
"content": "Markdown content...",
"renderMethod": "javascript",
"metadata": {
"description": "Page description",
"wordCount": 1500,
"loadTime": 2340,
"screenshot": "base64..." // if requested
}
}Performance Comparison
| Feature | Static Mode | JavaScript Mode | |---------|-------------|-----------------| | Speed | ~1-3s | ~3-8s | | JavaScript | ❌ | ✅ | | SPA Support | ❌ | ✅ | | Lazy Loading | ❌ | ✅ | | Resource Usage | Low | Medium | | Best For | Traditional HTML | Modern Web Apps |
Use Cases
1. Scraping JavaScript-Heavy Sites
// Site with React/Vue/Angular
{
"url": "https://spa-site.com",
"use_javascript": true,
"wait_for_selector": "#root > div",
"wait_time": 2000
}2. Capturing Visual State
// Get screenshot along with content
{
"url": "https://example.com/dashboard",
"use_javascript": true,
"take_screenshot": true
}3. API Documentation Sites
// Like your UAZ API docs example
{
"url": "https://docs.uazapi.com/endpoint/post/instance~init",
"use_javascript": true,
"wait_for_selector": ".api-content",
"response_format": "json"
}4. E-commerce Product Pages
// Lazy-loaded images and dynamic prices
{
"url": "https://shop.example.com/product/123",
"use_javascript": true,
"wait_time": 3000
}Troubleshooting
Playwright Issues
# Reinstall browsers
npm run install:browsers
# Check Playwright installation
npx playwright --versionLow Word Count on Dynamic Sites
Problem: Getting < 50 words from a JavaScript site?
Solution:
- Set
use_javascript: trueexplicitly - Use
wait_for_selectorfor specific elements - Increase
wait_timeif content loads slowly
Memory Issues
Problem: Browser consuming too much memory?
Solution:
- The browser instance is reused and shared
- Contexts are closed after each operation
- Consider increasing system resources for heavy usage
Advantages over Puppeteer
✅ Better Performance: Playwright is generally faster
✅ More Reliable: Better handling of modern web apps
✅ Auto-waiting: Smarter element waiting
✅ Multiple Browsers: Can use Chromium, Firefox, or WebKit
✅ Modern APIs: Cleaner, more intuitive API
✅ Active Development: Microsoft-backed, frequent updates
Development
Project Structure
webscraper-mcp-server-v2/
├── src/
│ ├── index.ts # Main entry point
│ ├── types.ts # TypeScript definitions (enhanced)
│ ├── constants.ts # Configuration constants
│ ├── schemas/ # Zod validation (updated)
│ ├── services/ # Playwright-based scraping
│ └── tools/ # MCP tool implementations
├── dist/ # Compiled JavaScript
├── package.json # Dependencies (with Playwright)
└── README.mdBuilding
npm run buildTesting
# With MCP Inspector
npx @modelcontextprotocol/inspector node dist/index.jsLimitations
- Maximum 10 URLs for batch scraping
- Content truncated at 100,000 characters
- Request timeout: 1-120 seconds
- Chromium browser required (~170MB download)
- Supports only HTTP/HTTPS protocols
- Requires publicly accessible URLs
Performance Tips
- Use Static Mode When Possible: 3-5x faster for traditional sites
- Batch Related URLs: More efficient than individual calls
- Set Appropriate Timeouts: Longer for slow sites, shorter for fast ones
- Use Selectors Wisely: Wait for specific elements instead of fixed times
- Limit Screenshot Usage: Screenshots increase response size significantly
Comparison with v1.0
| Feature | v1.0 (Cheerio Only) | v2.0 (Playwright) | |---------|---------------------|-------------------| | Static HTML | ✅ Fast | ✅ Fast | | JavaScript | ❌ | ✅ Full Support | | Auto-Detection | ❌ | ✅ Smart Fallback | | Screenshots | ❌ | ✅ Base64 Output | | Lazy Loading | ❌ | ✅ Supported | | SPAs | ❌ Limited | ✅ Full Support |
License
MIT
Contributing
Contributions welcome! Areas for improvement:
- [ ] Support for other Playwright browsers (Firefox, WebKit)
- [ ] PDF generation from pages
- [ ] Advanced selector strategies
- [ ] Request interception for blocking ads
- [ ] Cookie management
- [ ] Proxy support
Support
For issues or questions, please open an issue on the GitHub repository.
Made with ❤️ using Microsoft Playwright and Model Context Protocol
