npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

webscraper-mcp-server

v2.0.0

Published

MCP server for web scraping and converting web pages to Markdown with Playwright support

Downloads

17

Readme

WebScraper MCP Server v2.0 (Playwright Edition)

A Model Context Protocol (MCP) server that provides advanced web scraping and HTML to Markdown conversion using Microsoft Playwright. This version automatically detects and handles JavaScript-rendered pages.

🆕 What's New in v2.0

  • 🚀 Microsoft Playwright - Superior JavaScript rendering with automatic fallback
  • 🎯 Smart Detection - Automatically switches to JS rendering when needed
  • 📸 Screenshots - Capture page screenshots as base64
  • ⏱️ Custom Waits - Wait for specific selectors or time periods
  • 🔄 Dual Mode - Static scraping for speed, JS rendering for dynamic content
  • 📊 Performance Metrics - Track load times and render methods

Features

Core Capabilities

  • 🌐 Intelligent Web Scraping: Automatic detection of static vs dynamic pages
  • 📝 HTML to Markdown: Clean, well-formatted Markdown conversion
  • 🎭 JavaScript Rendering: Full Playwright support for SPA and dynamic content
  • 🔗 Link Extraction: Extract all hyperlinks with filtering options
  • 🖼️ Image Extraction: Extract images including lazy-loaded ones
  • 📦 Batch Processing: Scrape up to 10 URLs simultaneously
  • 🎯 Metadata Extraction: Title, description, author, keywords, and more
  • ⚙️ Flexible Options: Control timeouts, redirects, content inclusion
  • 📊 Multiple Formats: Output in Markdown or JSON
  • 📸 Screenshot Capture: Get base64 screenshots of pages

Rendering Modes

  1. Static Mode (Default, Fast)

    • Uses Axios + Cheerio
    • Suitable for traditional HTML pages
    • Fastest performance
  2. JavaScript Mode (Auto-detected or Forced)

    • Uses Playwright with Chromium
    • Executes JavaScript
    • Handles SPAs, lazy loading, dynamic content
    • Auto-activates when static mode returns < 50 words

Installation

# Clone or navigate to the project
cd webscraper-mcp-server-v2

# Install dependencies
npm install

# Install Playwright browsers
npm run install:browsers

# Build the project
npm run build

Usage

Running with stdio (Local)

npm start

Running with HTTP (Remote)

TRANSPORT=http PORT=3000 npm start

Available Tools

1. webscraper_scrape_page - Advanced Web Scraping

Automatically detects and handles both static and dynamic pages.

New Parameters:

  • use_javascript (boolean): Force JavaScript rendering
  • wait_for_selector (string): CSS selector to wait for
  • wait_time (number): Additional wait time in milliseconds
  • take_screenshot (boolean): Capture page screenshot

Example - Force JavaScript Rendering:

{
  "url": "https://docs.uazapi.com/endpoint/post/instance~init",
  "use_javascript": true,
  "wait_for_selector": ".content",
  "wait_time": 3000,
  "take_screenshot": true
}

Example - Auto-Detection:

{
  "url": "https://example.com/spa-app"
}

Automatically switches to JavaScript if static content is insufficient

2. webscraper_extract_links - Link Extraction

New Parameter:

  • use_javascript (boolean): Use JavaScript rendering for dynamic links

Example:

{
  "url": "https://example.com",
  "use_javascript": true,
  "filter_external": true
}

3. webscraper_extract_images - Image Extraction

New Parameter:

  • use_javascript (boolean): Extract lazy-loaded images

Example:

{
  "url": "https://example.com/gallery",
  "use_javascript": true,
  "limit": 50
}

4. webscraper_batch_scrape - Batch Operations

New Parameter:

  • use_javascript (boolean): Use JavaScript for all URLs

Example:

{
  "urls": ["https://page1.com", "https://page2.com"],
  "use_javascript": true,
  "timeout": 60000
}

Configuration

Environment Variables

  • TRANSPORT: Transport type ('stdio' or 'http', default: 'stdio')
  • PORT: HTTP server port (default: 3000, only for HTTP transport)

Client Configuration (Claude Desktop)

{
  "mcpServers": {
    "webscraper": {
      "command": "node",
      "args": ["/path/to/webscraper-mcp-server-v2/dist/index.js"]
    }
  }
}

Output Formats

Markdown Format (Enhanced)

# Page Title

**URL:** https://example.com
**Render Method:** javascript

**Description:** Page description
**Author:** Author Name
**Word Count:** 1500 | **Status:** 200 | **Load Time:** 2340ms

---

[Page content in Markdown...]

JSON Format (Enhanced)

{
  "url": "https://example.com",
  "title": "Page Title",
  "content": "Markdown content...",
  "renderMethod": "javascript",
  "metadata": {
    "description": "Page description",
    "wordCount": 1500,
    "loadTime": 2340,
    "screenshot": "base64..." // if requested
  }
}

Performance Comparison

| Feature | Static Mode | JavaScript Mode | |---------|-------------|-----------------| | Speed | ~1-3s | ~3-8s | | JavaScript | ❌ | ✅ | | SPA Support | ❌ | ✅ | | Lazy Loading | ❌ | ✅ | | Resource Usage | Low | Medium | | Best For | Traditional HTML | Modern Web Apps |

Use Cases

1. Scraping JavaScript-Heavy Sites

// Site with React/Vue/Angular
{
  "url": "https://spa-site.com",
  "use_javascript": true,
  "wait_for_selector": "#root > div",
  "wait_time": 2000
}

2. Capturing Visual State

// Get screenshot along with content
{
  "url": "https://example.com/dashboard",
  "use_javascript": true,
  "take_screenshot": true
}

3. API Documentation Sites

// Like your UAZ API docs example
{
  "url": "https://docs.uazapi.com/endpoint/post/instance~init",
  "use_javascript": true,
  "wait_for_selector": ".api-content",
  "response_format": "json"
}

4. E-commerce Product Pages

// Lazy-loaded images and dynamic prices
{
  "url": "https://shop.example.com/product/123",
  "use_javascript": true,
  "wait_time": 3000
}

Troubleshooting

Playwright Issues

# Reinstall browsers
npm run install:browsers

# Check Playwright installation
npx playwright --version

Low Word Count on Dynamic Sites

Problem: Getting < 50 words from a JavaScript site?

Solution:

  • Set use_javascript: true explicitly
  • Use wait_for_selector for specific elements
  • Increase wait_time if content loads slowly

Memory Issues

Problem: Browser consuming too much memory?

Solution:

  • The browser instance is reused and shared
  • Contexts are closed after each operation
  • Consider increasing system resources for heavy usage

Advantages over Puppeteer

Better Performance: Playwright is generally faster
More Reliable: Better handling of modern web apps
Auto-waiting: Smarter element waiting
Multiple Browsers: Can use Chromium, Firefox, or WebKit
Modern APIs: Cleaner, more intuitive API
Active Development: Microsoft-backed, frequent updates

Development

Project Structure

webscraper-mcp-server-v2/
├── src/
│   ├── index.ts           # Main entry point
│   ├── types.ts           # TypeScript definitions (enhanced)
│   ├── constants.ts       # Configuration constants
│   ├── schemas/           # Zod validation (updated)
│   ├── services/          # Playwright-based scraping
│   └── tools/             # MCP tool implementations
├── dist/                  # Compiled JavaScript
├── package.json           # Dependencies (with Playwright)
└── README.md

Building

npm run build

Testing

# With MCP Inspector
npx @modelcontextprotocol/inspector node dist/index.js

Limitations

  • Maximum 10 URLs for batch scraping
  • Content truncated at 100,000 characters
  • Request timeout: 1-120 seconds
  • Chromium browser required (~170MB download)
  • Supports only HTTP/HTTPS protocols
  • Requires publicly accessible URLs

Performance Tips

  1. Use Static Mode When Possible: 3-5x faster for traditional sites
  2. Batch Related URLs: More efficient than individual calls
  3. Set Appropriate Timeouts: Longer for slow sites, shorter for fast ones
  4. Use Selectors Wisely: Wait for specific elements instead of fixed times
  5. Limit Screenshot Usage: Screenshots increase response size significantly

Comparison with v1.0

| Feature | v1.0 (Cheerio Only) | v2.0 (Playwright) | |---------|---------------------|-------------------| | Static HTML | ✅ Fast | ✅ Fast | | JavaScript | ❌ | ✅ Full Support | | Auto-Detection | ❌ | ✅ Smart Fallback | | Screenshots | ❌ | ✅ Base64 Output | | Lazy Loading | ❌ | ✅ Supported | | SPAs | ❌ Limited | ✅ Full Support |

License

MIT

Contributing

Contributions welcome! Areas for improvement:

  • [ ] Support for other Playwright browsers (Firefox, WebKit)
  • [ ] PDF generation from pages
  • [ ] Advanced selector strategies
  • [ ] Request interception for blocking ads
  • [ ] Cookie management
  • [ ] Proxy support

Support

For issues or questions, please open an issue on the GitHub repository.


Made with ❤️ using Microsoft Playwright and Model Context Protocol