npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@pulses/scrapling-mcp

v1.0.1

Published

MCP server for web scraping with multiple tiers of fetching (HTTP, Browser, Stealthy)

Readme

Scrapling MCP Server

A TypeScript Model Context Protocol (MCP) server for web scraping with multiple tiers of fetching strategies. This server provides 6 tools for scraping websites with varying levels of protection against anti-bot measures.

Features

Three Tiers of Fetching

Tier 1: Simple HTTP (get, bulk_get)

  • Fast HTTP requests using native Node.js fetch with curl-impersonation
  • Good for low-mid protection sites
  • Minimal resource usage
  • Supports retries, redirects, proxies, basic auth, and cookies

Tier 2: Playwright Browser (fetch, bulk_fetch)

  • Full browser automation via Playwright
  • Handles JavaScript-heavy sites that require page rendering
  • Configurable resource blocking for performance
  • Network idle detection and selector waiting

Tier 3: Stealthy Browser (stealthy_fetch, bulk_stealthy_fetch)

  • Advanced anti-bot bypass with stealth measures
  • Navigator.webdriver detection bypassing
  • Canvas fingerprint noise injection
  • WebRTC blocking
  • Cloudflare Turnstile challenge solving
  • Plugin spoofing

Tools

1. get (Single URL HTTP Request)

Fast HTTP requests with curl-impersonation for low-mid protection sites.

Parameters:

  • url (string, required): URL to request
  • impersonate (string, default: "chrome"): Browser fingerprint to mimic
  • extraction_type (enum: "markdown"|"html"|"text", default: "markdown"): Content format
  • css_selector (string, nullable): CSS selector for content extraction
  • main_content_only (boolean, default: true): Extract only main body content
  • params (object, nullable): Query string parameters
  • headers (object, nullable): Custom headers
  • cookies (object, nullable): Cookies to send
  • timeout (number, default: 30): Timeout in seconds
  • follow_redirects (boolean, default: true): Follow HTTP redirects
  • max_redirects (number, default: 30): Maximum redirects to follow
  • retries (number, default: 3): Retry attempts on failure
  • retry_delay (number, default: 1): Seconds between retries
  • proxy (string, nullable): Proxy URL (format: http://user:pass@host:port)
  • proxy_auth (object, nullable): {username, password} for proxy
  • auth (object, nullable): {username, password} for basic auth
  • verify (boolean, default: true): Verify HTTPS certificates
  • stealthy_headers (boolean, default: true): Add realistic Chrome headers + Google referer

2. bulk_get (Multiple URL HTTP Request)

Same as get but accepts urls (string[]) instead of single url.

3. fetch (Single URL Browser Request)

Full browser automation for JavaScript-heavy sites.

Parameters:

  • url (string, required): URL to request
  • extraction_type (enum: "markdown"|"html"|"text", default: "markdown"): Content format
  • css_selector (string, nullable): CSS selector for extraction
  • main_content_only (boolean, default: true): Extract only main content
  • headless (boolean, default: true): Run browser in headless mode
  • disable_resources (boolean, default: false): Block images/fonts/media for speed
  • useragent (string, nullable): Custom user agent
  • cookies (object, nullable): Cookies as {name: value}
  • network_idle (boolean, default: false): Wait for no network activity
  • timeout (number, default: 30000): Timeout in milliseconds
  • wait (number, default: 0): Wait after page load (ms)
  • wait_selector (string, nullable): CSS selector to wait for
  • wait_selector_state (enum: "attached"|"detached"|"hidden"|"visible", default: "attached"): Selector state
  • timezone_id (string, nullable): Browser timezone (e.g., "America/New_York")
  • locale (string, nullable): Browser locale (e.g., "en-US")
  • google_search (boolean, default: true): Set Google referer
  • extra_headers (object, nullable): Additional HTTP headers
  • proxy (string|object, nullable): Proxy configuration
  • real_chrome (boolean, default: false): Use installed Chrome browser
  • cdp_url (string, nullable): Connect to CDP endpoint instead of launching

4. bulk_fetch (Multiple URL Browser Request)

Same as fetch but accepts urls (string[]) instead of single url.

5. stealthy_fetch (Single URL Stealth Browser Request)

Browser automation with anti-bot bypass for high-protection sites.

Parameters: All fetch parameters PLUS:

  • solve_cloudflare (boolean, default: false): Solve Cloudflare Turnstile challenges
  • allow_webgl (boolean, default: true): Enable WebGL (some WAFs require this)
  • hide_canvas (boolean, default: false): Add canvas fingerprint noise
  • block_webrtc (boolean, default: false): Block WebRTC for IP leak prevention
  • additional_args (object, nullable): Extra Playwright context settings

6. bulk_stealthy_fetch (Multiple URL Stealth Browser Request)

Same as stealthy_fetch but accepts urls (string[]) instead of single url.

Installation

npm install
npm run build

Usage

Start the Server

npm start

Or in development:

npm run dev

Example MCP Client Usage

// Fetch a simple HTTP request
const result = await client.request({
  method: "tools/call",
  params: {
    name: "get",
    arguments: {
      url: "https://example.com",
      extraction_type: "markdown",
      main_content_only: true
    }
  }
});

// Fetch with browser automation
const result = await client.request({
  method: "tools/call",
  params: {
    name: "fetch",
    arguments: {
      url: "https://example.com",
      wait_selector: ".content",
      network_idle: true
    }
  }
});

// Bypass Cloudflare
const result = await client.request({
  method: "tools/call",
  params: {
    name: "stealthy_fetch",
    arguments: {
      url: "https://example.com",
      solve_cloudflare: true,
      hide_canvas: true
    }
  }
});

Content Extraction

All tools support three extraction formats:

  • markdown: Converts HTML to markdown (default)
  • html: Returns raw HTML
  • text: Plain text with all HTML stripped

CSS Selector Support

All tools support optional css_selector parameter to extract specific page elements. If the selector matches multiple elements, they are concatenated.

Main Content Only

By default, content extraction focuses on <body> content. Set main_content_only: false to include full HTML context.

Performance Tips

  1. Use get for static sites - Fastest option with minimal resource usage
  2. Enable resource blocking - Set disable_resources: true in fetch/stealthy_fetch to skip loading images/fonts/media
  3. Use CSS selectors - More efficient than extracting entire page content
  4. Set network_idle: false - Faster page loads, set to true only if needed
  5. Leverage bulk operations - bulk_get, bulk_fetch, and bulk_stealthy_fetch process multiple URLs efficiently

Stealth Features

The stealthy_fetch tool includes:

  • Navigator spoofing: Hides automation indicators (navigator.webdriver)
  • Plugin spoofing: Mimics Chrome plugins
  • Chrome runtime: Defines window.chrome.runtime for compatibility
  • Canvas fingerprint noise: Adds random noise to canvas operations
  • WebRTC blocking: Prevents local IP leak through WebRTC
  • Cloudflare handling: Waits for challenge resolution

Proxy Configuration

Proxies can be specified in two formats:

String format (with optional auth):

http://user:[email protected]:8080

Object format:

{
  server: "http://proxy.example.com:8080",
  username: "user",
  password: "pass"
}

Error Handling

All tools return error messages when requests fail. Common error scenarios:

  • Network timeouts
  • HTTP errors (4xx, 5xx)
  • Selector wait timeout
  • Cloudflare challenge timeout
  • Invalid proxy configuration

Errors are returned in the response with descriptive messages.

Dependencies

  • @modelcontextprotocol/sdk: MCP server framework
  • playwright: Browser automation
  • cheerio: HTML parsing for Tier 1 extraction
  • turndown: HTML to markdown conversion
  • zod: Input schema validation

License

MIT