@pulses/scrapling-mcp

v1.0.1

Published

3 months ago

MCP server for web scraping with multiple tiers of fetching (HTTP, Browser, Stealthy)

0High
0Medium
0Low

pulsesai

faisalakhan

mcp model-context-protocol web-scraping playwright http-fetching

Scrapling MCP Server

A TypeScript Model Context Protocol (MCP) server for web scraping with multiple tiers of fetching strategies. This server provides 6 tools for scraping websites with varying levels of protection against anti-bot measures.

Features

Three Tiers of Fetching

Tier 1: Simple HTTP (get, bulk_get)

Fast HTTP requests using native Node.js fetch with curl-impersonation
Good for low-mid protection sites
Minimal resource usage
Supports retries, redirects, proxies, basic auth, and cookies

Tier 2: Playwright Browser (fetch, bulk_fetch)

Full browser automation via Playwright
Handles JavaScript-heavy sites that require page rendering
Configurable resource blocking for performance
Network idle detection and selector waiting

Tier 3: Stealthy Browser (stealthy_fetch, bulk_stealthy_fetch)

Advanced anti-bot bypass with stealth measures
Navigator.webdriver detection bypassing
Canvas fingerprint noise injection
WebRTC blocking
Cloudflare Turnstile challenge solving
Plugin spoofing

Tools

1. `get` (Single URL HTTP Request)

Fast HTTP requests with curl-impersonation for low-mid protection sites.

Parameters:

url (string, required): URL to request
impersonate (string, default: "chrome"): Browser fingerprint to mimic
extraction_type (enum: "markdown"|"html"|"text", default: "markdown"): Content format
css_selector (string, nullable): CSS selector for content extraction
main_content_only (boolean, default: true): Extract only main body content
params (object, nullable): Query string parameters
headers (object, nullable): Custom headers
cookies (object, nullable): Cookies to send
timeout (number, default: 30): Timeout in seconds
follow_redirects (boolean, default: true): Follow HTTP redirects
max_redirects (number, default: 30): Maximum redirects to follow
retries (number, default: 3): Retry attempts on failure
retry_delay (number, default: 1): Seconds between retries
proxy (string, nullable): Proxy URL (format: http://user:pass@host:port)
proxy_auth (object, nullable): {username, password} for proxy
auth (object, nullable): {username, password} for basic auth
verify (boolean, default: true): Verify HTTPS certificates
stealthy_headers (boolean, default: true): Add realistic Chrome headers + Google referer

2. `bulk_get` (Multiple URL HTTP Request)

Same as get but accepts urls (string[]) instead of single url.

3. `fetch` (Single URL Browser Request)

Full browser automation for JavaScript-heavy sites.

Parameters:

url (string, required): URL to request
extraction_type (enum: "markdown"|"html"|"text", default: "markdown"): Content format
css_selector (string, nullable): CSS selector for extraction
main_content_only (boolean, default: true): Extract only main content
headless (boolean, default: true): Run browser in headless mode
disable_resources (boolean, default: false): Block images/fonts/media for speed
useragent (string, nullable): Custom user agent
cookies (object, nullable): Cookies as {name: value}
network_idle (boolean, default: false): Wait for no network activity
timeout (number, default: 30000): Timeout in milliseconds
wait (number, default: 0): Wait after page load (ms)
wait_selector (string, nullable): CSS selector to wait for
wait_selector_state (enum: "attached"|"detached"|"hidden"|"visible", default: "attached"): Selector state
timezone_id (string, nullable): Browser timezone (e.g., "America/New_York")
locale (string, nullable): Browser locale (e.g., "en-US")
google_search (boolean, default: true): Set Google referer
extra_headers (object, nullable): Additional HTTP headers
proxy (string|object, nullable): Proxy configuration
real_chrome (boolean, default: false): Use installed Chrome browser
cdp_url (string, nullable): Connect to CDP endpoint instead of launching

4. `bulk_fetch` (Multiple URL Browser Request)

Same as fetch but accepts urls (string[]) instead of single url.

5. `stealthy_fetch` (Single URL Stealth Browser Request)

Browser automation with anti-bot bypass for high-protection sites.

Parameters: All fetch parameters PLUS:

solve_cloudflare (boolean, default: false): Solve Cloudflare Turnstile challenges
allow_webgl (boolean, default: true): Enable WebGL (some WAFs require this)
hide_canvas (boolean, default: false): Add canvas fingerprint noise
block_webrtc (boolean, default: false): Block WebRTC for IP leak prevention
additional_args (object, nullable): Extra Playwright context settings

6. `bulk_stealthy_fetch` (Multiple URL Stealth Browser Request)

Same as stealthy_fetch but accepts urls (string[]) instead of single url.

Installation

npm install
npm run build

Usage

Start the Server

npm start

Or in development:

npm run dev

Example MCP Client Usage

// Fetch a simple HTTP request
const result = await client.request({
  method: "tools/call",
  params: {
    name: "get",
    arguments: {
      url: "https://example.com",
      extraction_type: "markdown",
      main_content_only: true
    }
  }
});

// Fetch with browser automation
const result = await client.request({
  method: "tools/call",
  params: {
    name: "fetch",
    arguments: {
      url: "https://example.com",
      wait_selector: ".content",
      network_idle: true
    }
  }
});

// Bypass Cloudflare
const result = await client.request({
  method: "tools/call",
  params: {
    name: "stealthy_fetch",
    arguments: {
      url: "https://example.com",
      solve_cloudflare: true,
      hide_canvas: true
    }
  }
});

Content Extraction

All tools support three extraction formats:

markdown: Converts HTML to markdown (default)
html: Returns raw HTML
text: Plain text with all HTML stripped

CSS Selector Support

All tools support optional css_selector parameter to extract specific page elements. If the selector matches multiple elements, they are concatenated.

Main Content Only

By default, content extraction focuses on <body> content. Set main_content_only: false to include full HTML context.

Performance Tips

Use get for static sites - Fastest option with minimal resource usage
Enable resource blocking - Set disable_resources: true in fetch/stealthy_fetch to skip loading images/fonts/media
Use CSS selectors - More efficient than extracting entire page content
Set network_idle: false - Faster page loads, set to true only if needed
Leverage bulk operations - bulk_get, bulk_fetch, and bulk_stealthy_fetch process multiple URLs efficiently

Stealth Features

The stealthy_fetch tool includes:

Navigator spoofing: Hides automation indicators (navigator.webdriver)
Plugin spoofing: Mimics Chrome plugins
Chrome runtime: Defines window.chrome.runtime for compatibility
Canvas fingerprint noise: Adds random noise to canvas operations
WebRTC blocking: Prevents local IP leak through WebRTC
Cloudflare handling: Waits for challenge resolution

Proxy Configuration

Proxies can be specified in two formats:

String format (with optional auth):

http://user:[email protected]:8080

Object format:

{
  server: "http://proxy.example.com:8080",
  username: "user",
  password: "pass"
}

Error Handling

All tools return error messages when requests fail. Common error scenarios:

Network timeouts
HTTP errors (4xx, 5xx)
Selector wait timeout
Cloudflare challenge timeout
Invalid proxy configuration

Errors are returned in the response with descriptive messages.

Dependencies

@modelcontextprotocol/sdk: MCP server framework
playwright: Browser automation
cheerio: HTML parsing for Tier 1 extraction
turndown: HTML to markdown conversion
zod: Input schema validation

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme