@pulses/scrapling-mcp
v1.0.1
Published
MCP server for web scraping with multiple tiers of fetching (HTTP, Browser, Stealthy)
Readme
Scrapling MCP Server
A TypeScript Model Context Protocol (MCP) server for web scraping with multiple tiers of fetching strategies. This server provides 6 tools for scraping websites with varying levels of protection against anti-bot measures.
Features
Three Tiers of Fetching
Tier 1: Simple HTTP (get, bulk_get)
- Fast HTTP requests using native Node.js fetch with curl-impersonation
- Good for low-mid protection sites
- Minimal resource usage
- Supports retries, redirects, proxies, basic auth, and cookies
Tier 2: Playwright Browser (fetch, bulk_fetch)
- Full browser automation via Playwright
- Handles JavaScript-heavy sites that require page rendering
- Configurable resource blocking for performance
- Network idle detection and selector waiting
Tier 3: Stealthy Browser (stealthy_fetch, bulk_stealthy_fetch)
- Advanced anti-bot bypass with stealth measures
- Navigator.webdriver detection bypassing
- Canvas fingerprint noise injection
- WebRTC blocking
- Cloudflare Turnstile challenge solving
- Plugin spoofing
Tools
1. get (Single URL HTTP Request)
Fast HTTP requests with curl-impersonation for low-mid protection sites.
Parameters:
url(string, required): URL to requestimpersonate(string, default: "chrome"): Browser fingerprint to mimicextraction_type(enum: "markdown"|"html"|"text", default: "markdown"): Content formatcss_selector(string, nullable): CSS selector for content extractionmain_content_only(boolean, default: true): Extract only main body contentparams(object, nullable): Query string parametersheaders(object, nullable): Custom headerscookies(object, nullable): Cookies to sendtimeout(number, default: 30): Timeout in secondsfollow_redirects(boolean, default: true): Follow HTTP redirectsmax_redirects(number, default: 30): Maximum redirects to followretries(number, default: 3): Retry attempts on failureretry_delay(number, default: 1): Seconds between retriesproxy(string, nullable): Proxy URL (format: http://user:pass@host:port)proxy_auth(object, nullable): {username, password} for proxyauth(object, nullable): {username, password} for basic authverify(boolean, default: true): Verify HTTPS certificatesstealthy_headers(boolean, default: true): Add realistic Chrome headers + Google referer
2. bulk_get (Multiple URL HTTP Request)
Same as get but accepts urls (string[]) instead of single url.
3. fetch (Single URL Browser Request)
Full browser automation for JavaScript-heavy sites.
Parameters:
url(string, required): URL to requestextraction_type(enum: "markdown"|"html"|"text", default: "markdown"): Content formatcss_selector(string, nullable): CSS selector for extractionmain_content_only(boolean, default: true): Extract only main contentheadless(boolean, default: true): Run browser in headless modedisable_resources(boolean, default: false): Block images/fonts/media for speeduseragent(string, nullable): Custom user agentcookies(object, nullable): Cookies as {name: value}network_idle(boolean, default: false): Wait for no network activitytimeout(number, default: 30000): Timeout in millisecondswait(number, default: 0): Wait after page load (ms)wait_selector(string, nullable): CSS selector to wait forwait_selector_state(enum: "attached"|"detached"|"hidden"|"visible", default: "attached"): Selector statetimezone_id(string, nullable): Browser timezone (e.g., "America/New_York")locale(string, nullable): Browser locale (e.g., "en-US")google_search(boolean, default: true): Set Google refererextra_headers(object, nullable): Additional HTTP headersproxy(string|object, nullable): Proxy configurationreal_chrome(boolean, default: false): Use installed Chrome browsercdp_url(string, nullable): Connect to CDP endpoint instead of launching
4. bulk_fetch (Multiple URL Browser Request)
Same as fetch but accepts urls (string[]) instead of single url.
5. stealthy_fetch (Single URL Stealth Browser Request)
Browser automation with anti-bot bypass for high-protection sites.
Parameters: All fetch parameters PLUS:
solve_cloudflare(boolean, default: false): Solve Cloudflare Turnstile challengesallow_webgl(boolean, default: true): Enable WebGL (some WAFs require this)hide_canvas(boolean, default: false): Add canvas fingerprint noiseblock_webrtc(boolean, default: false): Block WebRTC for IP leak preventionadditional_args(object, nullable): Extra Playwright context settings
6. bulk_stealthy_fetch (Multiple URL Stealth Browser Request)
Same as stealthy_fetch but accepts urls (string[]) instead of single url.
Installation
npm install
npm run buildUsage
Start the Server
npm startOr in development:
npm run devExample MCP Client Usage
// Fetch a simple HTTP request
const result = await client.request({
method: "tools/call",
params: {
name: "get",
arguments: {
url: "https://example.com",
extraction_type: "markdown",
main_content_only: true
}
}
});
// Fetch with browser automation
const result = await client.request({
method: "tools/call",
params: {
name: "fetch",
arguments: {
url: "https://example.com",
wait_selector: ".content",
network_idle: true
}
}
});
// Bypass Cloudflare
const result = await client.request({
method: "tools/call",
params: {
name: "stealthy_fetch",
arguments: {
url: "https://example.com",
solve_cloudflare: true,
hide_canvas: true
}
}
});Content Extraction
All tools support three extraction formats:
- markdown: Converts HTML to markdown (default)
- html: Returns raw HTML
- text: Plain text with all HTML stripped
CSS Selector Support
All tools support optional css_selector parameter to extract specific page elements. If the selector matches multiple elements, they are concatenated.
Main Content Only
By default, content extraction focuses on <body> content. Set main_content_only: false to include full HTML context.
Performance Tips
- Use
getfor static sites - Fastest option with minimal resource usage - Enable resource blocking - Set
disable_resources: trueinfetch/stealthy_fetchto skip loading images/fonts/media - Use CSS selectors - More efficient than extracting entire page content
- Set
network_idle: false- Faster page loads, set totrueonly if needed - Leverage bulk operations -
bulk_get,bulk_fetch, andbulk_stealthy_fetchprocess multiple URLs efficiently
Stealth Features
The stealthy_fetch tool includes:
- Navigator spoofing: Hides automation indicators (navigator.webdriver)
- Plugin spoofing: Mimics Chrome plugins
- Chrome runtime: Defines window.chrome.runtime for compatibility
- Canvas fingerprint noise: Adds random noise to canvas operations
- WebRTC blocking: Prevents local IP leak through WebRTC
- Cloudflare handling: Waits for challenge resolution
Proxy Configuration
Proxies can be specified in two formats:
String format (with optional auth):
http://user:[email protected]:8080Object format:
{
server: "http://proxy.example.com:8080",
username: "user",
password: "pass"
}Error Handling
All tools return error messages when requests fail. Common error scenarios:
- Network timeouts
- HTTP errors (4xx, 5xx)
- Selector wait timeout
- Cloudflare challenge timeout
- Invalid proxy configuration
Errors are returned in the response with descriptive messages.
Dependencies
@modelcontextprotocol/sdk: MCP server frameworkplaywright: Browser automationcheerio: HTML parsing for Tier 1 extractionturndown: HTML to markdown conversionzod: Input schema validation
License
MIT
