@shyzus/mcp-scrapidou
v1.1.0
Published
Scrapidou - MCP server for web scraping and URL fetching
Maintainers
Readme
🕷️ Scrapidou - Web Scraping Server for ChatGPT
Scrapidou is a clean, modular MCP server for web scraping and URL fetching.
⚠️ Disclaimer
This project is independent and unofficial.
- ❌ Not affiliated with any scraping service
- ✅ Educational and practical purpose project
- ✅ Respects robots.txt and rate limiting
- ⚠️ Use responsibly - respect website terms of service
🎯 What is it?
This application allows ChatGPT and other MCP clients to fetch and scrape web content with a clean, modular architecture.
✨ Features
- 🌐 URL Fetching - Retrieve content from any URL with proper headers and redirect handling
- 📄 Flexible Extraction - Control content format (
text,html,both), size (maxContentLength), issue detection, and link extraction independently - 📝 Text Content Extraction - Clean text extraction without HTML tags for LLM consumption
- 🎨 HTML Content Extraction - Full HTML content preservation in
fullmode (formatting, images, citations) - 🔍 Issue Detection - Automatically detect paywalls, login requirements, and partial content
- 🔗 Related Links - Extract relevant links (see also, related articles) while filtering ads and navigation
- 🧭 Navigation Links - Extract sidebar/menu links for documentation sites (optional)
- 📊 Metadata Extraction - Extract title, description, author, and publication date
- 🏗️ Modular Architecture - Clean separation of concerns, reusable for future projects
- 🔌 Dual Mode - Works with ChatGPT (Streamable HTTP) and IDEs (stdio)
💬 Usage example
In ChatGPT, simply ask:
"Fetch the content from https://example.com"
Or:
"Extract the main content from https://blog.example.com/article with a 500 character limit"
Or:
"Get the full HTML content from https://docs.example.com/page"
ChatGPT will use the MCP server to fetch, extract, and return the content according to the selected parameters:
- Content format: Choose between
text(clean text, default),html(full HTML), orboth - Content size control: Use
maxContentLengthfor quick mapping (500-1000 chars) or leaveundefinedfor complete analysis - Issue detection: Control with
detectIssuesparameter (default:true) - Link extraction: Configure
extractRelatedLinksandextractNavigationLinksindependently
📖 Use Cases & Content Extraction
What is extracted?
The tool extracts two types of content, and you can choose which one(s) to return:
contentText(Text content)- What it is: Clean, readable text extracted from the main content of the page
- How it's extracted:
- Uses Mozilla Readability algorithm to identify the main content
- Removes HTML tags, scripts, styles
- Cleans up whitespace and formatting
- Preserves paragraph structure
- Available when:
contentFormat: 'text'orcontentFormat: 'both'(default:'text') - Use case: Perfect for LLM consumption, summarization, analysis
- No size limit: Full content is returned in
structuredContent.contentText
contentHTML(HTML content)- What it is: Full HTML of the main content area (preserves formatting, structure)
- How it's extracted:
- Uses Mozilla Readability to extract the main content HTML
- Preserves HTML structure, images, links, formatting
- Removes navigation, headers, footers, ads
- Available when:
contentFormat: 'html'orcontentFormat: 'both' - Use case: Technical analysis, preserving formatting, advanced processing
- Size control: Use
maxContentLengthto limit extraction (default: no limit - full HTML) - Note: When
contentFormatis'html'or'both', the tool automatically extracts HTML internally
Response Structure
All responses follow this structure:
{
// 1. Summary (markdown text visible to user and model)
content: [{
type: 'text',
text: '📄 Content extracted from: https://example.com\n...'
}],
// 2. Structured data (accessible by ChatGPT)
structuredContent: {
type: 'webpage',
url: 'https://example.com',
contentFormat: 'text', // 'text' | 'html' | 'both'
maxContentLength: undefined, // Optional: limit content size (undefined = no limit)
metadata: { title, description, author, publishedDate },
contentText: '...', // Text content (truncated if maxContentLength specified)
contentHTML: '...', // HTML content (if contentFormat: 'html' or 'both')
issues: [{ type: 'paywall', message: '...' }], // Empty array if detectIssues: false
relatedLinks: [{ url, text, type }], // All links (no limit)
navigationLinks: [{ url, text, level }], // All links (no limit)
contentTextLength: 1234, // Original full length
contentTextExtractedLength: 1234, // Actual extracted length
contentTextTruncated: false, // true if truncated
contentHTMLLength: 5678, // Original full length (if HTML present)
contentHTMLExtractedLength: 5678, // Actual extracted length (if HTML present)
contentHTMLTruncated: false // true if truncated (if HTML present)
}
}Decision Matrix
| Use Case | contentFormat | maxContentLength | detectIssues | extractRelatedLinks | extractNavigationLinks |
|----------|----------------|-------------------|----------------|----------------------|--------------------------|
| Quick mapping/summary | text | 500-1000 | false | false | false |
| Article/blog post | text | undefined (full) | true (default) | true (default) | false |
| Wikipedia page | text | undefined (full) | true (default) | true (default) | false |
| Documentation site | text | undefined (full) | true (default) | false | true |
| Need HTML content | html | undefined (full) | true (default) | true (default) | true (if needed) |
| Need both text & HTML | both | undefined (full) | true (default) | true (default) | true (if needed) |
| Technical analysis | html | undefined (full) | true (default) | false | true |
| Preview/quick read | text | 2000-5000 | true (default) | true (default) | false |
Notes:
- When
contentFormatis'html'or'both', the tool automatically extracts HTML internally. - Use
maxContentLengthfor quick mapping/summaries (500-1000 chars) or previews (2000-5000 chars). Leaveundefinedfor complete analysis. - Set
detectIssues: falsefor faster extraction when you know the content is freely accessible.
Parameters
contentFormat - Content Type
Controls what type of content is returned:
contentFormat: 'text'(default)- Returns
structuredContent.contentTextwith text content - Perfect for LLM analysis, summarization, general understanding
- Available in all modes
- Returns
contentFormat: 'html'- Returns
structuredContent.contentHTMLwith HTML content - Automatically extracts HTML internally
- Preserves formatting, structure, images, links
- Best for technical analysis, preserving document structure
- Returns
contentFormat: 'both'- Returns both
structuredContent.contentTextandstructuredContent.contentHTML - Automatically extracts HTML internally
- Use when you need both formats for different purposes
- Returns both
maxContentLength - Content Size Control
Controls the maximum number of characters to extract (applies to both text and HTML):
maxContentLength: undefined(default - no limit)- Extracts complete content without any truncation
- Use for complete analysis, deep understanding, or when you need all information
- Best for thorough content analysis
maxContentLength: 500-1000- Quick mapping or brief summaries
- Use for getting a quick overview of the content
- Good for previews or when you only need the beginning
maxContentLength: 2000-5000- Detailed previews or extended summaries
- Use when you need more context but not the full content
- Good balance between completeness and token usage
Important: The content is truncated at the specified limit if longer. The response includes:
contentTextTruncated/contentHTMLTruncated:trueif content was truncatedcontentTextLength/contentHTMLLength: Original full lengthcontentTextExtractedLength/contentHTMLExtractedLength: Actual extracted length
detectIssues - Issue Detection
Controls whether to detect issues on the page:
detectIssues: true(default)- Analyzes the page to detect paywalls, login requirements, or partial content
- Use for general use cases when you want to know if there are access issues
- Adds a small processing overhead
detectIssues: false- Skips issue detection for faster extraction
- Use when you know the content is freely accessible
- Best for quick mapping or when you don't need issue information
Content Extraction Details
Text Content (contentText):
- Extracted using Mozilla Readability algorithm
- Removes HTML tags, scripts, styles, ads
- Preserves paragraph structure
- Cleaned whitespace and formatting
- Available when:
contentFormat: 'text'or'both' - Size control: Use
maxContentLengthto limit extraction (default: no limit - full content)
HTML Content (contentHTML):
- Extracted using Mozilla Readability algorithm
- Preserves HTML structure, images, links, formatting
- Removes navigation, headers, footers, ads
- Available when:
contentFormat: 'html'or'both' - Size control: Use
maxContentLengthto limit extraction (default: no limit - full HTML) - Automatic mode: When
contentFormatis'html'or'both', the tool usesmode: 'full'internally
Important Notes
- No widget: This tool doesn't use widgets, so all content is directly accessible in
structuredContent - Flexible size control: Use
maxContentLengthfor quick mapping (500-1000 chars) or leaveundefinedfor complete analysis - All links included: Related links and navigation links are returned in full (no limits)
- No
_metacomplexity: Since there's no widget, we don't need complex_metastructures - Truncation indicators: When
maxContentLengthis used, the response includescontentTextTruncated/contentHTMLTruncatedflags and length information
🏗️ Architecture: MCP Server
What is an MCP Server?
MCP (Model Context Protocol) servers allow you to extend ChatGPT and other LLMs with:
- Custom tools (call external APIs)
- Real-time data (up-to-date information)
How does it work?
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ ChatGPT │ ◄─────► │ MCP Server │ ◄─────► │ Target URL │
│ │ HTTP │ (Node.js) │ HTTP │ │
└─────────────┘ └──────────────┘ └──────────────┘- ChatGPT connects via Streamable HTTP to
/mcp(GET/POST) - The MCP server fetches data from the target URL
- The results are returned to ChatGPT
MCP Protocol
MCP (Model Context Protocol) is an open standard created by Anthropic that allows LLMs to access external data and tools securely. It is used by:
- ChatGPT (via MCP connectors)
- Claude Desktop
- Cursor
- Other MCP clients
🚀 Quick Start
Use with Cursor / Claude Desktop / Warp
The easiest way - Install the npm client that connects to the remote server:
{
"mcpServers": {
"mcp-scrapidou": {
"command": "npx",
"args": ["-y", "@shyzus/mcp-scrapidou"]
}
}
}Config file locations:
- Cursor:
~/.cursor/mcp.json(macOS/Linux) or%APPDATA%\Cursor\mcp.json(Windows) - Claude Desktop:
~/Library/Application Support/Claude/claude_desktop_config.json(macOS) - Warp: In Warp AI settings
Use with ChatGPT
A production server is available and ready to use!
Server URL: https://scrapidou.rankorr.red/mcp
ChatGPT Configuration
- Have a ChatGPT account with subscription (ChatGPT Plus, Team, or Enterprise)
- Open ChatGPT in your browser → Go to Settings (⚙️)
- Go to "Apps & Connectors"
- Enable developer mode:
- In "Advanced Settings", enable developer mode
- Go back
- Create a new application:
- The "Create" button now appears in the top right
- Click on it
- Fill in the form:
- Name: "Scrapidou" (or another name)
- Image: Add an icon/image (optional)
- Server URL:
https://scrapidou.rankorr.red/mcp - Note: The server uses Streamable HTTP transport (modern MCP standard)
- Authentication: Select "None"
- Click "Create"
- The application is now available in ChatGPT
For developers - Local installation
# 1. Clone the project
git clone https://github.com/Shyzkanza/mcp-fetch-url.git
cd mcp-fetch-url
# 2. Install dependencies
npm install
# 3. Build
npm run build
# 4. Use locally
npx @modelcontextprotocol/inspector node dist/index.js📂 Project Structure
mcp-fetch-url/
├── src/
│ ├── config.ts # Configuration centralisée
│ ├── types.ts # Types TypeScript partagés
│ ├── client/
│ │ └── httpClient.ts # HTTP client avec headers, redirections, timeout
│ ├── tools/
│ │ └── fetchUrl.ts # Tool MCP: fetch_url
│ ├── resources/ # Templates (future)
│ ├── servers/
│ │ ├── stdio.ts # Serveur stdio (IDEs)
│ │ └── http.ts # Serveur Streamable HTTP (ChatGPT)
│ ├── utils/
│ │ ├── errors.ts # Gestion erreurs centralisée
│ │ ├── contentExtractor.ts # Extraction contenu (Readability + fallback) + text extraction
│ │ ├── issueDetector.ts # Détection paywall, login, contenu partiel
│ │ ├── linkExtractor.ts # Extraction liens pertinents (related links)
│ │ └── navigationExtractor.ts # Extraction liens navigation (sidebar/menu)
│ ├── index.ts # Entry point stdio
│ ├── http-server.ts # Entry point HTTP
│ └── http-client.ts # Client npm
├── dist/ # Compiled code (generated)
├── Dockerfile # Multi-stage Docker image
├── docker-compose.yml # Stack with Traefik labels
├── .nvmrc # Node version (20)
├── package.json # Server dependencies
├── tsconfig.json # TypeScript config
└── README.md # This file🛠️ Available Commands
📖 Full documentation: COMMANDS.md
Quick Reference
# 🌟 Recommended for ChatGPT development (2 terminals)
npm run tunnel # Terminal 1: ngrok (keep running)
npm run dev # Terminal 2: Dev server with hot-reload
# Alternative: All-in-one
npm run dev:tunnel # Dev + ngrok in parallel
# Testing
npm run inspect # Launch MCP Inspector
npm run health # Health check
# Build & Production
npm run build # Compile TypeScript
npm run rebuild # Clean + Build
npm run build:start # Build then start
# Utilities
npm run kill # Kill process on port 3000
npm run kill:tunnel # Kill ngrokCursor Commands
Available via Cmd+Shift+P:
dev-server- Dev with hot-reload (recommended)tunnel-only- Launch ngrok (keep running)mcp-inspector- Launch MCP Inspectorbuild/rebuild/cleankill-server/kill-tunnel
See COMMANDS.md for the complete list.
🔧 Advanced Configuration
Environment variables
Create a .env file:
PORT=3000 # HTTP server port
NODE_ENV=production # Environment
CORS_ORIGIN=* # CORS origin (default: * in dev, https://chatgpt.com in prod)🏗️ Architecture Details
This project serves as a template/base for future MCP servers with a clean, modular architecture:
Separation of Concerns
config.ts: Environment variables, constants, validationtypes.ts: Shared TypeScript interfacesclient/httpClient.ts: HTTP client abstraction (fetch, headers, redirects, timeout)tools/fetchUrl.ts: Business logic (validation, extraction orchestration)utils/contentExtractor.ts: Content extraction (Readability + fallback)utils/issueDetector.ts: Issue detection (paywall, login, partial content)utils/linkExtractor.ts: Related links extraction and filteringservers/: MCP implementation (stdio/Streamable HTTP), reuses toolsutils/errors.ts: Custom error classes, formatting
See CONTEXT.md for detailed architecture documentation.
📚 Resources & Documentation
Project Documentation
- CONTEXT.md - Project memory (status, decisions, changelog)
- COMMANDS.md - All npm scripts and Cursor commands
- OPENAI_APPS_SDK_REFERENCE.md - Complete SDK reference guide
Official documentation
- Model Context Protocol - MCP spec
- MCP SDK TypeScript - Node.js SDK
- ChatGPT Connectors - How to use MCP with ChatGPT
Community
- MCP Servers Repository - Official examples
🐛 Debugging & Troubleshooting
Server won't start
# Check that Node.js is installed (requires Node 20+)
node --version # Must be 20+
# If using nvm, switch to Node 20
nvm use 20 # or nvm install 20
# Check that dependencies are installed
npm install
# Full rebuild
npm run buildNote: This project requires Node.js 20+ due to dependencies (jsdom, @mozilla/readability). Use .nvmrc file or nvm use to ensure the correct version.
CORS errors
The server allows all origins in dev. In production, restrict in src/servers/http.ts:
res.setHeader('Access-Control-Allow-Origin', 'https://chatgpt.com');🚀 Use This Project as a Template
This project is a complete template for creating your own MCP servers with a clean architecture.
To create your own MCP server:
- Duplicate this project
- Implement your tools in
src/tools/ - Customize the configuration in
src/config.ts - Deploy!
📝 License
MIT - Use freely for your personal or commercial projects.
🙏 Credits & Attributions
- MCP Protocol - Anthropic
📞 Support
For any questions:
- 📖 Check the MCP documentation
- 💬 Open an issue on GitHub
Have fun with your MCP server! 🕷️✨
