mcp-crawl4ai
v1.0.6
Published
MCP server for advanced web scraping with Crawl4AI - supports authentication, dynamic content, and AI extraction
Maintainers
Readme
MCP Crawl4AI Server
🚀 A powerful Model Context Protocol (MCP) server that brings advanced web scraping capabilities to Claude Desktop. Built on top of Crawl4AI, this server enables Claude to crawl websites, extract structured data, handle authentication, and process dynamic content.
Features
🔐 Authentication Support
- Login Handling - Automatically login to protected sites
- Session Management - Maintain authenticated sessions
- Custom Selectors - Configure login form selectors
- Multi-strategy Login - Smart detection of login fields
🕷️ Core Crawling
- Single URL Crawling - Advanced single page scraping with screenshots, PDFs, and browser control
- Batch Crawling - Parallel crawling of multiple URLs with memory management
- Deep Crawling - Recursive crawling with BFS, DFS, and best-first strategies
- Dynamic Content - Handle JavaScript-heavy sites with scrolling and JS execution
📊 Data Extraction
- Structured Extraction - CSS and XPath selectors for precise data extraction
- LLM Extraction - Use GPT-4, Claude, or other models to extract semantic information
- Link Analysis - Extract and preview all internal/external links
- Content Filtering - BM25, LLM-based, and threshold filtering
⚡ Advanced Features
- JavaScript Execution - Run custom JS code during crawling
- Dynamic Loading - Handle infinite scroll and AJAX content
- Caching System - Persistent storage of crawled content
- Memory Management - Adaptive memory control for large-scale crawling
Quick Start
Install with Claude Desktop (One Command!)
claude mcp add crawl4ai --scope user -- npx -y mcp-crawl4aiThat's it! The server will be available in all your Claude Desktop conversations.
Detailed Installation
Method 1: Using claude mcp add (Recommended)
# Install for all projects (recommended)
claude mcp add crawl4ai --scope user -- npx -y mcp-crawl4ai
# Or install for current project only
claude mcp add crawl4ai -- npx -y mcp-crawl4aiThis will automatically:
- Download and configure the MCP server
- Install Python dependencies (if pip is available)
- Set up the server for use across all your projects
Method 2: Manual Configuration
Prerequisites
- Python 3.10 or higher
- Node.js 16+ (for npx)
- Chrome/Chromium browser
Configure Claude Desktop
Find your Claude configuration file:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json - Linux:
~/.config/Claude/claude_desktop_config.json
- macOS:
Edit the configuration file:
{
"mcpServers": {
"crawl4ai": {
"command": "npx",
"args": ["-y", "mcp-crawl4ai"],
"env": {
"OPENAI_API_KEY": "sk-your-key-here"
}
}
}
}For local development, you can also point directly to the Python server:
{
"mcpServers": {
"crawl4ai-local": {
"command": "python3",
"args": ["/absolute/path/to/mcp-crawl4ai/server.py"],
"env": {
"OPENAI_API_KEY": "sk-your-key-here"
}
}
}
}Step 4: Restart Claude Desktop
- Completely quit Claude Desktop (not just close the window)
- Start Claude Desktop again
- Look for the MCP server indicator (🔌) in the bottom-right of the input box
- Click the indicator to verify "mcp-crawl4ai" is connected
Configuration
Environment Variables
Configure API keys and credentials in your Claude Desktop config or .env file:
# Optional: For LLM-based features
OPENAI_API_KEY=your-openai-api-key
# Transport mode (stdio or sse)
TRANSPORT=stdio
# Browser settings
HEADLESS=true
BROWSER_TYPE=chromiumUsage Examples
Basic Crawling
# Crawl a single URL
result = await crawl_url(
url="https://example.com",
screenshot=True,
pdf=False,
wait_for="#content"
)
# Batch crawl multiple URLs
results = await batch_crawl(
urls=["https://example.com", "https://example.org"],
max_concurrent=5
)Crawling with Authentication
# Crawl a login-protected site
result = await crawl_url(
url="https://theoryofknowledge.net/tok-resources/tok-newsletter-archive/",
username="[email protected]",
password="Trfz998afds#",
wait_for=".content-area" # Wait for content after login
)
# Custom login selectors if defaults don't work
result = await crawl_url(
url="https://example.com/protected",
username="myuser",
password="mypass",
login_url="https://example.com/login", # Optional: specific login page
username_selector="#username", # Custom username field selector
password_selector="#password", # Custom password field selector
submit_selector="#login-button" # Custom submit button selector
)Deep Crawling
# BFS deep crawl
result = await deep_crawl(
start_url="https://docs.example.com",
max_depth=3,
max_pages=50,
strategy="bfs",
allowed_domains=["docs.example.com"]
)
# Best-first crawling with keyword focus
result = await deep_crawl(
start_url="https://blog.example.com",
strategy="best_first",
keyword_focus=["AI", "machine learning", "neural networks"]
)Data Extraction
# Extract structured data with CSS selectors
schema = {
"title": "h1.article-title",
"author": "span.author-name",
"date": "time.publish-date",
"content": "div.article-content"
}
result = await extract_structured_data(
url="https://blog.example.com/article",
schema=schema,
extraction_type="json_css"
)
# LLM-based extraction
result = await extract_with_llm(
url="https://example.com/product",
instruction="Extract product name, price, and key features",
model="gpt-4o-mini"
)Dynamic Content
# Handle infinite scroll
result = await crawl_dynamic_content(
url="https://example.com/feed",
scroll=True,
max_scrolls=10,
scroll_delay=1000
)
# Execute custom JavaScript
result = await crawl_with_js_execution(
url="https://spa.example.com",
js_code="""
document.querySelector('.load-more').click();
await new Promise(r => setTimeout(r, 2000));
""",
wait_for_js="document.querySelectorAll('.item').length > 10"
)Content Filtering
# BM25 relevance filtering
result = await crawl_with_filter(
url="https://news.example.com",
filter_type="bm25",
query="artificial intelligence breakthrough",
threshold=0.5
)
# Prune low-content sections
result = await crawl_with_filter(
url="https://example.com",
filter_type="pruning",
min_word_threshold=100
)Available Tools
Core Crawling Tools
crawl_url- Comprehensive single URL crawling with optional authenticationcrawl_with_auth- Specialized tool for login-protected sitesbatch_crawl- Parallel multi-URL crawlingdeep_crawl- Recursive crawling with strategies
Extraction Tools
extract_structured_data- CSS/XPath data extractionextract_with_llm- LLM-powered extractionextract_links- Link extraction and preview
Advanced Tools
crawl_with_js_execution- JavaScript executioncrawl_dynamic_content- Handle dynamic loadingcrawl_with_filter- Content filtering
Data Management
get_crawled_content- Retrieve stored contentlist_crawled_content- List all crawled items
Architecture
mcp-crawl4ai/
├── server.py # Main MCP server implementation
├── pyproject.toml # Package configuration
├── README.md # Documentation
├── .env # Environment variables
└── cache/ # Cached crawl resultsPerformance Tips
- Memory Management: Use
max_concurrentparameter for batch operations - Caching: Enable caching for repeated crawls
- Filtering: Use content filters to reduce data size
- Deep Crawling: Set reasonable
max_depthandmax_pageslimits
Troubleshooting
Common Issues
Browser not found: Install Playwright browsers
playwright install chromiumMemory issues: Reduce
max_concurrentvalueJavaScript timeout: Increase
js_timeoutparameterLLM features not working: Set
OPENAI_API_KEYin environment
Development
Running Tests
pytest tests/Code Formatting
black .
ruff check .Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details
Acknowledgments
- Built on top of Crawl4AI
- Uses the Model Context Protocol
- FastMCP framework for MCP server implementation
