npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

mcp-crawl4ai

v1.0.6

Published

MCP server for advanced web scraping with Crawl4AI - supports authentication, dynamic content, and AI extraction

Readme

MCP Crawl4AI Server

🚀 A powerful Model Context Protocol (MCP) server that brings advanced web scraping capabilities to Claude Desktop. Built on top of Crawl4AI, this server enables Claude to crawl websites, extract structured data, handle authentication, and process dynamic content.

npm version GitHub License Python MCP

Features

🔐 Authentication Support

  • Login Handling - Automatically login to protected sites
  • Session Management - Maintain authenticated sessions
  • Custom Selectors - Configure login form selectors
  • Multi-strategy Login - Smart detection of login fields

🕷️ Core Crawling

  • Single URL Crawling - Advanced single page scraping with screenshots, PDFs, and browser control
  • Batch Crawling - Parallel crawling of multiple URLs with memory management
  • Deep Crawling - Recursive crawling with BFS, DFS, and best-first strategies
  • Dynamic Content - Handle JavaScript-heavy sites with scrolling and JS execution

📊 Data Extraction

  • Structured Extraction - CSS and XPath selectors for precise data extraction
  • LLM Extraction - Use GPT-4, Claude, or other models to extract semantic information
  • Link Analysis - Extract and preview all internal/external links
  • Content Filtering - BM25, LLM-based, and threshold filtering

⚡ Advanced Features

  • JavaScript Execution - Run custom JS code during crawling
  • Dynamic Loading - Handle infinite scroll and AJAX content
  • Caching System - Persistent storage of crawled content
  • Memory Management - Adaptive memory control for large-scale crawling

Quick Start

Install with Claude Desktop (One Command!)

claude mcp add crawl4ai --scope user -- npx -y mcp-crawl4ai

That's it! The server will be available in all your Claude Desktop conversations.

Detailed Installation

Method 1: Using claude mcp add (Recommended)

# Install for all projects (recommended)
claude mcp add crawl4ai --scope user -- npx -y mcp-crawl4ai

# Or install for current project only
claude mcp add crawl4ai -- npx -y mcp-crawl4ai

This will automatically:

  • Download and configure the MCP server
  • Install Python dependencies (if pip is available)
  • Set up the server for use across all your projects

Method 2: Manual Configuration

Prerequisites

  • Python 3.10 or higher
  • Node.js 16+ (for npx)
  • Chrome/Chromium browser

Configure Claude Desktop

  1. Find your Claude configuration file:

    • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
    • Windows: %APPDATA%\Claude\claude_desktop_config.json
    • Linux: ~/.config/Claude/claude_desktop_config.json
  2. Edit the configuration file:

{
  "mcpServers": {
    "crawl4ai": {
      "command": "npx",
      "args": ["-y", "mcp-crawl4ai"],
      "env": {
        "OPENAI_API_KEY": "sk-your-key-here"
      }
    }
  }
}

For local development, you can also point directly to the Python server:

{
  "mcpServers": {
    "crawl4ai-local": {
      "command": "python3",
      "args": ["/absolute/path/to/mcp-crawl4ai/server.py"],
      "env": {
        "OPENAI_API_KEY": "sk-your-key-here"
      }
    }
  }
}

Step 4: Restart Claude Desktop

  1. Completely quit Claude Desktop (not just close the window)
  2. Start Claude Desktop again
  3. Look for the MCP server indicator (🔌) in the bottom-right of the input box
  4. Click the indicator to verify "mcp-crawl4ai" is connected

Configuration

Environment Variables

Configure API keys and credentials in your Claude Desktop config or .env file:

# Optional: For LLM-based features
OPENAI_API_KEY=your-openai-api-key

# Transport mode (stdio or sse)
TRANSPORT=stdio

# Browser settings
HEADLESS=true
BROWSER_TYPE=chromium

Usage Examples

Basic Crawling

# Crawl a single URL
result = await crawl_url(
    url="https://example.com",
    screenshot=True,
    pdf=False,
    wait_for="#content"
)

# Batch crawl multiple URLs
results = await batch_crawl(
    urls=["https://example.com", "https://example.org"],
    max_concurrent=5
)

Crawling with Authentication

# Crawl a login-protected site
result = await crawl_url(
    url="https://theoryofknowledge.net/tok-resources/tok-newsletter-archive/",
    username="[email protected]",
    password="Trfz998afds#",
    wait_for=".content-area"  # Wait for content after login
)

# Custom login selectors if defaults don't work
result = await crawl_url(
    url="https://example.com/protected",
    username="myuser",
    password="mypass",
    login_url="https://example.com/login",  # Optional: specific login page
    username_selector="#username",  # Custom username field selector
    password_selector="#password",  # Custom password field selector
    submit_selector="#login-button"  # Custom submit button selector
)

Deep Crawling

# BFS deep crawl
result = await deep_crawl(
    start_url="https://docs.example.com",
    max_depth=3,
    max_pages=50,
    strategy="bfs",
    allowed_domains=["docs.example.com"]
)

# Best-first crawling with keyword focus
result = await deep_crawl(
    start_url="https://blog.example.com",
    strategy="best_first",
    keyword_focus=["AI", "machine learning", "neural networks"]
)

Data Extraction

# Extract structured data with CSS selectors
schema = {
    "title": "h1.article-title",
    "author": "span.author-name",
    "date": "time.publish-date",
    "content": "div.article-content"
}

result = await extract_structured_data(
    url="https://blog.example.com/article",
    schema=schema,
    extraction_type="json_css"
)

# LLM-based extraction
result = await extract_with_llm(
    url="https://example.com/product",
    instruction="Extract product name, price, and key features",
    model="gpt-4o-mini"
)

Dynamic Content

# Handle infinite scroll
result = await crawl_dynamic_content(
    url="https://example.com/feed",
    scroll=True,
    max_scrolls=10,
    scroll_delay=1000
)

# Execute custom JavaScript
result = await crawl_with_js_execution(
    url="https://spa.example.com",
    js_code="""
        document.querySelector('.load-more').click();
        await new Promise(r => setTimeout(r, 2000));
    """,
    wait_for_js="document.querySelectorAll('.item').length > 10"
)

Content Filtering

# BM25 relevance filtering
result = await crawl_with_filter(
    url="https://news.example.com",
    filter_type="bm25",
    query="artificial intelligence breakthrough",
    threshold=0.5
)

# Prune low-content sections
result = await crawl_with_filter(
    url="https://example.com",
    filter_type="pruning",
    min_word_threshold=100
)

Available Tools

Core Crawling Tools

  • crawl_url - Comprehensive single URL crawling with optional authentication
  • crawl_with_auth - Specialized tool for login-protected sites
  • batch_crawl - Parallel multi-URL crawling
  • deep_crawl - Recursive crawling with strategies

Extraction Tools

  • extract_structured_data - CSS/XPath data extraction
  • extract_with_llm - LLM-powered extraction
  • extract_links - Link extraction and preview

Advanced Tools

  • crawl_with_js_execution - JavaScript execution
  • crawl_dynamic_content - Handle dynamic loading
  • crawl_with_filter - Content filtering

Data Management

  • get_crawled_content - Retrieve stored content
  • list_crawled_content - List all crawled items

Architecture

mcp-crawl4ai/
├── server.py           # Main MCP server implementation
├── pyproject.toml      # Package configuration
├── README.md           # Documentation
├── .env               # Environment variables
└── cache/             # Cached crawl results

Performance Tips

  1. Memory Management: Use max_concurrent parameter for batch operations
  2. Caching: Enable caching for repeated crawls
  3. Filtering: Use content filters to reduce data size
  4. Deep Crawling: Set reasonable max_depth and max_pages limits

Troubleshooting

Common Issues

  1. Browser not found: Install Playwright browsers

    playwright install chromium
  2. Memory issues: Reduce max_concurrent value

  3. JavaScript timeout: Increase js_timeout parameter

  4. LLM features not working: Set OPENAI_API_KEY in environment

Development

Running Tests

pytest tests/

Code Formatting

black .
ruff check .

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details

Acknowledgments