url-content-extractor-mcp

v1.0.5

Published

8 months ago

MCP server for extracting content from URLs with proper citations

0High
0Medium
0Low

systematiccaos

mcp model-context-protocol url content extraction citations web-scraping ai-tools claude openai

URL Content Extractor MCP Server

A Model Context Protocol server that extracts content from URLs and provides properly formatted citations. Perfect for AI assistants that need to access and cite web content.

🚀 Quick Start

Install and Run with uvx/npx

# Run directly with uvx (recommended)
uvx url-content-extractor-mcp

# Or with npx
npx url-content-extractor-mcp

Install Globally

npm install -g url-content-extractor-mcp
url-content-extractor-mcp

🔌 MCP Client Configuration

Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "url-content-extractor": {
      "command": "uvx",
      "args": ["url-content-extractor-mcp"]
    }
  }
}

Continue.dev

Add to your MCP configuration:

{
  "mcpServers": {
    "url-content-extractor": {
      "command": "npx",
      "args": ["url-content-extractor-mcp"]
    }
  }
}

🛠️ Features

Multiple URL processing: Extract content from multiple URLs in one call
Citation formats: APA, MLA, and Simple citation styles
Smart content extraction: Focuses on main content, removes navigation/ads
Domain filtering: Allow/block specific domains for security
Metadata extraction: Title, author, publication date, description
Error handling: Graceful handling of failed URLs
TypeScript: Full type safety and modern JavaScript features

📖 Usage

The server provides one tool: extract_url_content

Single URL

Extract content from: https://example.com/article

Multiple URLs

Compare these articles: https://site1.com/news, https://site2.com/blog

Example Output

🌐 Web Content Extraction Results

Processed: 1 successful, 0 failed

## 📄 Extracted Content

**📄 Document 1: Breaking News Article**

**Source:** https://example.com/article
**Domain:** example.com
**Author:** Jane Reporter
**Published:** 2024-07-04
**Citation:** Jane Reporter. Breaking News Article. https://example.com/article

**Content:**
[Full article content here...]

## 📖 Citation Summary

1. Jane Reporter. Breaking News Article. https://example.com/article

⚙️ Configuration

The server includes sensible defaults but can be customized by modifying the source:

Max content length: 15,000 characters
Min content length: 500 characters
Timeout: 15 seconds
Max URLs per call: 5
Citation style: Simple (configurable to APA/MLA)
Blocked domains: localhost, 127.0.0.1, 0.0.0.0

🔒 Security

Domain filtering prevents access to local/internal resources
Request timeouts prevent hanging
Content length limits prevent memory issues
No execution of JavaScript from scraped pages

📋 Requirements

Node.js 18.0.0 or higher
Internet connection for URL fetching

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme