site-cloner

v0.1.2

Published

4 months ago

MCP server for website cloning

Downloads

0High
0Medium
0Low

yangweijie

mcp site-cloner website

Site Cloner MCP Server

This is an MCP (Model Context Protocol) server designed to help LLMs (like Claude) clone websites by providing tools to fetch, analyze, and download website assets.

Features

Fetch HTML content from any URL
Authentication Support: Set cookies and localStorage to access login-protected pages
Browser Integration: Check and integrate with chrome-devtools or playwright MCP tools for login flows
Login Detection: Automatically detect login forms and links on web pages
Extract assets (CSS, JavaScript, images, fonts, etc.) from HTML content
Download individual assets to a local directory
Parse CSS files to extract linked assets (fonts, images)
Create a sitemap of a website
Analyze page structure and layout

Requirements

Node.js 18 or higher
npm or npx

Usage

Option 1: Run with npx (Recommended)

The easiest way to run the MCP server is using npx without installing anything:

npx -y site-cloner

Option 2: Run Locally

Clone or download this repository:

git clone https://github.com/yourusername/site-cloner.git
cd site-cloner

Install dependencies:
```
npm install
```
Build the TypeScript code:
```
npm run build
```
Run the server:
```
npm start
# or
node dist/index.js
```

For development with auto-reload:

npm run dev

Connecting to Cursor

To set up this MCP server in Cursor, you have two options:

1. Project-specific configuration

Create a .cursor/mcp.json file in your project root with the following content:

Using npx (recommended):

{
  "mcpServers": {
    "site-cloner": {
      "command": "npx",
      "args": ["-y", "site-cloner"]
    }
  }
}

Using local installation:

{
  "mcpServers": {
    "site-cloner": {
      "command": "node",
      "args": ["/path/to/site-cloner/dist/index.js"]
    }
  }
}

2. Global configuration

To make the MCP server available globally in Cursor, add the following configuration by going to Cursor Settings → MCP → Add new Global MCP Server:

Using npx:

{
  "mcpServers": {
    "site-cloner": {
      "command": "npx",
      "args": ["-y", "site-cloner"]
    }
  }
}

Available Tools

1. fetch_page

Fetches the HTML content of a webpage.

Args:
    url: The URL of the webpage to fetch

2. extract_assets

Extracts links to assets from HTML content.

Args:
    url: The URL of the webpage (used for resolving relative URLs)
    html_content: The HTML content to parse

3. download_asset

Downloads an asset from a URL and saves it to the specified directory.

智能资源发现：下载 JS/CSS 文件后，会自动扫描文件内容，发现其中引用的其他资源（如 importScripts('/assets/...')、fetch('/assets/...') 等）。

Args:
    url: The URL of the asset to download
    output_dir: The directory to save the asset to (default: downloaded_site)

Returns:
    success: Whether the download was successful
    saved_to: Path where the file was saved
    discovered_assets: (JS/CSS only) List of additional resources found in the file

4. download_assets_batch

Batch downloads multiple assets from a list of URLs. More efficient than calling download_asset multiple times.

Args:
    urls: Array of asset URLs to download
    base_url: (Optional) Base URL for resolving relative URLs
    output_dir: Directory to save assets (default: downloaded_site)
    recursive: Whether to analyze downloaded JS/CSS for additional resources (default: true)

Returns:
    total: Total number of URLs
    successful: Number of successful downloads
    failed: Number of failed downloads
    results: Array of download results with success status
    discovered_assets: (if recursive=true) Additional resources found in JS/CSS files

5. parse_css_for_assets

Parses CSS content to extract URLs of referenced assets like fonts and images.

Args:
    css_url: The URL of the CSS file (used for resolving relative URLs)
    css_content: The CSS content to parse (if not provided, it will be fetched from css_url)

5. create_site_map

Creates a sitemap of the website starting from the given URL.

Args:
    url: The starting URL to crawl
    max_depth: Maximum depth to crawl (default: 1)

6. analyze_page_structure

Analyzes the structure of an HTML page and extracts key components.

Args:
    html_content: The HTML content to analyze

7. check_browser_mcp_tools

Returns installation and configuration guides for browser MCP tools (chrome-devtools-mcp, playwright-mcp) needed for authentication flows.

Args:
    (no arguments required)

8. detect_login_page

Detects if a webpage contains login forms or login links.

Args:
    url: The URL of the webpage to analyze
    html_content: Optional HTML content (if not provided, will fetch from url)

9. set_auth_credentials

Sets authentication credentials (cookies, localStorage) for a domain to access login-protected pages.

Args:
    domain: The domain to set credentials for (e.g., "example.com")
    cookies: Optional object with cookie name-value pairs
    local_storage: Optional object with localStorage key-value pairs
    session_storage: Optional object with sessionStorage key-value pairs

Enhanced fetch_page for Authentication:

The fetch_page tool now supports an optional use_auth parameter:

Args:
    url: The URL of the webpage to fetch
    use_auth: Set to true to use saved credentials for this domain

Authentication Workflow

To clone login-protected websites:

Check if browser tools are needed:

Call check_browser_mcp_tools to get installation guides

Detect login page:

Call detect_login_page with the target URL

Install browser MCP tool (choose one):
- chrome-devtools-mcp: npx -y chrome-devtools-mcp@latest
- playwright-mcp: npx -y @playwright/mcp@latest

Configure in Cursor: Add to your .cursor/mcp.json:

{
  "mcpServers": {
    "site-cloner": {
      "command": "npx",
      "args": ["-y", "site-cloner"]
    },
    "chrome-devtools": {
      "command": "npx",
      "args": ["-y", "chrome-devtools-mcp@latest"]
    }
  }
}

Login using browser tool:
- Use chrome-devtools or playwright MCP to navigate to the website
- Complete the login process manually
- Extract cookies and localStorage using browser tool commands

Set credentials in site-cloner:

Call set_auth_credentials with:
- domain: "example.com"
- cookies: { sessionId: "xxx", token: "yyy" }
- local_storage: { userData: "..." }

Fetch protected content:
```
Call fetch_page with use_auth=true
```

Note: Credentials are stored in memory only and expire after 24 hours. They are lost when the server restarts.

Example Usage with Claude

Ask Claude to clone a website: "Please clone the website at example.com"
Claude will use the available tools to:
- Fetch the HTML content
- Extract assets
- Download necessary files
- Analyze the structure
- Create a local copy of the site

Troubleshooting

Server not showing up in Cursor

Restart Cursor
Check your configuration file syntax
Make sure Node.js is installed: node --version
Look at Cursor's MCP logs for errors:
- Output → Select Cursor MCP from Dropdown
Try running the server manually to see any errors:
```
npx -y site-cloner
```

Module Not Found Error

If you encounter a "Module not found" error when running locally:

Make sure you've installed dependencies: npm install
Make sure you've built the project: npm run build
Check that the dist/ directory exists

Build Errors

If you get TypeScript build errors:

Clean the build directory:
```
rm -rf dist/
```
Rebuild:
```
npm run build
```

Publishing to npm

To publish this package to npm:

Update version in package.json
Build the project:
```
npm run build
```
Publish:
```
npm publish
```

Notes

The server automatically organizes downloaded assets into subdirectories based on content type (html, css, js, images, fonts, videos, other)
When cloning a site, be mindful of copyright and terms of service restrictions
Some websites may block automated requests, in which case you might need to adjust the user agent string in the code

Development

Project Structure

site-cloner/
├── src/
│   └── index.ts          # Main server code
├── dist/                 # Compiled JavaScript (generated)
├── package.json          # Node.js dependencies
├── tsconfig.json         # TypeScript configuration
└── README.md             # This file

Scripts

npm run build - Compile TypeScript to JavaScript
npm run watch - Watch mode for development
npm run start - Run the compiled server
npm run dev - Run with tsx for development (no build needed)