site-cloner
v0.1.2
Published
MCP server for website cloning
Downloads
157
Readme
Site Cloner MCP Server
This is an MCP (Model Context Protocol) server designed to help LLMs (like Claude) clone websites by providing tools to fetch, analyze, and download website assets.
Features
- Fetch HTML content from any URL
- Authentication Support: Set cookies and localStorage to access login-protected pages
- Browser Integration: Check and integrate with chrome-devtools or playwright MCP tools for login flows
- Login Detection: Automatically detect login forms and links on web pages
- Extract assets (CSS, JavaScript, images, fonts, etc.) from HTML content
- Download individual assets to a local directory
- Parse CSS files to extract linked assets (fonts, images)
- Create a sitemap of a website
- Analyze page structure and layout
Requirements
- Node.js 18 or higher
- npm or npx
Usage
Option 1: Run with npx (Recommended)
The easiest way to run the MCP server is using npx without installing anything:
npx -y site-clonerOption 2: Run Locally
Clone or download this repository:
git clone https://github.com/yourusername/site-cloner.git cd site-clonerInstall dependencies:
npm installBuild the TypeScript code:
npm run buildRun the server:
npm start # or node dist/index.js
For development with auto-reload:
npm run devConnecting to Cursor
To set up this MCP server in Cursor, you have two options:
1. Project-specific configuration
Create a .cursor/mcp.json file in your project root with the following content:
Using npx (recommended):
{
"mcpServers": {
"site-cloner": {
"command": "npx",
"args": ["-y", "site-cloner"]
}
}
}Using local installation:
{
"mcpServers": {
"site-cloner": {
"command": "node",
"args": ["/path/to/site-cloner/dist/index.js"]
}
}
}2. Global configuration
To make the MCP server available globally in Cursor, add the following configuration by going to Cursor Settings → MCP → Add new Global MCP Server:
Using npx:
{
"mcpServers": {
"site-cloner": {
"command": "npx",
"args": ["-y", "site-cloner"]
}
}
}Available Tools
1. fetch_page
Fetches the HTML content of a webpage.
Args:
url: The URL of the webpage to fetch2. extract_assets
Extracts links to assets from HTML content.
Args:
url: The URL of the webpage (used for resolving relative URLs)
html_content: The HTML content to parse3. download_asset
Downloads an asset from a URL and saves it to the specified directory.
智能资源发现:下载 JS/CSS 文件后,会自动扫描文件内容,发现其中引用的其他资源(如 importScripts('/assets/...')、fetch('/assets/...') 等)。
Args:
url: The URL of the asset to download
output_dir: The directory to save the asset to (default: downloaded_site)
Returns:
success: Whether the download was successful
saved_to: Path where the file was saved
discovered_assets: (JS/CSS only) List of additional resources found in the file4. download_assets_batch
Batch downloads multiple assets from a list of URLs. More efficient than calling download_asset multiple times.
Args:
urls: Array of asset URLs to download
base_url: (Optional) Base URL for resolving relative URLs
output_dir: Directory to save assets (default: downloaded_site)
recursive: Whether to analyze downloaded JS/CSS for additional resources (default: true)
Returns:
total: Total number of URLs
successful: Number of successful downloads
failed: Number of failed downloads
results: Array of download results with success status
discovered_assets: (if recursive=true) Additional resources found in JS/CSS files5. parse_css_for_assets
Parses CSS content to extract URLs of referenced assets like fonts and images.
Args:
css_url: The URL of the CSS file (used for resolving relative URLs)
css_content: The CSS content to parse (if not provided, it will be fetched from css_url)5. create_site_map
Creates a sitemap of the website starting from the given URL.
Args:
url: The starting URL to crawl
max_depth: Maximum depth to crawl (default: 1)6. analyze_page_structure
Analyzes the structure of an HTML page and extracts key components.
Args:
html_content: The HTML content to analyze7. check_browser_mcp_tools
Returns installation and configuration guides for browser MCP tools (chrome-devtools-mcp, playwright-mcp) needed for authentication flows.
Args:
(no arguments required)8. detect_login_page
Detects if a webpage contains login forms or login links.
Args:
url: The URL of the webpage to analyze
html_content: Optional HTML content (if not provided, will fetch from url)9. set_auth_credentials
Sets authentication credentials (cookies, localStorage) for a domain to access login-protected pages.
Args:
domain: The domain to set credentials for (e.g., "example.com")
cookies: Optional object with cookie name-value pairs
local_storage: Optional object with localStorage key-value pairs
session_storage: Optional object with sessionStorage key-value pairsEnhanced fetch_page for Authentication:
The fetch_page tool now supports an optional use_auth parameter:
Args:
url: The URL of the webpage to fetch
use_auth: Set to true to use saved credentials for this domainAuthentication Workflow
To clone login-protected websites:
Check if browser tools are needed:
Call check_browser_mcp_tools to get installation guidesDetect login page:
Call detect_login_page with the target URLInstall browser MCP tool (choose one):
- chrome-devtools-mcp:
npx -y chrome-devtools-mcp@latest - playwright-mcp:
npx -y @playwright/mcp@latest
- chrome-devtools-mcp:
Configure in Cursor: Add to your
.cursor/mcp.json:{ "mcpServers": { "site-cloner": { "command": "npx", "args": ["-y", "site-cloner"] }, "chrome-devtools": { "command": "npx", "args": ["-y", "chrome-devtools-mcp@latest"] } } }Login using browser tool:
- Use chrome-devtools or playwright MCP to navigate to the website
- Complete the login process manually
- Extract cookies and localStorage using browser tool commands
Set credentials in site-cloner:
Call set_auth_credentials with: - domain: "example.com" - cookies: { sessionId: "xxx", token: "yyy" } - local_storage: { userData: "..." }Fetch protected content:
Call fetch_page with use_auth=true
Note: Credentials are stored in memory only and expire after 24 hours. They are lost when the server restarts.
Example Usage with Claude
- Ask Claude to clone a website: "Please clone the website at example.com"
- Claude will use the available tools to:
- Fetch the HTML content
- Extract assets
- Download necessary files
- Analyze the structure
- Create a local copy of the site
Troubleshooting
Server not showing up in Cursor
- Restart Cursor
- Check your configuration file syntax
- Make sure Node.js is installed:
node --version - Look at Cursor's MCP logs for errors:
Output→ SelectCursor MCPfrom Dropdown
- Try running the server manually to see any errors:
npx -y site-cloner
Module Not Found Error
If you encounter a "Module not found" error when running locally:
- Make sure you've installed dependencies:
npm install - Make sure you've built the project:
npm run build - Check that the
dist/directory exists
Build Errors
If you get TypeScript build errors:
- Clean the build directory:
rm -rf dist/ - Rebuild:
npm run build
Publishing to npm
To publish this package to npm:
- Update version in
package.json - Build the project:
npm run build - Publish:
npm publish
Notes
- The server automatically organizes downloaded assets into subdirectories based on content type (html, css, js, images, fonts, videos, other)
- When cloning a site, be mindful of copyright and terms of service restrictions
- Some websites may block automated requests, in which case you might need to adjust the user agent string in the code
Development
Project Structure
site-cloner/
├── src/
│ └── index.ts # Main server code
├── dist/ # Compiled JavaScript (generated)
├── package.json # Node.js dependencies
├── tsconfig.json # TypeScript configuration
└── README.md # This fileScripts
npm run build- Compile TypeScript to JavaScriptnpm run watch- Watch mode for developmentnpm run start- Run the compiled servernpm run dev- Run with tsx for development (no build needed)
