@imenam/simple-scraper

v1.0.8

Published

6 days ago

MCP server for web scraping and JavaScript execution using Puppeteer

0High
0Medium
0Low

imenam

mcp model-context-protocol scraper puppeteer headless-browser

simple-scraper-mcp

A Model Context Protocol (MCP) server for web scraping and JavaScript execution using a headless browser (Puppeteer). Includes an optional GUI for cookie management.

Features

scrape_page — Navigate to a URL and return the full rendered HTML of the page
execute_js — Navigate to a URL and execute custom JavaScript in the page context
get_page_inputs — Extract all form inputs from a page as a structured JSON object
get_show_page — Parse a detail/show page and extract key-value blocks and tables as structured JSON
Interactive sessions — Keep a browser page alive across multiple tool calls to navigate, click, type, run JS, and screenshot the page in any desired state
screenshot — Capture a screenshot of any page or active session, with inline or file output
Cookie support — Load Netscape-format cookie files automatically before each request
Optional GUI — Cookie manager interface, available when integrated with the MCP proxy

Requirements

Node.js >= 18
Puppeteer will automatically download a compatible Chromium browser (~300 MB) on first install

Installation

npx @imenam/simple-scraper

Or install globally:

npm install -g @imenam/simple-scraper
simple-scraper

Environment Variables

| Variable | Required | Default | Description | |----------|----------|---------|-------------| | PUPPETEER_HEADLESS | No | true | Run Chromium in headless mode. Set to false to display the browser window. | | PUPPETEER_TIMEOUT | No | 30000 | Default timeout in milliseconds for page navigation and waits. | | COOKIES_DIR | No | - | Absolute path to a folder containing Netscape-format .txt cookie files. All files are loaded and merged automatically before each request. | | MCP_LOG_DIR | No | .mcp-gui/logs | Absolute path to the directory where log files are written. | | PROXY_URL | No | - | Base URL of the MCP HTTP Gateway. Required to enable the GUI. | | PROXY_APP_PATH | No | /simple-scraper-mcp | URL path under which the GUI is registered on the proxy. | | PROXY_APP_NAME | No | Simple Scraper MCP | Display name shown in the proxy's app list. | | SCRAPER_MAX_SESSIONS | No | 5 | Maximum number of concurrent interactive sessions. | | SCRAPER_SESSION_TTL_MS | No | 600000 | Inactivity TTL for sessions in milliseconds (default: 10 minutes). Sessions unused beyond this duration are closed automatically. |

Configuration

Copy .env.example to .env and configure the variables:

# Puppeteer options (optional)
PUPPETEER_HEADLESS=true
PUPPETEER_TIMEOUT=30000

# Optional: path to a folder containing Netscape-format cookie files (.txt)
# All files in this folder will be loaded automatically before each request.
# COOKIES_DIR=/path/to/cookies

# GUI (optional) — required to enable the cookie manager interface
# PROXY_URL=http://localhost:3000
# PROXY_APP_PATH=/simple-scraper-mcp
# PROXY_APP_NAME=Simple Scraper MCP

Usage with Claude Desktop

Add the following to your claude_desktop_config.json. Full example with all available options:

{
  "mcpServers": {
    "simple-scraper": {
      "command": "npx",
      "args": ["@imenam/simple-scraper"],
      "env": {
        "PUPPETEER_HEADLESS": "true",
        "PUPPETEER_TIMEOUT": "30000",
        "COOKIES_DIR": "/path/to/your/cookies",
        "MCP_LOG_DIR": "/path/to/your/logs",
        "PROXY_URL": "http://localhost:4500",
        "PROXY_APP_PATH": "/simple-scraper",
        "PROXY_APP_NAME": "Simple Scraper"
      }
    }
  }
}

To load cookies automatically, add COOKIES_DIR pointing to a folder containing .txt files in Netscape cookie format:

{
  "mcpServers": {
    "simple-scraper": {
      "command": "npx",
      "args": ["@imenam/simple-scraper"],
      "env": {
        "PUPPETEER_HEADLESS": "true",
        "COOKIES_DIR": "/path/to/your/cookies"
      }
    }
  }
}

Usage with Cursor

In Cursor, MCP servers are configured in .cursor/mcp.json. You can pass environment variables directly in the config. Full example with all available options:

{
  "mcpServers": {
    "simple-scraper": {
      "command": "npx",
      "args": ["-y", "@imenam/simple-scraper"],
      "env": {
        "PUPPETEER_HEADLESS": "true",
        "PUPPETEER_TIMEOUT": "30000",
        "COOKIES_DIR": "/path/to/your/cookies",
        "MCP_LOG_DIR": "/path/to/your/logs",
        "PROXY_URL": "http://localhost:4500",
        "PROXY_APP_PATH": "/simple-scraper",
        "PROXY_APP_NAME": "Simple Scraper"
      }
    }
  }
}

Note: The -y flag in args avoids the interactive confirmation prompt when using npx.

MCP Tools

`scrape_page`

Navigate to a URL and return the full rendered HTML.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | url | string | ✅ | URL of the page to scrape | | wait_for | string | | CSS selector to wait for before capturing HTML | | timeout | number | | Timeout in ms (default: 30000) |

`execute_js`

Navigate to a URL and execute custom JavaScript in the page context.

The script parameter is executed as the body of a JavaScript function in the browser page context, equivalent to:

new Function(script)();

To return data from the tool, the script must use an explicit return. A bare expression such as document.title will evaluate but the tool will receive undefined.

Example:

return {
  title: document.title,
  url: window.location.href,
  text: document.body.innerText.slice(0, 500)
};

For asynchronous work, return a promise, for example with an async IIFE:

return (async () => {
  const response = await fetch('/api/data');
  return await response.json();
})();

Returned objects and arrays are serialized as formatted JSON. Primitive values are returned as text.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | url | string | ✅ | URL of the page | | script | string | ✅ | JavaScript function body to execute in the page context. Use return to send a result back to the tool. | | wait_for | string | | CSS selector to wait for before executing | | timeout | number | | Timeout in ms (default: 30000) |

`get_page_inputs`

Extract all form inputs from a page as a structured JSON object.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | url | string | ✅ | URL of the page | | selector | string | | CSS selector to scope the search (e.g. #my-form) | | wait_for | string | | CSS selector to wait for before extracting | | show_hidden | boolean | | Include input[type=hidden] fields (default: false) | | timeout | number | | Timeout in ms (default: 30000) |

`get_show_page`

Parse a detail page and extract key-value blocks and tables as structured JSON.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | url | string | ✅ | URL of the page | | keys_map | object | | Map of HTML label → JS key for field name translation | | box_selector | string | | CSS selector for section containers (default: .box.box-primary) | | tables_max_items | number | | Max rows per table (default: 2) | | wait_for | string | | CSS selector to wait for before extraction | | timeout | number | | Timeout in ms (default: 30000) |

Interactive Sessions

Interactive sessions let you keep a browser page alive across multiple tool calls, so you can bring the page into the exact state you need before extracting data or taking a screenshot.

Typical workflow

open_session → session_click / session_type / session_evaluate → screenshot / session_html → close_session

`open_session`

Open a persistent browser session. Returns a session_id to use with all session_* tools and screenshot.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | url | string | ✅ | URL to navigate to | | wait_for | string | | CSS selector to wait for before the session is considered ready | | timeout | number | | Timeout in ms (default: 30000) |

`close_session`

Close a session and free its resources. Always call this when you are done.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | ✅ | Session ID returned by open_session |

`list_sessions`

List all currently active sessions with their IDs and timestamps. No parameters.

`session_goto`

Navigate the session to a new URL without closing it.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | ✅ | Session ID | | url | string | ✅ | URL to navigate to | | wait_for | string | | CSS selector to wait for after navigation | | timeout | number | | Timeout in ms (default: 30000) |

`session_click`

Click an element in the session page.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | ✅ | Session ID | | selector | string | ✅ | CSS selector of the element to click | | timeout | number | | Timeout in ms to wait for the element (default: 30000) |

`session_type`

Type text into an input element in the session page.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | ✅ | Session ID | | selector | string | ✅ | CSS selector of the input element | | text | string | ✅ | Text to type | | clear | boolean | | Clear the field before typing (default: false) | | timeout | number | | Timeout in ms to wait for the element (default: 30000) |

`session_wait_for`

Wait for a CSS selector to appear in the session page.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | ✅ | Session ID | | selector | string | ✅ | CSS selector to wait for | | timeout | number | | Timeout in ms (default: 30000) |

`session_evaluate`

Execute JavaScript in the context of the session page. Same conventions as execute_js.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | ✅ | Session ID | | script | string | ✅ | JavaScript function body. Use return to get a result back. | | wait_for | string | | CSS selector to wait for before executing | | timeout | number | | Timeout in ms (default: 30000) |

`session_html`

Return the current full rendered HTML of the session page.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | ✅ | Session ID |

`screenshot`

Capture a screenshot of a page. Use session_id to capture an active session in its current state, or url for a one-shot capture.

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | session_id | string | | Session ID. If provided, url is ignored and the current page state is captured. | | url | string | | URL for a one-shot screenshot. Required if session_id is not provided. | | wait_for | string | | CSS selector to wait for (one-shot mode only) | | timeout | number | | Timeout in ms (default: 30000) | | selector | string | | CSS selector of a specific element to capture | | full_page | boolean | | Capture the full scrollable page height (default: false, ignored when selector is provided) | | format | png | jpeg | | Image format (default: png) | | output | inline | file | both | ✅ | inline embeds the image in the response, file saves to disk and returns the path, both does both | | path | string | | Absolute or relative path for the saved file. Defaults to ./screenshots/screenshot-<timestamp>.<format> |

Cookie Files

Cookies are loaded from .txt files in Netscape format. Place them in the folder specified by COOKIES_DIR. All files in the folder are loaded and merged automatically before each request.

GUI (Optional)

When PROXY_URL is set, a cookie manager web interface is registered with the MCP proxy. It allows you to list, upload, and delete cookie files through a browser UI.

License

ISC

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

simple-scraper-mcp

Features

Requirements

Installation

Environment Variables

Configuration

Usage with Claude Desktop

Usage with Cursor

MCP Tools

scrape_page

execute_js

get_page_inputs

get_show_page

Interactive Sessions

Typical workflow

open_session

close_session

list_sessions

session_goto

session_click

session_type

session_wait_for

session_evaluate

session_html

screenshot