@imenam/simple-scraper
v1.0.8
Published
MCP server for web scraping and JavaScript execution using Puppeteer
Maintainers
Readme
simple-scraper-mcp
A Model Context Protocol (MCP) server for web scraping and JavaScript execution using a headless browser (Puppeteer). Includes an optional GUI for cookie management.
Features
scrape_page— Navigate to a URL and return the full rendered HTML of the pageexecute_js— Navigate to a URL and execute custom JavaScript in the page contextget_page_inputs— Extract all form inputs from a page as a structured JSON objectget_show_page— Parse a detail/show page and extract key-value blocks and tables as structured JSON- Interactive sessions — Keep a browser page alive across multiple tool calls to navigate, click, type, run JS, and screenshot the page in any desired state
screenshot— Capture a screenshot of any page or active session, with inline or file output- Cookie support — Load Netscape-format cookie files automatically before each request
- Optional GUI — Cookie manager interface, available when integrated with the MCP proxy
Requirements
- Node.js >= 18
- Puppeteer will automatically download a compatible Chromium browser (~300 MB) on first install
Installation
npx @imenam/simple-scraperOr install globally:
npm install -g @imenam/simple-scraper
simple-scraperEnvironment Variables
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| PUPPETEER_HEADLESS | No | true | Run Chromium in headless mode. Set to false to display the browser window. |
| PUPPETEER_TIMEOUT | No | 30000 | Default timeout in milliseconds for page navigation and waits. |
| COOKIES_DIR | No | - | Absolute path to a folder containing Netscape-format .txt cookie files. All files are loaded and merged automatically before each request. |
| MCP_LOG_DIR | No | .mcp-gui/logs | Absolute path to the directory where log files are written. |
| PROXY_URL | No | - | Base URL of the MCP HTTP Gateway. Required to enable the GUI. |
| PROXY_APP_PATH | No | /simple-scraper-mcp | URL path under which the GUI is registered on the proxy. |
| PROXY_APP_NAME | No | Simple Scraper MCP | Display name shown in the proxy's app list. |
| SCRAPER_MAX_SESSIONS | No | 5 | Maximum number of concurrent interactive sessions. |
| SCRAPER_SESSION_TTL_MS | No | 600000 | Inactivity TTL for sessions in milliseconds (default: 10 minutes). Sessions unused beyond this duration are closed automatically. |
Configuration
Copy .env.example to .env and configure the variables:
# Puppeteer options (optional)
PUPPETEER_HEADLESS=true
PUPPETEER_TIMEOUT=30000
# Optional: path to a folder containing Netscape-format cookie files (.txt)
# All files in this folder will be loaded automatically before each request.
# COOKIES_DIR=/path/to/cookies
# GUI (optional) — required to enable the cookie manager interface
# PROXY_URL=http://localhost:3000
# PROXY_APP_PATH=/simple-scraper-mcp
# PROXY_APP_NAME=Simple Scraper MCPUsage with Claude Desktop
Add the following to your claude_desktop_config.json. Full example with all available options:
{
"mcpServers": {
"simple-scraper": {
"command": "npx",
"args": ["@imenam/simple-scraper"],
"env": {
"PUPPETEER_HEADLESS": "true",
"PUPPETEER_TIMEOUT": "30000",
"COOKIES_DIR": "/path/to/your/cookies",
"MCP_LOG_DIR": "/path/to/your/logs",
"PROXY_URL": "http://localhost:4500",
"PROXY_APP_PATH": "/simple-scraper",
"PROXY_APP_NAME": "Simple Scraper"
}
}
}
}To load cookies automatically, add COOKIES_DIR pointing to a folder containing .txt files in Netscape cookie format:
{
"mcpServers": {
"simple-scraper": {
"command": "npx",
"args": ["@imenam/simple-scraper"],
"env": {
"PUPPETEER_HEADLESS": "true",
"COOKIES_DIR": "/path/to/your/cookies"
}
}
}
}Usage with Cursor
In Cursor, MCP servers are configured in .cursor/mcp.json. You can pass environment variables directly in the config. Full example with all available options:
{
"mcpServers": {
"simple-scraper": {
"command": "npx",
"args": ["-y", "@imenam/simple-scraper"],
"env": {
"PUPPETEER_HEADLESS": "true",
"PUPPETEER_TIMEOUT": "30000",
"COOKIES_DIR": "/path/to/your/cookies",
"MCP_LOG_DIR": "/path/to/your/logs",
"PROXY_URL": "http://localhost:4500",
"PROXY_APP_PATH": "/simple-scraper",
"PROXY_APP_NAME": "Simple Scraper"
}
}
}
}Note: The
-yflag inargsavoids the interactive confirmation prompt when usingnpx.
MCP Tools
scrape_page
Navigate to a URL and return the full rendered HTML.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| url | string | ✅ | URL of the page to scrape |
| wait_for | string | | CSS selector to wait for before capturing HTML |
| timeout | number | | Timeout in ms (default: 30000) |
execute_js
Navigate to a URL and execute custom JavaScript in the page context.
The script parameter is executed as the body of a JavaScript function in the browser page context, equivalent to:
new Function(script)();To return data from the tool, the script must use an explicit return. A bare expression such as document.title will evaluate but the tool will receive undefined.
Example:
return {
title: document.title,
url: window.location.href,
text: document.body.innerText.slice(0, 500)
};For asynchronous work, return a promise, for example with an async IIFE:
return (async () => {
const response = await fetch('/api/data');
return await response.json();
})();Returned objects and arrays are serialized as formatted JSON. Primitive values are returned as text.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| url | string | ✅ | URL of the page |
| script | string | ✅ | JavaScript function body to execute in the page context. Use return to send a result back to the tool. |
| wait_for | string | | CSS selector to wait for before executing |
| timeout | number | | Timeout in ms (default: 30000) |
get_page_inputs
Extract all form inputs from a page as a structured JSON object.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| url | string | ✅ | URL of the page |
| selector | string | | CSS selector to scope the search (e.g. #my-form) |
| wait_for | string | | CSS selector to wait for before extracting |
| show_hidden | boolean | | Include input[type=hidden] fields (default: false) |
| timeout | number | | Timeout in ms (default: 30000) |
get_show_page
Parse a detail page and extract key-value blocks and tables as structured JSON.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| url | string | ✅ | URL of the page |
| keys_map | object | | Map of HTML label → JS key for field name translation |
| box_selector | string | | CSS selector for section containers (default: .box.box-primary) |
| tables_max_items | number | | Max rows per table (default: 2) |
| wait_for | string | | CSS selector to wait for before extraction |
| timeout | number | | Timeout in ms (default: 30000) |
Interactive Sessions
Interactive sessions let you keep a browser page alive across multiple tool calls, so you can bring the page into the exact state you need before extracting data or taking a screenshot.
Typical workflow
open_session → session_click / session_type / session_evaluate → screenshot / session_html → close_sessionopen_session
Open a persistent browser session. Returns a session_id to use with all session_* tools and screenshot.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| url | string | ✅ | URL to navigate to |
| wait_for | string | | CSS selector to wait for before the session is considered ready |
| timeout | number | | Timeout in ms (default: 30000) |
close_session
Close a session and free its resources. Always call this when you are done.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| session_id | string | ✅ | Session ID returned by open_session |
list_sessions
List all currently active sessions with their IDs and timestamps. No parameters.
session_goto
Navigate the session to a new URL without closing it.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| session_id | string | ✅ | Session ID |
| url | string | ✅ | URL to navigate to |
| wait_for | string | | CSS selector to wait for after navigation |
| timeout | number | | Timeout in ms (default: 30000) |
session_click
Click an element in the session page.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| session_id | string | ✅ | Session ID |
| selector | string | ✅ | CSS selector of the element to click |
| timeout | number | | Timeout in ms to wait for the element (default: 30000) |
session_type
Type text into an input element in the session page.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| session_id | string | ✅ | Session ID |
| selector | string | ✅ | CSS selector of the input element |
| text | string | ✅ | Text to type |
| clear | boolean | | Clear the field before typing (default: false) |
| timeout | number | | Timeout in ms to wait for the element (default: 30000) |
session_wait_for
Wait for a CSS selector to appear in the session page.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| session_id | string | ✅ | Session ID |
| selector | string | ✅ | CSS selector to wait for |
| timeout | number | | Timeout in ms (default: 30000) |
session_evaluate
Execute JavaScript in the context of the session page. Same conventions as execute_js.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| session_id | string | ✅ | Session ID |
| script | string | ✅ | JavaScript function body. Use return to get a result back. |
| wait_for | string | | CSS selector to wait for before executing |
| timeout | number | | Timeout in ms (default: 30000) |
session_html
Return the current full rendered HTML of the session page.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| session_id | string | ✅ | Session ID |
screenshot
Capture a screenshot of a page. Use session_id to capture an active session in its current state, or url for a one-shot capture.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| session_id | string | | Session ID. If provided, url is ignored and the current page state is captured. |
| url | string | | URL for a one-shot screenshot. Required if session_id is not provided. |
| wait_for | string | | CSS selector to wait for (one-shot mode only) |
| timeout | number | | Timeout in ms (default: 30000) |
| selector | string | | CSS selector of a specific element to capture |
| full_page | boolean | | Capture the full scrollable page height (default: false, ignored when selector is provided) |
| format | png | jpeg | | Image format (default: png) |
| output | inline | file | both | ✅ | inline embeds the image in the response, file saves to disk and returns the path, both does both |
| path | string | | Absolute or relative path for the saved file. Defaults to ./screenshots/screenshot-<timestamp>.<format> |
Cookie Files
Cookies are loaded from .txt files in Netscape format. Place them in the folder specified by COOKIES_DIR. All files in the folder are loaded and merged automatically before each request.
GUI (Optional)
When PROXY_URL is set, a cookie manager web interface is registered with the MCP proxy. It allows you to list, upload, and delete cookie files through a browser UI.
License
ISC
