@micrawl/mcp-server
v0.1.0
Published
Model Context Protocol (MCP) server for web scraping with Micrawl - exposes scraping capabilities to AI assistants
Downloads
5
Maintainers
Readme
Micrawl MCP Server
AI-facing interface to the Micrawl scraping engine – fetch pages, crawl sites, and persist structured knowledge through Model Context Protocol (MCP).
This server lets Claude Code, Cursor, Windsurf, and any MCP-compatible client tap directly into the shared Micrawl runtime for live browsing, markdown generation, and local storage workflows.
Table of Contents
- Micrawl MCP Server
Quick Start
# Build
pnpm install
pnpm --filter @micrawl/core build
pnpm --filter @micrawl/mcp-server build
# Configure your AI client
# Add to .claude/mcp.json or .cursor/mcp.json:
{
"mcpServers": {
"web-docs-saver": {
"description": "Fetch and save structured content from websites",
"command": "node",
"args": ["<absolute-path>/mcp-server/dist/stdio.js"]
}
}
}
# Restart your AI client, then ask:"Save the Hono documentation from https://hono.dev/docs to ./docs"
"Fetch https://example.com and show me the content"
What is MCP?
The Model Context Protocol (MCP) is an open standard that lets AI assistants:
- Call tools (web scrapers, file systems, APIs)
- Access resources (documents, databases, cached content)
- Receive notifications (progress updates, status changes)
MCP uses stdio (standard input/output) for local integrations, making it perfect for developer tools.
Available Tools
fetch_page
Fetch clean documentation from a URL and return as markdown.
Parameters:
url(required) - URL to fetch
Example:
"Get the content from https://hono.dev/docs/getting-started"
"Fetch https://example.com"
What it does:
- Fetches the URL using smart driver selection (Playwright or HTTP)
- Extracts clean content with Mozilla Readability
- Converts to markdown
- Returns the content in the AI conversation
Use when: You want to see/analyze content without saving it.
save_docs
Save documentation to your local filesystem. Intelligently handles:
- Single page - Pass a URL string
- Multiple pages - Pass an array of URLs
- Entire site - Set
crawl: trueto follow links
Parameters:
url(required) - Single URL or array of URLsoutDir(required) - Local directory to save files (e.g.,./docs)crawl(optional, default:false) - Follow links to save entire sitemaxPages(optional, default:20) - Maximum pages when crawlingmaxDepth(optional, default:2) - Link depth when crawling
Examples:
Single page:
"Save https://hono.dev/docs to ./docs"
Multiple pages:
"Save these URLs to ./docs: https://hono.dev/docs/getting-started and https://hono.dev/docs/api"
Entire site (crawl):
"Save https://hono.dev/docs to ./docs and crawl all pages"
What it does:
- Fetches URL(s) with clean extraction
- Converts to markdown
- Auto-generates filenames from URLs
- Adds YAML frontmatter (url, title, timestamp, depth)
- Saves to specified directory
- Reports success with file paths
Use when: You want offline docs, building a knowledge base, or archiving documentation.
Installation
Prerequisites
- Node.js 18+
- pnpm (or npm/yarn)
- AI client that supports MCP (Claude Code, Cursor, Windsurf, etc.)
Local Development
- Clone and install:
git clone <repository-url>
cd micro-play
pnpm install- Build packages:
# Build core library first
pnpm --filter @micrawl/core build
# Build MCP server
pnpm --filter @micrawl/mcp-server build- Configure your AI client:
See Configuration section below for your specific client.
Using npx
Run the published binary with npx to avoid cloning the repository. The command installs the package on demand and launches the stdio transport so you can wire it directly into your MCP client configuration.
npx @micrawl/mcp-server@latestFor Claude Desktop, point .claude/mcp.json at the npx invocation:
{
"mcpServers": {
"micrawl": {
"description": "Web documentation scraper",
"command": "npx",
"args": ["@micrawl/mcp-server@latest"],
"env": {
"MICRAWL_DOCS_DIR": "~/micrawl/data/docs",
"MICRAWL_LLMS_DIR": "~/micrawl/data/llms"
}
}
}
}Tip: add
--yesto the args array if you want npx to skip the interactive install prompt.
Docker Image
Pull the prebuilt container (or build it locally with the provided Dockerfile) and expose stdin/stdout so MCP clients can attach. The example mounts a local data directory for persisted markdown exports.
# Build locally (replace with docker pull micrawl/mcp-server:latest once published)
docker build -t micrawl/mcp-server -f mcp-server/Dockerfile .
# Run interactively so MCP clients can connect over stdio
docker run -it --rm \
-v "$(pwd)/micrawl-data:/app/data" \
micrawl/mcp-serverConfigure Claude or another MCP client to execute the container on demand:
{
"mcpServers": {
"micrawl": {
"command": "docker",
"args": [
"run", "-i", "--rm",
"-v", "$(pwd)/micrawl-data:/app/data",
"micrawl/mcp-server"
]
}
}
}Set additional scraper environment variables with -e NAME=value pairs or the env object in your MCP client configuration.
Configuration
Claude Code
Location: .claude/mcp.json in your project, or ~/.claude/mcp.json for global config
{
"mcpServers": {
"web-docs-saver": {
"description": "Fetch and save documentation from websites as clean markdown files. Use when user wants to download, save, or archive web documentation locally.",
"command": "node",
"args": ["/absolute/path/to/micro-play/mcp-server/dist/stdio.js"],
"env": {
"MICRAWL_DOCS_DIR": "./docs",
"SCRAPER_DEFAULT_TIMEOUT_MS": "60000"
},
"metadata": {
"triggers": ["documentation", "docs", "save", "fetch", "download", "archive", "website", "markdown"]
}
}
}
}Pro tip: Add rich description and triggers to help Claude Code discover your MCP server more often.
Cursor
Location: .cursor/mcp.json in your project
{
"mcpServers": {
"web-docs-saver": {
"type": "command",
"command": "node",
"args": ["/absolute/path/to/micro-play/mcp-server/dist/stdio.js"],
"env": {
"MICRAWL_DOCS_DIR": "/Users/you/cursor-docs"
}
}
}
}Windsurf
Location: ~/.codeium/windsurf/mcp_config.json
{
"mcpServers": {
"web-docs-saver": {
"command": "node",
"args": ["/absolute/path/to/micro-play/mcp-server/dist/stdio.js"]
}
}
}Other MCP Clients
Any MCP client with stdio support:
- Use
node /absolute/path/to/stdio.jsas command - Pass environment variables via client config
- Restart client to load MCP server
Environment Variables
MCP Server Settings
| Variable | Default | Description |
|----------|---------|-------------|
| MICRAWL_DOCS_DIR | ./docs | Directory where save_docs writes files |
Scraper Settings (from @micrawl/core)
| Variable | Default | Description |
|----------|---------|-------------|
| SCRAPER_DEFAULT_TIMEOUT_MS | 60000 | Page load timeout (1000-120000 ms) |
| SCRAPER_DEFAULT_DRIVER | playwright | Driver: playwright, http, or auto |
| SCRAPER_DEFAULT_LOCALE | en-US | Browser locale |
| SCRAPER_DEFAULT_TIMEZONE | America/New_York | Browser timezone |
| SCRAPER_DEFAULT_VIEWPORT_WIDTH | 1920 | Viewport width (320-4096 px) |
| SCRAPER_DEFAULT_VIEWPORT_HEIGHT | 1080 | Viewport height (320-4096 px) |
| CHROMIUM_BINARY | (auto) | Custom Chromium binary path |
Example .env
Create .env in your project root:
# Where to save docs
MICRAWL_DOCS_DIR=/Users/you/Documents/docs
# Scraper behavior
SCRAPER_DEFAULT_TIMEOUT_MS=60000
SCRAPER_DEFAULT_DRIVER=autoUsage Examples
Fetching Content
User asks:
"What's on https://hono.dev/docs/getting-started?"
"Show me the content from https://example.com"
AI calls: fetch_page({ url: "https://..." })
Result: Clean markdown content shown in conversation
Saving Single Page
User asks:
"Save the Hono getting started guide to my docs folder"
AI calls: save_docs({ url: "https://hono.dev/docs/getting-started", outDir: "./docs" })
Result:
✅ Saved: ./docs/hono-dev-docs-getting-started.md
Title: Getting Started | Hono
URL: https://hono.dev/docs/getting-startedFile structure:
docs/
└── hono-dev-docs-getting-started.mdSaving Multiple Pages
User asks:
"Save these 3 Hono docs pages: getting-started, routing, and middleware"
AI calls:
save_docs({
url: [
"https://hono.dev/docs/getting-started",
"https://hono.dev/docs/routing",
"https://hono.dev/docs/middleware"
],
outDir: "./docs"
})Result:
✅ Saved 3/3 pages to ./docs
1. ./docs/hono-dev-docs-getting-started.md
2. ./docs/hono-dev-docs-routing.md
3. ./docs/hono-dev-docs-middleware.mdCrawling Entire Site
User asks:
"Save all the Hono documentation, crawl the whole site"
AI calls:
save_docs({
url: "https://hono.dev/docs",
outDir: "./docs/hono",
crawl: true,
maxPages: 50,
maxDepth: 3
})Result:
✅ Crawled and saved 47 pages to ./docs/hono
1. ./docs/hono/hono-dev-docs.md (depth 0)
2. ./docs/hono/hono-dev-docs-getting-started.md (depth 1)
3. ./docs/hono/hono-dev-docs-api-request.md (depth 1)
... (44 more files)File structure:
docs/
└── hono/
├── hono-dev-docs.md
├── hono-dev-docs-getting-started.md
├── hono-dev-docs-routing.md
└── ... (44 more files)How It Works
Data Flow
AI Client (Claude/Cursor)
↓ stdio
MCP Server (this package)
↓ imports
@micrawl/core (scraping engine)
↓ uses
Playwright or HTTP driver
↓ fetches
Website → Clean MarkdownFile Processing
URL → Filename
https://hono.dev/docs/getting-started→hono-dev-docs-getting-started.md- Removes protocol, replaces special chars with hyphens
- Lowercased, max 100 chars
Content Processing
- Mozilla Readability extracts main content
- h2m-parser converts HTML → Markdown
- Removes navigation, ads, footers
Frontmatter (YAML)
--- url: "https://hono.dev/docs/getting-started" title: "Getting Started | Hono" scraped_at: "2025-09-30T12:00:00.000Z" depth: 1 --- # Getting Started ...
Crawling Logic
When crawl: true:
- Start at
url(depth 0) - Extract all same-origin links from page
- Queue links at depth + 1
- Scrape each page, save markdown
- Repeat until
maxDepthormaxPagesreached - Return summary of saved files
Same-origin only - Won't follow external links
Development
Running Tests
# All tests
pnpm --filter @micrawl/mcp-server test
# Watch mode
pnpm --filter @micrawl/mcp-server test --watchBuilding
pnpm --filter @micrawl/mcp-server buildOutput: mcp-server/dist/stdio.js (entrypoint)
Debugging
Enable verbose logging:
NODE_ENV=development node mcp-server/dist/stdio.jsLogs show:
- Tool invocations with parameters
- Scraping progress (queued → navigating → capturing)
- File saves with paths
- Errors with stack traces
Test with MCP Inspector:
npx @modelcontextprotocol/inspector node mcp-server/dist/stdio.jsOpens web UI to manually:
- Call
fetch_pageandsave_docs - Inspect tool schemas
- View server metadata
Security
The Micrawl MCP Server is designed with security as a priority. Key security features:
- stdio-only transport - No network exposure, process isolation
- Path traversal protection - Validates all filesystem operations
- Input validation - All parameters validated with Zod schemas
- No command execution - No shell commands with user input
- Minimal dependencies - Only 3 direct dependencies
Security Warning: Only scrape documentation from trusted sources. Malicious websites could contain content designed to trick the AI into performing unintended actions.
Troubleshooting
"MCP server not found" in AI client
Fix:
- Use absolute paths in config (not relative)
- Rebuild:
pnpm --filter @micrawl/mcp-server build - Restart AI client
Files not saving
Check:
MICRAWL_DOCS_DIRis writable- Directory exists (MCP server creates subdirs, not root)
- No permission errors in logs
Fix:
mkdir -p ./docs
chmod 755 ./docsScraping times out
Fix:
# Increase timeout (in .env or config)
SCRAPER_DEFAULT_TIMEOUT_MS=120000Or tell AI to use HTTP driver:
"Save https://example.com using the HTTP driver"
Chromium launch fails
Fix:
# Install Chromium dependencies (Linux)
sudo apt-get install -y \
libnss3 libatk1.0-0 libatk-bridge2.0-0 \
libcups2 libdrm2 libxkbcommon0 libxcomposite1 \
libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2
# Or use HTTP driver (no browser needed)
SCRAPER_DEFAULT_DRIVER=httpAI client doesn't use MCP tools
Fix:
- Add rich
descriptionto config (see Configuration) - Add
triggerskeywords in metadata - Ask more explicitly: "Use the web-docs-saver MCP server to..."
Architecture
Package Structure
mcp-server/
├── src/
│ ├── server.ts # MCP server (2 tools)
│ ├── scraper.ts # Wrapper around @micrawl/core
│ ├── files.ts # File utilities (save, frontmatter)
│ └── stdio.ts # CLI entrypoint
├── tests/
│ └── stdio.integration.test.ts
├── dist/ # Build output
├── package.json
└── tsconfig.jsonTotal: 315 lines of code (4 files)
Dependencies
- @modelcontextprotocol/sdk - MCP protocol
- @micrawl/core - Scraping engine (shared with API)
- zod - Schema validation
No duplication: Scraping logic lives in @micrawl/core, used by both MCP server and HTTP API.
Design Principles
Following cognitive-load.md:
- 2 tools not 4 - Reduced choice paralysis
- Natural names -
fetch_page,save_docs(not "scrape", "crawl") - Deep modules - Simple interface, complex logic hidden
- Smart defaults - Readability/frontmatter always on
- Example-driven - Tool descriptions show real usage
Before: 4 tools, 363 lines, technical jargon 🧠+++ After: 2 tools, 315 lines, natural language 🧠
Roadmap
Phase 1: Current State ✅
- Clean documentation extraction with Mozilla Readability
- 2 simple tools (
fetch_page,save_docs) - Multi-format output (HTML, Markdown)
- Smart driver selection (Playwright, HTTP, auto)
- Crawling with depth/page limits
Phase 2: Enhanced Intelligence (Planned)
search_docs- Search local saved documentation- Code example extraction from docs
- Auto-save project dependencies (via AI delegation)
- Documentation freshness detection
Phase 3: Advanced Features (Future)
- Semantic search with embeddings
- Multi-source search (local + web)
- Interactive tutorial generation
See docs/development-plan.md for detailed roadmap.
Contributing
See main project README for contribution guidelines.
License
MIT - See LICENSE for details
Support
- Issues: GitHub Issues
- API Docs: Main README
- Roadmap: Development Plan
