@cosmocoder/mcp-web-docs

v1.2.0

Published

12 days ago

MCP server for crawling and indexing web documentation - works with any website

Downloads

409

0High
0Medium
0Low

cosmocoder

mcp model-context-protocol documentation indexing search crawler ai llm cursor claude semantic-search vector-search

MCP Web Docs

Index Any Documentation. Search Locally. Stay Private.

A self-hosted Model Context Protocol (MCP) server that crawls, indexes, and searches documentation from any website. Unlike remote MCP servers limited to GitHub repos or pre-indexed libraries, web-docs gives you full control over what gets indexed — including private documentation behind authentication.

Features • Installation • Quick Start • Tools • Tips • Troubleshooting • Contributing

❌ The Problem

AI assistants struggle with documentation:

❌ Remote MCP servers only work with GitHub or pre-indexed libraries
❌ Private docs behind authentication can't be accessed
❌ Outdated indexes don't reflect your team's latest documentation
❌ No control over what gets indexed or when

✅ The Solution

MCP Web Docs crawls and indexes documentation from ANY website locally:

✅ Any website - Docusaurus, Storybook, GitBook, custom sites, internal wikis
✅ Private docs - Interactive browser login for authenticated sites
✅ Always fresh - Re-index anytime with one command
✅ Your data, your machine - No API keys, no cloud, full privacy

✨ Features

🌐 Universal Crawler - Works with any documentation site, not just GitHub
🔍 Hybrid Search - Combines full-text search (FTS) with semantic vector search
🏷️ Tags & Categories - Organize docs with tags and filter searches by project, team, or category
🔐 Authentication Support - Crawl private/protected docs with interactive browser login (auto-detects your default browser)
📊 Smart Extraction - Automatically extracts code blocks, props tables, and structured content
⚡ Local Embeddings - Uses FastEmbed for fast, private embedding generation (no API keys)
🗄️ Persistent Storage - LanceDB for vectors, SQLite for metadata
🔄 Real-time Progress - Track indexing status with progress updates

🚀 Installation

Prerequisites

Node.js >= 22.19.0

Option 1: Install from NPM (Recommended)

npm install -g @cosmocoder/mcp-web-docs

Option 2: Run with npx

No installation required - just configure your MCP client to use npx (see below).

Option 3: Build from Source

# Clone the repository
git clone https://github.com/cosmocoder/mcp-web-docs.git
cd mcp-web-docs

# Install dependencies (automatically installs Playwright browsers)
npm install

# Build
npm run build

Configure Your MCP Client

Add to your Cursor MCP settings (~/.cursor/mcp.json):

Using npx (no install required):

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

Using local build:

{
  "mcpServers": {
    "web-docs": {
      "command": "node",
      "args": ["/path/to/mcp-web-docs/build/index.js"]
    }
  }
}

Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

Using npx:

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

Add to .vscode/mcp.json in your workspace:

Using npx:

{
  "servers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "servers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

Add to ~/.codeium/windsurf/mcp_config.json:

Using npx:

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

Add to ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:

Using npx:

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"],
      "disabled": false,
      "autoApprove": []
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs",
      "disabled": false,
      "autoApprove": []
    }
  }
}

Global configuration: Open RooCode → Click MCP icon → "Edit Global MCP"

Project-level configuration: Create .roo/mcp.json at your project root

Using npx:

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

⚡ Quick Start

1. Index public documentation

Index the LanceDB documentation from https://lancedb.com/docs/

The AI assistant will call add_documentation and begin crawling.

2. Search for information

How do I create a table in LanceDB?

The AI will use search_documentation to find relevant content.

3. For private docs, authenticate first

I need to index private documentation at https://internal.company.com/docs/
It requires authentication.

A browser window will open for you to log in. The session is saved for future crawls.

🔨 Available Tools

`add_documentation`

Add a new documentation site for indexing.

add_documentation({
  url: "https://docs.example.com/",
  title: "Example Docs",              // Optional
  id: "example-docs",                 // Optional custom ID
  tags: ["frontend", "mycompany"],    // Optional tags for categorization
  auth: {                             // Optional authentication
    requiresAuth: true,
    // browser auto-detected from OS settings if omitted
    loginTimeoutSecs: 300
  }
})

`search_documentation`

Search through indexed documentation using hybrid search (FTS + semantic).

search_documentation({
  query: "how to configure authentication",
  url: "https://docs.example.com/",    // Optional: filter to specific site
  tags: ["frontend", "mycompany"],     // Optional: filter by tags
  limit: 10                            // Optional: max results
})

`authenticate`

Open a browser window for interactive login to protected sites. Your default browser is automatically detected from OS settings.

authenticate({
  url: "https://private-docs.example.com/",
  // browser auto-detected from OS settings - only specify to override
  loginTimeoutSecs: 300         // Optional: timeout in seconds
})

`list_documentation`

List all indexed documentation sites with their metadata including tags.

`set_tags`

Set or update tags for a documentation site. Tags help categorize and filter documentation.

set_tags({
  url: "https://docs.example.com/",
  tags: ["frontend", "react", "mycompany"]  // Replaces existing tags
})

`list_tags`

List all available tags with usage counts. Useful to see what tags exist across your indexed docs.

`reindex_documentation`

Re-crawl and re-index a specific documentation site.

`get_indexing_status`

Get the current status of indexing operations.

`delete_documentation`

Delete an indexed documentation site and all its data.

`clear_auth`

Clear saved authentication session for a domain.

💡 Tips

Crafting Better Search Queries

The search uses hybrid full-text and semantic search. For best results:

Be specific - Include unique terms from what you're looking for
- Instead of: "Button props"
- Try: "Button props onClick disabled loading"
Use exact phrases - Wrap in quotes for exact matching
- "authentication middleware" finds that exact phrase
Include context - Add related terms to narrow results
- API docs: "GET /users endpoint authentication headers"
- Config: "webpack config entry output plugins"

Auto-Invoke with Rules

To avoid typing search instructions in every prompt, add a rule to your MCP client:

Cursor (Cursor Settings > Rules):

When I ask about library documentation or need code examples,
use the web-docs MCP server to search indexed documentation.

Windsurf (.windsurfrules):

Always use web-docs search_documentation when I ask about
API references, configuration, or library usage.

Scoping Searches

If you have multiple sites indexed, filter by URL or tags:

// Filter by specific site URL
search_documentation({
  query: "routing",
  url: "https://nextjs.org/docs/"
})

// Filter by tags (searches all docs with matching tags)
search_documentation({
  query: "Button component",
  tags: ["frontend", "mycompany"]  // Only docs tagged with BOTH tags
})

Organizing with Tags

Tags help organize documentation when you have multiple related sites. Add tags when indexing:

// Index frontend package docs
add_documentation({
  url: "https://docs.mycompany.com/ui-components/",
  tags: ["frontend", "mycompany", "react"]
})

// Index backend API docs
add_documentation({
  url: "https://docs.mycompany.com/api/",
  tags: ["backend", "mycompany", "api"]
})

Later, search across all frontend docs:

search_documentation({
  query: "authentication",
  tags: ["frontend"]  // Searches all frontend-tagged docs
})

You can also add tags to existing documentation with set_tags.

🚨 Troubleshooting

The content extractor couldn't process the page. Try:

Re-indexing the documentation
Checking if the site uses JavaScript rendering (should work with Playwright)
Looking at the crawled data in ~/.mcp-web-docs/crawlee/datasets/

Make sure you call authenticate before add_documentation
The browser window needs to stay open until login is detected
For OAuth sites, complete the full flow manually
Your default browser is auto-detected; specify a different one with browser: "firefox", for example, if needed

Try more specific queries with unique terms
Use quotes for exact phrase matching
Filter by URL to search within a specific documentation site
Re-index if the documentation has been updated

If browsers aren't installed, run:

npx playwright install

Data Storage

All data is stored locally in ~/.mcp-web-docs/:

~/.mcp-web-docs/
├── docs.db           # SQLite database for document metadata
├── vectors/          # LanceDB vector database
├── sessions/         # Saved authentication sessions
└── crawlee/          # Crawlee datasets (cached crawl data)

📄 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Model Context Protocol - The protocol specification
Crawlee - Web scraping and browser automation
LanceDB - Vector database
FastEmbed - Local embedding generation
Playwright - Browser automation

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

MCP Web Docs

❌ The Problem

✅ The Solution

✨ Features

🚀 Installation

Prerequisites

Option 1: Install from NPM (Recommended)

Option 2: Run with npx

Option 3: Build from Source

Configure Your MCP Client

⚡ Quick Start

1. Index public documentation

2. Search for information

3. For private docs, authenticate first

🔨 Available Tools

add_documentation

search_documentation

authenticate

list_documentation

set_tags

list_tags

reindex_documentation

get_indexing_status

delete_documentation

clear_auth

💡 Tips

Crafting Better Search Queries

Auto-Invoke with Rules

Scoping Searches

Organizing with Tags

🚨 Troubleshooting

Data Storage

📄 License

🙏 Acknowledgments

`add_documentation`

`search_documentation`

`authenticate`

`list_documentation`

`set_tags`

`list_tags`

`reindex_documentation`

`get_indexing_status`

`delete_documentation`

`clear_auth`