@cosmocoder/mcp-web-docs
v1.2.0
Published
MCP server for crawling and indexing web documentation - works with any website
Downloads
409
Maintainers
Readme
MCP Web Docs
Index Any Documentation. Search Locally. Stay Private.
A self-hosted Model Context Protocol (MCP) server that crawls, indexes, and searches documentation from any website. Unlike remote MCP servers limited to GitHub repos or pre-indexed libraries, web-docs gives you full control over what gets indexed — including private documentation behind authentication.
Features • Installation • Quick Start • Tools • Tips • Troubleshooting • Contributing
❌ The Problem
AI assistants struggle with documentation:
- ❌ Remote MCP servers only work with GitHub or pre-indexed libraries
- ❌ Private docs behind authentication can't be accessed
- ❌ Outdated indexes don't reflect your team's latest documentation
- ❌ No control over what gets indexed or when
✅ The Solution
MCP Web Docs crawls and indexes documentation from ANY website locally:
- ✅ Any website - Docusaurus, Storybook, GitBook, custom sites, internal wikis
- ✅ Private docs - Interactive browser login for authenticated sites
- ✅ Always fresh - Re-index anytime with one command
- ✅ Your data, your machine - No API keys, no cloud, full privacy
✨ Features
- 🌐 Universal Crawler - Works with any documentation site, not just GitHub
- 🔍 Hybrid Search - Combines full-text search (FTS) with semantic vector search
- 🏷️ Tags & Categories - Organize docs with tags and filter searches by project, team, or category
- 🔐 Authentication Support - Crawl private/protected docs with interactive browser login (auto-detects your default browser)
- 📊 Smart Extraction - Automatically extracts code blocks, props tables, and structured content
- ⚡ Local Embeddings - Uses FastEmbed for fast, private embedding generation (no API keys)
- 🗄️ Persistent Storage - LanceDB for vectors, SQLite for metadata
- 🔄 Real-time Progress - Track indexing status with progress updates
🚀 Installation
Prerequisites
- Node.js >= 22.19.0
Option 1: Install from NPM (Recommended)
npm install -g @cosmocoder/mcp-web-docsOption 2: Run with npx
No installation required - just configure your MCP client to use npx (see below).
Option 3: Build from Source
# Clone the repository
git clone https://github.com/cosmocoder/mcp-web-docs.git
cd mcp-web-docs
# Install dependencies (automatically installs Playwright browsers)
npm install
# Build
npm run buildConfigure Your MCP Client
Add to your Cursor MCP settings (~/.cursor/mcp.json):
Using npx (no install required):
{
"mcpServers": {
"web-docs": {
"command": "npx",
"args": ["-y", "@cosmocoder/mcp-web-docs"]
}
}
}Using global install:
{
"mcpServers": {
"web-docs": {
"command": "mcp-web-docs"
}
}
}Using local build:
{
"mcpServers": {
"web-docs": {
"command": "node",
"args": ["/path/to/mcp-web-docs/build/index.js"]
}
}
}Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
Using npx:
{
"mcpServers": {
"web-docs": {
"command": "npx",
"args": ["-y", "@cosmocoder/mcp-web-docs"]
}
}
}Using global install:
{
"mcpServers": {
"web-docs": {
"command": "mcp-web-docs"
}
}
}Add to .vscode/mcp.json in your workspace:
Using npx:
{
"servers": {
"web-docs": {
"command": "npx",
"args": ["-y", "@cosmocoder/mcp-web-docs"]
}
}
}Using global install:
{
"servers": {
"web-docs": {
"command": "mcp-web-docs"
}
}
}Add to ~/.codeium/windsurf/mcp_config.json:
Using npx:
{
"mcpServers": {
"web-docs": {
"command": "npx",
"args": ["-y", "@cosmocoder/mcp-web-docs"]
}
}
}Using global install:
{
"mcpServers": {
"web-docs": {
"command": "mcp-web-docs"
}
}
}Add to ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:
Using npx:
{
"mcpServers": {
"web-docs": {
"command": "npx",
"args": ["-y", "@cosmocoder/mcp-web-docs"],
"disabled": false,
"autoApprove": []
}
}
}Using global install:
{
"mcpServers": {
"web-docs": {
"command": "mcp-web-docs",
"disabled": false,
"autoApprove": []
}
}
}Global configuration: Open RooCode → Click MCP icon → "Edit Global MCP"
Project-level configuration: Create .roo/mcp.json at your project root
Using npx:
{
"mcpServers": {
"web-docs": {
"command": "npx",
"args": ["-y", "@cosmocoder/mcp-web-docs"]
}
}
}Using global install:
{
"mcpServers": {
"web-docs": {
"command": "mcp-web-docs"
}
}
}⚡ Quick Start
1. Index public documentation
Index the LanceDB documentation from https://lancedb.com/docs/The AI assistant will call add_documentation and begin crawling.
2. Search for information
How do I create a table in LanceDB?The AI will use search_documentation to find relevant content.
3. For private docs, authenticate first
I need to index private documentation at https://internal.company.com/docs/
It requires authentication.A browser window will open for you to log in. The session is saved for future crawls.
🔨 Available Tools
add_documentation
Add a new documentation site for indexing.
add_documentation({
url: "https://docs.example.com/",
title: "Example Docs", // Optional
id: "example-docs", // Optional custom ID
tags: ["frontend", "mycompany"], // Optional tags for categorization
auth: { // Optional authentication
requiresAuth: true,
// browser auto-detected from OS settings if omitted
loginTimeoutSecs: 300
}
})search_documentation
Search through indexed documentation using hybrid search (FTS + semantic).
search_documentation({
query: "how to configure authentication",
url: "https://docs.example.com/", // Optional: filter to specific site
tags: ["frontend", "mycompany"], // Optional: filter by tags
limit: 10 // Optional: max results
})authenticate
Open a browser window for interactive login to protected sites. Your default browser is automatically detected from OS settings.
authenticate({
url: "https://private-docs.example.com/",
// browser auto-detected from OS settings - only specify to override
loginTimeoutSecs: 300 // Optional: timeout in seconds
})list_documentation
List all indexed documentation sites with their metadata including tags.
set_tags
Set or update tags for a documentation site. Tags help categorize and filter documentation.
set_tags({
url: "https://docs.example.com/",
tags: ["frontend", "react", "mycompany"] // Replaces existing tags
})list_tags
List all available tags with usage counts. Useful to see what tags exist across your indexed docs.
reindex_documentation
Re-crawl and re-index a specific documentation site.
get_indexing_status
Get the current status of indexing operations.
delete_documentation
Delete an indexed documentation site and all its data.
clear_auth
Clear saved authentication session for a domain.
💡 Tips
Crafting Better Search Queries
The search uses hybrid full-text and semantic search. For best results:
Be specific - Include unique terms from what you're looking for
- Instead of:
"Button props" - Try:
"Button props onClick disabled loading"
- Instead of:
Use exact phrases - Wrap in quotes for exact matching
"authentication middleware"finds that exact phrase
Include context - Add related terms to narrow results
- API docs:
"GET /users endpoint authentication headers" - Config:
"webpack config entry output plugins"
- API docs:
Auto-Invoke with Rules
To avoid typing search instructions in every prompt, add a rule to your MCP client:
Cursor (Cursor Settings > Rules):
When I ask about library documentation or need code examples,
use the web-docs MCP server to search indexed documentation.Windsurf (.windsurfrules):
Always use web-docs search_documentation when I ask about
API references, configuration, or library usage.Scoping Searches
If you have multiple sites indexed, filter by URL or tags:
// Filter by specific site URL
search_documentation({
query: "routing",
url: "https://nextjs.org/docs/"
})
// Filter by tags (searches all docs with matching tags)
search_documentation({
query: "Button component",
tags: ["frontend", "mycompany"] // Only docs tagged with BOTH tags
})Organizing with Tags
Tags help organize documentation when you have multiple related sites. Add tags when indexing:
// Index frontend package docs
add_documentation({
url: "https://docs.mycompany.com/ui-components/",
tags: ["frontend", "mycompany", "react"]
})
// Index backend API docs
add_documentation({
url: "https://docs.mycompany.com/api/",
tags: ["backend", "mycompany", "api"]
})Later, search across all frontend docs:
search_documentation({
query: "authentication",
tags: ["frontend"] // Searches all frontend-tagged docs
})You can also add tags to existing documentation with set_tags.
🚨 Troubleshooting
The content extractor couldn't process the page. Try:
- Re-indexing the documentation
- Checking if the site uses JavaScript rendering (should work with Playwright)
- Looking at the crawled data in
~/.mcp-web-docs/crawlee/datasets/
- Make sure you call
authenticatebeforeadd_documentation - The browser window needs to stay open until login is detected
- For OAuth sites, complete the full flow manually
- Your default browser is auto-detected; specify a different one with
browser: "firefox", for example, if needed
- Try more specific queries with unique terms
- Use quotes for exact phrase matching
- Filter by URL to search within a specific documentation site
- Re-index if the documentation has been updated
If browsers aren't installed, run:
npx playwright installData Storage
All data is stored locally in ~/.mcp-web-docs/:
~/.mcp-web-docs/
├── docs.db # SQLite database for document metadata
├── vectors/ # LanceDB vector database
├── sessions/ # Saved authentication sessions
└── crawlee/ # Crawlee datasets (cached crawl data)📄 License
MIT License - see LICENSE for details.
🙏 Acknowledgments
- Model Context Protocol - The protocol specification
- Crawlee - Web scraping and browser automation
- LanceDB - Vector database
- FastEmbed - Local embedding generation
- Playwright - Browser automation
