@beejay141/docs-mcp

v1.0.9

Published

2 months ago

MCP server that crawls documentation websites and exposes structured tools for AI agents to search and read library docs

0High
0Medium
0Low

beejay141

mcp documentation ai crawl

docs-mcp

A Model Context Protocol (MCP) server that crawls documentation websites and exposes structured tools enabling AI agents to search and read library documentation for accurate code generation.

Features

Register any docs site via URL — static or JS-rendered (Docusaurus, VitePress, Nextra, TypeDoc)
Full-text search across all indexed docs with BM25 ranking (SQLite FTS5)
Clean Markdown output stripped of nav/sidebar noise, preserving code blocks with language hints
Background sync — on every startup, stale libraries re-crawl automatically without blocking tool calls
6 MCP tools visible in Claude Desktop, VS Code Copilot, Cursor, and any MCP client

Installation

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "docs-mcp": {
      "command": "npx",
      "args": ["-y", "docs-mcp@latest"],
      "env": {
        "DOCS_MCP_DB": "/Users/you/.docs-mcp/docs.db"
      }
    }
  }
}

VS Code (GitHub Copilot)

Create .vscode/mcp.json in your project (already included in this repo):

{
  "servers": {
    "docs-mcp": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "docs-mcp@latest"],
      "env": {
        "DOCS_MCP_DB": "${env:HOME}/.docs-mcp/docs.db"
      }
    }
  }
}

Or add it to your VS Code user settings.json under the "mcp" key:

{
  "mcp": {
    "servers": {
      "docs-mcp": {
        "type": "stdio",
        "command": "npx",
        "args": ["-y", "docs-mcp@latest"]
      }
    }
  }
}

Cursor

Edit ~/.cursor/mcp.json (global) or <project>/.cursor/mcp.json (project-scoped):

{
  "mcpServers": {
    "docs-mcp": {
      "command": "npx",
      "args": ["-y", "docs-mcp@latest"],
      "env": {
        "DOCS_MCP_DB": "/Users/you/.docs-mcp/docs.db"
      }
    }
  }
}

Zed

Merge into ~/.config/zed/settings.json:

{
  "context_servers": {
    "docs-mcp": {
      "command": {
        "path": "npx",
        "args": ["-y", "docs-mcp@latest"],
        "env": {
          "DOCS_MCP_DB": "/Users/you/.docs-mcp/docs.db"
        }
      }
    }
  }
}

Windsurf

Edit ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "docs-mcp": {
      "command": "npx",
      "args": ["-y", "docs-mcp@latest"],
      "env": {
        "DOCS_MCP_DB": "/Users/you/.docs-mcp/docs.db"
      }
    }
  }
}

After adding the config, restart the application and ask your AI assistant:

"Add the React docs from https://react.dev/reference/react"

CLI (optional management tool)

npx docs-mcp-cli add https://react.dev/reference/react --name "React"
npx docs-mcp-cli add --preset tailwind
npx docs-mcp-cli list
npx docs-mcp-cli search "useState hook"

MCP Tools

| Tool | Description | | ---------------- | ------------------------------------------------------------------- | | add_library | Register a documentation source by URL. Triggers background crawl. | | list_libraries | List all indexed libraries with page counts and sync status. | | search_docs | Full-text search across docs. Returns ranked results with snippets. | | get_page | Retrieve full Markdown content of a specific page. | | list_sections | Browse pages/sections in a library (useful before searching). | | sync_status | Check background sync progress for one or all libraries. |

`add_library`

url          Required. Base documentation URL.
             Use https:// for public sites.
             Use http:// for internal/private hosts only (e.g. http://192.168.1.10/docs).
name         Optional. Human-readable label (auto-derived from domain)
id           Optional. Library ID slug (auto-derived from domain)
options:
  dynamic          Force Playwright crawler for JS-rendered sites
  contentSelector  CSS selector for main content area
  excludePatterns  URL patterns to skip (e.g. ["/blog", "/changelog"])
  maxPages         Max pages to crawl (default: 500, max: 5000)
  crawlDelay       Delay between requests in ms (default: 500)
  ttlHours         Re-sync interval in hours (default: 24)

CLI Reference

docs-mcp-cli add <baseUrl>         Register and crawl a doc site
  --name <label>                   Human-readable name
  --id <slug>                      Library ID slug
  --version <ver>                  Library version label
  --dynamic                        Force Playwright for JS-rendered sites
  --selector <css>                 CSS selector for main content
  --exclude <pattern...>           URL patterns to skip
  --max-pages <n>                  Max pages (default: 500)
  --delay <ms>                     Crawl delay (default: 500ms)
  --ttl <hours>                    Re-sync interval (default: 24h)
  --preset <name>                  Use a preset from libraries.json

docs-mcp-cli list                  List all indexed libraries
docs-mcp-cli remove <id>           Delete a library and its pages
docs-mcp-cli refresh <id>          Force immediate re-crawl
docs-mcp-cli sync-status [id]      Show sync queue and job progress
docs-mcp-cli search <query>        Quick CLI search
  --library <id>                   Limit to a specific library
  --limit <n>                  Number of results (default: 5)

Presets

Popular libraries are pre-configured in libraries.json:

docs-mcp-cli add --preset react
docs-mcp-cli add --preset vue
docs-mcp-cli add --preset tailwind
docs-mcp-cli add --preset nextjs
docs-mcp-cli add --preset typescript
docs-mcp-cli add --preset fastapi
docs-mcp-cli add --preset pydantic
docs-mcp-cli add --preset langchain

How Background Sync Works

On every server startup:

SyncManager.startupSync() queries all libraries where last_scraped_at IS NULL or older than their TTL (default: 24 hours)
Never-synced libraries get priority = "high" and run first
Stale libraries are queued as priority = "normal"
Up to maxConcurrency (default: 2) libraries crawl simultaneously
All MCP tools are available immediately — the sync never blocks tool calls

The sync queue is capped at 50 pending jobs. Rapid add_library calls beyond that return status: "queue_full".

Configuration

| Environment Variable | Default | Description | | -------------------------- | --------------------- | --------------------------- | | DOCS_MCP_DB | ~/.docs-mcp/docs.db | Path to the SQLite database | | DOCS_MCP_MAX_CONCURRENCY | 2 | Max simultaneous crawls |

Supported Site Types

| Framework | Crawler | Notes | | --------------- | -------------------- | -------------------------------------------------------------------------- | | Plain HTML | Static (Cheerio) | Default | | Docusaurus | Static (Cheerio) | Usually works without --dynamic | | VitePress | Auto-detected | Falls back to Playwright if < 10 static pages | | Nextra | Auto-detected | Falls back to Playwright | | TypeDoc | Static (Cheerio) | .col-content selector auto-detected; strip TypeDoc sidebar/toolbar noise | | GitBook | Static | Use --selector ".page-inner" | | ReadMe.io | Static | May need --selector "section.content" | | Any JS-rendered | Dynamic (Playwright) | Use --dynamic flag | | Internal (http) | Static (Cheerio) | Use http:// only for private/internal hosts |

Troubleshooting

Indexing a TypeDoc-generated site? TypeDoc's .col-content is auto-detected and sidebar/toolbar noise (.col-sidebar, .tsd-navigation, .tsd-toolbar) is stripped automatically. No extra flags needed for most TypeDoc sites:

docs-mcp-cli add https://mylib.github.io/api --name "MyLib API"
# Or for an internal TypeDoc server:
docs-mcp-cli add http://192.168.1.20:8080 --name "Internal API"

JS-rendered site not indexing? Add --dynamic to force Playwright:

docs-mcp-cli add https://example.com/docs --dynamic

Too many irrelevant pages being crawled? Use --exclude to skip sections:

docs-mcp-cli add https://example.com/docs --exclude /blog --exclude /changelog

Getting rate-limited? Increase crawl delay:

docs-mcp-cli add https://example.com/docs --delay 2000

Large site hitting page limit? Increase or specify a content selector to focus on relevant pages:

docs-mcp-cli add https://example.com/docs --max-pages 2000 --selector "article"

Check sync status:

docs-mcp-cli sync-status
# or via MCP tool:
# sync_status {}

Development

npm install
npm run dev          # Start MCP server
npm run dev:cli      # Run CLI
npm test             # Run all tests
npm run test:watch   # Watch mode
npm run build        # Build for production

Playwright (for dynamic crawling)

If you plan to crawl JS-rendered sites (use --dynamic), Playwright and its browsers are required. After installing dependencies, install Playwright browsers:

# install project deps (if not done already)
npm install
# install Playwright browsers needed for dynamic crawling
npx playwright install
# (Linux-only) install required system deps
npx playwright install-deps

Security

https:// is accepted for all hosts — public and internal.
http:// is only permitted for private/internal hosts (RFC-1918 ranges: 10.x, 172.16–31.x, 192.168.x, 127.x, localhost, ULA IPv6). This prevents unencrypted traffic being proxied to arbitrary public servers.
Hostnames are resolved via DNS and checked against private IP ranges before crawling begins.
All DB queries use parameterized statements (no SQL injection)
FTS5 query input is sanitized before use

URL policy summary:

| URL | Allowed? | | ----------------------------------- | ------------------------- | | https://react.dev/docs | ✅ | | https://internal.company.com/docs | ✅ | | http://192.168.1.10/docs | ✅ (private host) | | http://localhost:8080/docs | ✅ (loopback) | | http://react.dev/docs | ❌ (public host, no TLS) | | ftp://example.com | ❌ (unsupported protocol) |

License

ISC