@beejay141/docs-mcp
v1.0.9
Published
MCP server that crawls documentation websites and exposes structured tools for AI agents to search and read library docs
Readme
docs-mcp
A Model Context Protocol (MCP) server that crawls documentation websites and exposes structured tools enabling AI agents to search and read library documentation for accurate code generation.
Features
- Register any docs site via URL — static or JS-rendered (Docusaurus, VitePress, Nextra, TypeDoc)
- Full-text search across all indexed docs with BM25 ranking (SQLite FTS5)
- Clean Markdown output stripped of nav/sidebar noise, preserving code blocks with language hints
- Background sync — on every startup, stale libraries re-crawl automatically without blocking tool calls
- 6 MCP tools visible in Claude Desktop, VS Code Copilot, Cursor, and any MCP client
Installation
Claude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"docs-mcp": {
"command": "npx",
"args": ["-y", "docs-mcp@latest"],
"env": {
"DOCS_MCP_DB": "/Users/you/.docs-mcp/docs.db"
}
}
}
}VS Code (GitHub Copilot)
Create .vscode/mcp.json in your project (already included in this repo):
{
"servers": {
"docs-mcp": {
"type": "stdio",
"command": "npx",
"args": ["-y", "docs-mcp@latest"],
"env": {
"DOCS_MCP_DB": "${env:HOME}/.docs-mcp/docs.db"
}
}
}
}Or add it to your VS Code user settings.json under the "mcp" key:
{
"mcp": {
"servers": {
"docs-mcp": {
"type": "stdio",
"command": "npx",
"args": ["-y", "docs-mcp@latest"]
}
}
}
}Cursor
Edit ~/.cursor/mcp.json (global) or <project>/.cursor/mcp.json (project-scoped):
{
"mcpServers": {
"docs-mcp": {
"command": "npx",
"args": ["-y", "docs-mcp@latest"],
"env": {
"DOCS_MCP_DB": "/Users/you/.docs-mcp/docs.db"
}
}
}
}Zed
Merge into ~/.config/zed/settings.json:
{
"context_servers": {
"docs-mcp": {
"command": {
"path": "npx",
"args": ["-y", "docs-mcp@latest"],
"env": {
"DOCS_MCP_DB": "/Users/you/.docs-mcp/docs.db"
}
}
}
}
}Windsurf
Edit ~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"docs-mcp": {
"command": "npx",
"args": ["-y", "docs-mcp@latest"],
"env": {
"DOCS_MCP_DB": "/Users/you/.docs-mcp/docs.db"
}
}
}
}After adding the config, restart the application and ask your AI assistant:
"Add the React docs from https://react.dev/reference/react"
CLI (optional management tool)
npx docs-mcp-cli add https://react.dev/reference/react --name "React"
npx docs-mcp-cli add --preset tailwind
npx docs-mcp-cli list
npx docs-mcp-cli search "useState hook"MCP Tools
| Tool | Description |
| ---------------- | ------------------------------------------------------------------- |
| add_library | Register a documentation source by URL. Triggers background crawl. |
| list_libraries | List all indexed libraries with page counts and sync status. |
| search_docs | Full-text search across docs. Returns ranked results with snippets. |
| get_page | Retrieve full Markdown content of a specific page. |
| list_sections | Browse pages/sections in a library (useful before searching). |
| sync_status | Check background sync progress for one or all libraries. |
add_library
url Required. Base documentation URL.
Use https:// for public sites.
Use http:// for internal/private hosts only (e.g. http://192.168.1.10/docs).
name Optional. Human-readable label (auto-derived from domain)
id Optional. Library ID slug (auto-derived from domain)
options:
dynamic Force Playwright crawler for JS-rendered sites
contentSelector CSS selector for main content area
excludePatterns URL patterns to skip (e.g. ["/blog", "/changelog"])
maxPages Max pages to crawl (default: 500, max: 5000)
crawlDelay Delay between requests in ms (default: 500)
ttlHours Re-sync interval in hours (default: 24)CLI Reference
docs-mcp-cli add <baseUrl> Register and crawl a doc site
--name <label> Human-readable name
--id <slug> Library ID slug
--version <ver> Library version label
--dynamic Force Playwright for JS-rendered sites
--selector <css> CSS selector for main content
--exclude <pattern...> URL patterns to skip
--max-pages <n> Max pages (default: 500)
--delay <ms> Crawl delay (default: 500ms)
--ttl <hours> Re-sync interval (default: 24h)
--preset <name> Use a preset from libraries.json
docs-mcp-cli list List all indexed libraries
docs-mcp-cli remove <id> Delete a library and its pages
docs-mcp-cli refresh <id> Force immediate re-crawl
docs-mcp-cli sync-status [id] Show sync queue and job progress
docs-mcp-cli search <query> Quick CLI search
--library <id> Limit to a specific library
--limit <n> Number of results (default: 5)Presets
Popular libraries are pre-configured in libraries.json:
docs-mcp-cli add --preset react
docs-mcp-cli add --preset vue
docs-mcp-cli add --preset tailwind
docs-mcp-cli add --preset nextjs
docs-mcp-cli add --preset typescript
docs-mcp-cli add --preset fastapi
docs-mcp-cli add --preset pydantic
docs-mcp-cli add --preset langchainHow Background Sync Works
On every server startup:
SyncManager.startupSync()queries all libraries wherelast_scraped_at IS NULLor older than their TTL (default: 24 hours)- Never-synced libraries get
priority = "high"and run first - Stale libraries are queued as
priority = "normal" - Up to
maxConcurrency(default: 2) libraries crawl simultaneously - All MCP tools are available immediately — the sync never blocks tool calls
The sync queue is capped at 50 pending jobs. Rapid add_library calls beyond that return status: "queue_full".
Configuration
| Environment Variable | Default | Description |
| -------------------------- | --------------------- | --------------------------- |
| DOCS_MCP_DB | ~/.docs-mcp/docs.db | Path to the SQLite database |
| DOCS_MCP_MAX_CONCURRENCY | 2 | Max simultaneous crawls |
Supported Site Types
| Framework | Crawler | Notes |
| --------------- | -------------------- | -------------------------------------------------------------------------- |
| Plain HTML | Static (Cheerio) | Default |
| Docusaurus | Static (Cheerio) | Usually works without --dynamic |
| VitePress | Auto-detected | Falls back to Playwright if < 10 static pages |
| Nextra | Auto-detected | Falls back to Playwright |
| TypeDoc | Static (Cheerio) | .col-content selector auto-detected; strip TypeDoc sidebar/toolbar noise |
| GitBook | Static | Use --selector ".page-inner" |
| ReadMe.io | Static | May need --selector "section.content" |
| Any JS-rendered | Dynamic (Playwright) | Use --dynamic flag |
| Internal (http) | Static (Cheerio) | Use http:// only for private/internal hosts |
Troubleshooting
Indexing a TypeDoc-generated site?
TypeDoc's .col-content is auto-detected and sidebar/toolbar noise (.col-sidebar, .tsd-navigation, .tsd-toolbar) is stripped automatically. No extra flags needed for most TypeDoc sites:
docs-mcp-cli add https://mylib.github.io/api --name "MyLib API"
# Or for an internal TypeDoc server:
docs-mcp-cli add http://192.168.1.20:8080 --name "Internal API"JS-rendered site not indexing?
Add --dynamic to force Playwright:
docs-mcp-cli add https://example.com/docs --dynamicToo many irrelevant pages being crawled?
Use --exclude to skip sections:
docs-mcp-cli add https://example.com/docs --exclude /blog --exclude /changelogGetting rate-limited? Increase crawl delay:
docs-mcp-cli add https://example.com/docs --delay 2000Large site hitting page limit? Increase or specify a content selector to focus on relevant pages:
docs-mcp-cli add https://example.com/docs --max-pages 2000 --selector "article"Check sync status:
docs-mcp-cli sync-status
# or via MCP tool:
# sync_status {}Development
npm install
npm run dev # Start MCP server
npm run dev:cli # Run CLI
npm test # Run all tests
npm run test:watch # Watch mode
npm run build # Build for productionPlaywright (for dynamic crawling)
If you plan to crawl JS-rendered sites (use --dynamic), Playwright and its browsers are required. After installing dependencies, install Playwright browsers:
# install project deps (if not done already)
npm install
# install Playwright browsers needed for dynamic crawling
npx playwright install
# (Linux-only) install required system deps
npx playwright install-depsSecurity
https://is accepted for all hosts — public and internal.http://is only permitted for private/internal hosts (RFC-1918 ranges:10.x,172.16–31.x,192.168.x,127.x,localhost, ULA IPv6). This prevents unencrypted traffic being proxied to arbitrary public servers.- Hostnames are resolved via DNS and checked against private IP ranges before crawling begins.
- All DB queries use parameterized statements (no SQL injection)
- FTS5 query input is sanitized before use
URL policy summary:
| URL | Allowed? |
| ----------------------------------- | ------------------------- |
| https://react.dev/docs | ✅ |
| https://internal.company.com/docs | ✅ |
| http://192.168.1.10/docs | ✅ (private host) |
| http://localhost:8080/docs | ✅ (loopback) |
| http://react.dev/docs | ❌ (public host, no TLS) |
| ftp://example.com | ❌ (unsupported protocol) |
License
ISC
