docshark
v0.1.20
Published
๐ฆ Documentation MCP Server โ scrape, index, and search any doc website
Maintainers
Readme
๐ฆ DocShark
DocShark is a powerful MCP (Model Context Protocol) server designed to scrape, index, and search any documentation website. It creates a local, highly-searchable knowledge base from public documentation pages using FTS5 (Full-Text Search) and BM25 ranking, allowing AI assistants to query the latest docs effortlessly.
๐ Features
- Automated Crawling: Discovers pages via
sitemap.xmlwith fallback to BFS link crawling. - Smart Extraction: Uses Readability and Turndown to extract main content and convert it to clean Markdown, filtering out navbars and sidebars.
- Semantic Chunking: Splits content based on headings, preserving contextual headers for better AI understanding.
- High-Performance Search: Built-in SQLite + FTS5 indexing with BM25 ranking for accurate and lightning-fast search results.
- JS-Rendered Site Support: Tiered fetching strategy automatically detects React/Vue SPAs (empty shells) and upgrades to
puppeteer-coreif you have it installed (zero-config, auto-fallback). - Polite Crawling: Respects
robots.txtand implements rate limiting to prevent overloading documentation servers. - Standard MCP Tooling: Connect perfectly with Desktop Claude, VS Code, Cursor, and any other MCP-compatible clients via standard
stdioorhttp/ssetransports.
๐ฆ What We Have Done (Phase 1)
Phase 1: Core Engine is fully implemented and tested.
- โ Custom SQLite Database with FTS5 virtual tables and auto-sync triggers.
- โ
Web scraping engine supporting standard
fetch()andpuppeteer-core. - โ Markdown processor utilizing Readability + Turndown.
- โ Heading-based semantic chunker (500-1200 tokens per chunk).
- โ Asynchronous job manager and queue system.
- โ Complete HTTP API (REST endpoints + SSE event streams).
- โ
Seamless integration of 4 MCP tools:
manage_library,search_docs,list_libraries, andget_doc_page. - โ
Robust CLI interface (
start,add,rename,search,list).
๐๏ธ What We Are Doing
We are actively polishing the integration between the core engine and external MCP clients (like VS Code Agents and Claude Desktop).
๐ฎ What We Plan To Do (Phase 2 & Beyond)
- Web Dashboard: An intuitive SvelteKit dashboard to manage your synced libraries, view crawl progress in real-time (via SSE), and test searches manually.
- Incremental Crawling: Smarter
refreshjobs that compareETagandLast-Modifiedheaders to only re-scrape updated pages. - Vector Search (RAG): Integration of lightweight vector embeddings for semantic similarity search alongside the existing FTS5 keyword search.
- Advanced Scraping Setup: Support for custom CSS selectors to define exactly where content lives in non-standard documentation websites.
๐ ๏ธ Usage
Quick Start (from npm)
You can run DocShark directly without installing it globally using bunx:
# Add a documentation library to the index
bunx docshark add https://valibot.dev/guides/ --depth 2
# Search your indexed docs
bunx docshark search "schema validation"Installation
To install DocShark globally as a CLI tool:
DocShark is intended to be installed and run with Bun.
# Global Bun installation
bun add -g docsharkAfter installation, you can use the docshark command:
docshark list
# Update the global Bun installation when a new release is published
docshark update
# Script-friendly update check
docshark update --check --quietInteractive CLI runs will also let you know when a newer version is available. Update notices are intentionally skipped for MCP stdio mode so they never interfere with protocol output.
For scripts, docshark update --check exits 0 when current, 10 when a newer version is available, and 1 when the version check could not be completed.
๐ MCP Integration
VS Code (GitHub Copilot / MCP Extension)
Add DocShark to your .vscode/settings.json or global MCP configuration:
{
"mcpServers": {
"docshark": {
"command": "bunx",
"args": ["-y", "docshark", "start", "--stdio"]
}
}
}Cursor
- Open Cursor Settings > Models > MCP.
- Click + Add New MCP Server.
- Name:
docshark - Type:
command - Command:
bunx -y docshark start --stdio
Claude Desktop
Edit your Claude Desktop configuration file:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"docshark": {
"command": "bunx",
"args": ["-y", "docshark", "start", "--stdio"]
}
}
}๐ ๏ธ Development
Local Setup
Ensure you have Bun installed.
# Clone the repository
git clone https://github.com/Michael-Obele/docshark.git
cd docshark
# Install dependencies
bun install
# (Optional) Enable auto-detection & scraping of Javascript React/Vue single-page apps
bun add puppeteer-core
# Start the DocShark MCP server in HTTP mode for local testing
bun run src/cli.ts start --port 6380Local CLI Debugging
# Run CLI directly while developing
bun run src/cli.ts list๐ Versioning & Changelog
This project uses Google's Release Please to automate versioning and changelog generation.
- Semantic Versioning: Our versions automatically bump (e.g.
0.0.1->0.0.2or0.1.0) based on standard Conventional Commits (feat:,fix:,chore:, etc.). - Automated: A PR is automatically created on
masterwhen standard commits are merged, generating a standardCHANGELOG.md.
๐ License
This project is open-source and available under the MIT License.
Built to empower AI agents with the latest knowledge.
