docshark

v0.1.24

Published

a day ago

🦈 Documentation MCP Server — scrape, index, and search any doc website

0High
0Medium
0Low

svelte-apps

tmcp mcp documentation search ai scraper

🦈 DocShark

DocShark is a powerful MCP (Model Context Protocol) server designed to scrape, index, and search any documentation website. It creates a local, highly-searchable knowledge base from public documentation pages using FTS5 (Full-Text Search) and BM25 ranking, allowing AI assistants to query the latest docs effortlessly.

🚀 Features

Automated Crawling: Discovers pages via sitemap.xml with fallback to BFS link crawling.
Smart Extraction: Uses Readability and Turndown to extract main content and convert it to clean Markdown, filtering out navbars and sidebars.
Semantic Chunking: Splits content based on headings, preserving contextual headers for better AI understanding.
High-Performance Search: Built-in SQLite + FTS5 indexing with BM25 ranking for accurate and lightning-fast search results.
JS-Rendered Site Support: Tiered fetching strategy automatically detects React/Vue SPAs (empty shells) and upgrades to puppeteer-core if you have it installed (zero-config, auto-fallback).
Polite Crawling: Respects robots.txt and implements rate limiting to prevent overloading documentation servers.
Standard MCP Tooling: Connect perfectly with Desktop Claude, VS Code, Cursor, and any other MCP-compatible clients via standard stdio or http/sse transports.

📦 What We Have Done (Phase 1)

Phase 1: Core Engine is fully implemented and tested.

✅ Custom SQLite Database with FTS5 virtual tables and auto-sync triggers.
✅ Web scraping engine supporting standard fetch() and puppeteer-core.
✅ Markdown processor utilizing Readability + Turndown.
✅ Heading-based semantic chunker (500-1200 tokens per chunk).
✅ Asynchronous job manager and queue system.
✅ Complete HTTP API (REST endpoints + SSE event streams).
✅ Seamless integration of 4 MCP tools: manage_library, search_docs, list_libraries, and get_doc_page.
✅ Robust CLI interface (start, add, rename, search, list).

🏗️ What We Are Doing

We are actively polishing the integration between the core engine and external MCP clients (like VS Code Agents and Claude Desktop).

🔮 What We Plan To Do (Phase 2 & Beyond)

Web Dashboard: An intuitive SvelteKit dashboard to manage your synced libraries, view crawl progress in real-time (via SSE), and test searches manually.
Incremental Crawling: Smarter refresh jobs that compare ETag and Last-Modified headers to only re-scrape updated pages.
Vector Search (RAG): Integration of lightweight vector embeddings for semantic similarity search alongside the existing FTS5 keyword search.
Advanced Scraping Setup: Support for custom CSS selectors to define exactly where content lives in non-standard documentation websites.

🛠️ Usage

Quick Start (from npm)

You can run DocShark directly without installing it globally using bunx:

# Add a documentation library to the index
bunx docshark add https://valibot.dev/guides/ --depth 2

# Search your indexed docs
bunx docshark search "schema validation"

Installation

To install DocShark globally as a CLI tool:

DocShark is intended to be installed and run with Bun.

# Global Bun installation
bun add -g docshark

After installation, you can use the docshark command:

docshark list

# Update the global Bun installation when a new release is published
docshark update

# Script-friendly update check
docshark update --check --quiet

Interactive CLI runs will also let you know when a newer version is available. Update notices are intentionally skipped for MCP stdio mode so they never interfere with protocol output.

For scripts, docshark update --check exits 0 when current, 10 when a newer version is available, and 1 when the version check could not be completed.

🔌 MCP Integration

VS Code (GitHub Copilot / MCP Extension)

Add DocShark to your .vscode/settings.json or global MCP configuration:

{
  "mcpServers": {
    "docshark": {
      "command": "bunx",
      "args": ["-y", "docshark", "start", "--stdio"]
    }
  }
}

Cursor

Open Cursor Settings > Models > MCP.
Click + Add New MCP Server.
Name: docshark
Type: command
Command: bunx -y docshark start --stdio

Claude Desktop

Edit your Claude Desktop configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "docshark": {
      "command": "bunx",
      "args": ["-y", "docshark", "start", "--stdio"]
    }
  }
}

🛠️ Development

Local Setup

Ensure you have Bun installed.

# Clone the repository
git clone https://github.com/Michael-Obele/docshark.git
cd docshark

# Install dependencies
bun install

# (Optional) Enable auto-detection & scraping of Javascript React/Vue single-page apps
bun add puppeteer-core

# Start the DocShark MCP server in HTTP mode for local testing
bun run src/cli.ts start --port 6380

Local CLI Debugging

# Run CLI directly while developing
bun run src/cli.ts list

Tests

Run the regression suite before merging or publishing changes:

bun test scripts/*.test.ts

The suite covers storage and migrations, library management, extraction, chunking, search, crawl helpers, API routes, and MCP tool wrappers.

🔄 Versioning & Changelog

This project uses Google's Release Please to automate versioning and changelog generation.

Semantic Versioning: Our versions automatically bump (e.g. 0.0.1 -> 0.0.2 or 0.1.0) based on standard Conventional Commits (feat:, fix:, chore:, etc.).
Automated: A PR is automatically created on master when standard commits are merged, generating a standard CHANGELOG.md.

📜 License

This project is open-source and available under the MIT License.

Built to empower AI agents with the latest knowledge.