mongodocs-mcp

v11.0.0

Published

6 months ago

Transform any GitHub repository into searchable vector embeddings. MCP server with smart indexing, voyage-context-3 embeddings, and semantic search for Claude/Cursor IDEs.

Downloads

0High
0Medium
0Low

romiluz13

mcp model-context-protocol semantic-search vector-search voyage-ai embeddings github documentation code-search mongodb-atlas cursor-ide claude-desktop rag ai

mongodocs-mcp

A Model Context Protocol (MCP) server that transforms any GitHub repository into searchable vector embeddings, enabling semantic search across codebases and documentation through IDE integration.

Architecture

The system implements a three-phase indexing pipeline with smart change detection:

Repository → Git Clone → Smart Chunking → Vector Embeddings → MongoDB Atlas
                ↓              ↓                ↓                    ↓
           Hash Tracking   Semantic Split   voyage-context-3    Vector Search

Core Components

Indexer (src/core/indexer.ts): Git-based change detection using commit hashes
Semantic Chunker (src/core/semantic-chunker.ts): Multi-strategy content splitting
Embedding Service (src/core/embeddings.ts): Voyage AI integration with batching
Storage Service (src/core/storage.ts): MongoDB Atlas vector operations
Search Service (src/core/search.ts): Vector, hybrid RRF, and MMR algorithms
MCP Server (src/index.ts): Protocol implementation for IDE integration

Installation

Global Package

npm install -g mongodocs-mcp

From Source

git clone https://github.com/yourusername/mongodocs-mcp.git
cd mongodocs-mcp
npm install
npm run build
npm link

Setup

1. MongoDB Atlas

Create free M0 cluster at cloud.mongodb.com:

# Database structure
Database: mongodb_semantic_docs
Collection: documents

# Connection string format
mongodb+srv://username:[email protected]/?retryWrites=true&w=majority

Network Access Configuration:

Navigate to Network Access → Add IP Address
Add 0.0.0.0/0 for development (restrict in production)

Vector Search Index Creation:

Go to Atlas Search → Create Index
Select "JSON Editor"
Paste configuration:

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embedding": {
        "type": "knnVector",
        "dimensions": 1024,
        "similarity": "cosine"
      }
    }
  }
}

Name: vector_index

2. Voyage AI

Get API key from voyageai.com:

Model: voyage-context-3
Dimensions: 1024
Context window: 32,000 tokens
Rate limit: 2000 RPM

3. Environment Configuration

Create .env file:

# Required
MONGODB_URI=mongodb+srv://username:[email protected]/?retryWrites=true&w=majority
VOYAGE_API_KEY=pa-your-api-key

# Optional
GITHUB_TOKEN=ghp_your_token  # For private repos

Usage

Web Interface

# Start web UI
npm run web

# Opens http://localhost:3000
# 4-step wizard:
# 1. Configure APIs
# 2. Select repositories
# 3. Review MCP setup
# 4. Start processing

Command Line

# Index repositories (smart mode - only changed files)
npm run index

# Force complete rebuild
npm run rebuild

# Monitor indexing progress
npm run progress

# Database statistics
npm run stats

# Clean database
npm run clean

Programmatic API

import { Indexer } from 'mongodocs-mcp';

const config = {
  repositories: [{
    name: 'My Documentation',
    repo: 'owner/repository',
    branch: 'main',
    product: 'custom-my-docs'
  }],
  embedding: {
    model: 'voyage-context-3',
    dimensions: 1024,
    chunkSize: 1000,
    chunkOverlap: 200
  }
};

const indexer = new Indexer(config);
indexer.onProgress((progress) => {
  console.log(`${progress.phase}: ${progress.current}/${progress.total}`);
});
await indexer.index();

MCP Integration

Claude Desktop

File: ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "mongodocs": {
      "command": "npx",
      "args": ["mongodocs-mcp"],
      "env": {
        "MONGODB_URI": "your-connection-string",
        "VOYAGE_API_KEY": "your-api-key"
      }
    }
  }
}

Cursor IDE

File: .cursor/mcp_settings.json

{
  "mcpServers": {
    "mongodocs": {
      "command": "npx",
      "args": ["mongodocs-mcp"],
      "env": {
        "MONGODB_URI": "your-connection-string",
        "VOYAGE_API_KEY": "your-api-key"
      }
    }
  }
}

Restart IDE after configuration.

Search Methods

1. Hybrid RRF Search (Primary)

Reciprocal Rank Fusion combining vector and keyword search:

// Weight configuration
vectorWeight: 0.7
keywordWeight: 0.3

// Ranking formula
score = 1 / (k + rank) where k = 60

2. MMR Search (Diversity)

Maximum Marginal Relevance for result diversity:

// Parameters
fetchK: 20        // Initial candidates
lambdaMult: 0.7   // Relevance vs diversity
limit: 5          // Final results

// Algorithm
MMR = λ * Sim(Di, Q) - (1-λ) * max Sim(Di, Dj)

3. Pure Vector Search

Cosine similarity search:

// Configuration
numCandidates: 40  // 7.5x faster than default 300
limit: 10

Technical Implementation

Semantic Chunking

Three-strategy approach with statistical analysis:

1. Interquartile Method

// Calculate sentence distances
distances = sentences.map(embed).map(cosineDistance)
// Find breakpoints at quartile boundaries
Q1, Q3 = quartiles(distances)
threshold = Q3 + 1.5 * (Q3 - Q1)

2. Gradient Method

// Identify semantic transitions
gradients = distances.map(derivative)
breakpoints = gradients.filter(g => g > threshold)

3. Hybrid Scoring

score = 0.6 * interquartile + 0.4 * gradient
// Adaptive to content type

Chunk Optimization

const CHUNK_CONFIG = {
  base: 1000,      // Target size
  min: 100,        // Prevent empty
  max: 2500,       // Respect limits
  overlap: 200,    // Context preservation
  
  // Token validation
  maxTokens: 6000,  // voyage-context-3 safety
  tokenizer: 'cl100k_base'
};

Smart Indexing

Repository state tracking:

// Check existing hash
const existingHash = await storage.getRepositoryHash(repo.name);
const currentHash = await git.getLatestCommit();

if (existingHash === currentHash) {
  console.log('✅ Repository up to date, skipping...');
  return;
}

// Process only changed files
const changedFiles = await git.diff(existingHash, currentHash);
await processFiles(changedFiles);
await storage.updateRepositoryHash(repo.name, currentHash);

Error Handling

// Exponential backoff with jitter
const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
const jitter = Math.random() * 1000;
await sleep(delay + jitter);

// Token limit handling
if (error.message.includes('32000 tokens')) {
  // Split chunk and retry
  const subChunks = emergencySplit(chunk);
  return processSubChunks(subChunks);
}

Performance Characteristics

Indexing Metrics

Processing rate: 100-150 docs/hour (Voyage API limited)
Batch size: 32 documents optimal
Memory usage: <500MB peak
Network bandwidth: ~10MB/hour

Search Performance

Latency: <100ms p99
Throughput: 1000+ QPS
Index size: ~1.5KB per chunk
Cache TTL: 5 minutes

Storage Efficiency

// Document structure (avg 1.5KB)
{
  _id: ObjectId,
  title: string,           // 50 bytes
  content: string,         // 1000 bytes
  embedding: float[1024],  // 4KB compressed
  metadata: {              // 200 bytes
    file: string,
    repo: string,
    product: string,
    indexedAt: Date
  }
}

Repository Configuration

Default Repositories

const repositories = [
  {
    name: 'MongoDB Documentation',
    repo: 'mongodb/docs',
    branch: 'master',
    product: 'mongodb-docs',
    priority: 10
  },
  // Add custom repositories...
];

Custom Repository

{
  name: 'Your Documentation',
  repo: 'owner/repository',
  branch: 'main',
  product: 'custom-your-docs',
  
  // Optional filters
  include: ['docs/**/*.md'],
  exclude: ['**/node_modules/**'],
  
  // Processing options
  chunkSize: 1500,
  chunkOverlap: 300
}

Development

Build Pipeline

# Development with watch
npm run dev

# Production build
npm run build

# Type checking
npm run typecheck

# Linting
npm run lint

# Testing
npm test

Project Structure

src/
├── core/
│   ├── indexer.ts           # Orchestration
│   ├── semantic-chunker.ts  # Content splitting
│   ├── embeddings.ts        # Vector generation
│   ├── storage.ts           # Database operations
│   └── search.ts            # Query algorithms
├── config/
│   └── index.ts             # Repository definitions
├── web/
│   ├── server.ts            # Express server
│   ├── coordinator.ts       # Web orchestration
│   └── templates/           # HTML interfaces
└── index.ts                 # MCP server

dist/                        # Compiled output
.repos/                      # Cloned repositories

Key Dependencies

{
  "mongodb": "^6.10.0",           // Native driver
  "voyageai": "^0.0.1-5",         // Embeddings
  "@modelcontextprotocol/sdk": "^1.0.0",  // MCP
  "js-tiktoken": "^1.0.15",       // Tokenization
  "simple-git": "^3.27.0"         // Repository ops
}

Troubleshooting

Connection Issues

# Test MongoDB connection
node -e "
  const { MongoClient } = require('mongodb');
  MongoClient.connect(process.env.MONGODB_URI)
    .then(() => console.log('✅ Connected'))
    .catch(err => console.error('❌', err.message));
"

# Test Voyage AI
curl -X POST https://api.voyageai.com/v1/embeddings \
  -H "Authorization: Bearer $VOYAGE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": ["test"], "model": "voyage-context-3"}'

Index Issues

# Verify vector index
mongosh $MONGODB_URI --eval "
  db.documents.getSearchIndexes()
"

# Check document structure
mongosh $MONGODB_URI --eval "
  db.documents.findOne()
"

Performance Tuning

// Adjust for your use case
const tuning = {
  // Smaller batches for memory constraints
  batchSize: 16,
  
  // More candidates for precision
  numCandidates: 100,
  
  // Larger chunks for context
  chunkSize: 2000,
  
  // Disable for speed
  smartIndexing: false
};

Best Practices

Security

Store credentials in environment variables
Use least-privilege MongoDB user
Rotate API keys regularly
Enable MongoDB audit logging

Optimization

Index during off-peak hours
Use incremental updates
Monitor token usage
Cache frequent queries

Scaling

Horizontal sharding for large corpuses
Read replicas for search traffic
CDN for static assets
Queue system for processing

Contributing

Pull requests welcome. Please ensure:

TypeScript strict mode compliance
Test coverage >80%
Conventional commits
Documentation updates

License

MIT

Support

Issues: GitHub
Discussions: GitHub Discussions
MCP Spec: modelcontextprotocol.io

Built with MongoDB Atlas vector search and Voyage AI embeddings.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

mongodocs-mcp

Architecture

Core Components

Installation

Global Package

From Source

Setup

1. MongoDB Atlas

2. Voyage AI

3. Environment Configuration

Usage

Web Interface

Command Line

Programmatic API

MCP Integration

Claude Desktop

Cursor IDE

Search Methods

1. Hybrid RRF Search (Primary)

2. MMR Search (Diversity)

3. Pure Vector Search

Technical Implementation

Semantic Chunking

Chunk Optimization

Smart Indexing

Error Handling

Performance Characteristics

Indexing Metrics

Search Performance

Storage Efficiency

Repository Configuration

Default Repositories

Custom Repository

Development

Build Pipeline

Project Structure

Key Dependencies

Troubleshooting

Connection Issues

Index Issues

Performance Tuning

Best Practices

Security

Optimization

Scaling

Contributing

License

Support