mongodocs-mcp
v11.0.0
Published
Transform any GitHub repository into searchable vector embeddings. MCP server with smart indexing, voyage-context-3 embeddings, and semantic search for Claude/Cursor IDEs.
Maintainers
Readme
mongodocs-mcp
A Model Context Protocol (MCP) server that transforms any GitHub repository into searchable vector embeddings, enabling semantic search across codebases and documentation through IDE integration.
Architecture
The system implements a three-phase indexing pipeline with smart change detection:
Repository → Git Clone → Smart Chunking → Vector Embeddings → MongoDB Atlas
↓ ↓ ↓ ↓
Hash Tracking Semantic Split voyage-context-3 Vector SearchCore Components
- Indexer (
src/core/indexer.ts): Git-based change detection using commit hashes - Semantic Chunker (
src/core/semantic-chunker.ts): Multi-strategy content splitting - Embedding Service (
src/core/embeddings.ts): Voyage AI integration with batching - Storage Service (
src/core/storage.ts): MongoDB Atlas vector operations - Search Service (
src/core/search.ts): Vector, hybrid RRF, and MMR algorithms - MCP Server (
src/index.ts): Protocol implementation for IDE integration
Installation
Global Package
npm install -g mongodocs-mcpFrom Source
git clone https://github.com/yourusername/mongodocs-mcp.git
cd mongodocs-mcp
npm install
npm run build
npm linkSetup
1. MongoDB Atlas
Create free M0 cluster at cloud.mongodb.com:
# Database structure
Database: mongodb_semantic_docs
Collection: documents
# Connection string format
mongodb+srv://username:[email protected]/?retryWrites=true&w=majorityNetwork Access Configuration:
- Navigate to Network Access → Add IP Address
- Add
0.0.0.0/0for development (restrict in production)
Vector Search Index Creation:
- Go to Atlas Search → Create Index
- Select "JSON Editor"
- Paste configuration:
{
"mappings": {
"dynamic": true,
"fields": {
"embedding": {
"type": "knnVector",
"dimensions": 1024,
"similarity": "cosine"
}
}
}
}Name: vector_index
2. Voyage AI
Get API key from voyageai.com:
- Model:
voyage-context-3 - Dimensions: 1024
- Context window: 32,000 tokens
- Rate limit: 2000 RPM
3. Environment Configuration
Create .env file:
# Required
MONGODB_URI=mongodb+srv://username:[email protected]/?retryWrites=true&w=majority
VOYAGE_API_KEY=pa-your-api-key
# Optional
GITHUB_TOKEN=ghp_your_token # For private reposUsage
Web Interface
# Start web UI
npm run web
# Opens http://localhost:3000
# 4-step wizard:
# 1. Configure APIs
# 2. Select repositories
# 3. Review MCP setup
# 4. Start processingCommand Line
# Index repositories (smart mode - only changed files)
npm run index
# Force complete rebuild
npm run rebuild
# Monitor indexing progress
npm run progress
# Database statistics
npm run stats
# Clean database
npm run cleanProgrammatic API
import { Indexer } from 'mongodocs-mcp';
const config = {
repositories: [{
name: 'My Documentation',
repo: 'owner/repository',
branch: 'main',
product: 'custom-my-docs'
}],
embedding: {
model: 'voyage-context-3',
dimensions: 1024,
chunkSize: 1000,
chunkOverlap: 200
}
};
const indexer = new Indexer(config);
indexer.onProgress((progress) => {
console.log(`${progress.phase}: ${progress.current}/${progress.total}`);
});
await indexer.index();MCP Integration
Claude Desktop
File: ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"mongodocs": {
"command": "npx",
"args": ["mongodocs-mcp"],
"env": {
"MONGODB_URI": "your-connection-string",
"VOYAGE_API_KEY": "your-api-key"
}
}
}
}Cursor IDE
File: .cursor/mcp_settings.json
{
"mcpServers": {
"mongodocs": {
"command": "npx",
"args": ["mongodocs-mcp"],
"env": {
"MONGODB_URI": "your-connection-string",
"VOYAGE_API_KEY": "your-api-key"
}
}
}
}Restart IDE after configuration.
Search Methods
1. Hybrid RRF Search (Primary)
Reciprocal Rank Fusion combining vector and keyword search:
// Weight configuration
vectorWeight: 0.7
keywordWeight: 0.3
// Ranking formula
score = 1 / (k + rank) where k = 602. MMR Search (Diversity)
Maximum Marginal Relevance for result diversity:
// Parameters
fetchK: 20 // Initial candidates
lambdaMult: 0.7 // Relevance vs diversity
limit: 5 // Final results
// Algorithm
MMR = λ * Sim(Di, Q) - (1-λ) * max Sim(Di, Dj)3. Pure Vector Search
Cosine similarity search:
// Configuration
numCandidates: 40 // 7.5x faster than default 300
limit: 10Technical Implementation
Semantic Chunking
Three-strategy approach with statistical analysis:
1. Interquartile Method
// Calculate sentence distances
distances = sentences.map(embed).map(cosineDistance)
// Find breakpoints at quartile boundaries
Q1, Q3 = quartiles(distances)
threshold = Q3 + 1.5 * (Q3 - Q1)2. Gradient Method
// Identify semantic transitions
gradients = distances.map(derivative)
breakpoints = gradients.filter(g => g > threshold)3. Hybrid Scoring
score = 0.6 * interquartile + 0.4 * gradient
// Adaptive to content typeChunk Optimization
const CHUNK_CONFIG = {
base: 1000, // Target size
min: 100, // Prevent empty
max: 2500, // Respect limits
overlap: 200, // Context preservation
// Token validation
maxTokens: 6000, // voyage-context-3 safety
tokenizer: 'cl100k_base'
};Smart Indexing
Repository state tracking:
// Check existing hash
const existingHash = await storage.getRepositoryHash(repo.name);
const currentHash = await git.getLatestCommit();
if (existingHash === currentHash) {
console.log('✅ Repository up to date, skipping...');
return;
}
// Process only changed files
const changedFiles = await git.diff(existingHash, currentHash);
await processFiles(changedFiles);
await storage.updateRepositoryHash(repo.name, currentHash);Error Handling
// Exponential backoff with jitter
const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
const jitter = Math.random() * 1000;
await sleep(delay + jitter);
// Token limit handling
if (error.message.includes('32000 tokens')) {
// Split chunk and retry
const subChunks = emergencySplit(chunk);
return processSubChunks(subChunks);
}Performance Characteristics
Indexing Metrics
- Processing rate: 100-150 docs/hour (Voyage API limited)
- Batch size: 32 documents optimal
- Memory usage: <500MB peak
- Network bandwidth: ~10MB/hour
Search Performance
- Latency: <100ms p99
- Throughput: 1000+ QPS
- Index size: ~1.5KB per chunk
- Cache TTL: 5 minutes
Storage Efficiency
// Document structure (avg 1.5KB)
{
_id: ObjectId,
title: string, // 50 bytes
content: string, // 1000 bytes
embedding: float[1024], // 4KB compressed
metadata: { // 200 bytes
file: string,
repo: string,
product: string,
indexedAt: Date
}
}Repository Configuration
Default Repositories
const repositories = [
{
name: 'MongoDB Documentation',
repo: 'mongodb/docs',
branch: 'master',
product: 'mongodb-docs',
priority: 10
},
// Add custom repositories...
];Custom Repository
{
name: 'Your Documentation',
repo: 'owner/repository',
branch: 'main',
product: 'custom-your-docs',
// Optional filters
include: ['docs/**/*.md'],
exclude: ['**/node_modules/**'],
// Processing options
chunkSize: 1500,
chunkOverlap: 300
}Development
Build Pipeline
# Development with watch
npm run dev
# Production build
npm run build
# Type checking
npm run typecheck
# Linting
npm run lint
# Testing
npm testProject Structure
src/
├── core/
│ ├── indexer.ts # Orchestration
│ ├── semantic-chunker.ts # Content splitting
│ ├── embeddings.ts # Vector generation
│ ├── storage.ts # Database operations
│ └── search.ts # Query algorithms
├── config/
│ └── index.ts # Repository definitions
├── web/
│ ├── server.ts # Express server
│ ├── coordinator.ts # Web orchestration
│ └── templates/ # HTML interfaces
└── index.ts # MCP server
dist/ # Compiled output
.repos/ # Cloned repositoriesKey Dependencies
{
"mongodb": "^6.10.0", // Native driver
"voyageai": "^0.0.1-5", // Embeddings
"@modelcontextprotocol/sdk": "^1.0.0", // MCP
"js-tiktoken": "^1.0.15", // Tokenization
"simple-git": "^3.27.0" // Repository ops
}Troubleshooting
Connection Issues
# Test MongoDB connection
node -e "
const { MongoClient } = require('mongodb');
MongoClient.connect(process.env.MONGODB_URI)
.then(() => console.log('✅ Connected'))
.catch(err => console.error('❌', err.message));
"
# Test Voyage AI
curl -X POST https://api.voyageai.com/v1/embeddings \
-H "Authorization: Bearer $VOYAGE_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": ["test"], "model": "voyage-context-3"}'Index Issues
# Verify vector index
mongosh $MONGODB_URI --eval "
db.documents.getSearchIndexes()
"
# Check document structure
mongosh $MONGODB_URI --eval "
db.documents.findOne()
"Performance Tuning
// Adjust for your use case
const tuning = {
// Smaller batches for memory constraints
batchSize: 16,
// More candidates for precision
numCandidates: 100,
// Larger chunks for context
chunkSize: 2000,
// Disable for speed
smartIndexing: false
};Best Practices
Security
- Store credentials in environment variables
- Use least-privilege MongoDB user
- Rotate API keys regularly
- Enable MongoDB audit logging
Optimization
- Index during off-peak hours
- Use incremental updates
- Monitor token usage
- Cache frequent queries
Scaling
- Horizontal sharding for large corpuses
- Read replicas for search traffic
- CDN for static assets
- Queue system for processing
Contributing
Pull requests welcome. Please ensure:
- TypeScript strict mode compliance
- Test coverage >80%
- Conventional commits
- Documentation updates
License
MIT
Support
- Issues: GitHub
- Discussions: GitHub Discussions
- MCP Spec: modelcontextprotocol.io
Built with MongoDB Atlas vector search and Voyage AI embeddings.
