@knath2000/codebase-indexing-mcp
v1.0.0
Published
MCP server for codebase indexing with Voyage AI embeddings and Qdrant vector storage
Maintainers
Readme
MCP Codebase Indexing Server
A Model Context Protocol (MCP) server that provides intelligent codebase indexing and semantic search capabilities for AI assistants like Cursor. This server uses Voyage AI for embeddings and Qdrant for vector storage to enable powerful semantic code search across your entire codebase.
📋 Table of Contents
- ✨ Features
- 🏗️ Architecture
- 🚀 Quick Start
- 🎯 Cursor Integration Guide
- ⚙️ Customization Guide
- 📝 Configuration
- 🛠️ MCP Tools
- 🌐 Supported Languages
- 🔧 Troubleshooting
- 🤝 Contributing
- 📄 License
✨ Features
- 🧠 Intelligent Code Parsing: Uses tree-sitter to parse code into meaningful chunks (functions, classes, modules, etc.)
- 🔍 Semantic Search: Leverages Voyage AI embeddings for semantic code search beyond keyword matching
- 📊 Vector Storage: Uses Qdrant for efficient vector storage and lightning-fast similarity search
- 🌐 Multiple Language Support: Supports JavaScript, TypeScript, Python, and more
- ⚡ Incremental Indexing: Tracks file changes and only re-indexes when necessary
- 🎯 Flexible Search: Search by language, chunk type, file path, or semantic similarity
- 🔗 Context-Aware: Provides code context and related chunks for better understanding
- 🚀 MCP Compatible: Works seamlessly with Cursor and other MCP-compatible AI assistants
- 🛠️ 12 Powerful Tools: Complete set of indexing and search tools for comprehensive codebase management
🏗️ Architecture
graph TB
subgraph "AI Assistant"
A[Cursor/Claude]
end
subgraph "MCP Server"
B[HTTP Server<br/>Custom SSE + JSON-RPC]
C[IndexingService]
D[SearchService]
E[Code Parser<br/>Tree-sitter]
end
subgraph "External Services"
F[Voyage AI<br/>Embeddings]
G[Qdrant<br/>Vector DB]
end
A ↔ B
B → C
B → D
C → E
C → F
C → G
D → F
D → G
style A fill:#e1f5fe
style B fill:#f3e5f5
style F fill:#fff3e0
style G fill:#e8f5e8The server consists of several key components:
- Code Parser: Tree-sitter based parser that extracts semantic chunks from code
- Voyage Client: Handles embedding generation via Voyage AI API
- Qdrant Client: Manages vector storage and similarity search
- Indexing Service: Orchestrates the indexing process
- Search Service: Provides semantic search capabilities
- MCP Server: Exposes tools via the Model Context Protocol
📦 Installation
NPM Package (Recommended)
# Install globally
npm install -g mcp-codebase-indexing-server
# Or run directly with npx
npx mcp-codebase-indexing-serverDocker
# Pull and run
docker run -p 3001:3001 ghcr.io/your-org/mcp-codebase-indexing-server:latestFrom Source
git clone <repository-url>
cd mcp-codebase-indexing-server
npm install
npm run build
npm start🚀 Quick Start
Prerequisites
- Node.js 18+
- Voyage AI API key (Get one here)
- Qdrant instance (local or cloud)
- AI assistant that supports MCP (like Cursor)
5-Minute Setup
- Get your services ready:
# Start local Qdrant
docker run -d -p 6333:6333 --name qdrant qdrant/qdrant
# Get Voyage AI API key from https://www.voyageai.com/- Deploy the server:
git clone <repository-url>
cd mcp-codebase-indexing-server
npm install && npm run build
VOYAGE_API_KEY=your_key_here npm startConnect to Cursor:
- Add MCP server in Cursor settings
- Use server URL:
http://localhost:3001 - You should see a green circle with 12 tools available
Test it out:
- Index your codebase: "Index the current directory"
- Search your code: "Find authentication functions in TypeScript"
Prerequisites (Detailed)
- Node.js 18+
- Voyage AI API key
- Qdrant instance (local or cloud)
Installation
- Clone the repository:
git clone <repository-url>
cd mcp-codebase-indexing-server- Install dependencies:
npm install- Build the server:
npm run build🎯 Cursor Integration Guide
Setting Up MCP Server in Cursor
Open Cursor Settings:
- Go to Settings → Features → Model Context Protocol
Add MCP Server:
{ "name": "codebase-indexing", "command": "node", "args": ["path/to/your/mcp-codebase-indexing-server/dist/index.js"], "env": { "VOYAGE_API_KEY": "your_voyage_api_key_here", "QDRANT_URL": "http://localhost:6333" } }Verify Connection:
- Look for green circle indicator in Cursor
- Should show "12 tools" when connected
- If red circle: check logs and troubleshooting section
Using with Cursor
Indexing Your Codebase
"Index the current directory for semantic search"
"Index the src/ folder in my project"
"Re-index the modified files in my codebase"Searching Your Code
"Find authentication functions in TypeScript"
"Search for error handling patterns"
"Look for database query functions"
"Find classes that handle user data"
"Show me similar functions to the one I'm looking at"Getting Code Context
"Get context around the login function"
"Show me similar code to this authentication logic"
"Find related functions in this file"Troubleshooting Cursor Connection
| Issue | Solution | |-------|----------| | Red circle (0 tools) | Check VOYAGE_API_KEY is set correctly | | "No server info found" | Restart Cursor completely | | Connection timeout | Ensure Qdrant is running on correct port | | Tools not responding | Check server logs for errors |
⚙️ Customization Guide
For Different Project Types
Large Enterprise Codebases
# Handle large codebases efficiently
BATCH_SIZE=50
MAX_FILE_SIZE=2097152
CHUNK_SIZE=1500
EXCLUDE_PATTERNS=node_modules,dist,build,.git,coverage,logsAI/ML Projects
# Optimize for Python-heavy codebases
SUPPORTED_EXTENSIONS=.py,.ipynb,.md,.yaml,.yml
EMBEDDING_MODEL=voyage-code-2
CHUNK_SIZE=2000Frontend Projects
# Focus on web technologies
SUPPORTED_EXTENSIONS=.js,.jsx,.ts,.tsx,.vue,.svelte,.css,.scss
EXCLUDE_PATTERNS=node_modules,dist,build,.next,coverage
CHUNK_SIZE=800Microservices Architecture
# Index multiple service repositories
COLLECTION_NAME=microservices-org
BATCH_SIZE=100
# Consider separate instances per serviceAdvanced Configuration Options
Performance Tuning
# Memory optimization
BATCH_SIZE=25 # Smaller batches for memory-constrained environments
CHUNK_OVERLAP=100 # Reduce overlap to save storage
MAX_FILE_SIZE=1048576 # Limit file size (1MB default)
# Speed optimization
BATCH_SIZE=200 # Larger batches for faster processing
EMBEDDING_MODEL=voyage-code-2 # Optimized model for codeCustom File Filtering
# Include only specific file types
SUPPORTED_EXTENSIONS=.py,.js,.ts,.go,.rust
# Exclude testing and generated files
EXCLUDE_PATTERNS=*test*,*spec*,generated,vendor,node_modules
# Include documentation
SUPPORTED_EXTENSIONS=.md,.rst,.txt,.py,.js,.tsMulti-Environment Setup
# Development
COLLECTION_NAME=dev-codebase
QDRANT_URL=http://localhost:6333
# Staging
COLLECTION_NAME=staging-codebase
QDRANT_URL=https://staging-qdrant.company.com
# Production
COLLECTION_NAME=prod-codebase
QDRANT_URL=https://qdrant.company.com
QDRANT_API_KEY=prod_api_key🔒 Privacy & Security
Your Code Stays Private
The MCP server is designed with privacy as a core principle:
Small Code Chunks Only
- Chunk Size: Only small code segments (100-1000 characters) are sent for embedding
- Default: 800 characters maximum per chunk (configurable)
- Enforcement: Automatic truncation of larger chunks with logging
- No Full Files: Complete files are never sent to external services
One-Way Mathematical Representations
- Embeddings: Code chunks are converted to mathematical vectors (embeddings)
- Irreversible: Embeddings cannot be converted back to original code
- Semantic Only: Vectors capture meaning, not exact text
- No Code Storage: Original code never leaves your environment
Local Processing
- Parsing: All code parsing happens locally using Tree-sitter
- Chunking: Code segmentation occurs on your machine
- Storage: Only vector embeddings stored in your Qdrant instance
- Search: Semantic search runs on your infrastructure
Network Security
- HTTPS: All external API calls use TLS encryption
- API Keys: Securely stored in environment variables
- No Logging: Code content is never logged to external services
- Minimal Data: Only mathematical vectors transmitted
Privacy Configuration
# Privacy-optimized settings
CHUNK_SIZE=800 # Max 800 chars per chunk (100-1000 range)
CHUNK_OVERLAP=100 # Reduced overlap for privacy
MAX_FILE_SIZE=1048576 # 1MB file size limit
EXCLUDE_PATTERNS=*.git*,node_modules/**,dist/** # Skip sensitive directories📝 Configuration
The server is configured via environment variables:
Required Environment Variables
VOYAGE_API_KEY: Your Voyage AI API key
Optional Environment Variables
QDRANT_URL: Qdrant server URL (default:http://localhost:6333)QDRANT_API_KEY: Qdrant API key (if using cloud instance)COLLECTION_NAME: Name of the Qdrant collection (default:codebase)EMBEDDING_MODEL: Voyage AI model to use (default:voyage-code-3)BATCH_SIZE: Batch size for embedding generation (default:100)CHUNK_SIZE: Maximum chunk size in characters (default:800, range: 100-1000)CHUNK_OVERLAP: Overlap between chunks (default:100)MAX_FILE_SIZE: Maximum file size to index in bytes (default:1048576)EXCLUDE_PATTERNS: Comma-separated patterns to exclude (default: see config)SUPPORTED_EXTENSIONS: Comma-separated file extensions to support (default: see config)
Example Configuration
Create a .env file in the project root:
VOYAGE_API_KEY=your_voyage_api_key_here
QDRANT_URL=http://localhost:6333
COLLECTION_NAME=my_codebase
EMBEDDING_MODEL=voyage-code-2
BATCH_SIZE=50
MAX_FILE_SIZE=2097152Usage
Running the Server
npm startOr in development mode:
npm run devSetting up Qdrant
Local Qdrant (Docker)
docker run -p 6333:6333 qdrant/qdrantQdrant Cloud
Sign up at Qdrant Cloud and get your API key and URL.
🛠️ MCP Tools
The server provides 16 powerful tools organized by functionality:
index_directory: Index all files in a directory recursivelyindex_file: Index a single filereindex_file: Re-index a file (force update)remove_file: Remove a file from the indexclear_index: Clear the entire search index
codebase_search: 🌟 Natural language search for codebase understanding (e.g., "How is user authentication handled?", "Database connection setup", "Error handling patterns")search_code: Search for code chunks using semantic similaritysearch_functions: Search for functions by name or descriptionsearch_classes: Search for classes by name or descriptionfind_similar: Find code chunks similar to a given chunkget_code_context: Get code context around a specific chunk
get_indexing_stats: Get statistics about the indexed codebaseget_search_stats: Get statistics about the search indexget_enhanced_stats: Get enhanced statistics including cache and hybrid search metricsget_health_status: Get comprehensive health status of all servicesclear_search_cache: Clear search cache for fresh resultsinvalidate_file_cache: Invalidate cache for a specific file
Example Usage
- Index a directory:
{
"tool": "index_directory",
"arguments": {
"directory_path": "/path/to/your/codebase"
}
}- 🌟 Natural language codebase search:
{
"tool": "codebase_search",
"arguments": {
"query": "How is user authentication handled?",
"limit": 5,
"enable_hybrid": true,
"enable_reranking": true
}
}- Search for authentication functions:
{
"tool": "search_functions",
"arguments": {
"query": "authentication login user",
"language": "typescript",
"limit": 5
}
}- Search for error handling patterns:
{
"tool": "search_code",
"arguments": {
"query": "error handling exception try catch",
"chunk_type": "function",
"threshold": 0.7
}
}🌟 Natural Language Search Examples
The codebase_search tool understands natural language queries and provides:
- Relevant code snippets with syntax highlighting
- File paths with line numbers for direct navigation
- Similarity scores as percentages
- Clickable navigation links to jump to specific locations
Example queries that work great:
"How is user authentication handled?""Database connection setup""Error handling patterns""API endpoint definitions""Component state management""Configuration loading""Logging implementation"
Sample output format:
# 🔍 Natural Language Codebase Search
**Query:** "How is user authentication handled?"
## 📊 Search Results
- **Found:** 8 relevant code references
- **Search Time:** 45ms
- **Hybrid Search:** ✅ (Dense + Sparse)
- **LLM Re-ranked:** ✅ (Relevance optimized)
## 📝 Code References with Navigation Links
### 1. [📂 src/auth/auth-service.ts:15](file://src/auth/auth-service.ts#L15)
**Lines 15-28** | **function** | **typescript** | **Similarity: 94.2%**
```typescript
async authenticateUser(token: string): Promise<User | null> {
try {
const decoded = jwt.verify(token, this.secretKey);
return await this.userRepository.findById(decoded.userId);
} catch (error) {
logger.error('Authentication failed:', error);
return null;
}
}
## 🌐 Supported Languages
| Language | File Extensions | Status |
|----------|----------------|--------|
| **JavaScript** | `.js`, `.jsx` | ✅ Full Support |
| **TypeScript** | `.ts`, `.tsx` | ✅ Full Support |
| **Python** | `.py` | ✅ Full Support |
| **Go** | `.go` | 🔄 Coming Soon |
| **Rust** | `.rs` | 🔄 Coming Soon |
| **Java** | `.java` | 🔄 Coming Soon |
> 💡 **Extensible**: Additional languages can be added by installing the corresponding tree-sitter grammars and updating the configuration.
## API Reference
### Indexing Service
The `IndexingService` class provides:
```typescript
// Initialize the service
await indexingService.initialize();
// Index a directory
const stats = await indexingService.indexDirectory('/path/to/code');
// Index a single file
const chunks = await indexingService.indexFile('/path/to/file.ts');
// Remove a file from index
await indexingService.removeFile('/path/to/file.ts');
// Clear entire index
await indexingService.clearIndex();Search Service
The SearchService class provides:
// Initialize the service
await searchService.initialize();
// Basic search
const results = await searchService.search({
query: 'authentication',
language: 'typescript',
limit: 10
});
// Search functions
const functions = await searchService.searchFunctions('login', 'typescript');
// Find similar chunks
const similar = await searchService.findSimilar('chunk_id', 5);
// Get code context
const context = await searchService.getCodeContext('chunk_id', 5);Performance Considerations
- Batch Processing: The server processes files in batches to avoid memory issues
- Incremental Updates: Only re-indexes files that have changed
- Embedding Caching: Consider caching embeddings to reduce API calls
- Vector Storage: Qdrant provides efficient vector storage and retrieval
🔧 Troubleshooting
Common Issues
MCP Connection Issues
| Problem | Symptoms | Solution |
|---------|----------|----------|
| Server won't start | Error: EADDRINUSE | Port 3001 already in use. Change PORT env var or kill existing process |
| Connection timeout | Cursor shows "connecting..." forever | Check VOYAGE_API_KEY is valid and Qdrant is running |
| Red circle in Cursor | 0 tools shown | Restart Cursor completely, verify server is running |
| "Not connected" error | Tools fail with connection error | Server restarted automatically, wait 30 seconds |
Service Connection Issues
Connection to Qdrant fails:
# Check if Qdrant is running curl http://localhost:6333/collections # Start Qdrant if not running docker run -d -p 6333:6333 --name qdrant qdrant/qdrant # Check firewall settings netstat -tulpn | grep 6333Voyage AI API errors:
# Test API key curl -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"input": ["test"], "model": "voyage-code-2"}' \ https://api.voyageai.com/v1/embeddings # Check quota at https://www.voyageai.com/dashboard
Performance Issues
Out of memory during indexing:
# Reduce memory usage BATCH_SIZE=25 MAX_FILE_SIZE=524288 CHUNK_SIZE=500 # Exclude large directories EXCLUDE_PATTERNS=node_modules,dist,build,.git,logs,coverage,vendorSlow indexing performance:
# Optimize for speed BATCH_SIZE=100 CHUNK_OVERLAP=100 # Use faster embedding model if available EMBEDDING_MODEL=voyage-code-2
Code Parsing Issues
- Tree-sitter parsing errors:
- Error:
Language not supported- Solution: Add tree-sitter grammar for your language
- Error:
Failed to parse file- Solution: Check file encoding (must be UTF-8)
- Error:
File too large- Solution: Increase MAX_FILE_SIZE or exclude the file
- Error:
Diagnostic Commands
Check Server Health
# Test server is running
curl http://localhost:3001/health
# Test MCP endpoint
curl http://localhost:3001/sse
# Check server logs
npm start 2>&1 | tee server.logCheck Services
# Test Qdrant
curl http://localhost:6333/collections
# Test Voyage AI
curl -H "Authorization: Bearer $VOYAGE_API_KEY" \
https://api.voyageai.com/v1/embeddings \
-d '{"input":["test"],"model":"voyage-code-2"}'Debug Indexing
# Enable debug mode
DEBUG=1 npm start
# Test specific directory
curl -X POST http://localhost:3001/tools/call \
-H "Content-Type: application/json" \
-d '{"tool":"index_directory","arguments":{"directory_path":"./test"}}'Debug Mode
Enable comprehensive logging:
# Full debug output
DEBUG=1 npm start
# Service-specific debugging
DEBUG=indexing npm start
DEBUG=search npm start
DEBUG=mcp npm startLog Analysis
Look for these patterns in logs:
| Log Pattern | Meaning | Action |
|-------------|---------|--------|
| Error: VOYAGE_API_KEY is required | Missing API key | Set VOYAGE_API_KEY environment variable |
| Failed to connect to Qdrant | Vector DB unavailable | Check Qdrant is running and accessible |
| Rate limit exceeded | API quota reached | Wait or upgrade Voyage AI plan |
| Memory usage warning | High memory usage | Reduce BATCH_SIZE or exclude more files |
| Lazy initialization completed | Services ready | Normal startup, server ready for requests |
Getting Help
- Check server logs for specific error messages
- Test each service individually using diagnostic commands
- Verify environment variables are set correctly
- Restart services in order: Qdrant → MCP Server → Cursor
- Create minimal reproduction with a small test directory
If issues persist, create a GitHub issue with:
- Complete error logs
- Environment configuration (without API keys)
- Steps to reproduce
- System information (OS, Node.js version, etc.)
Development
Project Structure
src/
├── types.ts # Type definitions
├── index.ts # Main MCP server
├── clients/
│ ├── voyage-client.ts # Voyage AI client
│ └── qdrant-client.ts # Qdrant client
├── parsers/
│ └── code-parser.ts # Tree-sitter based parser
└── services/
├── indexing-service.ts # Indexing orchestration
└── search-service.ts # Search functionalityAdding New Languages
- Install the tree-sitter grammar:
npm install tree-sitter-rust- Update the
loadLanguagefunction incode-parser.ts - Add language configuration in
initializeLanguageConfigs - Update the file extension mapping
Testing
Run tests with:
npm testLinting
Check code style with:
npm run lintContributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
MIT License
Acknowledgments
- Voyage AI for embedding API
- Qdrant for vector database
- Tree-sitter for code parsing
- Model Context Protocol for the protocol specification
🤝 Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Development Setup
# Clone the repository
git clone <repository-url>
cd mcp-codebase-indexing-server
# Install dependencies
npm install
# Build the project
npm run build
# Run in development mode
npm run dev
# Run tests
npm testAdding New Languages
Install the tree-sitter grammar:
npm install tree-sitter-rustUpdate the
loadLanguagefunction insrc/parsers/code-parser.tsAdd language configuration in
initializeLanguageConfigsUpdate the file extension mapping
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Voyage AI for providing excellent code embeddings
- Qdrant for the powerful vector database
- Tree-sitter for robust code parsing
- Model Context Protocol for the standardized AI integration protocol
📈 Changelog
v1.0.0 - Production Release
- ✅ Complete MCP protocol implementation with 12 tools
- ✅ Lazy initialization to prevent connection timeouts
- ✅ Custom SSE implementation for Cursor compatibility
- ✅ Support for JavaScript, TypeScript, Python
- ✅ Voyage AI integration for semantic embeddings
- ✅ Qdrant integration for vector storage
- ✅ Incremental indexing with file change tracking
- ✅ Automated Fly.io deployment with GitHub Actions
🔗 Related Projects
- Model Context Protocol - Official MCP documentation
- MCP Servers - Official MCP server implementations
- Cursor - AI-powered code editor with MCP support
