@knath2000/codebase-indexing-mcp

v1.0.0

Published

10 months ago

MCP server for codebase indexing with Voyage AI embeddings and Qdrant vector storage

0High
0Medium
0Low

knath2000

mcp codebase indexing embeddings semantic-search vector-database voyage-ai qdrant

MCP Codebase Indexing Server

A Model Context Protocol (MCP) server that provides intelligent codebase indexing and semantic search capabilities for AI assistants like Cursor. This server uses Voyage AI for embeddings and Qdrant for vector storage to enable powerful semantic code search across your entire codebase.

📋 Table of Contents

✨ Features

🧠 Intelligent Code Parsing: Uses tree-sitter to parse code into meaningful chunks (functions, classes, modules, etc.)
🔍 Semantic Search: Leverages Voyage AI embeddings for semantic code search beyond keyword matching
📊 Vector Storage: Uses Qdrant for efficient vector storage and lightning-fast similarity search
🌐 Multiple Language Support: Supports JavaScript, TypeScript, Python, and more
⚡ Incremental Indexing: Tracks file changes and only re-indexes when necessary
🎯 Flexible Search: Search by language, chunk type, file path, or semantic similarity
🔗 Context-Aware: Provides code context and related chunks for better understanding
🚀 MCP Compatible: Works seamlessly with Cursor and other MCP-compatible AI assistants
🛠️ 12 Powerful Tools: Complete set of indexing and search tools for comprehensive codebase management

🏗️ Architecture

graph TB
    subgraph "AI Assistant"
        A[Cursor/Claude]
    end
    
    subgraph "MCP Server"
        B[HTTP Server<br/>Custom SSE + JSON-RPC]
        C[IndexingService]
        D[SearchService]
        E[Code Parser<br/>Tree-sitter]
    end
    
    subgraph "External Services"
        F[Voyage AI<br/>Embeddings]
        G[Qdrant<br/>Vector DB]
    end
    
    A ↔ B
    B → C
    B → D
    C → E
    C → F
    C → G
    D → F
    D → G
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style F fill:#fff3e0
    style G fill:#e8f5e8

The server consists of several key components:

Code Parser: Tree-sitter based parser that extracts semantic chunks from code
Voyage Client: Handles embedding generation via Voyage AI API
Qdrant Client: Manages vector storage and similarity search
Indexing Service: Orchestrates the indexing process
Search Service: Provides semantic search capabilities
MCP Server: Exposes tools via the Model Context Protocol

📦 Installation

NPM Package (Recommended)

# Install globally
npm install -g mcp-codebase-indexing-server

# Or run directly with npx
npx mcp-codebase-indexing-server

Docker

# Pull and run
docker run -p 3001:3001 ghcr.io/your-org/mcp-codebase-indexing-server:latest

From Source

git clone <repository-url>
cd mcp-codebase-indexing-server
npm install
npm run build
npm start

🚀 Quick Start

Prerequisites

Node.js 18+
Voyage AI API key (Get one here)
Qdrant instance (local or cloud)
AI assistant that supports MCP (like Cursor)

5-Minute Setup

Get your services ready:

# Start local Qdrant
docker run -d -p 6333:6333 --name qdrant qdrant/qdrant

# Get Voyage AI API key from https://www.voyageai.com/

Deploy the server:

git clone <repository-url>
cd mcp-codebase-indexing-server
npm install && npm run build
VOYAGE_API_KEY=your_key_here npm start

Connect to Cursor:
- Add MCP server in Cursor settings
- Use server URL: http://localhost:3001
- You should see a green circle with 12 tools available
Test it out:
- Index your codebase: "Index the current directory"
- Search your code: "Find authentication functions in TypeScript"

Prerequisites (Detailed)

Node.js 18+
Voyage AI API key
Qdrant instance (local or cloud)

Installation

Clone the repository:

git clone <repository-url>
cd mcp-codebase-indexing-server

Install dependencies:

npm install

Build the server:

npm run build

🎯 Cursor Integration Guide

Setting Up MCP Server in Cursor

Open Cursor Settings:
- Go to Settings → Features → Model Context Protocol

Add MCP Server:

{
  "name": "codebase-indexing",
  "command": "node",
  "args": ["path/to/your/mcp-codebase-indexing-server/dist/index.js"],
  "env": {
    "VOYAGE_API_KEY": "your_voyage_api_key_here",
    "QDRANT_URL": "http://localhost:6333"
  }
}

Verify Connection:
- Look for green circle indicator in Cursor
- Should show "12 tools" when connected
- If red circle: check logs and troubleshooting section

Using with Cursor

Indexing Your Codebase

"Index the current directory for semantic search"
"Index the src/ folder in my project"
"Re-index the modified files in my codebase"

Searching Your Code

"Find authentication functions in TypeScript"
"Search for error handling patterns"
"Look for database query functions"
"Find classes that handle user data"
"Show me similar functions to the one I'm looking at"

Getting Code Context

"Get context around the login function"
"Show me similar code to this authentication logic"
"Find related functions in this file"

Troubleshooting Cursor Connection

| Issue | Solution | |-------|----------| | Red circle (0 tools) | Check VOYAGE_API_KEY is set correctly | | "No server info found" | Restart Cursor completely | | Connection timeout | Ensure Qdrant is running on correct port | | Tools not responding | Check server logs for errors |

⚙️ Customization Guide

For Different Project Types

Large Enterprise Codebases

# Handle large codebases efficiently
BATCH_SIZE=50
MAX_FILE_SIZE=2097152
CHUNK_SIZE=1500
EXCLUDE_PATTERNS=node_modules,dist,build,.git,coverage,logs

AI/ML Projects

# Optimize for Python-heavy codebases
SUPPORTED_EXTENSIONS=.py,.ipynb,.md,.yaml,.yml
EMBEDDING_MODEL=voyage-code-2
CHUNK_SIZE=2000

Frontend Projects

# Focus on web technologies
SUPPORTED_EXTENSIONS=.js,.jsx,.ts,.tsx,.vue,.svelte,.css,.scss
EXCLUDE_PATTERNS=node_modules,dist,build,.next,coverage
CHUNK_SIZE=800

Microservices Architecture

# Index multiple service repositories
COLLECTION_NAME=microservices-org
BATCH_SIZE=100
# Consider separate instances per service

Advanced Configuration Options

Performance Tuning

# Memory optimization
BATCH_SIZE=25              # Smaller batches for memory-constrained environments
CHUNK_OVERLAP=100          # Reduce overlap to save storage
MAX_FILE_SIZE=1048576      # Limit file size (1MB default)

# Speed optimization  
BATCH_SIZE=200             # Larger batches for faster processing
EMBEDDING_MODEL=voyage-code-2  # Optimized model for code

Custom File Filtering

# Include only specific file types
SUPPORTED_EXTENSIONS=.py,.js,.ts,.go,.rust

# Exclude testing and generated files
EXCLUDE_PATTERNS=*test*,*spec*,generated,vendor,node_modules

# Include documentation
SUPPORTED_EXTENSIONS=.md,.rst,.txt,.py,.js,.ts

Multi-Environment Setup

# Development
COLLECTION_NAME=dev-codebase
QDRANT_URL=http://localhost:6333

# Staging  
COLLECTION_NAME=staging-codebase
QDRANT_URL=https://staging-qdrant.company.com

# Production
COLLECTION_NAME=prod-codebase
QDRANT_URL=https://qdrant.company.com
QDRANT_API_KEY=prod_api_key

🔒 Privacy & Security

Your Code Stays Private

The MCP server is designed with privacy as a core principle:

Small Code Chunks Only

Chunk Size: Only small code segments (100-1000 characters) are sent for embedding
Default: 800 characters maximum per chunk (configurable)
Enforcement: Automatic truncation of larger chunks with logging
No Full Files: Complete files are never sent to external services

One-Way Mathematical Representations

Embeddings: Code chunks are converted to mathematical vectors (embeddings)
Irreversible: Embeddings cannot be converted back to original code
Semantic Only: Vectors capture meaning, not exact text
No Code Storage: Original code never leaves your environment

Local Processing

Parsing: All code parsing happens locally using Tree-sitter
Chunking: Code segmentation occurs on your machine
Storage: Only vector embeddings stored in your Qdrant instance
Search: Semantic search runs on your infrastructure

Network Security

HTTPS: All external API calls use TLS encryption
API Keys: Securely stored in environment variables
No Logging: Code content is never logged to external services
Minimal Data: Only mathematical vectors transmitted

Privacy Configuration

# Privacy-optimized settings
CHUNK_SIZE=800                    # Max 800 chars per chunk (100-1000 range)
CHUNK_OVERLAP=100                 # Reduced overlap for privacy
MAX_FILE_SIZE=1048576             # 1MB file size limit
EXCLUDE_PATTERNS=*.git*,node_modules/**,dist/**  # Skip sensitive directories

📝 Configuration

The server is configured via environment variables:

Required Environment Variables

VOYAGE_API_KEY: Your Voyage AI API key

Optional Environment Variables

QDRANT_URL: Qdrant server URL (default: http://localhost:6333)
QDRANT_API_KEY: Qdrant API key (if using cloud instance)
COLLECTION_NAME: Name of the Qdrant collection (default: codebase)
EMBEDDING_MODEL: Voyage AI model to use (default: voyage-code-3)
BATCH_SIZE: Batch size for embedding generation (default: 100)
CHUNK_SIZE: Maximum chunk size in characters (default: 800, range: 100-1000)
CHUNK_OVERLAP: Overlap between chunks (default: 100)
MAX_FILE_SIZE: Maximum file size to index in bytes (default: 1048576)
EXCLUDE_PATTERNS: Comma-separated patterns to exclude (default: see config)
SUPPORTED_EXTENSIONS: Comma-separated file extensions to support (default: see config)

Example Configuration

Create a .env file in the project root:

VOYAGE_API_KEY=your_voyage_api_key_here
QDRANT_URL=http://localhost:6333
COLLECTION_NAME=my_codebase
EMBEDDING_MODEL=voyage-code-2
BATCH_SIZE=50
MAX_FILE_SIZE=2097152

Usage

Running the Server

npm start

Or in development mode:

npm run dev

Setting up Qdrant

Local Qdrant (Docker)

docker run -p 6333:6333 qdrant/qdrant

Qdrant Cloud

🛠️ MCP Tools

The server provides 16 powerful tools organized by functionality:

index_directory: Index all files in a directory recursively
index_file: Index a single file
reindex_file: Re-index a file (force update)
remove_file: Remove a file from the index
clear_index: Clear the entire search index

codebase_search: 🌟 Natural language search for codebase understanding (e.g., "How is user authentication handled?", "Database connection setup", "Error handling patterns")
search_code: Search for code chunks using semantic similarity
search_functions: Search for functions by name or description
search_classes: Search for classes by name or description
find_similar: Find code chunks similar to a given chunk
get_code_context: Get code context around a specific chunk

get_indexing_stats: Get statistics about the indexed codebase
get_search_stats: Get statistics about the search index
get_enhanced_stats: Get enhanced statistics including cache and hybrid search metrics
get_health_status: Get comprehensive health status of all services
clear_search_cache: Clear search cache for fresh results
invalidate_file_cache: Invalidate cache for a specific file

Example Usage

Index a directory:

{
  "tool": "index_directory",
  "arguments": {
    "directory_path": "/path/to/your/codebase"
  }
}

🌟 Natural language codebase search:

{
  "tool": "codebase_search",
  "arguments": {
    "query": "How is user authentication handled?",
    "limit": 5,
    "enable_hybrid": true,
    "enable_reranking": true
  }
}

Search for authentication functions:

{
  "tool": "search_functions",
  "arguments": {
    "query": "authentication login user",
    "language": "typescript",
    "limit": 5
  }
}

Search for error handling patterns:

{
  "tool": "search_code",
  "arguments": {
    "query": "error handling exception try catch",
    "chunk_type": "function",
    "threshold": 0.7
  }
}

🌟 Natural Language Search Examples

The codebase_search tool understands natural language queries and provides:

Relevant code snippets with syntax highlighting
File paths with line numbers for direct navigation
Similarity scores as percentages
Clickable navigation links to jump to specific locations

Example queries that work great:

"How is user authentication handled?"
"Database connection setup"
"Error handling patterns"
"API endpoint definitions"
"Component state management"
"Configuration loading"
"Logging implementation"

Sample output format:

# 🔍 Natural Language Codebase Search

**Query:** "How is user authentication handled?"

## 📊 Search Results
- **Found:** 8 relevant code references
- **Search Time:** 45ms
- **Hybrid Search:** ✅ (Dense + Sparse)
- **LLM Re-ranked:** ✅ (Relevance optimized)

## 📝 Code References with Navigation Links

### 1. [📂 src/auth/auth-service.ts:15](file://src/auth/auth-service.ts#L15)
**Lines 15-28** | **function** | **typescript** | **Similarity: 94.2%**

```typescript
async authenticateUser(token: string): Promise<User | null> {
  try {
    const decoded = jwt.verify(token, this.secretKey);
    return await this.userRepository.findById(decoded.userId);
  } catch (error) {
    logger.error('Authentication failed:', error);
    return null;
  }
}


## 🌐 Supported Languages

| Language | File Extensions | Status |
|----------|----------------|--------|
| **JavaScript** | `.js`, `.jsx` | ✅ Full Support |
| **TypeScript** | `.ts`, `.tsx` | ✅ Full Support |
| **Python** | `.py` | ✅ Full Support |
| **Go** | `.go` | 🔄 Coming Soon |
| **Rust** | `.rs` | 🔄 Coming Soon |
| **Java** | `.java` | 🔄 Coming Soon |

> 💡 **Extensible**: Additional languages can be added by installing the corresponding tree-sitter grammars and updating the configuration.

## API Reference

### Indexing Service

The `IndexingService` class provides:

```typescript
// Initialize the service
await indexingService.initialize();

// Index a directory
const stats = await indexingService.indexDirectory('/path/to/code');

// Index a single file
const chunks = await indexingService.indexFile('/path/to/file.ts');

// Remove a file from index
await indexingService.removeFile('/path/to/file.ts');

// Clear entire index
await indexingService.clearIndex();

Search Service

The SearchService class provides:

// Initialize the service
await searchService.initialize();

// Basic search
const results = await searchService.search({
  query: 'authentication',
  language: 'typescript',
  limit: 10
});

// Search functions
const functions = await searchService.searchFunctions('login', 'typescript');

// Find similar chunks
const similar = await searchService.findSimilar('chunk_id', 5);

// Get code context
const context = await searchService.getCodeContext('chunk_id', 5);

Performance Considerations

Batch Processing: The server processes files in batches to avoid memory issues
Incremental Updates: Only re-indexes files that have changed
Embedding Caching: Consider caching embeddings to reduce API calls
Vector Storage: Qdrant provides efficient vector storage and retrieval

🔧 Troubleshooting

Common Issues

MCP Connection Issues

| Problem | Symptoms | Solution | |---------|----------|----------| | Server won't start | Error: EADDRINUSE | Port 3001 already in use. Change PORT env var or kill existing process | | Connection timeout | Cursor shows "connecting..." forever | Check VOYAGE_API_KEY is valid and Qdrant is running | | Red circle in Cursor | 0 tools shown | Restart Cursor completely, verify server is running | | "Not connected" error | Tools fail with connection error | Server restarted automatically, wait 30 seconds |

Service Connection Issues

Connection to Qdrant fails:

# Check if Qdrant is running
curl http://localhost:6333/collections
   
# Start Qdrant if not running
docker run -d -p 6333:6333 --name qdrant qdrant/qdrant
   
# Check firewall settings
netstat -tulpn | grep 6333

Voyage AI API errors:

# Test API key
curl -H "Authorization: Bearer YOUR_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"input": ["test"], "model": "voyage-code-2"}' \
     https://api.voyageai.com/v1/embeddings
   
# Check quota at https://www.voyageai.com/dashboard

Performance Issues

Out of memory during indexing:

# Reduce memory usage
BATCH_SIZE=25
MAX_FILE_SIZE=524288
CHUNK_SIZE=500
   
# Exclude large directories
EXCLUDE_PATTERNS=node_modules,dist,build,.git,logs,coverage,vendor

Slow indexing performance:

# Optimize for speed
BATCH_SIZE=100
CHUNK_OVERLAP=100
   
# Use faster embedding model if available
EMBEDDING_MODEL=voyage-code-2

Code Parsing Issues

Tree-sitter parsing errors:
- Error: Language not supported
  - Solution: Add tree-sitter grammar for your language
- Error: Failed to parse file
  - Solution: Check file encoding (must be UTF-8)
- Error: File too large
  - Solution: Increase MAX_FILE_SIZE or exclude the file

Diagnostic Commands

Check Server Health

# Test server is running
curl http://localhost:3001/health

# Test MCP endpoint
curl http://localhost:3001/sse

# Check server logs
npm start 2>&1 | tee server.log

Check Services

# Test Qdrant
curl http://localhost:6333/collections

# Test Voyage AI
curl -H "Authorization: Bearer $VOYAGE_API_KEY" \
     https://api.voyageai.com/v1/embeddings \
     -d '{"input":["test"],"model":"voyage-code-2"}'

Debug Indexing

# Enable debug mode
DEBUG=1 npm start

# Test specific directory
curl -X POST http://localhost:3001/tools/call \
  -H "Content-Type: application/json" \
  -d '{"tool":"index_directory","arguments":{"directory_path":"./test"}}'

Debug Mode

Enable comprehensive logging:

# Full debug output
DEBUG=1 npm start

# Service-specific debugging
DEBUG=indexing npm start
DEBUG=search npm start
DEBUG=mcp npm start

Log Analysis

Look for these patterns in logs:

| Log Pattern | Meaning | Action | |-------------|---------|--------| | Error: VOYAGE_API_KEY is required | Missing API key | Set VOYAGE_API_KEY environment variable | | Failed to connect to Qdrant | Vector DB unavailable | Check Qdrant is running and accessible | | Rate limit exceeded | API quota reached | Wait or upgrade Voyage AI plan | | Memory usage warning | High memory usage | Reduce BATCH_SIZE or exclude more files | | Lazy initialization completed | Services ready | Normal startup, server ready for requests |

Getting Help

Check server logs for specific error messages
Test each service individually using diagnostic commands
Verify environment variables are set correctly
Restart services in order: Qdrant → MCP Server → Cursor
Create minimal reproduction with a small test directory

If issues persist, create a GitHub issue with:

Complete error logs
Environment configuration (without API keys)
Steps to reproduce
System information (OS, Node.js version, etc.)

Development

Project Structure

src/
├── types.ts              # Type definitions
├── index.ts              # Main MCP server
├── clients/
│   ├── voyage-client.ts  # Voyage AI client
│   └── qdrant-client.ts  # Qdrant client
├── parsers/
│   └── code-parser.ts    # Tree-sitter based parser
└── services/
    ├── indexing-service.ts # Indexing orchestration
    └── search-service.ts   # Search functionality

Adding New Languages

Install the tree-sitter grammar:

npm install tree-sitter-rust

Update the loadLanguage function in code-parser.ts
Add language configuration in initializeLanguageConfigs
Update the file extension mapping

Testing

Run tests with:

npm test

Linting

Check code style with:

npm run lint

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

License

MIT License

Acknowledgments

Voyage AI for embedding API
Qdrant for vector database
Tree-sitter for code parsing
Model Context Protocol for the protocol specification

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Clone the repository
git clone <repository-url>
cd mcp-codebase-indexing-server

# Install dependencies
npm install

# Build the project
npm run build

# Run in development mode
npm run dev

# Run tests
npm test

Adding New Languages

Install the tree-sitter grammar:
```
npm install tree-sitter-rust
```
Update the loadLanguage function in src/parsers/code-parser.ts
Add language configuration in initializeLanguageConfigs
Update the file extension mapping

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Voyage AI for providing excellent code embeddings
Qdrant for the powerful vector database
Tree-sitter for robust code parsing
Model Context Protocol for the standardized AI integration protocol

📈 Changelog

v1.0.0 - Production Release

✅ Complete MCP protocol implementation with 12 tools
✅ Lazy initialization to prevent connection timeouts
✅ Custom SSE implementation for Cursor compatibility
✅ Support for JavaScript, TypeScript, Python
✅ Voyage AI integration for semantic embeddings
✅ Qdrant integration for vector storage
✅ Incremental indexing with file change tracking
✅ Automated Fly.io deployment with GitHub Actions

🔗 Related Projects

Model Context Protocol - Official MCP documentation
MCP Servers - Official MCP server implementations
Cursor - AI-powered code editor with MCP support

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

MCP Codebase Indexing Server

📋 Table of Contents

✨ Features

🏗️ Architecture

📦 Installation

NPM Package (Recommended)

Docker

From Source

🚀 Quick Start

Prerequisites

5-Minute Setup

Prerequisites (Detailed)

Installation

🎯 Cursor Integration Guide

Setting Up MCP Server in Cursor

Using with Cursor

Indexing Your Codebase

Searching Your Code

Getting Code Context

Troubleshooting Cursor Connection

⚙️ Customization Guide

For Different Project Types

Large Enterprise Codebases

AI/ML Projects

Frontend Projects

Microservices Architecture

Advanced Configuration Options

Performance Tuning

Custom File Filtering

Multi-Environment Setup

🔒 Privacy & Security

Your Code Stays Private

Small Code Chunks Only

One-Way Mathematical Representations

Local Processing

Network Security

Privacy Configuration

📝 Configuration

Required Environment Variables

Optional Environment Variables

Example Configuration

Usage

Running the Server

Setting up Qdrant

Local Qdrant (Docker)

Qdrant Cloud

🛠️ MCP Tools

Example Usage

🌟 Natural Language Search Examples

Search Service

Performance Considerations

🔧 Troubleshooting

Common Issues

MCP Connection Issues

Service Connection Issues

Performance Issues

Code Parsing Issues

Diagnostic Commands

Check Server Health

Check Services

Debug Indexing

Debug Mode

Log Analysis

Getting Help

Development

Project Structure

Adding New Languages

Testing

Linting

Contributing

License

Acknowledgments

🤝 Contributing

Development Setup