@dimitrk/mcp-search
v0.1.9
Published
MCP server for web search and semantic page content retrieval with local caching
Maintainers
Readme
MCP Search
A production-ready Model Context Protocol (MCP) server for web search and semantic page content retrieval with local vector caching. Built for AI agents that need reliable, fast, and contextually relevant web information.
✨ Features
- 🔍 Google Custom Search: Batch queries with rate limiting and error recovery
- 🧠 Semantic Page Reading: Extract and chunk content with embedding-based similarity search
- 💾 Local Vector Caching: DuckDB + VSS extension for persistent, fast retrieval
- 🛡️ Production Security: Input validation, content filtering, graceful degradation
- 📊 Observability: Structured logging, correlation IDs, performance metrics
- 🐳 Container Ready: Docker support with multi-platform builds
- ⚡ High Performance: P50 < 300ms cached, < 3s first-time extraction
- 🔧 CLI Tools: Health checks, database inspection, cleanup utilities
🚀 Quick Start
Prerequisites
Follow this guide to create your Google Search API credentials: Programmable Search Engine.
Installing MCP through NPM
Install Playwright (optional - enables crawling SPAs)
# Additionally install Playwright with chromium browser. This is a peer dependency that allows the mcp to crawl SPAs
npx [email protected] install --with-deps chromiumInstall the MCP
{
"mcpServers": {
"web-search": {
"command": "npx",
"args": ["-y", "@dimitrk/mcp-search"],
"env": {
"GOOGLE_API_KEY": "[ENTER GOOGLE API KEY]",
"GOOGLE_SEARCH_ENGINE_ID": "[ENTER GOOGLE SEARCH ID]",
"EMBEDDING_SERVER_URL": "https://api.openai.com/v1",
"EMBEDDING_SERVER_API_KEY": "[OPEN AI KEY]",
"EMBEDDING_MODEL_NAME": "text-embedding-3-small",
"SIMILARITY_THRESHOLD": "0.72"
}
}
}
}Installing MCP through Docker
{
"mcpServers": {
"web-search": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"-e",
"GOOGLE_API_KEY",
"-e",
"GOOGLE_SEARCH_ENGINE_ID",
"-e",
"EMBEDDING_SERVER_URL",
"-e",
"EMBEDDING_SERVER_API_KEY",
"-e",
"EMBEDDING_MODEL_NAME",
"-e",
"SIMILARITY_THRESHOLD",
"-v",
"mcp_data:/app/data",
"mcp-search:test"
],
"env": {
"GOOGLE_API_KEY": "[ENTER GOOGLE API KEY]",
"GOOGLE_SEARCH_ENGINE_ID": "[ENTER GOOGLE SEARCH ENGINE ID]",
"EMBEDDING_SERVER_URL": "https://api.openai.com/v1",
"EMBEDDING_SERVER_API_KEY": "[YOUR OPEN AI KEY]",
"EMBEDDING_MODEL_NAME": "text-embedding-3-small",
"SIMILARITY_THRESHOLD": "0.72"
}
}
}
}🔧 Configuration
Environment Variables Reference
| Variable | Required | Default | Description |
| -------------------------- | -------- | --------------- | ----------------------------------- |
| GOOGLE_API_KEY | ✅ | - | Google Custom Search API key |
| GOOGLE_SEARCH_ENGINE_ID | ✅ | - | Google Custom Search Engine ID |
| EMBEDDING_SERVER_URL | ✅ | - | OpenAI-compatible embedding API URL |
| EMBEDDING_SERVER_API_KEY | ✅ | - | API key for embedding service |
| EMBEDDING_MODEL_NAME | ✅ | - | Model name for embeddings |
| DATA_DIR | ❌ | OS app data dir | Data storage directory |
| SIMILARITY_THRESHOLD | ❌ | 0.6 | Minimum similarity score (0-1) |
| EMBEDDING_TOKENS_SIZE | ❌ | 512 | Chunk size in tokens |
| REQUEST_TIMEOUT_MS | ❌ | 20000 | HTTP request timeout |
| CONCURRENCY | ❌ | 2 | Max concurrent requests |
| VECTOR_DB_MODE | ❌ | inline | inline, thead or process |
📖 Using it as a library
Command Line Interface
# Start MCP server
mcp-search server
# Health check
mcp-search health --verbose
# Database inspection
mcp-search inspect --stats
mcp-search inspect --url "https://example.com"
# Cleanup old data
mcp-search cleanup --days 30 --vacuumMCP Client Integration
Connect to the MCP server from any MCP-compatible client:
# Using MCP Inspector for debugging
npx @modelcontextprotocol/inspector mcp-search
# Programmatic usage (Node.js)
const { Client } = require('@modelcontextprotocol/sdk/client/index.js');
const client = new Client({
name: 'mcp-search-client',
version: '1.0.0'
});Tool Usage Examples
Web Search
// Single query
const result = await client.callTool({
name: 'web.search',
arguments: {
query: 'latest AI developments',
resultsPerQuery: 5,
},
});
// Multiple queries in parallel
const results = await client.callTool({
name: 'web.search',
arguments: {
query: ['machine learning', 'neural networks', 'transformers'],
resultsPerQuery: 3,
},
});Semantic Page Reading
// Extract and search page content
const pageResults = await client.callTool({
name: 'web.readFromPage',
arguments: {
url: 'https://example.com/article',
query: ['main findings', 'methodology', 'conclusions'],
maxResults: 8,
forceRefresh: false,
},
});
// Returns semantically relevant text chunks with similarity scores
console.log(pageResults.queries[0].results[0]);
// {
// id: 'chunk-abc123',
// text: 'Relevant content excerpt...',
// score: 0.87,
// sectionPath: ['Introduction', 'Key Findings']
// }Performance Tuning
# High-performance setup
CONCURRENCY=8
EMBEDDING_TOKENS_SIZE=1024
SIMILARITY_THRESHOLD=0.7
REQUEST_TIMEOUT_MS=30000
VECTOR_DB_MODE=thread
# Memory-optimized setup
CONCURRENCY=1
EMBEDDING_TOKENS_SIZE=256
VECTOR_DB_MODE=inline
# Accuracy-focused setup
SIMILARITY_THRESHOLD=0.7
EMBEDDING_TOKENS_SIZE=512🛠️ Development
Prerequisites
- Node.js 20+ (22+ recommended)
- npm 9+
- Docker (optional, for containerized development)
- Git
Setup
# Clone repository
git clone https://github.com/dimitrk/mcp-search.git
cd mcp-search
# Install dependencies
npm install
# Set up environment
cp .env.example .env
# Edit .env with your API keys
# Build project
npm run build
# Run health check
npm run healthEnvironment Setup
Create .env file:
# Required
GOOGLE_API_KEY=your_google_api_key_here
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id_here
EMBEDDING_SERVER_URL=https://api.openai.com/v1
EMBEDDING_SERVER_API_KEY=your_openai_api_key_here
EMBEDDING_MODEL_NAME=text-embedding-3-small # Embedding model of your choice
# Optional (with defaults)
DATA_DIR=~/.mcp-search # Data storage location
SIMILARITY_THRESHOLD=0.6 # Similarity cutoff (0-1)
EMBEDDING_TOKENS_SIZE=512 # Chunk size in tokens
REQUEST_TIMEOUT_MS=20000 # HTTP timeout
CONCURRENCY=2 # Concurrent requestsDevelopment Scripts
# Development
npm run dev # Start in development mode
npm run dev:mock # Use mock APIs for testing
npm run build:watch # Watch mode build
# Testing
npm test # Run all tests
npm run test:unit # Unit tests only
npm run test:integration # Integration tests only
npm run test:coverage # Coverage report
npm run test:performance # Performance benchmarks
# Quality
npm run lint # ESLint check
npm run lint:fix # Auto-fix linting issues
npm run format # Prettier formatting
npm run typecheck # TypeScript validation
# Database
npm run db:inspect # Inspect database contents
npm run cleanup # Clean old data
# Production
npm start # Production server
npm run health:verbose # Detailed health checkTesting
# Run specific test suites
npm run test:unit -- --testNamePattern="chunker"
npm run test:integration -- --testNamePattern="readFromPage"
# Debug tests
npm run test:debug
# Performance benchmarks
npm run test:performance -- --verbose📊 Architecture
System Overview
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ MCP Client │────│ MCP Server │────│ Google Search │
│ (AI Agent) │ │ │ │ API │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
│
┌──────────────────┐ ┌─────────────────┐
│ Content │────│ Embedding │
│ Extraction │ │ API │
└──────────────────┘ └─────────────────┘
│
│
┌──────────────────┐ ┌─────────────────┐
│ DuckDB │ │ Vector │
│ Database │────│ Search │
└──────────────────┘ └─────────────────┘Data Flow
- Search Request: Client sends MCP tool call
- Content Fetching: HTTP client retrieves web content
- Content Extraction: Multi-stage extraction (Readability → Cheerio → SPA)
- Semantic Chunking: Intelligent content segmentation
- Embedding Generation: Vector representations via API
- Vector Storage: DuckDB + VSS for persistence
- Similarity Search: Semantic matching for queries
- Response: Ranked, relevant content chunks
Key Components
- MCP Server: Protocol-compliant tool server
- HTTP Fetcher: Robust content retrieval with retries
- Content Extractors: Multi-strategy HTML processing
- Semantic Chunker: Token-aware content segmentation
- Vector Store: DuckDB with VSS extension
- Embedding Service: OpenAI-compatible API integration
🐳 Docker Deployment
Basic Deployment
# Pull image
docker pull dimitrisk/mcp-search:latest
# Run container
docker run -d \
--name mcp-search \
--env-file .env \
-v mcp_data:/app/data \
-p 3000:3000 \
dimitrisk/mcp-search:latestDocker Compose (Recommended)
# docker-compose.yml
version: '3.8'
services:
mcp-search:
image: dimitrisk/mcp-search:latest
container_name: mcp-search
restart: unless-stopped
env_file: .env
volumes:
- mcp_data:/app/data
healthcheck:
test: ['CMD', 'node', 'dist/cli.js', 'health']
interval: 30s
timeout: 10s
retries: 3
volumes:
mcp_data:Production Deployment
# Use production compose file
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
# Monitor logs
docker-compose logs -f mcp-search
# Health check
docker-compose exec mcp-search node dist/cli.js health --verbose🔍 Troubleshooting
Common Issues
Environment Variables Missing
# Check current environment
mcp-search health --verbose
# Validate specific variables
echo $GOOGLE_API_KEY | wc -c # Should be >30 charactersDatabase Issues
# Check database status
mcp-search inspect --stats
# Reset database
mcp-search cleanup --days 0 --vacuum
# Manual database reset
rm ~/.mcp-search/db/mpc.duckdbPerformance Issues
# Check system resources
mcp-search health --verbose
# Reduce concurrency
export CONCURRENCY=1
# Increase timeouts
export REQUEST_TIMEOUT_MS=30000Network/API Issues
# Test Google API
curl "https://www.googleapis.com/customsearch/v1?key=$GOOGLE_API_KEY&cx=$GOOGLE_SEARCH_ENGINE_ID&q=test"
# Test embedding API
curl -X POST "$EMBEDDING_SERVER_URL/embeddings" \
-H "Authorization: Bearer $EMBEDDING_SERVER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "'$EMBEDDING_MODEL_NAME'", "input": "test"}'Debug Mode
# Enable verbose logging
DEBUG=mcp-search:* mcp-search server
# Use development configuration
NODE_ENV=development mcp-search server
# Run with MCP inspector
npx @modelcontextprotocol/inspector mcp-searchGetting Help
🔧 API Reference
Tool Schemas
web.search
interface SearchInput {
query: string | string[]; // Search queries
resultsPerQuery?: number; // 1-50, default 5
}
interface SearchOutput {
queries: Array<{
query: string;
result: unknown; // Raw Google JSON
}>;
}web.readFromPage
interface ReadFromPageInput {
url: string; // Target URL
query: string | string[]; // Search queries
forceRefresh?: boolean; // Skip cache, default false
maxResults?: number; // 1-50, default 8
includeMetadata?: boolean; // Extra metadata, default false
}
interface ReadFromPageOutput {
url: string;
title?: string;
lastCrawled: string;
queries: Array<{
query: string;
results: Array<{
id: string; // Stable chunk ID
text: string; // Content text
score: number; // Similarity score 0-1
sectionPath?: string[]; // Document structure
}>;
}>;
note?: string; // Degradation notices
}🏗️ Contributing
Development Workflow
- Fork & Clone: Fork the repository and clone locally
- Branch: Create feature branch (
git checkout -b feature/amazing-feature) - Develop: Write code following our standards
- Test: Ensure all tests pass (
npm test) - Commit: Use conventional commits (
git commit -m 'feat: add amazing feature') - Push: Push to your fork (
git push origin feature/amazing-feature) - PR: Open a Pull Request with detailed description
Code Standards
- TypeScript: Strict mode, explicit types
- ESLint: Airbnb config with custom rules
- Prettier: Consistent formatting
- Jest: >90% test coverage requirement
- Conventional Commits: For changelog generation
Release Process
# Version bump (patch/minor/major)
npm version patch
# Push tags
git push origin --tags
# GitHub Actions will:
# 1. Run full test suite
# 2. Security scan
# 3. Build Docker images
# 4. Publish to NPM
# 5. Create GitHub release📋 Roadmap
- [ ] v1.1: PDF and document parsing support
- [ ] v1.2: Local embedding models (node-llama-cpp)
- [ ] v1.3: Advanced chunking strategies (code, tables)
- [ ] v1.4: Vector database alternatives (Qdrant, Weaviate)
- [ ] v1.5: Robots.txt compliance toggle
- [ ] v2.0: GraphQL schema introspection tool
📄 License
MIT License - see LICENSE file for details.
🙏 Acknowledgments
- Model Context Protocol - Protocol specification
- DuckDB - In-process analytical database
- VSS Extension - Vector similarity search
- Mozilla Readability - Content extraction
- Playwright - Browser automation
Built with ❤️ for the AI agent ecosystem
