@teknologika/mcp-local-knowledge
v0.1.4
Published
Local-first semantic search for documents using MCP protocol. Requires Python 3.10+ and Docling (pip install docling).
Maintainers
Readme
@teknologika/mcp-local-knowledge
A local-first semantic search system for documents using the Model Context Protocol (MCP)
📋 Table of Contents
- Overview
- Features
- Installation
- Quick Start
- Usage
- Configuration
- MCP Client Configuration
- Supported Document Formats
- Architecture
- Troubleshooting
- Development
- Contributing
- License
Overview
The Local Knowledge MCP Server enables AI assistants to search and retrieve information from your document collections using semantic search. It supports PDFs, Word documents, presentations, spreadsheets, and more—all processed locally without cloud dependencies.
Why Use This?
- Find Information Fast: AI assistants can search across all your documents semantically
- Privacy-First: All processing happens locally—your documents never leave your machine
- Multi-Format Support: PDFs, DOCX, PPTX, XLSX, HTML, Markdown, text files, and audio
- Smart Chunking: Structure-aware document chunking preserves context and meaning
- Fast & Efficient: Optimized for quick search responses with intelligent caching
- Easy Integration: Works seamlessly with Claude Desktop and other MCP clients
Features
- 🔒 Local-First: All operations run locally without external API calls
- 🔍 Semantic Search: Find information by meaning, not just keywords
- 📄 Multi-Format Support: PDF, DOCX, PPTX, XLSX, HTML, Markdown, text, audio
- 🤖 MCP Integration: Seamless integration with MCP-compatible AI assistants (Claude Desktop, etc.)
- 🧠 Smart Chunking: Structure-aware document chunking preserves context and hierarchy
- 📝 Docling Integration: Powerful document conversion with OCR support for scanned PDFs
- 🖥️ Web Management UI: Manage knowledge bases through a browser interface
- ⚡ Performance Optimized: Sub-500ms search responses with intelligent caching
- 🎯 Smart Filtering: Filter by document type, exclude test documents
- 📊 Detailed Statistics: Track chunk counts, document counts, and format distribution
- 🔄 Gitignore Support: Respects .gitignore patterns during ingestion
- 🎤 Audio Transcription: Automatic transcription of audio files using Whisper ASR
Installation
Global Installation (Recommended)
npm install -g @teknologika/mcp-local-knowledgeThis makes three commands available globally:
mcp-local-knowledge- MCP server for AI assistantsmcp-knowledge-ingest- CLI for indexing documentsmcp-knowledge-manager- Web UI for management
Local Installation
npm install @teknologika/mcp-local-knowledgeThen use with npx:
npx mcp-knowledge-ingest --path ./my-documents --name my-documents
npx mcp-local-knowledge
npx mcp-knowledge-managerRequirements
- Node.js: 23.0.0 or higher
- npm: 10.0.0 or higher
- Python: 3.10 or higher (for document conversion)
- Disk Space: ~500MB for embedding models (downloaded on first use)
Python Dependencies
This package uses Docling for document conversion (PDF, DOCX, PPTX, XLSX, HTML, and more). You need to install Docling separately:
pip install doclingWhat is Docling?
Docling is a powerful document conversion library that transforms various document formats into markdown while preserving structure, tables, and formatting. It includes OCR support for scanned PDFs and handles complex document layouts intelligently.
Verifying Installation:
After installing, verify Docling is available:
python -c "import docling; print('Docling installed successfully')"Learn More:
Quick Start
1. Index Your First Knowledge Base
mcp-knowledge-ingest --path ./my-documents --name my-documentsExample Output:
Ingesting knowledge base: my-documents
Path: /Users/dev/documents/my-documents
Supported formats: PDF, DOCX, PPTX, XLSX, HTML, Markdown, Text, Audio
Scanning directory: [████████████████████] 100% (234/234)
Converting documents: [████████████████████] 100% (200/200)
Chunking documents: [████████████████████] 100% (200/200)
Generating embeddings: [████████████████████] 100% (1,234/1,234)
Storing chunks: [████████████████████] 100% (1,234/1,234)
✓ Ingestion completed successfully!
Statistics:
Total files scanned: 234
Supported files: 200
Unsupported files: 34
Chunks created: 1,234
Duration: 45.2s
Document types:
pdf: 450 chunks (50 files)
docx: 380 chunks (40 files)
markdown: 280 chunks (80 files)
text: 124 chunks (30 files)2. Configure Your MCP Client
For Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"local-knowledge": {
"command": "mcp-local-knowledge",
"args": []
}
}
}3. Start Using in Your AI Assistant
Once configured, your AI assistant can use these tools:
- list_knowledgebases: See all indexed knowledge bases
- search_knowledgebases: Search for information semantically
- get_knowledgebase_stats: View detailed statistics
- open_knowledgebase_manager: Launch and open the Manager UI in your browser
4. (Optional) Explore the Manager UI
mcp-knowledge-managerOpens http://localhost:8009 in your default browser with a visual interface for:
- Searching documents with filters
- Managing indexed knowledge bases
- Viewing statistics and metadata
- Adding new knowledge bases with real-time progress
Usage
Ingestion CLI
The mcp-knowledge-ingest command indexes documents for semantic search.
Basic Usage
mcp-knowledge-ingest --path <directory> --name <knowledge-base-name>Options
| Option | Description | Required | Example |
|--------|-------------|----------|---------|
| -p, --path | Path to document directory | Yes | --path ./my-documents |
| -n, --name | Unique name for the knowledge base | Yes | --name my-documents |
| -c, --config | Path to configuration file | No | --config ./config.json |
| --no-gitignore | Disable .gitignore filtering | No | --no-gitignore |
Examples
Index a document folder:
mcp-knowledge-ingest --path ~/Documents/work --name work-docsIndex with custom config:
mcp-knowledge-ingest --path ./reports --name quarterly-reports --config ./custom-config.jsonIndex without gitignore filtering:
mcp-knowledge-ingest --path ./my-documents --name my-documents --no-gitignoreRe-index an existing knowledge base:
# Simply run the same command again - old data is automatically replaced
mcp-knowledge-ingest --path ~/Documents/work --name work-docsWhat Gets Indexed?
- ✅ All files with supported extensions (
.pdf,.docx,.pptx,.xlsx,.html,.md,.txt,.mp3,.wav,.m4a,.flac) - ✅ Files in nested subdirectories (recursive scanning)
- ✅ Semantic document chunks (paragraphs, sections, tables, headings)
- ✅ Metadata tags (document type, page numbers, heading hierarchy)
- ❌ Files larger than 50MB (configurable via
maxFileSize) - ❌ Files in
.gitignore(by default, use--no-gitignoreto include) - ❌ Binary files and unsupported formats
- ❌ Hidden directories (starting with
.)
MCP Server
The MCP server exposes tools for AI assistants to search and explore knowledge bases.
Starting the Server
mcp-local-knowledgeThe server runs in stdio mode and communicates with MCP clients via standard input/output.
Available Tools
1. list_knowledgebases
Lists all indexed knowledge bases with metadata.
Input: None
Output:
{
"knowledgebases": [
{
"name": "my-documents",
"path": "/path/to/documents",
"chunkCount": 5678,
"documentCount": 450,
"lastIngestion": "2024-01-15T10:30:00Z",
"documentTypes": ["pdf", "docx", "markdown", "text"]
}
]
}2. search_knowledgebases
Performs semantic search across indexed knowledge bases.
Input:
{
"query": "project timeline and milestones",
"knowledgebaseName": "my-documents", // Optional
"documentType": "pdf", // Optional
"maxResults": 25 // Optional (default: 50)
}Output:
{
"results": [
{
"filePath": "reports/Q4-2024-Report.pdf",
"content": "Project Timeline: The Q4 milestones include...",
"documentType": "pdf",
"chunkType": "section",
"pageNumber": 5,
"headingPath": ["Executive Summary", "Project Timeline"],
"similarityScore": 0.92,
"knowledgebaseName": "my-documents"
}
],
"totalResults": 1,
"queryTime": 45
}3. get_knowledgebase_stats
Retrieves detailed statistics for a specific knowledge base.
Input:
{
"name": "my-documents"
}Output:
{
"name": "my-documents",
"path": "/path/to/documents",
"chunkCount": 5678,
"documentCount": 450,
"lastIngestion": "2024-01-15T10:30:00Z",
"documentTypes": [
{ "type": "pdf", "documentCount": 200, "chunkCount": 3200 },
{ "type": "docx", "documentCount": 150, "chunkCount": 1500 },
{ "type": "markdown", "documentCount": 100, "chunkCount": 978 }
],
"chunkTypes": [
{ "type": "paragraph", "count": 2500 },
{ "type": "section", "count": 1200 },
{ "type": "table", "count": 978 }
],
"sizeBytes": 2500000
}4. open_knowledgebase_manager
Opens the web-based Manager UI in the default browser. Automatically launches the server if it's not already running.
Input: None
Output:
{
"success": true,
"url": "http://localhost:8009",
"serverStarted": true,
"message": "Manager UI opened in browser. Server was started."
}Note: The tool checks if the Manager server is running on the configured port. If not, it launches the server in the background before opening the browser.
Manager UI
The Manager UI provides a web-based interface for managing indexed knowledge bases.
Starting the Manager
mcp-knowledge-managerThis will:
- Start a Fastify server on port 8009 (configurable)
- Automatically open
http://localhost:8009in your default browser - Display all indexed knowledge bases with statistics
Features
Search Tab:
- Semantic search across all knowledge bases
- Filter by knowledge base and max results
- Filter by document type (PDF, DOCX, etc.)
- Exclude test documents checkbox
- Collapsible results with color-coded confidence scores:
- 🟢 Green (0.80-1.00): Excellent match
- 🟡 Yellow (0.60-0.79): Good match
- 🔵 Blue (0.00-0.59): Lower match
Manage Knowledge Bases Tab:
- View all indexed knowledge bases
- See chunk counts, document counts, and last indexed date
- Add new knowledge bases with real-time progress tracking
- Rename knowledge bases
- Remove knowledge bases
- Gitignore filtering checkbox (checked by default)
Ingest Tab:
- Drag-and-drop file upload
- Folder selection and upload
- Real-time progress tracking for each file
- Support for all document formats
Manager Controls:
- Quit Manager button with confirmation dialog (stops server and closes browser tab)
Configuration
The system can be configured using a JSON configuration file. The default location is ~/.knowledge-base/config.json.
Automatic Setup
On first run, the system automatically:
- Creates the
~/.knowledge-base/directory structure - Generates a default
config.jsonfile with sensible defaults - Creates subdirectories for:
lancedb/- Vector database storagemodels/- Embedding model cachetmp/- Temporary file uploads
No manual setup is required - just run any command and the system will initialize itself.
Configuration File Example
{
"lancedb": {
"persistPath": "~/.knowledge-base/lancedb"
},
"embedding": {
"modelName": "Xenova/all-MiniLM-L6-v2",
"cachePath": "~/.knowledge-base/models"
},
"server": {
"port": 8009,
"host": "localhost",
"sessionSecret": "change-me-in-production"
},
"mcp": {
"transport": "stdio"
},
"ingestion": {
"batchSize": 100,
"maxFileSize": 52428800
},
"search": {
"defaultMaxResults": 50,
"cacheTimeoutSeconds": 60
},
"document": {
"conversionTimeout": 30000,
"maxTokens": 512,
"chunkSize": 1000,
"chunkOverlap": 200
},
"logging": {
"level": "info"
},
"schemaVersion": "1.0.0"
}Configuration Options
LanceDB Settings
| Option | Description | Default |
|--------|-------------|---------|
| persistPath | Directory for LanceDB storage | ~/.knowledge-base/lancedb |
Embedding Settings
| Option | Description | Default |
|--------|-------------|---------|
| modelName | Hugging Face model for embeddings | Xenova/all-MiniLM-L6-v2 |
| cachePath | Directory for model cache | ~/.knowledge-base/models |
Server Settings
| Option | Description | Default |
|--------|-------------|---------|
| port | Port for Manager UI server | 8009 |
| host | Host for Manager UI server | localhost |
| sessionSecret | Secret for session cookies | Auto-generated |
Ingestion Settings
| Option | Description | Default |
|--------|-------------|---------|
| batchSize | Documents per batch during ingestion | 100 |
| maxFileSize | Maximum file size in bytes | 52428800 (50MB) |
Document Settings
| Option | Description | Default |
|--------|-------------|---------|
| conversionTimeout | Document conversion timeout (ms) | 30000 (30s) |
| maxTokens | Maximum tokens per chunk | 512 |
| chunkSize | Fallback chunk size (characters) | 1000 |
| chunkOverlap | Fallback chunk overlap (characters) | 200 |
Search Settings
| Option | Description | Default |
|--------|-------------|---------|
| defaultMaxResults | Default maximum search results | 50 |
| cacheTimeoutSeconds | Search result cache timeout | 60 |
Logging Settings
| Option | Description | Default | Options |
|--------|-------------|---------|---------|
| level | Log level | info | debug, info, warn, error |
Custom Configuration
To use a custom configuration file:
# For ingestion
mcp-knowledge-ingest --config ./my-config.json --path ./documents --name my-docs
# For MCP server (via environment variable)
CONFIG_PATH=./my-config.json mcp-local-knowledgeMCP Client Configuration
Claude Desktop
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"local-knowledge": {
"command": "mcp-local-knowledge",
"args": []
}
}
}Using Codex MCP CLI
If you have Codex installed, you can add the server with a single command:
codex mcp add local-knowledge \
--env CONFIG_PATH=~/.knowledge-base/config.json \
--env LOG_LEVEL=info \
-- mcp-local-knowledgeThis automatically configures the MCP server in your client without manual JSON editing.
Other MCP Clients
For other MCP-compatible clients, use the stdio transport:
{
"mcpServers": {
"local-knowledge": {
"command": "mcp-local-knowledge",
"args": [],
"env": {
"CONFIG_PATH": "~/.knowledge-base/config.json",
"LOG_LEVEL": "info"
}
}
}
}Verifying Configuration
After configuring your MCP client:
- Restart the client application
- Check that the
local-knowledgeserver appears in the MCP server list - Try using the
list_knowledgebasestool to verify connectivity
Supported Document Formats
The system uses Docling for document conversion and processing. All documents are converted to markdown with structure preservation.
| Format | Extensions | Features |
|--------|-----------|----------|
| PDF | .pdf | OCR support for scanned documents, table extraction, image detection |
| Word | .docx, .doc | Formatting preservation, table extraction, heading hierarchy |
| PowerPoint | .pptx, .ppt | Slide content extraction, speaker notes, embedded text |
| Excel | .xlsx, .xls | Table data extraction, sheet names, cell formatting |
| HTML | .html, .htm | Structure preservation, semantic elements, link extraction |
| Markdown | .md, .markdown | Native processing, heading hierarchy, code blocks |
| Text | .txt | Plain text processing, paragraph detection |
| Audio | .mp3, .wav, .m4a, .flac | Automatic transcription using Whisper ASR |
What Gets Extracted?
For each document, the system extracts:
- Content: Full text content converted to markdown
- Structure: Headings, sections, paragraphs, lists
- Tables: Tabular data with formatting
- Metadata: Title, page count, word count, format, images, tables
- Context: Heading hierarchy for each chunk
- Page Numbers: For paginated documents (PDF, DOCX, PPTX)
Document Chunking
Documents are split into semantic chunks using Docling's HybridChunker:
- Structure-Aware: Respects document hierarchy (headings, sections)
- Token-Aware: Configurable max tokens per chunk (default: 512)
- Context-Preserved: Includes heading path for each chunk
- Type-Tagged: Each chunk labeled as paragraph, section, table, heading, list, or code
File Classification
The system automatically classifies files during ingestion:
Test Documents (tagged with isTest: true):
- Files with "test" or "spec" in the path
- Files in
test/,tests/,spec/directories
These tags enable filtering in search results.
Architecture
System Overview
┌────────────────────────────────────────────────────────────┐
│ Entry Points │
├──────────────┬──────────────────┬──────────────────────────┤
│ MCP Server │ Ingestion CLI │ Manager UI │
│ (stdio) │ (command-line) │ (web interface) │
└──────┬───────┴────────┬─────────┴──────────┬───────────────┘
│ │ │
│ │ │
┌──────▼────────────────▼────────────────────▼───────────────┐
│ Core Services │
├─────────────┬──────────────┬──────────────┬────────────────┤
│ Knowledge │ Search │ Ingestion │ Embedding │
│ Base │ Service │ Service │ Service │
│ Service │ │ │ │
└──────┬──────┴──────┬───────┴──────┬───────┴────────┬───────┘
│ │ │ │
│ │ │ │
┌──────▼─────────────▼──────────────▼────────────────▼───────┐
│ Storage & External │
├──────────────┬──────────────────┬─────────────────────────┤
│ LanceDB │ Docling SDK │ Hugging Face │
│ (Vector DB) │ (Doc Convert) │ (Embeddings) │
└──────────────┴──────────────────┴─────────────────────────┘Component Responsibilities
MCP Server (mcp-local-knowledge)
- Exposes tools via Model Context Protocol
- Validates inputs and outputs
- Handles stdio communication
Ingestion CLI (mcp-knowledge-ingest)
- Scans directories recursively
- Respects .gitignore patterns
- Converts documents with Docling
- Chunks documents with HybridChunker
- Classifies test documents
- Generates embeddings
- Stores chunks in LanceDB
Manager UI (mcp-knowledge-manager)
- Fastify web server with SSR
- Real-time ingestion progress via SSE
- Search interface with filters
- Knowledge base management
- File upload with drag-and-drop
Core Services
- Knowledge Base Service: CRUD operations for knowledge bases
- Search Service: Semantic search with filtering and caching
- Ingestion Service: Orchestrates document processing pipeline
- Embedding Service: Generates vector embeddings locally
- Document Converter: Converts documents to markdown via Docling
- Document Chunker: Splits documents into semantic chunks
Data Flow
Ingestion Flow
Documents → File Scanner → Document Converter (Docling) → Markdown
↓
Document Chunker
↓
LanceDB ← Embeddings ← Embedding Service ← Tagged ChunksSearch Flow
Query → Embedding Service → Vector
↓
LanceDB Search
↓
Apply Filters (document type, tests)
↓
Ranked Results → Format → ResponseStorage Schema
LanceDB Tables:
- Table naming:
kb_{name}_{schemaVersion} - Example:
kb_my-documents_1_0_0
Row Structure:
{
"id": "my-documents_2024-01-15T10:30:00Z_0",
"vector": [0.1, 0.2, ...],
"content": "Project Timeline: The Q4 milestones include...",
"filePath": "reports/Q4-2024-Report.pdf",
"documentType": "pdf",
"chunkType": "section",
"chunkIndex": 5,
"pageNumber": 5,
"headingPath": ["Executive Summary", "Project Timeline"],
"isTest": false,
"ingestionTimestamp": "2024-01-15T10:30:00Z",
"_knowledgebaseName": "my-documents",
"_path": "/path/to/documents",
"_lastIngestion": "2024-01-15T10:30:00Z"
}Troubleshooting
Common Issues
1. "Command not found: mcp-local-knowledge"
Problem: The package is not installed globally or not in PATH.
Solution:
# Reinstall globally
npm install -g @teknologika/mcp-local-knowledge
# Or use npx
npx mcp-local-knowledge2. "Failed to initialize LanceDB"
Problem: LanceDB persistence directory is not writable or corrupted.
Solution:
# Check permissions
ls -la ~/.knowledge-base/lancedb
# Reset LanceDB (WARNING: deletes all data)
rm -rf ~/.knowledge-base/lancedb
# Re-ingest knowledge bases
mcp-knowledge-ingest --path ./my-documents --name my-documents3. "Embedding model download failed"
Problem: Network issues or insufficient disk space.
Solution:
# Check disk space
df -h ~/.knowledge-base
# Clear model cache and retry
rm -rf ~/.knowledge-base/models
# Run ingestion again (will re-download)
mcp-knowledge-ingest --path ./my-documents --name my-documents4. "Search returns no results"
Problem: Knowledge base not indexed or query too specific.
Solution:
# Verify knowledge base is indexed
mcp-knowledge-manager
# Check the UI for your knowledge base
# Try broader queries
# Instead of: "Q4 2024 financial projections table"
# Try: "financial projections" or "quarterly report"5. "Manager UI won't open"
Problem: Port 8009 is already in use.
Solution:
# Check what's using port 8009
lsof -i :8009
# Kill the process or use a different port
# Edit ~/.knowledge-base/config.json
{
"server": {
"port": 8009
}
}6. "MCP client can't connect to server"
Problem: Configuration issue or server not starting.
Solution:
# Test server manually
mcp-local-knowledge
# Verify configuration path
cat ~/Library/Application\ Support/Claude/claude_desktop_config.json
# Check logs for errors7. "Knowledge base not found" after ingestion
Problem: Knowledge base name contains special characters that were sanitized.
Explanation: Knowledge base names are sanitized to ensure compatibility with the database. Special characters (spaces, hyphens, etc.) are replaced with underscores.
Examples:
Cloud Forge PRFAQ→Cloud_Forge_PRFAQmy-project-docs→my_project_docsQ1 2024 Reports→Q1_2024_Reports
Solution:
# Check what name was actually created
mcp-knowledge-manager
# Or use the debug script
node scripts/debug-kb.js
# Use the sanitized name when searching
# If you ingested with: --name "Cloud Forge PRFAQ"
# Use in MCP: Cloud_Forge_PRFAQNote: The CLI now displays the sanitized name during ingestion if it differs from your input.
Docling-Specific Issues
8. "Docling not found" or "docling-sdk is not installed"
Problem: Python Docling is not installed or not in PATH.
Solution:
# Verify Python is installed (3.10+ required)
python --version
# Install Docling
pip install docling
# Verify installation
python -c "import docling; print('Docling installed successfully')"
# If using a virtual environment, activate it first
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
pip install doclingAlternative Solution (if pip fails):
# Try with pip3
pip3 install docling
# Or use python -m pip
python -m pip install docling
# For user-level install (no sudo required)
pip install --user docling8. Python Version Issues
Problem: Docling requires Python 3.10 or higher, but an older version is installed.
Solution:
macOS (using Homebrew):
# Install Python 3.11
brew install [email protected]
# Verify version
python3.11 --version
# Install Docling with specific Python version
python3.11 -m pip install doclingUbuntu/Debian:
# Add deadsnakes PPA for newer Python versions
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
# Install Python 3.11
sudo apt install python3.11 python3.11-pip
# Install Docling
python3.11 -m pip install doclingWindows:
# Download Python 3.11+ from python.org
# https://www.python.org/downloads/
# After installation, verify
python --version
# Install Docling
pip install doclingUsing pyenv (all platforms):
# Install pyenv (see https://github.com/pyenv/pyenv)
curl https://pyenv.run | bash
# Install Python 3.11
pyenv install 3.11.7
pyenv global 3.11.7
# Verify
python --version
# Install Docling
pip install docling9. Docling Installation Fails on macOS
Problem: Installation fails with compiler errors or missing dependencies.
Solution:
# Install Xcode Command Line Tools
xcode-select --install
# Install required system dependencies via Homebrew
brew install cmake pkg-config
# Try installing Docling again
pip install docling
# If still failing, try with verbose output to see the error
pip install -v doclingCommon macOS-specific issues:
# If you see "error: command 'clang' failed"
# Install or update Xcode Command Line Tools
sudo rm -rf /Library/Developer/CommandLineTools
xcode-select --install
# If you see "fatal error: 'Python.h' file not found"
# Install Python development headers
brew reinstall [email protected]10. Docling Installation Fails on Linux
Problem: Missing system dependencies or compiler errors.
Solution:
Ubuntu/Debian:
# Install build essentials and Python development headers
sudo apt update
sudo apt install build-essential python3-dev python3-pip
# Install additional dependencies
sudo apt install libpoppler-cpp-dev pkg-config
# Try installing Docling again
pip install doclingFedora/RHEL/CentOS:
# Install development tools
sudo dnf groupinstall "Development Tools"
sudo dnf install python3-devel
# Install additional dependencies
sudo dnf install poppler-cpp-devel
# Try installing Docling again
pip install doclingArch Linux:
# Install base development packages
sudo pacman -S base-devel python-pip
# Install additional dependencies
sudo pacman -S poppler
# Try installing Docling again
pip install docling11. Docling Installation Fails on Windows
Problem: Missing Visual C++ build tools or compilation errors.
Solution:
# Install Microsoft C++ Build Tools
# Download from: https://visualstudio.microsoft.com/visual-cpp-build-tools/
# During installation, select "Desktop development with C++"
# After installation, restart your terminal and try again
pip install docling
# Alternative: Use pre-built wheels if available
pip install --only-binary :all: doclingIf using Windows Subsystem for Linux (WSL):
# Follow the Linux instructions above
# WSL provides a better environment for Python packages with native dependencies12. Document Conversion Fails
Problem: Docling fails to convert a specific document.
Solution:
# Check if the document is corrupted
# Try opening it in its native application first
# Check file permissions
ls -l /path/to/document.pdf
# Try converting manually to see the error
python -c "from docling.document_converter import DocumentConverter; \
converter = DocumentConverter(); \
result = converter.convert('/path/to/document.pdf'); \
print(result)"
# Check Docling logs for detailed error messages
# Logs are typically in the system temp directoryCommon conversion issues:
Encrypted PDFs:
# Docling cannot process password-protected PDFs
# Remove password protection first using tools like:
# - qpdf: qpdf --decrypt input.pdf output.pdf
# - pdftk: pdftk input.pdf output output.pdfScanned PDFs (images):
# Docling uses OCR for scanned PDFs
# Ensure tesseract is installed for better OCR results
brew install tesseract # macOS
sudo apt install tesseract-ocr # Ubuntu/DebianCorrupted documents:
# Try repairing the document first
# For PDFs: use tools like pdftk or ghostscript
gs -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress input.pdf13. Document Conversion Timeout
Problem: Large documents take too long to convert and timeout after 30 seconds.
Solution:
Increase timeout in configuration:
{
"document": {
"conversionTimeout": 60000
}
}Or split large documents:
# For PDFs, split into smaller chunks
# Using pdftk
pdftk large.pdf burst output page_%04d.pdf
# Using ghostscript
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
-dFirstPage=1 -dLastPage=50 \
-sOutputFile=part1.pdf large.pdfOptimize document before conversion:
# Compress PDF to reduce size
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=compressed.pdf input.pdf14. Performance Issues with Large Documents
Problem: Document processing is very slow or uses excessive memory.
Solution:
Adjust batch size for ingestion:
{
"ingestion": {
"batchSize": 50
}
}Increase Node.js memory limit:
# Set max memory to 4GB
export NODE_OPTIONS="--max-old-space-size=4096"
# Then run ingestion
mcp-knowledge-ingest --path ./docs --name my-docsProcess documents in smaller batches:
# Instead of ingesting entire directory at once
# Process subdirectories separately
mcp-knowledge-ingest --path ./docs/section1 --name my-docs
mcp-knowledge-ingest --path ./docs/section2 --name my-docsOptimize document files:
# Reduce PDF file sizes
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=optimized.pdf input.pdf
# Convert DOCX to PDF for better processing
# Use LibreOffice in headless mode
libreoffice --headless --convert-to pdf document.docxMonitor resource usage:
# Check memory usage during ingestion
# macOS
top -pid $(pgrep -f mcp-knowledge-ingest)
# Linux
htop -p $(pgrep -f mcp-knowledge-ingest)
# Windows
# Use Task Manager or Resource Monitor15. Audio Transcription Issues
Problem: Audio files fail to transcribe or produce poor results.
Solution:
Ensure audio format is supported:
# Supported formats: MP3, WAV, M4A, FLAC
# Convert unsupported formats using ffmpeg
ffmpeg -i input.ogg -acodec libmp3lame output.mp3Improve transcription quality:
# Convert to WAV with optimal settings for speech recognition
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
# Reduce background noise (requires sox)
sox input.wav output.wav noisered noise-profile.txt 0.21Check Whisper model installation:
# Docling uses Whisper for audio transcription
# Verify it's installed
python -c "import whisper; print('Whisper available')"
# If not installed
pip install openai-whisper16. Docling CLI Not Found in PATH
Problem: System can't find the Docling CLI executable.
Solution:
# Find where pip installed Docling
pip show docling | grep Location
# Add Python scripts directory to PATH
# macOS/Linux (add to ~/.bashrc or ~/.zshrc)
export PATH="$HOME/.local/bin:$PATH"
# Windows (add to System Environment Variables)
# Add: C:\Users\YourUsername\AppData\Local\Programs\Python\Python311\Scripts
# Verify Docling is accessible
which docling # macOS/Linux
where docling # WindowsAlternative: Use Python module directly:
# Instead of calling 'docling' command
# Use Python module invocation
python -m docling.cli convert document.pdfPerformance Tips
Increase batch size for faster ingestion (if you have sufficient RAM):
{ "ingestion": { "batchSize": 200 } }Adjust cache timeout for frequently repeated queries:
{ "search": { "cacheTimeoutSeconds": 120 } }Use SSD storage for LanceDB persistence directory
Exclude unnecessary files using .gitignore patterns
Development
Setup
# Clone the repository
git clone https://github.com/teknologika/mcp-codebase-search.git
cd mcp-codebase-search
# Install dependencies
npm install
# Build the project
npm run buildScripts
# Build TypeScript
npm run build
# Run tests
npm test
# Run tests in watch mode
npm run test:watch
# Run tests with coverage
npm run test:coverage
# Lint code
npm run lint
# Fix linting issues
npm run lint:fix
# Security audit
npm run security
# Clean build artifacts
npm run clean
# Type check without building
npm run typecheckProject Structure
src/
├── bin/ # Entry points (mcp-server, ingest, manager)
├── domains/ # Domain-specific business logic
│ ├── knowledgebase/ # Knowledge base CRUD operations
│ ├── search/ # Semantic search functionality
│ ├── ingestion/ # File scanning and indexing
│ ├── embedding/ # Embedding generation
│ └── document/ # Document conversion and chunking
├── infrastructure/ # External integrations
│ ├── lancedb/ # LanceDB client wrapper
│ ├── mcp/ # MCP server implementation
│ └── fastify/ # Fastify server and routes
├── shared/ # Shared utilities
│ ├── config/ # Configuration management
│ ├── logging/ # Structured logging with Pino
│ ├── types/ # Shared TypeScript types
│ └── utils/ # Utility functions
└── ui/ # Web interface
└── manager/ # Single-page management UITesting
The project uses Vitest for testing with both unit tests and property-based tests.
Test Coverage Requirements:
- Minimum 80% statement coverage
- Minimum 80% branch coverage
- 90%+ coverage for critical paths
Run specific tests:
# Test a specific file
npm test -- src/domains/search/search.service.test.ts
# Test with coverage
npm run test:coverage
# Watch mode for TDD
npm run test:watchBuilding and Packaging
# Clean and build
npm run clean && npm run build
# Create npm package
npm pack
# Install package globally for testing
npm install -g ./teknologika-mcp-local-knowledge-1.0.0.tgz
# Test commands
mcp-local-knowledge --version
mcp-knowledge-ingest --help
mcp-knowledge-managerContributing
We welcome contributions! Here's how you can help:
Reporting Issues
- Search existing issues to avoid duplicates
- Provide details:
- Node.js version
- Operating system
- Steps to reproduce
- Expected vs actual behavior
- Error messages and logs
Submitting Pull Requests
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Make your changes:
- Follow existing code style
- Add tests for new functionality
- Update documentation
- Run tests:
npm test - Run linter:
npm run lint - Commit with clear messages:
git commit -m "feat: add new feature" - Push to your fork:
git push origin feature/my-feature - Open a pull request
Code Style
- TypeScript: Strict mode enabled
- Formatting: Follow existing patterns
- Naming: Use descriptive names (camelCase for variables, PascalCase for classes)
- Comments: Document complex logic and public APIs
- Tests: Write both unit tests and property-based tests
Commit Messages
Follow Conventional Commits:
feat:New featurefix:Bug fixdocs:Documentation changestest:Test changesrefactor:Code refactoringperf:Performance improvementschore:Build/tooling changes
Areas for Contribution
- 📄 Document format support: Add more format handlers
- ⚡ Performance: Optimize search and ingestion
- 🎨 UI improvements: Enhance the Manager UI
- 📚 Documentation: Improve guides and examples
- 🧪 Testing: Increase test coverage
- 🐛 Bug fixes: Fix reported issues
- 🔍 Search improvements: Better ranking algorithms
- 🏷️ Document classification: More patterns for test document detection
- 🎤 Audio processing: Improve transcription quality
- 📊 Analytics: Add usage statistics and insights
Security
Local-First Architecture
- ✅ No external API calls: All processing happens locally
- ✅ No telemetry: No usage data is collected or transmitted
- ✅ No cloud dependencies: Embeddings generated locally with Hugging Face Transformers
File System Security
- Path validation: All file paths are validated to prevent directory traversal
- Permission checks: Respects file system permissions
- Gitignore support: Automatically skips files in
.gitignore
Input Validation
- Schema validation: All inputs validated with Zod schemas
- Type checking: Strict TypeScript types throughout
- Sanitization: User inputs sanitized before processing
Resource Limits
- Max file size: 50MB default (configurable)
- Max results: 200 maximum per search
- Batch size limits: Prevents memory exhaustion
Network Security
- Localhost only: Manager UI binds to localhost by default
- Security headers: Helmet.js for HTTP security headers
- Session management: Secure session cookies
Recommendations
- Do not expose Manager UI to public networks
- Keep the package updated for security patches
- Run regular security audits:
npm audit - Use strong file system permissions
- Back up data regularly before major updates
License
MIT License - see LICENSE file for details.
Author
Teknologika
Acknowledgments
- Model Context Protocol - MCP specification
- LanceDB - Vector database
- Docling - Document conversion
- Hugging Face - Embedding models
- Fastify - Web framework
Questions or Issues? Open an issue on GitHub
Need Help? Check the Troubleshooting section above
