@vespermcp/mcp-server
v1.5.2
Published
AI-powered dataset discovery, quality analysis, and preparation MCP server with multimodal support (text, image, audio, video)
Downloads
1,593
Maintainers
Readme
Vesper MCP Server 🚀
AI-powered dataset discovery, quality analysis, and preparation with multimodal support (text, image, audio, video).
Vesper is a Model Context Protocol (MCP) server that helps you find, analyze, and prepare high-quality datasets for machine learning projects. It integrates seamlessly with AI assistants like Claude, providing autonomous dataset workflows.
✨ Features
🔍 Dataset Discovery
- Search across HuggingFace, Kaggle, UCI ML Repository, and more
- Intelligent ranking based on quality, safety, and relevance
- Automatic metadata extraction and enrichment
📊 Quality Analysis
- Text: Missing data, duplicates, column profiling
- Images: Resolution, corruption, blur detection
- Audio: Sample rate, duration, silence detection
- Video: FPS, frame validation, corruption risk
- Unified Reports: Consolidated quality scores (0-100) with recommendations
🛠️ Data Preparation
- Automated cleaning pipelines
- Format conversion (CSV, JSON, Parquet)
- Train/test/validation splitting
- Automatic installation to project directories
🎯 Multimodal Support
- Analyze mixed datasets (text + images + audio)
- Media-specific quality metrics
- Intelligent modality detection
📦 Installation
🚀 Quick Start (VS Code + Copilot)
The fastest way to install Vesper and configure it for GitHub Copilot Chat or Cursor is to run the automated setup:
npx -y -p @vespermcp/mcp-server@latest vespermcp --setup- Select Visual Studio Code (Settings.json) from the list.
- Restart VS Code.
- Open Copilot Chat and look for the MCP Servers section.
🛠️ Configuration
Vesper supports:
- GitHub Copilot Chat: Automated setup via
settings.json. - Cursor: Automated setup via
mcp.json. - Claude Desktop: Automated setup via
claude_desktop_config.json.
Manual Python Setup (if needed)
pip install opencv-python pillow numpy librosa soundfile⚙️ MCP Configuration
vesper_extract_web returns “Tool not found” (-32601)
The running MCP process is an older @vespermcp/mcp-server build that does not register vesper_extract_web. Fix: install @vespermcp/[email protected] or newer (e.g. npx -y -p @vespermcp/mcp-server@latest vespermcp after publish), then restart the MCP server / IDE. On startup, stderr should show @vespermcp/mcp-server v1.5.x. Cursor’s cached JSON under .cursor/.../mcps/ can list tools that the live server does not expose until you upgrade.
For Cursor
- Go to Settings > Features > MCP
- Click Add New MCP Server
- Enter:
- Name:
vesper - Type:
command - Command:
vesper
- Name:
For Claude Desktop
Vesper attempts to auto-configure itself! Restart Claude and check. If not:
{
"mcpServers": {
"vesper": {
"command": "vesper",
"args": [],
"env": {
"HF_TOKEN": "your-huggingface-token"
}
}
}
}Note: If the
vespercommand isn't found, you can stick to the absolute path method.
Environment Variables (Optional)
KAGGLE_USERNAME&KAGGLE_KEY: For Kaggle dataset accessHF_TOKEN: For private HuggingFace datasetsVESPER_TELEMETRY_ENDPOINT: Optional HTTP endpoint for lineage telemetry events (lineage.version.appended)
Telemetry Transparency (Opt-in)
Vesper does not send telemetry unless VESPER_TELEMETRY_ENDPOINT is explicitly set.
When enabled, Vesper sends only lineage event metadata on version append:
- dataset base/version IDs
- tool name + actor metadata (
agent_id,pipeline_idwhen provided) - basic output metadata (
local_path, rows/columns, format) - timestamp + host name
It does not upload dataset file contents.
Lineage Receiver (for web dashboard backend)
Vesper includes a tiny ingestion server for lineage telemetry events:
npm run telemetry:receiverStorage backends:
- Postgres: set
DATABASE_URL - SQLite: set
SQLITE_PATH(for lightweight/local deployments)
Optional env vars:
PORT(default8787)LINEAGE_INGEST_PATH(default/vesper/lineage)
Example for hosted backend:
- ingest URL:
https://getvesper.dev/vesper/lineage - client env:
VESPER_TELEMETRY_ENDPOINT=https://getvesper.dev/vesper/lineage
DDL files:
telemetry/sql/lineage_events.postgres.sqltelemetry/sql/lineage_events.sqlite.sql
Stats endpoint for web dashboard bootstrap:
GET /vesper/lineage/stats?days=30- Returns JSON: overview, by-tool counts, by-day counts, top datasets, recent activity.
Optional Kaggle Setup (Not Required)
Core Vesper works without any API keys. Keys are only needed when you explicitly use Kaggle or gated Hugging Face.
Install optional Kaggle client only if you need Kaggle source access:
pip install kagglevespermcp config keysThe setup wizard supports skip and stores keys securely via OS keyring when available,
with fallback to ~/.vesper/config.toml.
or use Kaggle's native file:
~/.kaggle/kaggle.json
If credentials are missing and you run Kaggle commands, Vesper shows:
Kaggle support requires API key. Run 'vespermcp config keys' (30 seconds).
CLI Examples
vespermcp discover --source kaggle "credit risk" --limit 10
vespermcp discover --source huggingface "credit risk" --limit 10
vespermcp download kaggle username/dataset-name
vespermcp download kaggle https://www.kaggle.com/datasets/username/dataset-name --target-dir ./data
vespermcp status
vespermcp status --dir ./some/project --max-depth 3🚀 Quick Start
After installation and configuration, restart your AI assistant and try:
search_datasets(query="sentiment analysis", limit=5)prepare_dataset(query="image classification cats vs dogs")generate_quality_report(
dataset_id="huggingface:imdb",
dataset_path="/path/to/data"
)📚 Available Tools
Dataset Discovery
unified_dataset_api
Single facade over multiple dataset backends. Use one tool for provider capability inspection, dataset discovery, dataset download, and dataset info lookup. The gateway prefers public/keyless providers and can also use server-managed credentials for connectors like Kaggle or data.world when configured by the operator.
Parameters:
operation(string):providers,discover,download, orinfosource(string, optional):auto,huggingface,openml,kaggle,dataworld,s3,bigqueryquery(string, required fordiscover)dataset_id(string, required fordownload/info)limit(number, optional)target_dir(string, optional)public_only(boolean, optional)
Examples:
unified_dataset_api(operation="providers")unified_dataset_api(operation="discover", query="credit risk", source="auto")unified_dataset_api(operation="download", dataset_id="huggingface:imdb")search_datasets
Search for datasets across multiple sources.
Parameters:
query(string): Search querylimit(number, optional): Max results (default: 10)min_quality_score(number, optional): Minimum quality threshold
Example:
search_datasets(query="medical imaging", limit=5, min_quality_score=70)Data Preparation
prepare_dataset
Download, analyze, and prepare a dataset for use.
Parameters:
query(string): Dataset search query or ID
Example:
prepare_dataset(query="squad")export_dataset
Export a prepared dataset to a custom directory with format conversion.
Parameters:
dataset_id(string): Dataset identifiertarget_dir(string): Export directoryformat(string, optional): Output format (csv, json, parquet)
Example:
export_dataset(
dataset_id="huggingface:imdb",
target_dir="./my-data",
format="csv"
)vesper_download_assets
Download image/media assets to a user-controlled local directory.
Parameters:
dataset_id(string): Dataset identifiersource(string):huggingface,kaggle, orurltarget_dir(string, optional): Exact local directory where assets should be writtenoutput_dir(string, optional): Alias fortarget_diroutput_format(string, optional):webdataset,imagefolder, orparquet
Example:
vesper_download_assets(
dataset_id="cats_vs_dogs",
source="kaggle",
target_dir="./datasets/cats_dogs_100",
output_format="imagefolder"
)Quality Analysis
analyze_image_quality
Analyze image datasets for resolution, corruption, and blur.
Parameters:
path(string): Path to image file or folder
Example:
analyze_image_quality(path="/path/to/images")analyze_media_quality
Analyze audio/video files for quality metrics.
Parameters:
path(string): Path to media file or folder
Example:
analyze_media_quality(path="/path/to/audio")generate_quality_report
Generate a comprehensive unified quality report for multimodal datasets.
Parameters:
dataset_id(string): Dataset identifierdataset_path(string): Path to dataset directory
Example:
generate_quality_report(
dataset_id="my-dataset",
dataset_path="/path/to/data"
)Data Splitting
split_dataset
Split a dataset into train/test/validation sets.
Parameters:
dataset_id(string): Dataset identifiertrain_ratio(number): Training set ratio (0-1)test_ratio(number): Test set ratio (0-1)val_ratio(number, optional): Validation set ratio (0-1)
Example:
split_dataset(
dataset_id="my-dataset",
train_ratio=0.7,
test_ratio=0.2,
val_ratio=0.1
)🏗️ Architecture
Vesper is built with:
- TypeScript for the MCP server
- Python for image/audio/video processing
- SQLite for metadata storage
- Transformers.js for semantic search
🤝 Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
📄 License
MIT License - see LICENSE for details.
🐛 Issues & Support
- Issues: https://github.com/vesper/mcp-server/issues
- Discussions: https://github.com/vesper/mcp-server/discussions
🌟 Acknowledgments
Built with:
Made with ❤️ by the Vesper Team
