@dataset.sh/cli
v0.1.1
Published
Dataset CLI for managing local and remote dataset storage
Downloads
127
Maintainers
Readme
@dataset.sh/cli
A powerful command-line interface for managing datasets with local caching, remote downloads, and flexible storage management. Similar to package managers like pnpm, but designed specifically for dataset files.
Features
- 📦 Local and Global Installation - Install datasets per-project or globally
- 🔄 Intelligent Caching - Global cache with SHA-256 integrity verification
- 🏷️ Tag and Version Support - Install by semantic tags or specific versions
- 🔗 Symbolic Linking - Efficient storage with automatic linking strategies
- 🌐 Multiple Servers - Support for multiple dataset servers with authentication
- 📤 Dataset Unpacking - Extract dataset contents for direct use
- 🔐 Security - Built-in checksum verification and retry logic
Installation
Global Installation
pnpm add -g @dataset.sh/cli
# or
npm install -g @dataset.sh/cliAfter global installation, use the dataset.sh command:
dataset.sh init
dataset.sh install nlp/sentimentUsing npx (No Installation Required)
npx @dataset.sh/cli init
npx @dataset.sh/cli install nlp/sentiment
npx @dataset.sh/cli unpack nlp/sentimentLocal Project Installation
pnpm add @dataset.sh/cli
# or
npm install @dataset.sh/cliThen use via npm scripts or npx.
Quick Start
1. Initialize a Project
# Using global installation
dataset.sh init
# Using npx (no installation required)
npx @dataset.sh/cli init2. Install a Dataset
# Install dataset with default tag (main)
dataset.sh install nlp/sentiment
# or
npx @dataset.sh/cli install nlp/sentiment
# Install specific tag
dataset.sh install nlp/sentiment -t v1.2
# Install specific version (using version hash)
dataset.sh install nlp/sentiment -v a1b2c3d4e5f6...
# Install globally
dataset.sh install -g nlp/sentiment3. Unpack for Direct Use
# Unpack to public/datasets/nlp/sentiment
dataset.sh unpack nlp/sentiment
# or
npx @dataset.sh/cli unpack nlp/sentiment
# Unpack to custom location
dataset.sh unpack nlp/sentiment -d ./dataGlobal Options
--debug
Enable detailed debug logging to stderr. This shows internal operations including:
- Configuration loading and path resolution
- Network requests and responses
- Cache operations (hits/misses)
- File system operations
- Linking strategies and operations
# Enable debug logging for any command
dataset.sh --debug init
dataset.sh --debug install nlp/sentiment
# Using npx
npx @dataset.sh/cli --debug init
npx @dataset.sh/cli --debug install nlp/sentimentDebug output includes timestamped logs with module prefixes:
[CLI]- Command-line interface operations[CONFIG]- Configuration and path management[NETWORK]- HTTP requests and server communication[CACHE]- Cache operations and integrity checking[LINKING]- File linking and symlink operations[FS]- File system operations[INIT]- Init command operations[INSTALL]- Install command operations[UNPACK]- Unpack command operations
Commands
dataset.sh init
Creates a datasets.json file in the current directory.
dataset.sh initdataset.sh install [dataset]
Installs datasets from datasets.json or adds and installs a specific dataset.
# Install all datasets from datasets.json
dataset.sh install
# Install specific dataset
dataset.sh install nlp/sentiment
# Install with options
dataset.sh install nlp/sentiment -t v1.2 -s myserver
dataset.sh install -g nlp/sentiment -v a1b2c3d4e5f6...Options:
-g, --global- Install to global directory (~/.dataset_sh/global)-s, --server <profile>- Use specific server profile-t, --tag <tag>- Install specific tag (default: main)-v, --version <version>- Install specific version (64-character hex string)
dataset.sh unpack <dataset>
Unpacks dataset content to a destination folder. The dataset must be installed first.
# Unpack to public/datasets
dataset.sh unpack nlp/sentiment
# Unpack to custom directory
dataset.sh unpack nlp/sentiment -d ./data
# Unpack specific version
dataset.sh unpack nlp/sentiment -v a1b2c3d4e5f6...Options:
-v, --version <version>- Unpack specific version (default: latest available)-d, --dest <folder>- Destination folder (default:public/datasets)
Configuration
Environment Variables
DSH_CACHE_DIR- Global cache directory (default:~/.dataset_sh/cache)DSH_GLOBAL_DIR- Global install directory (default:~/.dataset_sh/global)DSH_PROFILE_FILE- Server profiles file (default:~/.dataset_sh/profile.json)
Server Profiles
Create ~/.dataset_sh/profile.json to configure server access:
{
"servers": {
"production": {
"host": "https://api.example.com",
"accessKey": "your-access-key"
},
"staging": {
"host": "https://staging-api.example.com",
"accessKey": "staging-key"
}
}
}datasets.json Format
The datasets.json file tracks project dependencies:
{
"datasets": {
"nlp/sentiment": [
{
"tag": "v1.2",
"host": "https://api.example.com"
}
],
"vision/imagenet": [
{
"version": "a1b2c3d4e5f6789...",
"host": "https://api.example.com"
}
]
}
}File Organization
Local Installation Structure
project/
├── datasets.json # Project dataset manifest
├── dsh_datasets/ # Local dataset installations
│ └── nlp/
│ └── sentiment/
│ ├── tag/
│ │ ├── main -> ../version/a1b2c3d4...
│ │ └── v1.2 -> ../version/f6e5d4c3...
│ └── version/
│ ├── a1b2c3d4.../
│ └── f6e5d4c3.../Global Cache Structure
~/.dataset_sh/
├── cache/ # Global cache with integrity checking
│ └── nlp/
│ └── sentiment/
│ └── version/
│ ├── a1b2c3d4.../
│ │ └── sentiment.dataset
│ └── f6e5d4c3.../
│ └── sentiment.dataset
├── global/ # Global installations
├── profile.json # Server configurationsHow It Works
Installation Process
- Tag Resolution - If installing by tag, resolves to specific version via API
- Cache Check - Checks if dataset exists in global cache and validates checksum
- Download - Downloads dataset if not cached or corrupted
- Verification - Validates SHA-256 checksum before caching
- Linking - Creates symbolic links (or copies) to target location
Caching Strategy
- Global Cache - All datasets stored in
~/.dataset_sh/cacheby version - Integrity Checking - SHA-256 checksums verify file integrity
- Automatic Redownload - Corrupted cache entries are automatically redownloaded
- Cross-Platform - Uses appropriate linking strategy per platform
Network Resilience
- Exponential Backoff - Retries failed downloads with 1s, 2s, 4s delays
- Smart Error Handling - Distinguishes between retryable and permanent failures
- Authentication Support - Bearer token authentication for private servers
Examples
Machine Learning Workflow
# Initialize project
dataset.sh init
# Install training data
dataset.sh install ml/training-data -t latest
# Install validation set
dataset.sh install ml/validation-data -v a1b2c3d4e5f6...
# Unpack for training script
dataset.sh unpack ml/training-data -d ./data/train
dataset.sh unpack ml/validation-data -d ./data/valMulti-Environment Setup
# Development
dataset.sh install nlp/dataset -t dev -s staging
# Production
dataset.sh install nlp/dataset -t v2.1 -s productionGlobal Dataset Management
# Install commonly used datasets globally
dataset.sh install -g common/embeddings
dataset.sh install -g common/stopwords
# Use in any project without reinstalling
dataset.sh unpack common/embeddingsError Handling
The CLI provides clear, actionable error messages:
- Network failures - Suggests checking connection and retry
- Authentication errors - Points to profile configuration
- Missing datasets - Shows available versions and tags
- Disk space issues - Advises on freeing space
- Permission errors - Guides on fixing file permissions
Troubleshooting
Debug Mode
When encountering issues, enable debug logging to see detailed internal operations:
dataset.sh --debug install problem/dataset
# or
npx @dataset.sh/cli --debug install problem/datasetThis will show:
- Which server profiles are being used
- Network request details and response codes
- Cache hit/miss information
- File system operations and linking strategies
- Checksum verification steps
Common Issues
"datasets.json not found"
# Run init first
dataset.sh init
# or
npx @dataset.sh/cli init"Server profile not found"
# Check your profile configuration
cat ~/.dataset_sh/profile.json
# Or create one
mkdir -p ~/.dataset_sh
echo '{"servers":{"default":{"host":"https://api.example.com"}}}' > ~/.dataset_sh/profile.json"Checksum verification failed"
# Clear cache and retry
rm -rf ~/.dataset_sh/cache/category/dataset
dataset.sh --debug install category/dataset
# or
npx @dataset.sh/cli --debug install category/datasetNetwork issues
# Use debug mode to see network details
dataset.sh --debug install category/dataset
# or
npx @dataset.sh/cli --debug install category/dataset
# Check server connectivity
curl -v https://your-server.com/api/healthDevelopment
Building
pnpm buildTesting
pnpm test
pnpm test:watchCompatibility
- Node.js >= 16.0.0
- TypeScript >= 5.0.0
- Cross-platform - Works on Windows, macOS, and Linux
License
MIT
