@dbclean/cli
v1.0.1
Published
Transform messy CSV data into clean, standardized datasets using AI-powered automation
Maintainers
Readme
🧹 DBClean
Transform messy CSV data into clean, standardized datasets using AI-powered automation.
DBClean is a powerful command-line tool that automatically cleans, standardizes, and restructures your CSV data using advanced AI models. Perfect for data scientists, analysts, and anyone working with messy datasets.
📁 Project Structure
After processing, your workspace will look like this:
your-project/
├── data.csv # Your original input file
├── data/
│ ├── data_cleaned.csv # After preclean step
│ ├── data_deduped.csv # After duplicate removal
│ ├── data_stitched.csv # Final cleaned dataset
│ ├── train.csv # Training set (70%)
│ ├── validate.csv # Validation set (15%)
│ └── test.csv # Test set (15%)
├── settings/
│ ├── instructions.txt # Custom AI instructions
│ └── exclude_columns.txt # Columns to skip in preclean
└── outputs/
├── architect_output.txt # AI schema design
├── column_mapping.json # Column transformations
├── cleaned_columns/ # Individual column results
├── cleaner_changes_analysis.html
└── dedupe_report.txt✨ Features
- 🤖 AI-Powered Cleaning - Uses advanced language models to intelligently clean and standardize data
- 🏗️ Schema Design - Automatically creates optimal database schemas from your data
- 🔍 Duplicate Detection - AI-powered duplicate identification and removal
- 🎯 Outlier Detection - Uses Isolation Forest to identify and remove anomalies
- ✂️ Data Splitting - Automatically splits cleaned data into training, validation, and test sets
- 🔄 Full Pipeline - Complete automation from raw CSV to clean, structured data
- 📊 Column-by-Column Processing - Detailed cleaning and standardization of individual columns
- 🎯 Model Selection - Choose from multiple AI models for different tasks
- 📋 Custom Instructions - Guide the AI with your specific cleaning requirements
- 💰 Credit-Based Billing - Pay only for what you use with transparent pricing
💳 Credit System
DBClean uses a transparent, pay-as-you-go credit system:
- Free Tier: 5 free requests per month for new users
- Minimum Balance: $0.01 required for paid requests
- Precision: 4 decimal places (charges as low as $0.0001)
- Pricing: Based on actual AI model costs with no markup
- Billing: Credits deducted only after successful processing
Check your balance anytime with dbclean credits or get a complete overview with dbclean account.
🚀 Quick Start
1. Initialize Your Account
dbclean initEnter your email and API key when prompted. Don't have an account? Sign up at dbclean.dev
2. Verify Setup
dbclean test-auth
dbclean account3. Process Your Data
# Place your CSV file as data.csv in your current directory
dbclean runYour cleaned data will be available in data/data_stitched.csv 🎉
📖 Command Reference
🔧 Setup & Authentication
| Command | Description |
|---------|-------------|
| dbclean init | Initialize with your email and API key |
| dbclean test-auth | Verify your credentials are working |
| dbclean logout | Remove stored credentials |
| dbclean status | Check API key status and account info |
💰 Account Management
| Command | Description |
|---------|-------------|
| dbclean account | Complete account overview (credits, usage, status) |
| dbclean credits | Check your current credit balance |
| dbclean usage | View API usage statistics |
| dbclean usage --detailed | Detailed breakdown by service and model |
| dbclean models | List all available AI models |
📊 Data Processing Pipeline
| Command | Description |
|---------|-------------|
| dbclean run | Execute complete pipeline (recommended) |
| dbclean preclean | Clean CSV data (remove newlines, special chars) |
| dbclean architect | AI-powered schema design and standardization |
| dbclean dedupe | AI-powered duplicate detection and removal |
| dbclean cleaner | AI-powered column-by-column data cleaning |
| dbclean stitcher | Combine all changes into final CSV |
| dbclean isosplit | Detect outliers and split into train/validate/test |
🔄 Complete Pipeline
The recommended approach is to use the full pipeline:
# Basic full pipeline
dbclean run
# With custom AI model
dbclean run -m "gemini-2.0-flash-exp"
# Different models for different steps
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"
# With custom instructions and larger sample
dbclean run -i -x 10
# Skip certain steps
dbclean run --skip-preclean --skip-dedupePipeline Steps
- Preclean - Prepares raw CSV by removing problematic characters and formatting
- Architect - AI analyzes your data structure and creates optimized schema
- Dedupe - AI identifies and removes duplicate records intelligently
- Cleaner - AI processes each column to standardize and clean data
- Stitcher - Combines all improvements into final dataset
- Isosplit - Removes outliers and splits data for machine learning
🎛️ Command Options
Model Selection
-m <model>- Use same model for all AI steps--model-architect <model>- Specific model for architect step--model-cleaner <model>- Specific model for cleaner step
Processing Options
-x <number>- Sample size for architect analysis (default: 5)-i- Use custom instructions fromsettings/instructions.txt--input <file>- Specify input CSV file (default: data.csv)
Skip Options
--skip-preclean- Skip data preparation step--skip-architect- Skip schema design step--skip-dedupe- Skip duplicate detection step--skip-cleaner- Skip column cleaning step--skip-isosplit- Skip outlier detection and data splitting
🤖 AI Models
Recommended Models
| Model | Best For | Speed | Cost |
|-------|----------|-------|------|
| gemini-2.0-flash-exp | General purpose, fast processing | ⚡⚡⚡ | 💲 |
| gemini-2.0-flash-thinking | Complex data analysis | ⚡⚡ | 💲💲 |
| gemini-1.5-pro | Large, complex datasets | ⚡ | 💲💲💲 |
Model Selection Tips
- For speed and cost: Use
gemini-2.0-flash-exp - For complex, messy data: Use
gemini-2.0-flash-thinkingfor architect - For mixed workloads: Use different models per step with
--model-architectand--model-cleaner
# List all available models
dbclean models📝 Custom Instructions
Create custom cleaning instructions to guide the AI:
- For architect step: Use the
-iflag with asettings/instructions.txtfile - Example instructions:
- Standardize all phone numbers to E.164 format (+1XXXXXXXXXX) - Convert all dates to YYYY-MM-DD format - Normalize company names (remove Inc, LLC, etc.) - Flag any entries with missing critical information - Ensure email addresses are properly formatted
dbclean run -i # Uses instructions from settings/instructions.txt💡 Usage Examples
Basic Processing
# Process a CSV file with default settings
dbclean run
# Use a specific input file
dbclean run --input customer_data.csvAdvanced Processing
# High-quality processing with larger sample
dbclean run -m "gemini-2.0-flash-thinking" -x 15 -i
# Fast processing for large datasets
dbclean run -m "gemini-2.0-flash-exp" --skip-dedupe
# Custom pipeline - architect only
dbclean run --skip-preclean --skip-cleaner --skip-dedupe --skip-isosplitIndividual Steps
# Run architect with custom model and sample size
dbclean architect -m "gemini-2.0-flash-thinking" -x 10 -i
# Clean data with specific model
dbclean cleaner -m "gemini-2.0-flash-exp"
# Remove duplicates with AI analysis
dbclean dedupe🎯 Best Practices
1. Start Small and Iterate
# Test with small sample first
dbclean architect -x 3
# Review outputs, then run full pipeline
dbclean run2. Choose the Right Models
# For complex schema design
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"3. Use Custom Instructions
Create settings/instructions.txt with domain-specific requirements:
Finance data requirements:
- Currency amounts in USD format ($X,XXX.XX)
- Account numbers must be 10-12 digits
- Transaction dates in YYYY-MM-DD format4. Monitor Your Usage
# Check account status regularly
dbclean account
# Monitor detailed usage
dbclean usage --detailed❗ Troubleshooting
Common Issues
Authentication Problems
dbclean init # Re-enter credentials
dbclean test-auth # Verify connectionData File Issues
- Ensure
data.csvexists in current directory - Use
--input <file>for different file names - Check file permissions and encoding
API Limits
- Check credit balance:
dbclean credits - View usage:
dbclean usage - Free tier: 5 requests per month, then paid credits required
Model Availability
dbclean models # See available modelsGetting Help
dbclean --help # General help
dbclean run --help # Command-specific help
dbclean help-commands # Detailed command reference📊 Output Files
After processing, you'll have:
data/data_stitched.csv- Your final, cleaned datasetdata/train.csv- Training data (70%)data/validate.csv- Validation data (15%)data/test.csv- Test data (15%)outputs/cleaner_changes_analysis.html- Visual changes reportoutputs/architect_output.txt- AI schema analysisoutputs/column_mapping.json- Column transformation details
🤝 Support
- Documentation: dbclean.dev/docs
- Support: dbclean.dev/support
- API Status: Check real-time status and get your API key
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
Ready to clean your data? Start with dbclean init and transform your messy CSV files into pristine datasets! 🚀
