@dbclean/cli

v1.0.1

Published

6 months ago

Transform messy CSV data into clean, standardized datasets using AI-powered automation

0High
0Medium
0Low

csv data-cleaning data-processing ai machine-learning data-science automation cli command-line data-transformation schema-design gemini dbclean

🧹 DBClean

Transform messy CSV data into clean, standardized datasets using AI-powered automation.

DBClean is a powerful command-line tool that automatically cleans, standardizes, and restructures your CSV data using advanced AI models. Perfect for data scientists, analysts, and anyone working with messy datasets.

📁 Project Structure

After processing, your workspace will look like this:

your-project/
├── data.csv                  # Your original input file
├── data/
│   ├── data_cleaned.csv      # After preclean step
│   ├── data_deduped.csv      # After duplicate removal
│   ├── data_stitched.csv     # Final cleaned dataset
│   ├── train.csv             # Training set (70%)
│   ├── validate.csv          # Validation set (15%)
│   └── test.csv              # Test set (15%)
├── settings/
│   ├── instructions.txt      # Custom AI instructions
│   └── exclude_columns.txt   # Columns to skip in preclean
└── outputs/
    ├── architect_output.txt  # AI schema design
    ├── column_mapping.json   # Column transformations
    ├── cleaned_columns/      # Individual column results
    ├── cleaner_changes_analysis.html
    └── dedupe_report.txt

✨ Features

🤖 AI-Powered Cleaning - Uses advanced language models to intelligently clean and standardize data
🏗️ Schema Design - Automatically creates optimal database schemas from your data
🔍 Duplicate Detection - AI-powered duplicate identification and removal
🎯 Outlier Detection - Uses Isolation Forest to identify and remove anomalies
✂️ Data Splitting - Automatically splits cleaned data into training, validation, and test sets
🔄 Full Pipeline - Complete automation from raw CSV to clean, structured data
📊 Column-by-Column Processing - Detailed cleaning and standardization of individual columns
🎯 Model Selection - Choose from multiple AI models for different tasks
📋 Custom Instructions - Guide the AI with your specific cleaning requirements
💰 Credit-Based Billing - Pay only for what you use with transparent pricing

💳 Credit System

DBClean uses a transparent, pay-as-you-go credit system:

Free Tier: 5 free requests per month for new users
Minimum Balance: $0.01 required for paid requests
Precision: 4 decimal places (charges as low as $0.0001)
Pricing: Based on actual AI model costs with no markup
Billing: Credits deducted only after successful processing

Check your balance anytime with dbclean credits or get a complete overview with dbclean account.

🚀 Quick Start

1. Initialize Your Account

dbclean init

Enter your email and API key when prompted. Don't have an account? Sign up at dbclean.dev

2. Verify Setup

dbclean test-auth
dbclean account

3. Process Your Data

# Place your CSV file as data.csv in your current directory
dbclean run

Your cleaned data will be available in data/data_stitched.csv 🎉

📖 Command Reference

🔧 Setup & Authentication

| Command | Description | |---------|-------------| | dbclean init | Initialize with your email and API key | | dbclean test-auth | Verify your credentials are working | | dbclean logout | Remove stored credentials | | dbclean status | Check API key status and account info |

💰 Account Management

| Command | Description | |---------|-------------| | dbclean account | Complete account overview (credits, usage, status) | | dbclean credits | Check your current credit balance | | dbclean usage | View API usage statistics | | dbclean usage --detailed | Detailed breakdown by service and model | | dbclean models | List all available AI models |

📊 Data Processing Pipeline

| Command | Description | |---------|-------------| | dbclean run | Execute complete pipeline (recommended) | | dbclean preclean | Clean CSV data (remove newlines, special chars) | | dbclean architect | AI-powered schema design and standardization | | dbclean dedupe | AI-powered duplicate detection and removal | | dbclean cleaner | AI-powered column-by-column data cleaning | | dbclean stitcher | Combine all changes into final CSV | | dbclean isosplit | Detect outliers and split into train/validate/test |

🔄 Complete Pipeline

The recommended approach is to use the full pipeline:

# Basic full pipeline
dbclean run

# With custom AI model
dbclean run -m "gemini-2.0-flash-exp"

# Different models for different steps
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"

# With custom instructions and larger sample
dbclean run -i -x 10

# Skip certain steps
dbclean run --skip-preclean --skip-dedupe

Pipeline Steps

Preclean - Prepares raw CSV by removing problematic characters and formatting
Architect - AI analyzes your data structure and creates optimized schema
Dedupe - AI identifies and removes duplicate records intelligently
Cleaner - AI processes each column to standardize and clean data
Stitcher - Combines all improvements into final dataset
Isosplit - Removes outliers and splits data for machine learning

🎛️ Command Options

Model Selection

-m <model> - Use same model for all AI steps
--model-architect <model> - Specific model for architect step
--model-cleaner <model> - Specific model for cleaner step

Processing Options

-x <number> - Sample size for architect analysis (default: 5)
-i - Use custom instructions from settings/instructions.txt
--input <file> - Specify input CSV file (default: data.csv)

Skip Options

--skip-preclean - Skip data preparation step
--skip-architect - Skip schema design step
--skip-dedupe - Skip duplicate detection step
--skip-cleaner - Skip column cleaning step
--skip-isosplit - Skip outlier detection and data splitting

🤖 AI Models

Recommended Models

| Model | Best For | Speed | Cost | |-------|----------|-------|------| | gemini-2.0-flash-exp | General purpose, fast processing | ⚡⚡⚡ | 💲 | | gemini-2.0-flash-thinking | Complex data analysis | ⚡⚡ | 💲💲 | | gemini-1.5-pro | Large, complex datasets | ⚡ | 💲💲💲 |

Model Selection Tips

For speed and cost: Use gemini-2.0-flash-exp
For complex, messy data: Use gemini-2.0-flash-thinking for architect
For mixed workloads: Use different models per step with --model-architect and --model-cleaner

# List all available models
dbclean models

📝 Custom Instructions

Create custom cleaning instructions to guide the AI:

For architect step: Use the -i flag with a settings/instructions.txt file

Example instructions:

- Standardize all phone numbers to E.164 format (+1XXXXXXXXXX)
- Convert all dates to YYYY-MM-DD format
- Normalize company names (remove Inc, LLC, etc.)
- Flag any entries with missing critical information
- Ensure email addresses are properly formatted

dbclean run -i  # Uses instructions from settings/instructions.txt

💡 Usage Examples

Basic Processing

# Process a CSV file with default settings
dbclean run

# Use a specific input file
dbclean run --input customer_data.csv

Advanced Processing

# High-quality processing with larger sample
dbclean run -m "gemini-2.0-flash-thinking" -x 15 -i

# Fast processing for large datasets
dbclean run -m "gemini-2.0-flash-exp" --skip-dedupe

# Custom pipeline - architect only
dbclean run --skip-preclean --skip-cleaner --skip-dedupe --skip-isosplit

Individual Steps

# Run architect with custom model and sample size
dbclean architect -m "gemini-2.0-flash-thinking" -x 10 -i

# Clean data with specific model
dbclean cleaner -m "gemini-2.0-flash-exp"

# Remove duplicates with AI analysis
dbclean dedupe

🎯 Best Practices

1. Start Small and Iterate

# Test with small sample first
dbclean architect -x 3

# Review outputs, then run full pipeline
dbclean run

2. Choose the Right Models

# For complex schema design
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"

3. Use Custom Instructions

Create settings/instructions.txt with domain-specific requirements:

Finance data requirements:
- Currency amounts in USD format ($X,XXX.XX)
- Account numbers must be 10-12 digits
- Transaction dates in YYYY-MM-DD format

4. Monitor Your Usage

# Check account status regularly
dbclean account

# Monitor detailed usage
dbclean usage --detailed

❗ Troubleshooting

Common Issues

Authentication Problems

dbclean init     # Re-enter credentials
dbclean test-auth # Verify connection

Data File Issues

Ensure data.csv exists in current directory
Use --input <file> for different file names
Check file permissions and encoding

API Limits

Check credit balance: dbclean credits
View usage: dbclean usage
Free tier: 5 requests per month, then paid credits required

Model Availability

dbclean models   # See available models

Getting Help

dbclean --help              # General help
dbclean run --help          # Command-specific help
dbclean help-commands       # Detailed command reference

📊 Output Files

After processing, you'll have:

data/data_stitched.csv - Your final, cleaned dataset
data/train.csv - Training data (70%)
data/validate.csv - Validation data (15%)
data/test.csv - Test data (15%)
outputs/cleaner_changes_analysis.html - Visual changes report
outputs/architect_output.txt - AI schema analysis
outputs/column_mapping.json - Column transformation details

🤝 Support

Documentation: dbclean.dev/docs
Support: dbclean.dev/support
API Status: Check real-time status and get your API key

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Ready to clean your data? Start with dbclean init and transform your messy CSV files into pristine datasets! 🚀