npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@dbclean/cli

v1.0.1

Published

Transform messy CSV data into clean, standardized datasets using AI-powered automation

Readme

🧹 DBClean

Transform messy CSV data into clean, standardized datasets using AI-powered automation.

DBClean is a powerful command-line tool that automatically cleans, standardizes, and restructures your CSV data using advanced AI models. Perfect for data scientists, analysts, and anyone working with messy datasets.

📁 Project Structure

After processing, your workspace will look like this:

your-project/
├── data.csv                  # Your original input file
├── data/
│   ├── data_cleaned.csv      # After preclean step
│   ├── data_deduped.csv      # After duplicate removal
│   ├── data_stitched.csv     # Final cleaned dataset
│   ├── train.csv             # Training set (70%)
│   ├── validate.csv          # Validation set (15%)
│   └── test.csv              # Test set (15%)
├── settings/
│   ├── instructions.txt      # Custom AI instructions
│   └── exclude_columns.txt   # Columns to skip in preclean
└── outputs/
    ├── architect_output.txt  # AI schema design
    ├── column_mapping.json   # Column transformations
    ├── cleaned_columns/      # Individual column results
    ├── cleaner_changes_analysis.html
    └── dedupe_report.txt

✨ Features

  • 🤖 AI-Powered Cleaning - Uses advanced language models to intelligently clean and standardize data
  • 🏗️ Schema Design - Automatically creates optimal database schemas from your data
  • 🔍 Duplicate Detection - AI-powered duplicate identification and removal
  • 🎯 Outlier Detection - Uses Isolation Forest to identify and remove anomalies
  • ✂️ Data Splitting - Automatically splits cleaned data into training, validation, and test sets
  • 🔄 Full Pipeline - Complete automation from raw CSV to clean, structured data
  • 📊 Column-by-Column Processing - Detailed cleaning and standardization of individual columns
  • 🎯 Model Selection - Choose from multiple AI models for different tasks
  • 📋 Custom Instructions - Guide the AI with your specific cleaning requirements
  • 💰 Credit-Based Billing - Pay only for what you use with transparent pricing

💳 Credit System

DBClean uses a transparent, pay-as-you-go credit system:

  • Free Tier: 5 free requests per month for new users
  • Minimum Balance: $0.01 required for paid requests
  • Precision: 4 decimal places (charges as low as $0.0001)
  • Pricing: Based on actual AI model costs with no markup
  • Billing: Credits deducted only after successful processing

Check your balance anytime with dbclean credits or get a complete overview with dbclean account.

🚀 Quick Start

1. Initialize Your Account

dbclean init

Enter your email and API key when prompted. Don't have an account? Sign up at dbclean.dev

2. Verify Setup

dbclean test-auth
dbclean account

3. Process Your Data

# Place your CSV file as data.csv in your current directory
dbclean run

Your cleaned data will be available in data/data_stitched.csv 🎉

📖 Command Reference

🔧 Setup & Authentication

| Command | Description | |---------|-------------| | dbclean init | Initialize with your email and API key | | dbclean test-auth | Verify your credentials are working | | dbclean logout | Remove stored credentials | | dbclean status | Check API key status and account info |

💰 Account Management

| Command | Description | |---------|-------------| | dbclean account | Complete account overview (credits, usage, status) | | dbclean credits | Check your current credit balance | | dbclean usage | View API usage statistics | | dbclean usage --detailed | Detailed breakdown by service and model | | dbclean models | List all available AI models |

📊 Data Processing Pipeline

| Command | Description | |---------|-------------| | dbclean run | Execute complete pipeline (recommended) | | dbclean preclean | Clean CSV data (remove newlines, special chars) | | dbclean architect | AI-powered schema design and standardization | | dbclean dedupe | AI-powered duplicate detection and removal | | dbclean cleaner | AI-powered column-by-column data cleaning | | dbclean stitcher | Combine all changes into final CSV | | dbclean isosplit | Detect outliers and split into train/validate/test |

🔄 Complete Pipeline

The recommended approach is to use the full pipeline:

# Basic full pipeline
dbclean run

# With custom AI model
dbclean run -m "gemini-2.0-flash-exp"

# Different models for different steps
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"

# With custom instructions and larger sample
dbclean run -i -x 10

# Skip certain steps
dbclean run --skip-preclean --skip-dedupe

Pipeline Steps

  1. Preclean - Prepares raw CSV by removing problematic characters and formatting
  2. Architect - AI analyzes your data structure and creates optimized schema
  3. Dedupe - AI identifies and removes duplicate records intelligently
  4. Cleaner - AI processes each column to standardize and clean data
  5. Stitcher - Combines all improvements into final dataset
  6. Isosplit - Removes outliers and splits data for machine learning

🎛️ Command Options

Model Selection

  • -m <model> - Use same model for all AI steps
  • --model-architect <model> - Specific model for architect step
  • --model-cleaner <model> - Specific model for cleaner step

Processing Options

  • -x <number> - Sample size for architect analysis (default: 5)
  • -i - Use custom instructions from settings/instructions.txt
  • --input <file> - Specify input CSV file (default: data.csv)

Skip Options

  • --skip-preclean - Skip data preparation step
  • --skip-architect - Skip schema design step
  • --skip-dedupe - Skip duplicate detection step
  • --skip-cleaner - Skip column cleaning step
  • --skip-isosplit - Skip outlier detection and data splitting

🤖 AI Models

Recommended Models

| Model | Best For | Speed | Cost | |-------|----------|-------|------| | gemini-2.0-flash-exp | General purpose, fast processing | ⚡⚡⚡ | 💲 | | gemini-2.0-flash-thinking | Complex data analysis | ⚡⚡ | 💲💲 | | gemini-1.5-pro | Large, complex datasets | ⚡ | 💲💲💲 |

Model Selection Tips

  • For speed and cost: Use gemini-2.0-flash-exp
  • For complex, messy data: Use gemini-2.0-flash-thinking for architect
  • For mixed workloads: Use different models per step with --model-architect and --model-cleaner
# List all available models
dbclean models

📝 Custom Instructions

Create custom cleaning instructions to guide the AI:

  1. For architect step: Use the -i flag with a settings/instructions.txt file
  2. Example instructions:
    - Standardize all phone numbers to E.164 format (+1XXXXXXXXXX)
    - Convert all dates to YYYY-MM-DD format
    - Normalize company names (remove Inc, LLC, etc.)
    - Flag any entries with missing critical information
    - Ensure email addresses are properly formatted
dbclean run -i  # Uses instructions from settings/instructions.txt

💡 Usage Examples

Basic Processing

# Process a CSV file with default settings
dbclean run

# Use a specific input file
dbclean run --input customer_data.csv

Advanced Processing

# High-quality processing with larger sample
dbclean run -m "gemini-2.0-flash-thinking" -x 15 -i

# Fast processing for large datasets
dbclean run -m "gemini-2.0-flash-exp" --skip-dedupe

# Custom pipeline - architect only
dbclean run --skip-preclean --skip-cleaner --skip-dedupe --skip-isosplit

Individual Steps

# Run architect with custom model and sample size
dbclean architect -m "gemini-2.0-flash-thinking" -x 10 -i

# Clean data with specific model
dbclean cleaner -m "gemini-2.0-flash-exp"

# Remove duplicates with AI analysis
dbclean dedupe

🎯 Best Practices

1. Start Small and Iterate

# Test with small sample first
dbclean architect -x 3

# Review outputs, then run full pipeline
dbclean run

2. Choose the Right Models

# For complex schema design
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"

3. Use Custom Instructions

Create settings/instructions.txt with domain-specific requirements:

Finance data requirements:
- Currency amounts in USD format ($X,XXX.XX)
- Account numbers must be 10-12 digits
- Transaction dates in YYYY-MM-DD format

4. Monitor Your Usage

# Check account status regularly
dbclean account

# Monitor detailed usage
dbclean usage --detailed

❗ Troubleshooting

Common Issues

Authentication Problems

dbclean init     # Re-enter credentials
dbclean test-auth # Verify connection

Data File Issues

  • Ensure data.csv exists in current directory
  • Use --input <file> for different file names
  • Check file permissions and encoding

API Limits

  • Check credit balance: dbclean credits
  • View usage: dbclean usage
  • Free tier: 5 requests per month, then paid credits required

Model Availability

dbclean models   # See available models

Getting Help

dbclean --help              # General help
dbclean run --help          # Command-specific help
dbclean help-commands       # Detailed command reference

📊 Output Files

After processing, you'll have:

  • data/data_stitched.csv - Your final, cleaned dataset
  • data/train.csv - Training data (70%)
  • data/validate.csv - Validation data (15%)
  • data/test.csv - Test data (15%)
  • outputs/cleaner_changes_analysis.html - Visual changes report
  • outputs/architect_output.txt - AI schema analysis
  • outputs/column_mapping.json - Column transformation details

🤝 Support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Ready to clean your data? Start with dbclean init and transform your messy CSV files into pristine datasets! 🚀