automated-database-seed

v1.0.0

Published

6 months ago

**Generate realistic, schema-aware test data for PostgreSQL databases in seconds, not hours.**

Downloads

0High
0Medium
0Low

thesaasdevkit

database test-data postgresql seeding faker data-generation testing

🚀 Smart Test Data Fabricator

Generate realistic, schema-aware test data for PostgreSQL databases in seconds, not hours.

Stop wasting time manually scrubbing production data or dealing with broken foreign keys from simple faker scripts. Smart Test Data Fabricator automatically understands your database schema and generates consistent, realistic test data that just works.

# Generate realistic test data instantly
fabricate-data --url postgresql://localhost/testdb --mode generate

# Safely scrub production data 
fabricate-data --url postgresql://prod/backup --mode scrub --output postgresql://test/db

✨ Why Smart Test Data Fabricator?

The Problem

Manual data scrubbing takes hours and risks PII leaks
Simple faker libraries create data that violates foreign keys
Copying production is dangerous and time-consuming
Writing custom scripts for each schema is tedious

The Solution

🧠 Schema-aware: Automatically understands tables, relationships, and constraints
🔗 Referential integrity: All foreign keys point to valid records
🎯 Realistic data: Uses smart patterns to generate believable content
🛡️ PII protection: Safely anonymizes sensitive data while preserving structure
⚡ Fast: Generates millions of records efficiently
🔧 Easy: Works with simple commands or detailed configurations

🚀 Quick Start

Installation

pip install smart-test-data-fabricator

Generate Your First Dataset

# Connect to your empty database and generate realistic data
fabricate-data --url postgresql://user:pass@localhost/testdb --mode generate

# See what it would do without making changes
fabricate-data --url postgresql://localhost/testdb --mode generate --dry-run

That's it! The tool will:

🔍 Analyze your database schema
📊 Determine table dependencies
🎲 Generate realistic, consistent data
✅ Maintain all foreign key relationships

💡 Common Use Cases

🏗️ Local Development

Perfect for seeding your development database with realistic data:

# Generate a small dataset for development
fabricate-data --url postgresql://localhost/myapp_dev --template small_demo

🧪 Testing & QA

Create specific scenarios for testing edge cases:

# test_scenario.yml
tables:
  users: 100
  orders: 
    count: "each users 0-15"  # Some users have no orders, others have many
    fields:
      status: [pending:20%, completed:70%, cancelled:10%]
  products: 50

fabricate-data --url postgresql://localhost/test --config test_scenario.yml

🏭 CI/CD Pipeline Integration

Automatically seed test databases in your deployment pipeline:

# .github/workflows/test.yml
- name: Seed Test Database
  run: |
    fabricate-data \
      --url ${{ secrets.TEST_DATABASE_URL }} \
      --mode generate \
      --template integration_test \
      --quiet

🔒 Production Data Scrubbing

Safely anonymize production data for development use:

# Scrub sensitive data while preserving relationships
fabricate-data \
  --url postgresql://prod-backup/db \
  --mode scrub \
  --output postgresql://staging/db \
  --config scrub_rules.yml

📚 Configuration Examples

Simple Generation

# quick_demo.yml
mode: generate
tables:
  users: 50
  posts: 200
  comments: 500

Advanced Scenarios

# ecommerce_scenario.yml
mode: generate
settings:
  seed: 12345  # Reproducible data
  
tables:
  users: 
    count: 1000
    fields:
      email_verified: 80% true
      plan: [free:70%, pro:25%, enterprise:5%]
      
  products:
    count: 200
    fields:
      category: [electronics:30%, clothing:25%, books:20%, home:25%]
      
  orders:
    count: "each users 0-10"  # Realistic distribution
    fields:
      status: [pending:5%, shipped:20%, delivered:70%, returned:5%]
      
  order_items:
    count: "each orders 1-5"

PII Scrubbing Rules

# scrub_config.yml
mode: scrub
auto_detect_pii: true

custom_rules:
  users.email: fake_email
  users.phone: fake_phone  
  users.ssn: mask_with_x
  profiles.bio: lorem_paragraph
  
consistency:
  preserve_relationships: true
  maintain_distributions: true

🛠️ CLI Reference

Basic Commands

# Generate mode - create synthetic data
fabricate-data --url <database_url> --mode generate [options]

# Scrub mode - anonymize existing data  
fabricate-data --url <source_url> --mode scrub --output <target_url> [options]

Essential Options

| Option | Description | Example | |--------|-------------|---------| | --config FILE | Use configuration file | --config scenario.yml | | --template NAME | Use built-in template | --template small_demo | | --dry-run | Show what would be done | --dry-run | | --quiet | Minimal output for scripts | --quiet | | --verbose | Detailed logging | --verbose | | --seed NUMBER | Reproducible generation | --seed 12345 |

Built-in Templates

small_demo - Perfect for development (100s of records)
integration_test - Medium dataset for testing (1000s of records)
performance_test - Large dataset for load testing (100K+ records)
saas_app - Typical SaaS application schema
ecommerce - E-commerce platform schema

🎯 Advanced Features

Smart Data Generation

The tool automatically generates realistic data based on column names and types:

email columns → [email protected]
phone columns → (555) 123-4567
first_name + last_name → Consistent fake names
created_at → Realistic timestamps
price columns → Reasonable monetary values

Referential Integrity

Automatically handles complex relationships:

✅ Foreign keys always point to valid records
✅ Handles self-referencing tables (categories, employees)
✅ Manages circular dependencies intelligently
✅ Supports composite keys and unique constraints

Performance Optimization

Efficient for large datasets:

🚀 Bulk inserts using PostgreSQL COPY protocol
🧵 Parallel processing for independent tables
💾 Memory-efficient streaming for large datasets
📊 Progress reporting for long-running operations

🔧 Development Setup

Prerequisites

Python 3.8+
PostgreSQL 9.6+

Local Development

# Clone the repository
git clone https://github.com/your-org/smart-test-data-fabricator
cd smart-test-data-fabricator

# Install dependencies
pip install -r requirements.txt

# Run tests
pytest tests/

# Install in development mode
pip install -e .

Docker Usage

# Using Docker Compose
docker-compose up -d postgres
export DATABASE_URL="postgresql://test:test@localhost:5432/testdb"
fabricate-data --url $DATABASE_URL --mode generate

🐛 Troubleshooting

Common Issues

Connection refused

# Check your database URL and credentials
fabricate-data --url postgresql://user:pass@host:port/db --dry-run

Foreign key violations

# The tool should prevent this, but if it happens:
fabricate-data --url <database_url> --mode generate --validate-schema

Out of memory with large datasets

# Use streaming mode for large datasets
fabricate-data --url <database_url> --mode generate --batch-size 1000 --stream

PII not detected during scrubbing

# Use custom rules for specific columns
fabricate-data --mode scrub --config custom_pii_rules.yml

Getting Help

📖 Check our documentation
🐛 Report bugs on GitHub Issues
💬 Join our Discord community
📧 Email us at [email protected]

🤝 Contributing

We love contributions! Here's how to help:

🍴 Fork the repository
🌿 Create a feature branch (git checkout -b feature/amazing-feature)
✅ Test your changes (pytest tests/)
📝 Commit your changes (git commit -am 'Add amazing feature')
🚀 Push to the branch (git push origin feature/amazing-feature)
🔄 Open a Pull Request

Development Guidelines

Write tests for new features
Follow PEP 8 style guidelines
Update documentation for user-facing changes
Add type hints for better code quality

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with ❤️ for developers who deserve better test data tools
Inspired by the pain of manual data scrubbing and broken faker scripts
Thanks to the PostgreSQL community for excellent introspection capabilities
Special thanks to the Faker library maintainers

Ready to generate some realistic test data?

pip install smart-test-data-fabricator
fabricate-data --url postgresql://localhost/myapp --mode generate

Made with ❤️ by developers, for developers