@mkven/samples-generation

v0.5.1

Published

16 days ago

Generate sample data for PostgreSQL, ClickHouse, SQLite, and Trino with a unified interface

0High
0Medium
0Low

tebuchet

data-generation sample-data test-data postgresql clickhouse sqlite trino database faker mock-data

samples-generation

Generate sample data for multiple databases with a unified interface.

Objective

We often need to prefill tables during tests, checks, and measurements. These generators support filling tables with random data and applying transformations (template-based construction of values from other columns, lookups in another table, data corruptions, etc). May be used for simple prefill and for controlled corruptions for further testing entity resolution.

Supported Databases

PostgreSQL - via postgres package
ClickHouse - via @clickhouse/client
SQLite - via better-sqlite3
Trino - via trino-client (writes to Iceberg tables)

Installation

As a library:

pnpm add @mkven/samples-generation
# or
npm install @mkven/samples-generation

For CLI usage (no install needed):

npx @mkven/samples-generation scenario.json --postgres

CLI Usage

The package includes a CLI that can be run via npx:

npx @mkven/samples-generation <scenario.json> [options]

Database Selection

--sqlite               # Generate for SQLite
--postgres             # Generate for PostgreSQL  
--clickhouse           # Generate for ClickHouse
--trino                # Generate for Trino (Iceberg)

Common Options

-r, --rows <count>     # Override row count (supports 1_000_000 format)
-b, --batch <size>     # Batch size for large datasets
--drop                 # Drop tables before generating
--truncate             # Truncate tables before generating

Connection Options

# PostgreSQL
--pg-host, --pg-port, --pg-database, --pg-user, --pg-password

# ClickHouse  
--ch-host, --ch-port, --ch-database, --ch-user, --ch-password

# Trino
--trino-host, --trino-port, --trino-user, --trino-catalog, --trino-schema

# SQLite
--sqlite-path

Examples

# Generate 10,000 rows in PostgreSQL
npx @mkven/samples-generation scenario.json --postgres -r 10000

# Generate in Trino with custom host
npx @mkven/samples-generation scenario.json --trino --trino-host db.example.com

# Generate in multiple databases with drop
npx @mkven/samples-generation scenario.json --postgres --clickhouse --drop

# Large dataset with batching
npx @mkven/samples-generation scenario.json --trino -r 1_000_000_000 -b 100_000_000

Scenario File Format

{
  "name": "My scenario",
  "steps": [
    {
      "table": {
        "name": "users",
        "columns": [
          { "name": "id", "type": "bigint", "generator": { "kind": "sequence" } },
          { "name": "name", "type": "string", "generator": { "kind": "randomString", "length": 10 } },
          { "name": "email", "type": "string", "generator": { "kind": "constant", "value": "" } }
        ]
      },
      "rowCount": 1000,
      "transformations": [
        {
          "description": "Generate email",
          "transformations": [
            { "kind": "template", "column": "email", "template": "{name}@example.com", "lowercase": true }
          ]
        }
      ]
    }
  ]
}

See examples/ directory for more scenario examples.

Measurements of generations

Environment: local databases, simple setup, 1 billion rows, batches by 100 millions.

Trivial generation

5 columns (id, 10-char string, 0 - 1000 float, string choice out of 3 variants, datetime)

npx tsx scripts/generate-all.ts --scenario simple -r 1_000_000_000 --clickhouse --trino

1 billion rows

ClickHouse, 16 Gb Ram: Generated in 12m 44s (generation: 7m 10s, optimisation: 5m 34s), table size: 24.83 GB

Trino, 16 Gb Ram + spill by fte: Generated in 8m 40s (generation: 6m 48s, optimisation: 291ms), table size: 15.70 GB

PostgreSQL: Generated in 54m 26s (generation: 35m 55s, optimisation: 18m 30s), table size: 63.60 GB

10 billion rows

ClickHouse: Generated in 4h 52m 21s (generation: 4h 47m 44s, optimisation: 4m 36s), table size: 248.38 GB

Trino: Generated in 1h 4m 41s (generation: 1h 4m 41s, optimisation: 181ms), table size: 158.62 GB

Names, then templated email based on generated names

7 columns total.

npx tsx scripts/generate-all.ts --scenario english-names -r 1_000_000_000 --clickhouse --trino

1 billion rows

English names:

ClickHouse, 16 Gb Ram: Generated in 18m 17s (generation: 7m 25s, transformation: 2m 2s, optimisation: 8m 49s), table size: 44.96 GB

Trino, 16 Gb Ram + spill by fte: Generated in 24m 51s (generation: 7m 47s, transformation: 16m 40s, optimisation: 24.34s), table size: 20.51 GB

Lookup transformation (10K departments → employees)

7 columns in employees table, lookup from department name in 10K-row departments table.

npx tsx scripts/generate-all.ts --scenario lookup-demo -r 1_000_000_000 --clickhouse --trino

1 billion rows

ClickHouse: Generated in 1h 19m 46s (generation: 25m 31s, transformation: 39m 25s, optimisation: 14m 49s), table size: 41.42 GB

Trino, 20 Gb Ram: Generated in 32m 57s (generation: 4m 55s, transformation: 26m 36s, optimisation: 1m 25s), table size: 17.94 GB

Trino, 16 Gb Ram + spill by fte: Generated in 46m 11s (generation: 13m 52s, transformation: 31m 43s, optimisation: 35.11s), table size: 17.94 GB

Note

Two configurations are available:

Standard (20GB): trino - high memory, no spilling
16GB comparison: trino-fte (fault-tolerant execution with disk spilling) and clickhouse - both with 16GB limits

See compose/docker-compose.yml for container resource limits and compose/trino/ or compose/trino-fte/ for Trino-specific settings.

Quick Start

Starting Databases

Start all databases:

pnpm compose:up

Or start individual databases:

pnpm compose:postgres    # PostgreSQL only
pnpm compose:clickhouse  # ClickHouse only
pnpm compose:trino       # Trino 20GB, high memory
pnpm compose:trino-fte   # Trino 16GB, fault-tolerant execution with spill

Note: trino and trino-fte share port 8080 - stop one before starting the other.

Stop and clean up:

pnpm compose:down   # Stop containers
pnpm compose:reset  # Stop and remove volumes

Using the Generator API

import {
  PostgresDataGenerator,
  ClickHouseDataGenerator,
  SQLiteDataGenerator,
  TrinoDataGenerator,
  type TableConfig,
} from "./src/generator/index.js";

const table: TableConfig = {
  name: "users",
  columns: [
    { name: "id", type: "integer", generator: { kind: "sequence", start: 1 } },
    {
      name: "name",
      type: "string",
      generator: { kind: "randomString", length: 10 },
    },
    {
      name: "score",
      type: "float",
      generator: { kind: "randomFloat", min: 0, max: 100 },
    },
    {
      name: "status",
      type: "string",
      generator: { kind: "choice", values: ["active", "inactive"] },
    },
    { name: "created_at", type: "datetime", generator: { kind: "datetime" } },
  ],
};

// All generators have the same interface
const generator = new PostgresDataGenerator({
  host: "localhost",
  port: 5432,
  database: "appdb",
  username: "postgres",
  password: "postgres",
});

await generator.connect();
const result = await generator.generate({
  table,
  rowCount: 1000,
  truncateFirst: true,
  resumeSequences: true, // Continue sequence from last max value
  batchSize: 100_000_000, // Optional: insert in batches (useful for large datasets)
});
console.log(
  `Inserted ${result.rowsInserted} rows in ${result.generateMs}ms (optimize: ${result.optimizeMs}ms)`
);
await generator.disconnect();

Database Configurations

All databases use a consistent host/port configuration:

// PostgreSQL
new PostgresDataGenerator({
  host: "localhost",
  port: 5432,
  database: "appdb",
  username: "postgres",
  password: "postgres",
});

// ClickHouse
new ClickHouseDataGenerator({
  host: "localhost",
  port: 8123,
  database: "default",
  username: "default",
  password: "clickhouse",
});

// SQLite
new SQLiteDataGenerator({
  path: "data/samples.db",
});

// Trino/Iceberg
new TrinoDataGenerator({
  host: "localhost",
  port: 8080,
  catalog: "iceberg",
  schema: "warehouse",
  user: "trino",
});

Column Types

| Type | PostgreSQL | ClickHouse | SQLite | Trino | | ---------- | ---------------- | ---------- | ------- | --------- | | integer | INTEGER | Int32 | INTEGER | INTEGER | | bigint | BIGINT | Int64 | INTEGER | BIGINT | | float | DOUBLE PRECISION | Float64 | REAL | DOUBLE | | string | TEXT | String | TEXT | VARCHAR | | boolean | BOOLEAN | Bool | INTEGER | BOOLEAN | | datetime | TIMESTAMP | DateTime | TEXT | TIMESTAMP | | date | DATE | Date | TEXT | DATE |

Column Options

Each column can have additional options:

| Option | Type | Default | Description | | ----------------- | --------- | ------- | ------------------------------------------------ | | nullable | boolean | false | If true, omits NOT NULL constraint on column | | nullProbability | number | 0 | Probability of NULL values (0-1) |

Example with nullable column:

{
  name: "middle_name",
  type: "string",
  generator: { kind: "randomString", length: 10 },
  nullable: true,
  nullProbability: 0.3  // 30% of rows will have NULL
}

Value Generators

| Generator | Kind | Options | | ---------------- | ---------------- | -------------------------------------- | | sequence | Auto-increment | start, step | | randomInt | Random integer | min, max | | randomFloat | Random float | min, max, precision (default: 2) | | randomString | Random string | length | | choice | Pick from list | values | | choiceByLookup | Optimized choice | values (large arrays) | | constant | Fixed value | value | | datetime | Random datetime | from, to | | uuid | UUID v4 | - |

`choiceByLookup` Generator

Use choiceByLookup instead of choice when selecting from thousands of values. It uses CTEs with arrays for O(1) random selection, making it efficient for billions of rows:

{
  name: "last_name",
  type: "string",
  generator: {
    kind: "choiceByLookup",
    values: ["Smith", "Johnson", "Williams", ...] // thousands of values
  }
}

PostgreSQL: CTE with ARRAY[] and array_length() indexing
ClickHouse: WITH clause with array variable
SQLite: CTE with JSON array and json_extract()
Trino: CTE with ARRAY[] and element_at()

Generate Options

interface GenerateOptions {
  table: TableConfig;
  rowCount: number;
  createTable?: boolean; // Default: true
  dropFirst?: boolean; // Default: false - drop table before generating
  truncateFirst?: boolean; // Default: false
  resumeSequences?: boolean; // Default: true - continue from max value
  optimize?: boolean; // Default: true - run VACUUM/OPTIMIZE after insert
}

Transformations

Apply transformations to existing tables. Useful for creating derived columns (like email from first/last name) or introducing realistic data quality issues.

// Generate data first
await generator.generate({ table: usersTable, rowCount: 10000 });

// Then apply transformations
await generator.transform("users", [
  {
    description: "Generate email addresses",
    transformations: [
      {
        kind: "template",
        column: "email",
        template: "{first_name}.{last_name}@example.com",
      },
    ],
  },
]);

interface TransformResult {
  durationMs: number;
  batchesApplied: number;
}

Transformation Types

Template Transformation - Build column values from other columns:

{
  kind: "template",
  column: "email",
  template: "{first_name}.{last_name}@example.com",
  lowercase: true  // Optional: convert result to lowercase
}

Mutate Transformation - Introduce random character mutations:

{
  kind: "mutate",
  column: "name",
  probability: 0.1,  // 10% of rows get mutated
  operations: ["replace", "delete", "insert"]  // Random operation selected per row
}

Lookup Transformation - Assign values from another table via join:

{
  kind: "lookup",
  column: "category_name",      // Column to update
  fromTable: "categories",      // Source table
  fromColumn: "name",           // Column to copy value from
  joinOn: {
    targetColumn: "category_id", // Column in target table
    lookupColumn: "id"           // Column in source table to match
  }
}

Note: For ClickHouse, lookup transformation uses a table swap approach (CREATE → INSERT SELECT with JOIN → RENAME) since ClickHouse doesn't support correlated subqueries in ALTER TABLE UPDATE. This means lookups execute before other transformations in the same batch. If order matters, place lookups in a separate batch.

Swap Transformation - Swap values between two columns with probability:

{
  kind: "swap",
  column1: "first_name",
  column2: "last_name",
  probability: 0.1  // 10% of rows get swapped
}

Note: Both columns use the same random decision per row, ensuring atomic swaps (if column1 gets column2's value, column2 always gets column1's value). For ClickHouse, swap also uses the table swap approach (like lookup) since ClickHouse evaluates each rand() call separately. Multiple swaps in the same batch are combined into a single table swap operation for efficiency.

Design Note: PostgreSQL, SQLite, and Trino execute each swap as a separate UPDATE ... WHERE random() < probability statement. This is intentionally not batched because UPDATE is a lightweight operation on these databases. ClickHouse batches swaps because each swap would otherwise require a full table copy (CREATE → INSERT → RENAME → DROP), making the overhead significant.

Batching Transformations

Transformations are organized in batches for efficiency:

Each batch becomes a separate UPDATE statement (executed sequentially)
Transformations within a batch are combined into a single UPDATE
Batches support optional descriptions for logging and debugging

await generator.transform("users", [
  {
    description: "Generate email addresses",
    transformations: [
      {
        kind: "template",
        column: "email",
        template: "{first_name}.{last_name}@example.com",
      },
    ],
  },
  {
    description: "Introduce data quality issues",
    transformations: [
      {
        kind: "mutate",
        column: "email",
        probability: 0.1,
        operations: ["replace"],
      },
    ],
  },
]);

With descriptions, you'll see helpful logs:

[postgres] Applying transformations: Generate email addresses (1 transformation(s))
[postgres] Applying transformations: Introduce data quality issues (1 transformation(s))

Scenarios

A Scenario orchestrates multi-step data generation workflows - create lookup tables, generate main tables, and apply cross-table transformations in sequence:

import { PostgresDataGenerator, type Scenario } from "./src/generator/index.js";

const scenario: Scenario = {
  name: "E-commerce data",
  steps: [
    // Step 1: Create lookup table
    {
      table: {
        name: "departments",
        columns: [
          {
            name: "id",
            type: "integer",
            generator: { kind: "sequence", start: 1 },
          },
          {
            name: "name",
            type: "string",
            generator: {
              kind: "choice",
              values: ["Engineering", "Sales", "HR"],
            },
          },
        ],
      },
      rowCount: 3,
    },
    // Step 2: Generate main table with transformations
    {
      table: {
        name: "employees",
        columns: [
          {
            name: "id",
            type: "bigint",
            generator: { kind: "sequence", start: 1 },
          },
          {
            name: "first_name",
            type: "string",
            generator: {
              kind: "choice",
              values: ["John", "Jane", "Bob"],
            },
          },
          {
            name: "last_name",
            type: "string",
            generator: { kind: "choice", values: ["Smith", "Jones"] },
          },
          {
            name: "email",
            type: "string",
            generator: { kind: "constant", value: "" },
          },
          {
            name: "department_id",
            type: "integer",
            generator: { kind: "randomInt", min: 1, max: 3 },
          },
          {
            name: "department_name",
            type: "string",
            generator: { kind: "constant", value: "" },
          },
        ],
      },
      rowCount: 1000,
      transformations: [
        {
          description: "Generate email from names",
          transformations: [
            {
              kind: "template",
              column: "email",
              template: "{first_name}.{last_name}@company.com",
              lowercase: true,
            },
          ],
        },
      ],
    },
    // Step 3: Transform-only step - apply cross-table lookup
    {
      tableName: "employees",
      transformations: [
        {
          description: "Lookup department name",
          transformations: [
            {
              kind: "lookup",
              column: "department_name",
              fromTable: "departments",
              fromColumn: "name",
              joinOn: { targetColumn: "department_id", lookupColumn: "id" },
            },
          ],
        },
      ],
    },
  ],
};

const generator = new PostgresDataGenerator({
  /* config */
});
await generator.connect();

const result = await generator.runScenario({
  scenario,
  dropFirst: true,
});

for (const step of result.steps) {
  console.log(`[${step.tableName}] ${step.generate?.rowsInserted ?? 0} rows`);
  if (step.transform) {
    console.log(
      `  Applied ${step.transform.batchesApplied} transformation batch(es)`
    );
  }
}
console.log(
  `Total: ${result.totalRowsInserted} rows in ${result.durationMs}ms`
);

await generator.disconnect();

Step Types:

| Step Type | Fields | Description | | -------------------- | --------------------------------------- | ----------------------------------------------- | | Generate + Transform | table, rowCount, transformations? | Create table, insert rows, optionally transform | | Transform only | tableName, transformations | Apply transformations to existing table |

interface ScenarioOptions {
  scenario: Scenario;
  createTable?: boolean; // Default: true
  dropFirst?: boolean; // Default: false
  truncateFirst?: boolean; // Default: false
  resumeSequences?: boolean; // Default: true
  optimize?: boolean; // Default: true (runs once at end for all tables)
}

interface ScenarioResult {
  steps: ScenarioStepResult[];
  totalRowsInserted: number;
  durationMs: number; // Total wall-clock time
  generateMs: number; // Time spent generating rows
  transformMs: number; // Time spent applying transformations
  optimizeMs: number; // Time spent on optimisations
}

Escape Utilities

For custom queries, use the exported escape functions:

import {
  // Identifier escaping (table/column names)
  escapePostgresIdentifier,
  escapeClickHouseIdentifier,
  escapeTrinoIdentifier,
  // Literal escaping (string values)
  escapePostgresLiteral,
  escapeClickHouseLiteral,
  escapeTrinoLiteral,
  escapeSqliteLiteral,
} from "@mkven/samples-generation";

// Identifier escaping (for table/column names)
escapePostgresIdentifier("my-table"); // "my-table"
escapePostgresIdentifier('table"name'); // "table""name"

escapeClickHouseIdentifier("my-table"); // `my-table`
escapeClickHouseIdentifier("table`name"); // `table``name`

escapeTrinoIdentifier("samples"); // "samples"
escapeTrinoIdentifier("samples$files"); // "samples$files" (for metadata tables)

// Literal escaping (for string values in SQL)
escapePostgresLiteral("O'Brien"); // 'O''Brien'
escapeClickHouseLiteral("O'Brien"); // 'O\'Brien'
escapeTrinoLiteral("O'Brien"); // 'O''Brien'
escapeSqliteLiteral("O'Brien"); // 'O''Brien'

Table Size

Get the size of a table (including indexes):

// Get size in bytes
const bytes = await generator.getTableSize("users");
// 1234567

// Get human-readable size
const size = await generator.getTableSizeForHuman("users");
// "1.18 MB"

You can also use the formatBytes utility directly:

import { formatBytes } from "./src/generator/index.js";

formatBytes(1024); // "1.00 KB"
formatBytes(1048576); // "1.00 MB"

Optimization

By default, generate() runs database-specific optimization after inserting rows:

| Database | Optimization | | ---------- | ----------------------------------------------------------------- | | PostgreSQL | VACUUM ANALYZE - reclaims storage and updates statistics | | ClickHouse | OPTIMIZE TABLE FINAL - merges all parts for MergeTree engines | | SQLite | VACUUM + ANALYZE - rebuilds file and gathers statistics | | Trino | optimize + expire_snapshots + remove_orphan_files - Iceberg |

Disable for quick tests:

await generator.generate({
  table,
  rowCount: 1000,
  optimize: false, // Skip VACUUM/OPTIMIZE
});

Or call manually:

await generator.optimize("users");

Scripts

Generate Data

# Generate 1000 rows in all databases (requires docker-compose up)
npx tsx scripts/generate-all.ts

# Specify row count
npx tsx scripts/generate-all.ts --rows 1000
npx tsx scripts/generate-all.ts -r 1_000_000

# Use batching for large datasets (helps with memory and timeouts)
npx tsx scripts/generate-all.ts -r 1_000_000_000 -b 100_000_000  # 10 batches of 100M
npx tsx scripts/generate-all.ts -r 1_000_000_000 --batch 50_000_000  # 20 batches of 50M

# Choose scenario
npx tsx scripts/generate-all.ts --scenario simple          # Default: 5 columns
npx tsx scripts/generate-all.ts --scenario english-names   # Names + email template
npx tsx scripts/generate-all.ts --scenario russian-names   # Russian names + email
npx tsx scripts/generate-all.ts --scenario lookup-demo     # Departments + employees lookup

# Generate for specific databases only
npx tsx scripts/generate-all.ts --sqlite
npx tsx scripts/generate-all.ts --postgres
npx tsx scripts/generate-all.ts --clickhouse
npx tsx scripts/generate-all.ts --trino

# Combine options
npx tsx scripts/generate-all.ts -r 10000 --scenario english-names --postgres --clickhouse

# Show help
npx tsx scripts/generate-all.ts --help

Docker Compose

Start all databases:

pnpm compose:up

Services available:

| Service | Port(s) | Credentials | | ---------- | -------------------------- | --------------------- | | PostgreSQL | 5432 | postgres:postgres | | ClickHouse | 8123 (HTTP), 9009 (native) | default:clickhouse | | Trino | 8080 | trino (no password) | | MinIO | 9000 (S3), 9001 (console) | minioadmin:minioadmin | | Nessie | 19120 | - |

Testing

# Run unit tests (SQL expression generators, escape functions)
pnpm test

# Run e2e tests against SQLite (default, no setup required)
pnpm test:e2e

# Run e2e tests with specific databases (requires docker-compose up)
TEST_POSTGRES=1 pnpm test:e2e
TEST_CLICKHOUSE=1 pnpm test:e2e
TEST_TRINO=1 pnpm test:e2e

# Run e2e tests against all databases
TEST_POSTGRES=1 TEST_CLICKHOUSE=1 TEST_TRINO=1 pnpm test:e2e

# Or use the shortcut script
./test-all-dbs.sh

Quality Checks

# Run formatting, linting, type checking, and tests
./check.sh

# Check for security vulnerabilities and outdated dependencies
./health.sh

# Check for dependency updates (requires npx renovate)
./renovate-check.sh

# Run all checks
./all-checks.sh

Projects Using This Library

indexless-query-benchmarks — Benchmark queries on Trino+Iceberg, PostgreSQL, and ClickHouse without indexes

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

samples-generation

Objective

Supported Databases

Installation

CLI Usage

Database Selection

Common Options

Connection Options

Examples

Scenario File Format

Measurements of generations

Trivial generation

1 billion rows

10 billion rows

Names, then templated email based on generated names

1 billion rows

Lookup transformation (10K departments → employees)

1 billion rows

Note

Quick Start

Starting Databases

Using the Generator API

Database Configurations

Column Types

Column Options

Value Generators

choiceByLookup Generator

Generate Options

Transformations

Transformation Types

Batching Transformations

Scenarios

Escape Utilities

Table Size

Optimization

Scripts

Generate Data

Docker Compose

Testing

Quality Checks

Projects Using This Library

`choiceByLookup` Generator