npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@mkven/samples-generation

v0.4.4

Published

Generate sample data for PostgreSQL, ClickHouse, SQLite, and Trino with a unified interface

Downloads

841

Readme

samples-generation

Generate sample data for multiple databases with a unified interface.

Objective

We often need to prefill tables during tests, checks, and measurements. These generators support filling tables with random data and applying transformations (template-based construction of values from other columns, lookups in another table, data corruptions, etc). May be used for simple prefill and for controlled corruptions for further testing entity resolution.

Supported Databases

  • PostgreSQL - via postgres package
  • ClickHouse - via @clickhouse/client
  • SQLite - via better-sqlite3
  • Trino - via trino-client (writes to Iceberg tables)

Installation

pnpm install

Measurements of generations

Environment: local databases, simple setup, 1 billion rows, batches by 100 millions.

Trivial generation

5 columns (id, 10-char string, 0 - 1000 float, string choice out of 3 variants, datetime)

npx tsx scripts/generate-all.ts --scenario simple -r 1_000_000_000 --clickhouse --trino

1 billion rows

ClickHouse, 16 Gb Ram: Generated in 12m 44s (generation: 7m 10s, optimisation: 5m 34s), table size: 24.83 GB

Trino, 16 Gb Ram + spill by fte: Generated in 8m 40s (generation: 6m 48s, optimisation: 291ms), table size: 15.70 GB

PostgreSQL: Generated in 54m 26s (generation: 35m 55s, optimisation: 18m 30s), table size: 63.60 GB

10 billion rows

ClickHouse: Generated in 4h 52m 21s (generation: 4h 47m 44s, optimisation: 4m 36s), table size: 248.38 GB

Trino: Generated in 1h 4m 41s (generation: 1h 4m 41s, optimisation: 181ms), table size: 158.62 GB

Names, then templated email based on generated names

7 columns total.

npx tsx scripts/generate-all.ts --scenario english-names -r 1_000_000_000 --clickhouse --trino

1 billion rows

English names:

ClickHouse, 16 Gb Ram: Generated in 18m 17s (generation: 7m 25s, transformation: 2m 2s, optimisation: 8m 49s), table size: 44.96 GB

Trino, 16 Gb Ram + spill by fte: Generated in 24m 51s (generation: 7m 47s, transformation: 16m 40s, optimisation: 24.34s), table size: 20.51 GB

Lookup transformation (10K departments → employees)

7 columns in employees table, lookup from department name in 10K-row departments table.

npx tsx scripts/generate-all.ts --scenario lookup-demo -r 1_000_000_000 --clickhouse --trino

1 billion rows

ClickHouse: Generated in 1h 19m 46s (generation: 25m 31s, transformation: 39m 25s, optimisation: 14m 49s), table size: 41.42 GB

Trino, 20 Gb Ram: Generated in 32m 57s (generation: 4m 55s, transformation: 26m 36s, optimisation: 1m 25s), table size: 17.94 GB

Trino, 16 Gb Ram + spill by fte: Generated in 46m 11s (generation: 13m 52s, transformation: 31m 43s, optimisation: 35.11s), table size: 17.94 GB

Note

Two configurations are available:

  • Standard (20GB): trino - high memory, no spilling
  • 16GB comparison: trino-fte (fault-tolerant execution with disk spilling) and clickhouse - both with 16GB limits

See compose/docker-compose.yml for container resource limits and compose/trino/ or compose/trino-fte/ for Trino-specific settings.

Quick Start

Starting Databases

Start all databases:

pnpm compose:up

Or start individual databases:

pnpm compose:postgres    # PostgreSQL only
pnpm compose:clickhouse  # ClickHouse only
pnpm compose:trino       # Trino 20GB, high memory
pnpm compose:trino-fte   # Trino 16GB, fault-tolerant execution with spill

Note: trino and trino-fte share port 8080 - stop one before starting the other.

Stop and clean up:

pnpm compose:down   # Stop containers
pnpm compose:reset  # Stop and remove volumes

Using the Generator API

import {
  PostgresDataGenerator,
  ClickHouseDataGenerator,
  SQLiteDataGenerator,
  TrinoDataGenerator,
  type TableConfig,
} from "./src/generator/index.js";

const table: TableConfig = {
  name: "users",
  columns: [
    { name: "id", type: "integer", generator: { kind: "sequence", start: 1 } },
    {
      name: "name",
      type: "string",
      generator: { kind: "randomString", length: 10 },
    },
    {
      name: "score",
      type: "float",
      generator: { kind: "randomFloat", min: 0, max: 100 },
    },
    {
      name: "status",
      type: "string",
      generator: { kind: "choice", values: ["active", "inactive"] },
    },
    { name: "created_at", type: "datetime", generator: { kind: "datetime" } },
  ],
};

// All generators have the same interface
const generator = new PostgresDataGenerator({
  host: "localhost",
  port: 5432,
  database: "appdb",
  username: "postgres",
  password: "postgres",
});

await generator.connect();
const result = await generator.generate({
  table,
  rowCount: 1000,
  truncateFirst: true,
  resumeSequences: true, // Continue sequence from last max value
  batchSize: 100_000_000, // Optional: insert in batches (useful for large datasets)
});
console.log(
  `Inserted ${result.rowsInserted} rows in ${result.generateMs}ms (optimize: ${result.optimizeMs}ms)`
);
await generator.disconnect();

Database Configurations

All databases use a consistent host/port configuration:

// PostgreSQL
new PostgresDataGenerator({
  host: "localhost",
  port: 5432,
  database: "appdb",
  username: "postgres",
  password: "postgres",
});

// ClickHouse
new ClickHouseDataGenerator({
  host: "localhost",
  port: 8123,
  database: "default",
  username: "default",
  password: "clickhouse",
});

// SQLite
new SQLiteDataGenerator({
  path: "data/samples.db",
});

// Trino/Iceberg
new TrinoDataGenerator({
  host: "localhost",
  port: 8080,
  catalog: "iceberg",
  schema: "warehouse",
  user: "trino",
});

Column Types

| Type | PostgreSQL | ClickHouse | SQLite | Trino | | ---------- | ---------------- | ---------- | ------- | --------- | | integer | INTEGER | Int32 | INTEGER | INTEGER | | bigint | BIGINT | Int64 | INTEGER | BIGINT | | float | DOUBLE PRECISION | Float64 | REAL | DOUBLE | | string | TEXT | String | TEXT | VARCHAR | | boolean | BOOLEAN | Bool | INTEGER | BOOLEAN | | datetime | TIMESTAMP | DateTime | TEXT | TIMESTAMP | | date | DATE | Date | TEXT | DATE |

Column Options

Each column can have additional options:

| Option | Type | Default | Description | | ----------------- | --------- | ------- | ------------------------------------------------ | | nullable | boolean | false | If true, omits NOT NULL constraint on column | | nullProbability | number | 0 | Probability of NULL values (0-1) |

Example with nullable column:

{
  name: "middle_name",
  type: "string",
  generator: { kind: "randomString", length: 10 },
  nullable: true,
  nullProbability: 0.3  // 30% of rows will have NULL
}

Value Generators

| Generator | Kind | Options | | ---------------- | ---------------- | -------------------------------------- | | sequence | Auto-increment | start, step | | randomInt | Random integer | min, max | | randomFloat | Random float | min, max, precision (default: 2) | | randomString | Random string | length | | choice | Pick from list | values | | choiceByLookup | Optimized choice | values (large arrays) | | constant | Fixed value | value | | datetime | Random datetime | from, to | | uuid | UUID v4 | - |

choiceByLookup Generator

Use choiceByLookup instead of choice when selecting from thousands of values. It uses CTEs with arrays for O(1) random selection, making it efficient for billions of rows:

{
  name: "last_name",
  type: "string",
  generator: {
    kind: "choiceByLookup",
    values: ["Smith", "Johnson", "Williams", ...] // thousands of values
  }
}
  • PostgreSQL: CTE with ARRAY[] and array_length() indexing
  • ClickHouse: WITH clause with array variable
  • SQLite: CTE with JSON array and json_extract()
  • Trino: CTE with ARRAY[] and element_at()

Generate Options

interface GenerateOptions {
  table: TableConfig;
  rowCount: number;
  createTable?: boolean; // Default: true
  dropFirst?: boolean; // Default: false - drop table before generating
  truncateFirst?: boolean; // Default: false
  resumeSequences?: boolean; // Default: true - continue from max value
  optimize?: boolean; // Default: true - run VACUUM/OPTIMIZE after insert
}

Transformations

Apply transformations to existing tables. Useful for creating derived columns (like email from first/last name) or introducing realistic data quality issues.

// Generate data first
await generator.generate({ table: usersTable, rowCount: 10000 });

// Then apply transformations
await generator.transform("users", [
  {
    description: "Generate email addresses",
    transformations: [
      {
        kind: "template",
        column: "email",
        template: "{first_name}.{last_name}@example.com",
      },
    ],
  },
]);
interface TransformResult {
  durationMs: number;
  batchesApplied: number;
}

Transformation Types

Template Transformation - Build column values from other columns:

{
  kind: "template",
  column: "email",
  template: "{first_name}.{last_name}@example.com",
  lowercase: true  // Optional: convert result to lowercase
}

Mutate Transformation - Introduce random character mutations:

{
  kind: "mutate",
  column: "name",
  probability: 0.1,  // 10% of rows get mutated
  operations: ["replace", "delete", "insert"]  // Random operation selected per row
}

Lookup Transformation - Assign values from another table via join:

{
  kind: "lookup",
  column: "category_name",      // Column to update
  fromTable: "categories",      // Source table
  fromColumn: "name",           // Column to copy value from
  joinOn: {
    targetColumn: "category_id", // Column in target table
    lookupColumn: "id"           // Column in source table to match
  }
}

Note: For ClickHouse, lookup transformation uses a table swap approach (CREATE → INSERT SELECT with JOIN → RENAME) since ClickHouse doesn't support correlated subqueries in ALTER TABLE UPDATE. This means lookups execute before other transformations in the same batch. If order matters, place lookups in a separate batch.

Swap Transformation - Swap values between two columns with probability:

{
  kind: "swap",
  column1: "first_name",
  column2: "last_name",
  probability: 0.1  // 10% of rows get swapped
}

Note: Both columns use the same random decision per row, ensuring atomic swaps (if column1 gets column2's value, column2 always gets column1's value). For ClickHouse, swap also uses the table swap approach (like lookup) since ClickHouse evaluates each rand() call separately. Multiple swaps in the same batch are combined into a single table swap operation for efficiency.

Design Note: PostgreSQL, SQLite, and Trino execute each swap as a separate UPDATE ... WHERE random() < probability statement. This is intentionally not batched because UPDATE is a lightweight operation on these databases. ClickHouse batches swaps because each swap would otherwise require a full table copy (CREATE → INSERT → RENAME → DROP), making the overhead significant.

Batching Transformations

Transformations are organized in batches for efficiency:

  • Each batch becomes a separate UPDATE statement (executed sequentially)
  • Transformations within a batch are combined into a single UPDATE
  • Batches support optional descriptions for logging and debugging
await generator.transform("users", [
  {
    description: "Generate email addresses",
    transformations: [
      {
        kind: "template",
        column: "email",
        template: "{first_name}.{last_name}@example.com",
      },
    ],
  },
  {
    description: "Introduce data quality issues",
    transformations: [
      {
        kind: "mutate",
        column: "email",
        probability: 0.1,
        operations: ["replace"],
      },
    ],
  },
]);

With descriptions, you'll see helpful logs:

[postgres] Applying transformations: Generate email addresses (1 transformation(s))
[postgres] Applying transformations: Introduce data quality issues (1 transformation(s))

Scenarios

A Scenario orchestrates multi-step data generation workflows - create lookup tables, generate main tables, and apply cross-table transformations in sequence:

import { PostgresDataGenerator, type Scenario } from "./src/generator/index.js";

const scenario: Scenario = {
  name: "E-commerce data",
  steps: [
    // Step 1: Create lookup table
    {
      table: {
        name: "departments",
        columns: [
          {
            name: "id",
            type: "integer",
            generator: { kind: "sequence", start: 1 },
          },
          {
            name: "name",
            type: "string",
            generator: {
              kind: "choice",
              values: ["Engineering", "Sales", "HR"],
            },
          },
        ],
      },
      rowCount: 3,
    },
    // Step 2: Generate main table with transformations
    {
      table: {
        name: "employees",
        columns: [
          {
            name: "id",
            type: "bigint",
            generator: { kind: "sequence", start: 1 },
          },
          {
            name: "first_name",
            type: "string",
            generator: {
              kind: "choice",
              values: ["John", "Jane", "Bob"],
            },
          },
          {
            name: "last_name",
            type: "string",
            generator: { kind: "choice", values: ["Smith", "Jones"] },
          },
          {
            name: "email",
            type: "string",
            generator: { kind: "constant", value: "" },
          },
          {
            name: "department_id",
            type: "integer",
            generator: { kind: "randomInt", min: 1, max: 3 },
          },
          {
            name: "department_name",
            type: "string",
            generator: { kind: "constant", value: "" },
          },
        ],
      },
      rowCount: 1000,
      transformations: [
        {
          description: "Generate email from names",
          transformations: [
            {
              kind: "template",
              column: "email",
              template: "{first_name}.{last_name}@company.com",
              lowercase: true,
            },
          ],
        },
      ],
    },
    // Step 3: Transform-only step - apply cross-table lookup
    {
      tableName: "employees",
      transformations: [
        {
          description: "Lookup department name",
          transformations: [
            {
              kind: "lookup",
              column: "department_name",
              fromTable: "departments",
              fromColumn: "name",
              joinOn: { targetColumn: "department_id", lookupColumn: "id" },
            },
          ],
        },
      ],
    },
  ],
};

const generator = new PostgresDataGenerator({
  /* config */
});
await generator.connect();

const result = await generator.runScenario({
  scenario,
  dropFirst: true,
});

for (const step of result.steps) {
  console.log(`[${step.tableName}] ${step.generate?.rowsInserted ?? 0} rows`);
  if (step.transform) {
    console.log(
      `  Applied ${step.transform.batchesApplied} transformation batch(es)`
    );
  }
}
console.log(
  `Total: ${result.totalRowsInserted} rows in ${result.durationMs}ms`
);

await generator.disconnect();

Step Types:

| Step Type | Fields | Description | | -------------------- | --------------------------------------- | ----------------------------------------------- | | Generate + Transform | table, rowCount, transformations? | Create table, insert rows, optionally transform | | Transform only | tableName, transformations | Apply transformations to existing table |

interface ScenarioOptions {
  scenario: Scenario;
  createTable?: boolean; // Default: true
  dropFirst?: boolean; // Default: false
  truncateFirst?: boolean; // Default: false
  resumeSequences?: boolean; // Default: true
  optimize?: boolean; // Default: true (runs once at end for all tables)
}

interface ScenarioResult {
  steps: ScenarioStepResult[];
  totalRowsInserted: number;
  durationMs: number; // Total wall-clock time
  generateMs: number; // Time spent generating rows
  transformMs: number; // Time spent applying transformations
  optimizeMs: number; // Time spent on optimisations
}

Escape Utilities

For custom queries, use the exported escape functions:

import {
  // Identifier escaping (table/column names)
  escapePostgresIdentifier,
  escapeClickHouseIdentifier,
  escapeTrinoIdentifier,
  // Literal escaping (string values)
  escapePostgresLiteral,
  escapeClickHouseLiteral,
  escapeTrinoLiteral,
  escapeSqliteLiteral,
} from "@mkven/samples-generation";

// Identifier escaping (for table/column names)
escapePostgresIdentifier("my-table"); // "my-table"
escapePostgresIdentifier('table"name'); // "table""name"

escapeClickHouseIdentifier("my-table"); // `my-table`
escapeClickHouseIdentifier("table`name"); // `table``name`

escapeTrinoIdentifier("samples"); // "samples"
escapeTrinoIdentifier("samples$files"); // "samples$files" (for metadata tables)

// Literal escaping (for string values in SQL)
escapePostgresLiteral("O'Brien"); // 'O''Brien'
escapeClickHouseLiteral("O'Brien"); // 'O\'Brien'
escapeTrinoLiteral("O'Brien"); // 'O''Brien'
escapeSqliteLiteral("O'Brien"); // 'O''Brien'

Table Size

Get the size of a table (including indexes):

// Get size in bytes
const bytes = await generator.getTableSize("users");
// 1234567

// Get human-readable size
const size = await generator.getTableSizeForHuman("users");
// "1.18 MB"

You can also use the formatBytes utility directly:

import { formatBytes } from "./src/generator/index.js";

formatBytes(1024); // "1.00 KB"
formatBytes(1048576); // "1.00 MB"

Optimization

By default, generate() runs database-specific optimization after inserting rows:

| Database | Optimization | | ---------- | ----------------------------------------------------------------- | | PostgreSQL | VACUUM ANALYZE - reclaims storage and updates statistics | | ClickHouse | OPTIMIZE TABLE FINAL - merges all parts for MergeTree engines | | SQLite | VACUUM + ANALYZE - rebuilds file and gathers statistics | | Trino | optimize + expire_snapshots + remove_orphan_files - Iceberg |

Disable for quick tests:

await generator.generate({
  table,
  rowCount: 1000,
  optimize: false, // Skip VACUUM/OPTIMIZE
});

Or call manually:

await generator.optimize("users");

Scripts

Generate Data

# Generate 1000 rows in all databases (requires docker-compose up)
npx tsx scripts/generate-all.ts

# Specify row count
npx tsx scripts/generate-all.ts --rows 1000
npx tsx scripts/generate-all.ts -r 1_000_000

# Use batching for large datasets (helps with memory and timeouts)
npx tsx scripts/generate-all.ts -r 1_000_000_000 -b 100_000_000  # 10 batches of 100M
npx tsx scripts/generate-all.ts -r 1_000_000_000 --batch 50_000_000  # 20 batches of 50M

# Choose scenario
npx tsx scripts/generate-all.ts --scenario simple          # Default: 5 columns
npx tsx scripts/generate-all.ts --scenario english-names   # Names + email template
npx tsx scripts/generate-all.ts --scenario russian-names   # Russian names + email
npx tsx scripts/generate-all.ts --scenario lookup-demo     # Departments + employees lookup

# Generate for specific databases only
npx tsx scripts/generate-all.ts --sqlite
npx tsx scripts/generate-all.ts --postgres
npx tsx scripts/generate-all.ts --clickhouse
npx tsx scripts/generate-all.ts --trino

# Combine options
npx tsx scripts/generate-all.ts -r 10000 --scenario english-names --postgres --clickhouse

# Show help
npx tsx scripts/generate-all.ts --help

Docker Compose

Start all databases:

pnpm compose:up

Services available:

| Service | Port(s) | Credentials | | ---------- | -------------------------- | --------------------- | | PostgreSQL | 5432 | postgres:postgres | | ClickHouse | 8123 (HTTP), 9009 (native) | default:clickhouse | | Trino | 8080 | trino (no password) | | MinIO | 9000 (S3), 9001 (console) | minioadmin:minioadmin | | Nessie | 19120 | - |

Testing

# Run unit tests (SQL expression generators, escape functions)
pnpm test

# Run e2e tests against SQLite (default, no setup required)
pnpm test:e2e

# Run e2e tests with specific databases (requires docker-compose up)
TEST_POSTGRES=1 pnpm test:e2e
TEST_CLICKHOUSE=1 pnpm test:e2e
TEST_TRINO=1 pnpm test:e2e

# Run e2e tests against all databases
TEST_POSTGRES=1 TEST_CLICKHOUSE=1 TEST_TRINO=1 pnpm test:e2e

# Or use the shortcut script
./test-all-dbs.sh

Quality Checks

# Run formatting, linting, type checking, and tests
./check.sh

# Check for security vulnerabilities and outdated dependencies
./health.sh

# Check for dependency updates (requires npx renovate)
./renovate-check.sh

# Run all checks
./all-checks.sh

Projects Using This Library