@mkven/samples-generation
v0.4.4
Published
Generate sample data for PostgreSQL, ClickHouse, SQLite, and Trino with a unified interface
Downloads
841
Maintainers
Readme
samples-generation
Generate sample data for multiple databases with a unified interface.
Objective
We often need to prefill tables during tests, checks, and measurements. These generators support filling tables with random data and applying transformations (template-based construction of values from other columns, lookups in another table, data corruptions, etc). May be used for simple prefill and for controlled corruptions for further testing entity resolution.
Supported Databases
- PostgreSQL - via
postgrespackage - ClickHouse - via
@clickhouse/client - SQLite - via
better-sqlite3 - Trino - via
trino-client(writes to Iceberg tables)
Installation
pnpm installMeasurements of generations
Environment: local databases, simple setup, 1 billion rows, batches by 100 millions.
Trivial generation
5 columns (id, 10-char string, 0 - 1000 float, string choice out of 3 variants, datetime)
npx tsx scripts/generate-all.ts --scenario simple -r 1_000_000_000 --clickhouse --trino1 billion rows
ClickHouse, 16 Gb Ram: Generated in 12m 44s (generation: 7m 10s, optimisation: 5m 34s), table size: 24.83 GB
Trino, 16 Gb Ram + spill by fte: Generated in 8m 40s (generation: 6m 48s, optimisation: 291ms), table size: 15.70 GB
PostgreSQL: Generated in 54m 26s (generation: 35m 55s, optimisation: 18m 30s), table size: 63.60 GB
10 billion rows
ClickHouse: Generated in 4h 52m 21s (generation: 4h 47m 44s, optimisation: 4m 36s), table size: 248.38 GB
Trino: Generated in 1h 4m 41s (generation: 1h 4m 41s, optimisation: 181ms), table size: 158.62 GB
Names, then templated email based on generated names
7 columns total.
npx tsx scripts/generate-all.ts --scenario english-names -r 1_000_000_000 --clickhouse --trino1 billion rows
English names:
ClickHouse, 16 Gb Ram: Generated in 18m 17s (generation: 7m 25s, transformation: 2m 2s, optimisation: 8m 49s), table size: 44.96 GB
Trino, 16 Gb Ram + spill by fte: Generated in 24m 51s (generation: 7m 47s, transformation: 16m 40s, optimisation: 24.34s), table size: 20.51 GB
Lookup transformation (10K departments → employees)
7 columns in employees table, lookup from department name in 10K-row departments table.
npx tsx scripts/generate-all.ts --scenario lookup-demo -r 1_000_000_000 --clickhouse --trino1 billion rows
ClickHouse: Generated in 1h 19m 46s (generation: 25m 31s, transformation: 39m 25s, optimisation: 14m 49s), table size: 41.42 GB
Trino, 20 Gb Ram: Generated in 32m 57s (generation: 4m 55s, transformation: 26m 36s, optimisation: 1m 25s), table size: 17.94 GB
Trino, 16 Gb Ram + spill by fte: Generated in 46m 11s (generation: 13m 52s, transformation: 31m 43s, optimisation: 35.11s), table size: 17.94 GB
Note
Two configurations are available:
- Standard (20GB):
trino- high memory, no spilling - 16GB comparison:
trino-fte(fault-tolerant execution with disk spilling) andclickhouse- both with 16GB limits
See compose/docker-compose.yml for container resource limits and compose/trino/ or compose/trino-fte/ for Trino-specific settings.
Quick Start
Starting Databases
Start all databases:
pnpm compose:upOr start individual databases:
pnpm compose:postgres # PostgreSQL only
pnpm compose:clickhouse # ClickHouse only
pnpm compose:trino # Trino 20GB, high memory
pnpm compose:trino-fte # Trino 16GB, fault-tolerant execution with spillNote:
trinoandtrino-fteshare port 8080 - stop one before starting the other.
Stop and clean up:
pnpm compose:down # Stop containers
pnpm compose:reset # Stop and remove volumesUsing the Generator API
import {
PostgresDataGenerator,
ClickHouseDataGenerator,
SQLiteDataGenerator,
TrinoDataGenerator,
type TableConfig,
} from "./src/generator/index.js";
const table: TableConfig = {
name: "users",
columns: [
{ name: "id", type: "integer", generator: { kind: "sequence", start: 1 } },
{
name: "name",
type: "string",
generator: { kind: "randomString", length: 10 },
},
{
name: "score",
type: "float",
generator: { kind: "randomFloat", min: 0, max: 100 },
},
{
name: "status",
type: "string",
generator: { kind: "choice", values: ["active", "inactive"] },
},
{ name: "created_at", type: "datetime", generator: { kind: "datetime" } },
],
};
// All generators have the same interface
const generator = new PostgresDataGenerator({
host: "localhost",
port: 5432,
database: "appdb",
username: "postgres",
password: "postgres",
});
await generator.connect();
const result = await generator.generate({
table,
rowCount: 1000,
truncateFirst: true,
resumeSequences: true, // Continue sequence from last max value
batchSize: 100_000_000, // Optional: insert in batches (useful for large datasets)
});
console.log(
`Inserted ${result.rowsInserted} rows in ${result.generateMs}ms (optimize: ${result.optimizeMs}ms)`
);
await generator.disconnect();Database Configurations
All databases use a consistent host/port configuration:
// PostgreSQL
new PostgresDataGenerator({
host: "localhost",
port: 5432,
database: "appdb",
username: "postgres",
password: "postgres",
});
// ClickHouse
new ClickHouseDataGenerator({
host: "localhost",
port: 8123,
database: "default",
username: "default",
password: "clickhouse",
});
// SQLite
new SQLiteDataGenerator({
path: "data/samples.db",
});
// Trino/Iceberg
new TrinoDataGenerator({
host: "localhost",
port: 8080,
catalog: "iceberg",
schema: "warehouse",
user: "trino",
});Column Types
| Type | PostgreSQL | ClickHouse | SQLite | Trino |
| ---------- | ---------------- | ---------- | ------- | --------- |
| integer | INTEGER | Int32 | INTEGER | INTEGER |
| bigint | BIGINT | Int64 | INTEGER | BIGINT |
| float | DOUBLE PRECISION | Float64 | REAL | DOUBLE |
| string | TEXT | String | TEXT | VARCHAR |
| boolean | BOOLEAN | Bool | INTEGER | BOOLEAN |
| datetime | TIMESTAMP | DateTime | TEXT | TIMESTAMP |
| date | DATE | Date | TEXT | DATE |
Column Options
Each column can have additional options:
| Option | Type | Default | Description |
| ----------------- | --------- | ------- | ------------------------------------------------ |
| nullable | boolean | false | If true, omits NOT NULL constraint on column |
| nullProbability | number | 0 | Probability of NULL values (0-1) |
Example with nullable column:
{
name: "middle_name",
type: "string",
generator: { kind: "randomString", length: 10 },
nullable: true,
nullProbability: 0.3 // 30% of rows will have NULL
}Value Generators
| Generator | Kind | Options |
| ---------------- | ---------------- | -------------------------------------- |
| sequence | Auto-increment | start, step |
| randomInt | Random integer | min, max |
| randomFloat | Random float | min, max, precision (default: 2) |
| randomString | Random string | length |
| choice | Pick from list | values |
| choiceByLookup | Optimized choice | values (large arrays) |
| constant | Fixed value | value |
| datetime | Random datetime | from, to |
| uuid | UUID v4 | - |
choiceByLookup Generator
Use choiceByLookup instead of choice when selecting from thousands of values. It uses CTEs with arrays for O(1) random selection, making it efficient for billions of rows:
{
name: "last_name",
type: "string",
generator: {
kind: "choiceByLookup",
values: ["Smith", "Johnson", "Williams", ...] // thousands of values
}
}- PostgreSQL: CTE with
ARRAY[]andarray_length()indexing - ClickHouse:
WITHclause with array variable - SQLite: CTE with JSON array and
json_extract() - Trino: CTE with
ARRAY[]andelement_at()
Generate Options
interface GenerateOptions {
table: TableConfig;
rowCount: number;
createTable?: boolean; // Default: true
dropFirst?: boolean; // Default: false - drop table before generating
truncateFirst?: boolean; // Default: false
resumeSequences?: boolean; // Default: true - continue from max value
optimize?: boolean; // Default: true - run VACUUM/OPTIMIZE after insert
}Transformations
Apply transformations to existing tables. Useful for creating derived columns (like email from first/last name) or introducing realistic data quality issues.
// Generate data first
await generator.generate({ table: usersTable, rowCount: 10000 });
// Then apply transformations
await generator.transform("users", [
{
description: "Generate email addresses",
transformations: [
{
kind: "template",
column: "email",
template: "{first_name}.{last_name}@example.com",
},
],
},
]);interface TransformResult {
durationMs: number;
batchesApplied: number;
}Transformation Types
Template Transformation - Build column values from other columns:
{
kind: "template",
column: "email",
template: "{first_name}.{last_name}@example.com",
lowercase: true // Optional: convert result to lowercase
}Mutate Transformation - Introduce random character mutations:
{
kind: "mutate",
column: "name",
probability: 0.1, // 10% of rows get mutated
operations: ["replace", "delete", "insert"] // Random operation selected per row
}Lookup Transformation - Assign values from another table via join:
{
kind: "lookup",
column: "category_name", // Column to update
fromTable: "categories", // Source table
fromColumn: "name", // Column to copy value from
joinOn: {
targetColumn: "category_id", // Column in target table
lookupColumn: "id" // Column in source table to match
}
}Note: For ClickHouse, lookup transformation uses a table swap approach (CREATE → INSERT SELECT with JOIN → RENAME) since ClickHouse doesn't support correlated subqueries in
ALTER TABLE UPDATE. This means lookups execute before other transformations in the same batch. If order matters, place lookups in a separate batch.
Swap Transformation - Swap values between two columns with probability:
{
kind: "swap",
column1: "first_name",
column2: "last_name",
probability: 0.1 // 10% of rows get swapped
}Note: Both columns use the same random decision per row, ensuring atomic swaps (if column1 gets column2's value, column2 always gets column1's value). For ClickHouse, swap also uses the table swap approach (like lookup) since ClickHouse evaluates each
rand()call separately. Multiple swaps in the same batch are combined into a single table swap operation for efficiency.
Design Note: PostgreSQL, SQLite, and Trino execute each swap as a separate
UPDATE ... WHERE random() < probabilitystatement. This is intentionally not batched because UPDATE is a lightweight operation on these databases. ClickHouse batches swaps because each swap would otherwise require a full table copy (CREATE → INSERT → RENAME → DROP), making the overhead significant.
Batching Transformations
Transformations are organized in batches for efficiency:
- Each batch becomes a separate UPDATE statement (executed sequentially)
- Transformations within a batch are combined into a single UPDATE
- Batches support optional descriptions for logging and debugging
await generator.transform("users", [
{
description: "Generate email addresses",
transformations: [
{
kind: "template",
column: "email",
template: "{first_name}.{last_name}@example.com",
},
],
},
{
description: "Introduce data quality issues",
transformations: [
{
kind: "mutate",
column: "email",
probability: 0.1,
operations: ["replace"],
},
],
},
]);With descriptions, you'll see helpful logs:
[postgres] Applying transformations: Generate email addresses (1 transformation(s))
[postgres] Applying transformations: Introduce data quality issues (1 transformation(s))Scenarios
A Scenario orchestrates multi-step data generation workflows - create lookup tables, generate main tables, and apply cross-table transformations in sequence:
import { PostgresDataGenerator, type Scenario } from "./src/generator/index.js";
const scenario: Scenario = {
name: "E-commerce data",
steps: [
// Step 1: Create lookup table
{
table: {
name: "departments",
columns: [
{
name: "id",
type: "integer",
generator: { kind: "sequence", start: 1 },
},
{
name: "name",
type: "string",
generator: {
kind: "choice",
values: ["Engineering", "Sales", "HR"],
},
},
],
},
rowCount: 3,
},
// Step 2: Generate main table with transformations
{
table: {
name: "employees",
columns: [
{
name: "id",
type: "bigint",
generator: { kind: "sequence", start: 1 },
},
{
name: "first_name",
type: "string",
generator: {
kind: "choice",
values: ["John", "Jane", "Bob"],
},
},
{
name: "last_name",
type: "string",
generator: { kind: "choice", values: ["Smith", "Jones"] },
},
{
name: "email",
type: "string",
generator: { kind: "constant", value: "" },
},
{
name: "department_id",
type: "integer",
generator: { kind: "randomInt", min: 1, max: 3 },
},
{
name: "department_name",
type: "string",
generator: { kind: "constant", value: "" },
},
],
},
rowCount: 1000,
transformations: [
{
description: "Generate email from names",
transformations: [
{
kind: "template",
column: "email",
template: "{first_name}.{last_name}@company.com",
lowercase: true,
},
],
},
],
},
// Step 3: Transform-only step - apply cross-table lookup
{
tableName: "employees",
transformations: [
{
description: "Lookup department name",
transformations: [
{
kind: "lookup",
column: "department_name",
fromTable: "departments",
fromColumn: "name",
joinOn: { targetColumn: "department_id", lookupColumn: "id" },
},
],
},
],
},
],
};
const generator = new PostgresDataGenerator({
/* config */
});
await generator.connect();
const result = await generator.runScenario({
scenario,
dropFirst: true,
});
for (const step of result.steps) {
console.log(`[${step.tableName}] ${step.generate?.rowsInserted ?? 0} rows`);
if (step.transform) {
console.log(
` Applied ${step.transform.batchesApplied} transformation batch(es)`
);
}
}
console.log(
`Total: ${result.totalRowsInserted} rows in ${result.durationMs}ms`
);
await generator.disconnect();Step Types:
| Step Type | Fields | Description |
| -------------------- | --------------------------------------- | ----------------------------------------------- |
| Generate + Transform | table, rowCount, transformations? | Create table, insert rows, optionally transform |
| Transform only | tableName, transformations | Apply transformations to existing table |
interface ScenarioOptions {
scenario: Scenario;
createTable?: boolean; // Default: true
dropFirst?: boolean; // Default: false
truncateFirst?: boolean; // Default: false
resumeSequences?: boolean; // Default: true
optimize?: boolean; // Default: true (runs once at end for all tables)
}
interface ScenarioResult {
steps: ScenarioStepResult[];
totalRowsInserted: number;
durationMs: number; // Total wall-clock time
generateMs: number; // Time spent generating rows
transformMs: number; // Time spent applying transformations
optimizeMs: number; // Time spent on optimisations
}Escape Utilities
For custom queries, use the exported escape functions:
import {
// Identifier escaping (table/column names)
escapePostgresIdentifier,
escapeClickHouseIdentifier,
escapeTrinoIdentifier,
// Literal escaping (string values)
escapePostgresLiteral,
escapeClickHouseLiteral,
escapeTrinoLiteral,
escapeSqliteLiteral,
} from "@mkven/samples-generation";
// Identifier escaping (for table/column names)
escapePostgresIdentifier("my-table"); // "my-table"
escapePostgresIdentifier('table"name'); // "table""name"
escapeClickHouseIdentifier("my-table"); // `my-table`
escapeClickHouseIdentifier("table`name"); // `table``name`
escapeTrinoIdentifier("samples"); // "samples"
escapeTrinoIdentifier("samples$files"); // "samples$files" (for metadata tables)
// Literal escaping (for string values in SQL)
escapePostgresLiteral("O'Brien"); // 'O''Brien'
escapeClickHouseLiteral("O'Brien"); // 'O\'Brien'
escapeTrinoLiteral("O'Brien"); // 'O''Brien'
escapeSqliteLiteral("O'Brien"); // 'O''Brien'Table Size
Get the size of a table (including indexes):
// Get size in bytes
const bytes = await generator.getTableSize("users");
// 1234567
// Get human-readable size
const size = await generator.getTableSizeForHuman("users");
// "1.18 MB"You can also use the formatBytes utility directly:
import { formatBytes } from "./src/generator/index.js";
formatBytes(1024); // "1.00 KB"
formatBytes(1048576); // "1.00 MB"Optimization
By default, generate() runs database-specific optimization after inserting rows:
| Database | Optimization |
| ---------- | ----------------------------------------------------------------- |
| PostgreSQL | VACUUM ANALYZE - reclaims storage and updates statistics |
| ClickHouse | OPTIMIZE TABLE FINAL - merges all parts for MergeTree engines |
| SQLite | VACUUM + ANALYZE - rebuilds file and gathers statistics |
| Trino | optimize + expire_snapshots + remove_orphan_files - Iceberg |
Disable for quick tests:
await generator.generate({
table,
rowCount: 1000,
optimize: false, // Skip VACUUM/OPTIMIZE
});Or call manually:
await generator.optimize("users");Scripts
Generate Data
# Generate 1000 rows in all databases (requires docker-compose up)
npx tsx scripts/generate-all.ts
# Specify row count
npx tsx scripts/generate-all.ts --rows 1000
npx tsx scripts/generate-all.ts -r 1_000_000
# Use batching for large datasets (helps with memory and timeouts)
npx tsx scripts/generate-all.ts -r 1_000_000_000 -b 100_000_000 # 10 batches of 100M
npx tsx scripts/generate-all.ts -r 1_000_000_000 --batch 50_000_000 # 20 batches of 50M
# Choose scenario
npx tsx scripts/generate-all.ts --scenario simple # Default: 5 columns
npx tsx scripts/generate-all.ts --scenario english-names # Names + email template
npx tsx scripts/generate-all.ts --scenario russian-names # Russian names + email
npx tsx scripts/generate-all.ts --scenario lookup-demo # Departments + employees lookup
# Generate for specific databases only
npx tsx scripts/generate-all.ts --sqlite
npx tsx scripts/generate-all.ts --postgres
npx tsx scripts/generate-all.ts --clickhouse
npx tsx scripts/generate-all.ts --trino
# Combine options
npx tsx scripts/generate-all.ts -r 10000 --scenario english-names --postgres --clickhouse
# Show help
npx tsx scripts/generate-all.ts --helpDocker Compose
Start all databases:
pnpm compose:upServices available:
| Service | Port(s) | Credentials | | ---------- | -------------------------- | --------------------- | | PostgreSQL | 5432 | postgres:postgres | | ClickHouse | 8123 (HTTP), 9009 (native) | default:clickhouse | | Trino | 8080 | trino (no password) | | MinIO | 9000 (S3), 9001 (console) | minioadmin:minioadmin | | Nessie | 19120 | - |
Testing
# Run unit tests (SQL expression generators, escape functions)
pnpm test
# Run e2e tests against SQLite (default, no setup required)
pnpm test:e2e
# Run e2e tests with specific databases (requires docker-compose up)
TEST_POSTGRES=1 pnpm test:e2e
TEST_CLICKHOUSE=1 pnpm test:e2e
TEST_TRINO=1 pnpm test:e2e
# Run e2e tests against all databases
TEST_POSTGRES=1 TEST_CLICKHOUSE=1 TEST_TRINO=1 pnpm test:e2e
# Or use the shortcut script
./test-all-dbs.shQuality Checks
# Run formatting, linting, type checking, and tests
./check.sh
# Check for security vulnerabilities and outdated dependencies
./health.sh
# Check for dependency updates (requires npx renovate)
./renovate-check.sh
# Run all checks
./all-checks.shProjects Using This Library
- indexless-query-benchmarks — Benchmark queries on Trino+Iceberg, PostgreSQL, and ClickHouse without indexes
