@pujansrt/data-genie

v2.1.7

Published

2 months ago

High performant ETL engine written in TypeScript

0High
0Medium
0Low

typescript etl engine data processing performance pipeline transformations data-genie data-engine data-pipeline data-transformation

Data-Genie 🧞‍♂️

A high-performant, streaming-first ETL Engine for Node.js and TypeScript, designed for processing massive datasets with a constant memory footprint.

Documentation & Examples

Visit our full documentation site for in-depth guides, API reference, and real-world recipes:

https://pujansrt.github.io/data-genie/

Installation

# Install as a library
npm install @pujansrt/data-genie

# OR install globally to use the CLI
npm install -g @pujansrt/data-genie

Declarative Pipelines (CLI)

Instead of writing code, you can define your ETL pipelines in YAML and run them using the data-genie CLI.

# pipeline.yaml
pipeline:
  read: { type: csv, path: input.csv }
  transform:
    - { type: filter, expression: "age > 18" }
    - { type: rename, mapping: { fname: firstName } }
  write: { type: json, path: output.json }

Run it with:

data-genie run pipeline.yaml

Quick Start (Programmatic)

demo

import { CSVReader, JsonWriter, Job } from '@pujansrt/data-genie';

const reader = new CSVReader('users.csv');
const writer = new JsonWriter('output.json');

(async () => {
    // Process 10GB+ files with just 15MB RAM
    const metrics = await Job.run(reader, writer);
    console.log(`Processed ${metrics.recordCount} records!`);
})();

Preview (Dry Run)

Verify your transformations and filters instantly without writing any data.

// Inspect the first 5 records in a beautiful console table
await Job.preview(pipeline);

Why Data-Genie? (Performance Benchmark)

In our latest benchmarks (Processing 500k records), Data-Genie used 100x less memory than standard array-based processing.

Features

Streaming-First: Constant memory footprint regardless of file size (O(1) memory complexity).
Multi-Format: Support for CSV, TSV, JSON, NDJSON, Parquet, Excel, and SQL.
Transport Agnostic: Read/Write from Local Disk, AWS S3, HTTP APIs, or Memory.
Fault Tolerant: Retries, Circuit Breakers, and Dead Letter Queues (DLQ).
Event Emitters Support - Use Job events to build a monitoring UI for your ETL pipelines.

Common Recipes

1. S3 Parquet to Local CSV

Stream massive datasets directly from the cloud to your local machine.

const source = new S3Source(s3Client, 'mybucket', 'data/users.parquet');
const reader = new ParquetReader(source);
const writer = new CSVWriter('users.csv');

await Job.run(reader, writer);

2. Schema Validation (Zod) + DLQ

Validate data in real-time and divert "poison" records to a Dead Letter Queue.

const validator = new SchemaValidatingReader(reader, z.object({
    email: z.string().email(),
    age: z.number().min(18)
})).setDLQ(new JsonWriter('invalid_records.json'));

await Job.run(validator, new SQLWriter(db, 'users'));

3. Parallel Fan-out (Multi-Sink)

Read once, transform, and write to multiple destinations in parallel.

const multiWriter = new MultiWriter(
  new ConsoleWriter(),
  new JsonWriter('processed.json'),
  new SQLWriter(db, 'audit_log')
);

await Job.run(pipeline, multiWriter);

See 15+ more recipes in our Cookbook

Contributing

Contributions are welcome! Whether it's adding a new DataReader, fixing a bug, or improving documentation.

Check out our Contributing Guide.
Look for Good First Issues.
Submit a PR!

Running Benchmarks

Want to see the performance difference on your own machine? We provide a built-in benchmark script that compares Data-Genie with a standard fs.readFileSync approach.

# Clone the repo and install dependencies
git clone https://github.com/pujansrt/data-genie.git
npm install

# Run the benchmark
npx tsx benchmarks/run-benchmark.ts