csv-super

v1.0.0

Published

7 days ago

Read a 10GB CSV file without crashing RAM — Streaming CSV parser with fixed 50MB memory footprint

csv-super ⚡

Read a 10GB CSV file without crashing RAM.
Fixed ~50MB memory footprint. RFC 4180 compliant. Zero dependencies.

The Real Problem

// ❌ This CRASHES on a 10GB file — allocates 16+ GB of RAM
const results = [];
fs.createReadStream('data.csv')
  .pipe(csvParser())
  .on('data', row => results.push(row));   // ← ALL rows land in heap

Why it crashes: csv-parser, fast-csv, and papaparse emit rows via EventEmitter 'data' events. With no backpressure applied, the stream reads at disk speed (~500 MB/s) and piles all rows into an array. By the time a 10GB file finishes, the V8 heap holds 15–30GB of objects (JSON has 1.5–3× overhead).

The Solution

import { csvSuper } from 'csv-super';

// ✅ RAM stays at ~50MB regardless of file size
for await (const { rows, totalSoFar } of csvSuper('data.csv', { batch: 1000 })) {
  await db.insertMany(rows);                // process 1000 rows at a time
  console.log(`Processed: ${totalSoFar}`);  // cumulative counter built-in
}

Why it works: csv-super uses an async function* (Async Generator). Each yield suspends the generator until the consumer's await resolves. While suspended, fs.createReadStream is automatically paused — no data accumulates in the heap. This is native JavaScript backpressure.

Performance Comparison

| Metric | csv-parser (traditional) | csv-super | |-------------------|--------------------------|------------------------| | 1GB file | ~2GB RAM | ~50MB RAM | | 10GB file | 💀 OOM crash | ~50MB RAM | | First row delay | After full file read | Within seconds | | TypeScript | Partial | 100% typed | | RFC 4180 | Partial | Full compliance | | Dependencies | 2 | 0 | | Node.js ≥ 18 | ✅ | ✅ |

Installation

npm install csv-super

Requirements: Node.js ≥ 18.0.0 (uses native fs.createReadStream + Async Generators)

Quick Start

import { csvSuper } from 'csv-super';

// Basic: 1000 rows per batch (default)
for await (const batch of csvSuper('sales.csv')) {
  console.log(batch.rows);      // CsvRow[] = Record<string, string>[]
  console.log(batch.batchIndex); // 0, 1, 2, ...
  console.log(batch.count);      // rows in this batch (≤ 1000)
  console.log(batch.totalSoFar); // cumulative rows processed
}

API Reference

`csvSuper(filePath, options?)` → `AsyncGenerator<BatchResult>`

import { csvSuper } from 'csv-super';
import type { CsvSuperOptions, BatchResult } from 'csv-super';

Options

| Option | Type | Default | Description | |-----------------|-----------------------------------------|--------------|------------------------------------------| | batch | number | 1000 | Rows per yielded batch (1–100000) | | delimiter | string | ',' | Field separator (single char) | | quote | string | '"' | Quote character | | escape | string | '"' | Escape char (RFC 4180: same as quote) | | headers | boolean | true | First row = column names | | skipEmptyLines| boolean | true | Skip blank lines | | encoding | 'utf8' \| 'utf16le' \| 'latin1' \| 'auto' | 'auto' | File encoding (auto = BOM detection) | | chunkSize | number | 65536 | Read buffer size in bytes (≥ 1024) | | onProgress | (info: ProgressInfo) => void | null | Progress callback |

BatchResult

interface BatchResult {
  rows:       CsvRow[];    // The parsed rows
  batchIndex: number;      // 0-based batch counter
  count:      number;      // rows.length (convenience)
  totalSoFar: number;      // cumulative rows across all batches
}

ProgressInfo

interface ProgressInfo {
  bytesRead:             number;  // bytes consumed so far
  totalBytes:            number;  // total file size
  percentage:            number;  // 0.0–100.0
  speedMBps:             number;  // current read speed (sliding window)
  estimatedSecondsLeft:  number;  // ETA in seconds
  rowsProcessed:         number;  // rows parsed so far
}

Examples

Insert into Database

import { csvSuper } from 'csv-super';

for await (const batch of csvSuper('customers.csv', { batch: 5000 })) {
  await db.customers.insertMany(batch.rows);
  console.log(`✅ Inserted ${batch.totalSoFar} customers`);
}

Progress Bar

for await (const batch of csvSuper('large.csv', {
  batch: 5000,
  onProgress: ({ percentage, speedMBps, estimatedSecondsLeft }) => {
    const bar = '█'.repeat(Math.floor(percentage / 5)).padEnd(20, '░');
    process.stdout.write(
      `\r[${bar}] ${percentage.toFixed(1)}% @ ${speedMBps.toFixed(1)} MB/s`
    );
  },
})) {
  await processRows(batch.rows);
}

TSV Files

for await (const batch of csvSuper('data.tsv', { delimiter: '\t' })) {
  // ...
}

Early Termination (stream closes automatically, no leak)

for await (const batch of csvSuper('events.csv')) {
  for (const row of batch.rows) {
    if (row.severity === 'CRITICAL') {
      console.log('Critical event found:', row);
      return; // or break — stream closes cleanly
    }
  }
}

No Headers (index-based access)

for await (const batch of csvSuper('raw.csv', { headers: false })) {
  // rows: { '0': 'value1', '1': 'value2', ... }
  console.log(batch.rows[0]?.['0']);
}

Pro Features ($17/month)

For enterprise workloads that process CSV files daily, csv-super Pro adds:

Multi-Thread Processing (Worker Threads)

import { csvSuperPro } from 'csv-super';

for await (const batch of csvSuperPro('huge-file.csv', {
  licenseKey: process.env.CSV_SUPER_KEY,
  threads: 8,         // Use 8 CPU cores in parallel
  batch: 10_000,
})) {
  await db.insertMany(batch.rows);
}

Speed: ~N× faster on N-core machines. A file that takes 60s on 1 core takes ~8s on 8 cores.

Transform Pipeline

import { csvSuperPro, TransformPipeline } from 'csv-super';

const pipeline = new TransformPipeline()
  .filter(row => row.status === 'active')          // Filter rows
  .trim()                                           // Trim all fields
  .select(['id', 'name', 'email', 'salary'])        // Select columns
  .rename({ 'id': 'employee_id' })                  // Rename columns
  .mapField('salary', v => String(parseInt(v, 10))) // Type coerce
  .pipe(async row => {                              // Async enrichment
    const dept = await getDept(row.employee_id);
    return { ...row, department: dept };
  });

for await (const batch of csvSuperPro('employees.csv', {
  licenseKey: process.env.CSV_SUPER_KEY,
  transform: pipeline.toFn(),
})) {
  await db.employees.insertMany(batch.rows);
}

Pro Pricing

| Plan | Price | Features | |------------|------------|------------------------------------------------------| | Free | $0 forever | Streaming + Batch + TypeScript + RFC 4180 + Progress | | Pro | $17/month | + Multi-thread + Transform + Priority support | | Enterprise | Custom | + SLA + Custom seat count + Dedicated support |

Get Pro →

Architecture

I/O Layer           Parser Layer           Delivery Layer
──────────          ──────────────         ───────────────
fs.createReadStream → CsvParser (FSM)  →  BatchController
     ↓                    ↓                     ↓
  64KB chunks      RFC 4180 State Machine   Async Generator
  (backpressure)   (incremental, chunked)   (yield + await)

Key insight: When the consumer awaits inside for await...of, the Async Generator suspends at the yield. While suspended, getNextChunk() is not called, so readStream.resume() is never called — the stream stays paused. This is mechanical, guaranteed backpressure.

Memory formula:

heap ≈ chunkSize (64KB) + batch_size × avg_row_size
     ≈ 0.064MB  +  1000  ×  0.05MB
     ≈ ~50MB (constant regardless of file size)

RFC 4180 Compliance

Full support for all CSV edge cases:

✅ Fields with commas:    "Smith, John"
✅ Fields with newlines:  "line1\nline2"
✅ Escaped quotes (x2):  "He said ""hi"""
✅ Empty fields:          a,,b
✅ CRLF line endings:     \r\n
✅ No trailing newline    (handled by finalize())
✅ Unicode content:       UTF-8, UTF-16 LE/BE
✅ Custom delimiters:     TSV (\t), PSV (|), SSV (;)

Error Handling

import { csvSuper, CsvSuperError, ParseError } from 'csv-super';

try {
  for await (const batch of csvSuper('data.csv')) {
    await processRows(batch.rows);
  }
} catch (err) {
  if (err instanceof ParseError) {
    console.error(`Parse error at line ${err.lineNumber}: ${err.message}`);
  } else if (err instanceof CsvSuperError) {
    console.error(`csv-super error [${err.code}]: ${err.message}`);
  } else {
    throw err; // re-throw unexpected errors
  }
}

TypeScript

Fully typed out of the box. No @types package needed.

import type {
  CsvRow,           // Record<string, string>
  BatchResult,      // { rows, batchIndex, count, totalSoFar }
  ProgressInfo,     // { percentage, speedMBps, ... }
  CsvSuperOptions,  // full options type
  TransformFn,      // (row: CsvRow) => CsvRow | null | Promise<CsvRow | null>
} from 'csv-super';

Contributing

Issues and PRs are welcome at github.com/csv-super/csv-super.

git clone https://github.com/Brah-Timo/csv-super
cd csv-super
npm install
npm test
npm run bench

License

Core (free tier): MIT License
Pro (multi-thread + transform): Commercial License — see csv-super.dev/pro

Built with ❤️ for data engineers who have felt the pain of a 10GB CSV on a 16GB server.