@arjunanda/data-engine
v1.0.2
Published
Production-ready Node.js package for importing and exporting massive datasets with a Go-based engine
Maintainers
Readme
Data Engine
Production-ready Node.js package for importing and exporting massive datasets (millions to billions of rows) with a high-performance Go engine.
Features
✅ Streaming Architecture - Handle datasets far larger than system memory
✅ Multi-Core Performance - Fully utilize all CPU cores with worker pools
✅ Stable Memory Usage - Constant memory footprint regardless of dataset size
✅ Multiple Formats - CSV, TSV, JSONL, XLSX (import) + Parquet (export)
✅ Database Support - PostgreSQL and MySQL with optimized batch operations
✅ Production-Grade - Graceful shutdown, error handling, progress reporting
✅ Zero Native Compilation - Prebuilt binaries downloaded automatically
Installation
npm install @arjunanda/data-engineThe package automatically downloads the appropriate prebuilt binary for your platform during installation.
Supported Platforms:
- Linux (x64, arm64)
- macOS (x64, arm64)
- Windows (x64)
Quick Start
Import CSV to PostgreSQL
const { importData } = require("@arjunanda/data-engine");
await importData({
file: "/data/huge-dataset.csv",
format: "auto", // auto-detect format
dsn: "postgres://user:pass@localhost/mydb",
table: "my_table",
batchSize: 5000,
workers: 0, // 0 = auto-detect CPU count
});Export Database to JSONL
const { exportData } = require("@arjunanda/data-engine");
await exportData({
output: "/data/export.jsonl",
format: "jsonl",
dsn: "postgres://user:pass@localhost/mydb",
query:
"SELECT * FROM large_table WHERE created_at > NOW() - INTERVAL '30 days'",
batchSize: 5000,
workers: 0,
});TypeScript Support
The package includes full TypeScript type definitions for enhanced IDE support and type safety.
TypeScript Usage
import {
importData,
exportData,
ImportOptions,
ExportOptions,
} from "@arjunanda/data-engine";
// Full type safety and autocomplete
const options: ImportOptions = {
file: "./data.csv",
format: "auto", // Autocomplete: 'auto' | 'csv' | 'tsv' | 'jsonl' | 'xlsx'
dsn: "postgres://localhost/mydb",
table: "users",
batchSize: 10000,
workers: 0,
};
await importData(options);
// Export with type-safe format
const exportOpts: ExportOptions = {
output: "./export.parquet",
format: "parquet", // Autocomplete: 'csv' | 'tsv' | 'jsonl' | 'parquet'
dsn: "postgres://localhost/mydb",
query: "SELECT * FROM users",
};
await exportData(exportOpts);See examples.ts for more TypeScript examples.
API Reference
importData(options)
Import data from a file into a database.
Options:
file(string, required) - Path to input fileformat(string) - File format:auto,csv,tsv,jsonl,xlsx(default:auto)dsn(string, required) - Database connection stringtable(string, required) - Target table namebatchSize(number) - Rows per batch (default:5000)workers(number) - Worker count,0= auto-detect (default:0)
Returns: Promise<void>
Example:
await importData({
file: "./data.csv",
format: "csv",
dsn: "postgres://localhost/db",
table: "users",
batchSize: 10000,
});exportData(options)
Export data from a database to a file.
Options:
output(string, required) - Path to output fileformat(string, required) - Output format:csv,tsv,jsonl,parquetdsn(string, required) - Database connection stringquery(string, required) - SQL query to executebatchSize(number) - Rows per batch (default:5000)workers(number) - Worker count,0= auto-detect (default:0)
Returns: Promise<void>
Example:
await exportData({
output: "./export.parquet",
format: "parquet",
dsn: "mysql://user:pass@localhost/db",
query: "SELECT * FROM orders WHERE year = 2024",
});Supported Formats
Import (File → Database)
| Format | Extension | Notes |
| ------ | ------------------- | -------------------------------------- |
| CSV | .csv | Fully supported, streaming |
| TSV | .tsv | Tab-separated values |
| JSONL | .jsonl, .ndjson | Newline-delimited JSON |
| XLSX | .xlsx | Limited: 100MB max, streaming only |
Not Supported:
- ❌ XLS (legacy Excel) - Convert to XLSX or CSV
- ❌ JSON arrays - Use JSONL instead
Export (Database → File)
| Format | Extension | Notes |
| ------- | ---------- | ---------------------------------------- |
| CSV | .csv | Comma-separated values |
| TSV | .tsv | Tab-separated values |
| JSONL | .jsonl | Newline-delimited JSON |
| Parquet | .parquet | Columnar format, optimized for analytics |
Database Connection Strings
PostgreSQL
postgres://user:password@host:port/database
postgresql://user:password@host:port/database?sslmode=requireMySQL
user:password@tcp(host:port)/database
mysql://user:password@host:port/databasePerformance Tuning
Batch Size
- Smaller batches (1000-2000): Lower memory, more frequent DB commits
- Larger batches (10000-20000): Higher throughput, more memory
Workers
- Default (0): Auto-detects CPU count - recommended for most cases
- Manual: Set to CPU count for CPU-bound operations, or higher for I/O-bound
Example: Tuning for 100M row import
await importData({
file: "/data/100m-rows.csv",
dsn: "postgres://localhost/db",
table: "events",
batchSize: 10000, // Larger batches for throughput
workers: 8, // Match CPU cores
});Expected Performance:
- CSV import: ~100,000 - 500,000 rows/sec (depends on hardware and network)
- Memory usage: Constant (~50-200MB regardless of file size)
Error Handling
The package provides detailed error messages and proper exit codes:
try {
await importData({
file: "./data.csv",
dsn: "postgres://localhost/db",
table: "users",
});
console.log("Import successful!");
} catch (err) {
console.error("Import failed:", err.message);
// err.message contains detailed error information
}Common Errors:
- Invalid DSN format
- File not found
- Unsupported format (XLS, JSON arrays)
- Database connection failure
- XLSX file exceeds 100MB limit
Graceful Shutdown
The engine supports graceful shutdown on SIGINT (Ctrl+C) and SIGTERM:
const operation = importData({
file: "./huge.csv",
dsn: "postgres://localhost/db",
table: "data",
});
// User presses Ctrl+C
// Engine will:
// 1. Stop reading new data
// 2. Finish processing current batches
// 3. Exit cleanly without data corruption
await operation; // Will reject with "Operation cancelled by user"Production Deployment
Docker Example
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
CMD ["node", "your-script.js"]Environment Variables
# Database connection
export DATABASE_DSN="postgres://user:pass@db-host/mydb"
# Tuning
export BATCH_SIZE=10000
export WORKERS=8Monitoring
The engine outputs progress to stderr:
[INFO] Mode: import
[INFO] Workers: 8
[INFO] Batch Size: 5000
[INFO] Detected 15 columns: [id, name, email, ...]
[PROGRESS] Processed 500000 rows (125000 rows/sec)
[PROGRESS] Processed 1000000 rows (130000 rows/sec)
[SUCCESS] Operation completed successfully
[INFO] Import completed: 2500000 rows in 20.5 seconds (121951 rows/sec)Troubleshooting
Binary not found
If the postinstall script fails to download the binary:
- Manually download from GitHub Releases
- Place in
node_modules/@arjunanda/data-engine/bin/ - Make executable:
chmod +x node_modules/@arjunanda/data-engine/bin/data-engine
XLSX file too large
Error: XLSX file too large: 150000000 bytes (max 100000000 bytes)Solution: Convert to CSV for large files:
# Using LibreOffice
libreoffice --headless --convert-to csv large-file.xlsx
# Or use online convertersOut of memory
If you encounter OOM errors, reduce batchSize:
await importData({
// ... other options
batchSize: 1000, // Reduce from default 5000
});License
MIT
Contributing
Contributions welcome! Please open an issue or PR on GitHub.
