@danmasta/csv

v1.0.0

Published

2 years ago

CSV Parser

Downloads

0High
0Medium
0Low

danmasta

csv parser etl

CSV

CSV Parser for Node Apps

Features:

Easy to use
Simple, lightweight, and fast
Parse csv data to js objects
Header field mapping
Value field mapping
Supports csv spec rfc 4180
Memory safe, only keeps maximum of 1 row in memory during streaming
Supports streaming, string parsing, and promises
Supports multi-byte newlines (\n, \r\n)
Custom newline support
Works with buffers and strings
Only 1 dependency: lodash

About

I was working on a BI system at work and as our user base kept growing, our jobs got slower and slower. A huge bottleneck for us was csv parsing. We tried many of the popular csv libraries on npm, but were unable to get a decent level of performance consistently from our many jobs. We are parsing around a billion rows per day, with the average row being around 1.5kb and 80+ columns (it's not uncommon for a single job to parse over 10gb of data). Some columns have complex json strings, some are empty, some are in languages like Chinese, and some use \r\n for new lines. We needed something fast, chunk boundary safe, and character encoding safe. Something that worked well on all of our various csv formats. Some of the other parsers failed to accurately parse \r\n newlines across chunk boundaries, some failed altogether, some caused out of memory errors, some were just crazy slow, and some had encoding errors with multi-byte characters.

This library is a result of those learnings. It's fast, stable, works on any data, and is highly configurable. Feel free to check out and run the benchmarks. I'm currently able to achieve around 100mb per second parsing (on a i7 9700k), depending on data and cpu.

Usage

Add csv as a dependency for your app and install via npm

npm install @danmasta/csv --save

Require the package in your app

const csv = require('@danmasta/csv');

By default csv returns a transform stream interface. You can use the methods: promise(), map(), each(), and parse() to consume row objects via promises or as a synchronous array.

Options

name | type | description -----|----- | ----------- headers | boolean\|object\|array\|function | If truthy, reads the first line as header fields. If false, disables header fields and replaces with integer values of 0-n. If object, the original header name field is replaced with it's value in the mapping. If function, the header field is set to the functions return value. Default is true values | boolean\|object\|function | Same as the header field, but for values. If you want to replace values on the fly, provide an object: {'null': null, 'True', true}. If function, the value is replaced with the functions return value. Default is null newline | string | Which character to use for newlines. Default is \n cb | function | Function to call when ready to flush a complete row. Used only for the Parser class. If you implement a custom parser you will need to include a cb function. Default is this.push buffer | boolean | If true, uses a string decoder to parse buffers. This is set automatically with convenience methods, but will need to be set on custom parser instances. Default is false encoding | string | Which encoding to use when parsing rows in buffer mode. This doesn't matter if using strings or streams not in buffer mode. Default is utf8 stream | object | Options to be passed to transform stream constructor. Default is { objectMode: true, encoding: null }

Methods

Name | Description -----|------------ Parser(opts, str) | Parser class for generating custom csv parser instances. Accepts an options object and optional string or buffer to parse Parser.parse(str, opts) | Synchronous parse function. Accepts a string or buffer and optional options object. Returns an array of parsed rows promise() | Returns a promise that resolves with an array of parsed rows map(fn) | Runs an iterator function over each row. Returns a promise that resolves with the new array each(fn) | Runs an iterator function over each row. Returns a promise that resolves with undefined parse(str) | Parses a string or buffer and pushes rows to stream. Need to call flush() when finished. Returns the csv parser instance flush() | Flushes remaining data and pushes final rows to stream. Returns the csv parser instance

Examples

Use CRLF line endings

csv.parse(str, { newline: '\r\n' });

Parse input from a stream and write rows to a destination stream

let read = fs.createReadStream('./data.csv');

read.pipe(csv()).pipe(myDestinationStream());

Parse a string and get results in a promise

csv().parse(str).promise().then(res => {
    console.log('Rows:', res);
});

Run an iteration function over each row using each()

csv().parse(str).each(row => {
    console.log('Row:', row);
}).then(() => {
    console.log('Parse complete!');
});

Update header values

let headers = {
    time: 'timestamp',
    latitude: 'lat',
    longitude: 'lng'
};

csv.parse(str, { headers });

Update headers and/or values with functions

function headers (str) {
    return str.toLowerCase();
}

function values (val) {
    if (val === 'false') return false;
    if (val === 'true') return true;
    if (val === 'null' || val === 'undefined') return null;
    return val.toLowerCase();
}

csv.parse(str, { headers, values });

Create a custom parser that pushes to a queue

const Queue = require('queue');
const q = new Queue();

const parser = csv({ cb: q.push.bind(q) });

parser.parse(str);
parser.flush();

Testing

Testing is currently run using mocha and chai. To execute tests just run npm run test. To generate unit test coverage reports just run npm run coverage

Benchmarks

Benchmarks are currently built using gulp. Just run gulp bench to test timings and bytes per second

Filename         Mode    Density  Rows    Bytes       Time (ms)  Rows/Sec      Bytes/Sec
---------------  ------  -------  ------  ----------  ---------  ------------  --------------
Earthquakes      String  0.11     72,689  11,392,064  121.66     597,492.68    93,641,058.06
Earthquakes      Buffer  0.11     72,689  11,392,064  104.8      693,582.44    108,700,567.12
Pop-By-Zip-LA    String  0.18     3,199   120,912     1.88       1,702,582.7   64,352,197.35
Pop-By-Zip-LA    Buffer  0.18     3,199   120,912     1.28       2,490,424.44  94,130,103.07
Stats-By-Zip-NY  String  0.41     2,369   263,168     15.68      151,058.46    16,780,816.02
Stats-By-Zip-NY  Buffer  0.41     2,369   263,168     14.52      163,183.15    18,127,726.59

Speed and throughput are highly dependent on the density (token matches/byte length) of data as well as the size of the objects created. The earthquakes file test data represents a roughly average level of density for common csv datas. The stats by zip code file represents an example of very dense csv data

Contact

If you have any questions feel free to get in touch

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

CSV