@cle-does-things/sunbears

v1.2.1

Published

2 months ago

A CSV data loader for TypeScript with an API similar to Polars and Pandas, written in pure Rust

0High
0Medium
0Low

cle-does-things

CSV pandas polars Rust napi-rs node-addon

`sunbears`

A CSV data loader for TypeScript with an API similar to Polars and Pandas, written in pure Rust.

Installation

Install the package with your favorite package manager:

npm install @cle-does-things/sunbears

Usage

`readCsv`

The main function for sunbears is readCsv, which loads the data contained in a CSV file as a DataFrame, a columnar data format.

import { readCsv } from '@cle-does-things/sunbears'

const df = readCsv('test.csv')

The DataFrame class exposes two methods:

colDtype: retrieve the data type of the records contained within a column (integer, float, boolean or string)
get: get a column
writeCsv: write the dataframe to CSV (see dedicated paragraph)
dropNull/fillNull: Drop or fill null values (see dedicated paragraph)
dropNan / fillNan: Drop or fill NaN values (see dedicated paragraph)

const dt = df.colDtype('name')
const colData = df.get('name')

Based on the data type of the column, you can use one of the following helper functions to extract the associated array of data (as (string | null)[], (boolean | null)[] or (number | null)[]):

import { DataType, asBooleanArray, asFloatArray, asIntArray, asStringArray } from '@cle-does-things/sunbears'

let arr
switch (dt) {
  case DataType.Float:
    arr = asFloatArray(colData)
    break
  case DataType.Integer:
    arr = asIntArray(colData)
    break
  case DataType.Boolean:
    arr = asBooleanArray(colData)
    break
  default:
    arr = asStringArray(colData)
    break
}

If the helper function is used on the wrong data type, it will return null.

You can then chain these methods and functions to perform filter or map operations (natively supported by TypeScript arrays):

const filteredNames = asStringArray(readCsv('test.csv').get('name'))?.filter((n) => n === 'John Doe')
const mappedNames = asStringArray(readCsv('test.csv').get('name'))?.map((n) => n.toUpperCase())

`DataFrame.writeCsv`

The writeCsv method writes a DataFrame to CSV.

You can construct a DataFrame simply starting from arrays, using the following helper functions:

import { DataFrame, toIntColumn, toFloatColumn, toStringColumn, toBoolColumn } from '@cle-does-things/sunbears'

const col1 = toStringColumn(['hello', 'world'])
const col2 = toFloatColumn([1.2, 2.3])
const col3 = toIntColumn([4, 5])
const col4 = toBoolColumn([true, false])

You can then use the fromColumns factory for the DataFrame class to turn column data into a DataFrame: if the columns do not have the same length, an error will be thrown.

const df = DataFrame.fromColumns({
  col1: col1,
  col2: col2,
  col3: col3,
  col4: col4,
})

Writing to the CSV file is then trivial:

df.writeCsv('test.csv')

The file will look like this:

col1,col2,col3,col4
hello,1.2,4,true
world,2.3,5,false

Null and NaN dropping and filling

The DataFrame class supports also methods for filtering out or changing null values in the columns:

const df = readCsv('test.csv')
df.dropNull() // drop null
df.fillNull() // fill null values with the zero value of their type
df.fillNull('a string', 0.5, 4, true) // fill null values with custom values

You can filter out and change NaN values as well (only applies if there are float-typed columns):

df.dropNan()
df.fillNan() // fill with zero value
df.fillNan(99.3) // fill with a specific value

Benchmarking

sunbears was benchmarked using the tinybench-based script you can find here. The script reports latency statistics related to the readCsv and writeCsv functions reading/writing increasingly large CSV files (100, 1000, 100.000 and 1.000.000 rows).

The latest benchmark run was:

| Task | Latency avg (s) | Latency med (s) | Throughput avg (ops/s) | Throughput med (ops/s) | Samples | | ------------------------- | ---------------- | -------------------- | ---------------------- | ---------------------- | ------- | | Read a 100-lines CSV | 0.000054 ± 0.24% | 0.000050 ± 0.0000023 | 18964 ± 0.16% | 19967 ± 921 | 18654 | | Read a 1000-lines CSV | 0.000289 ± 0.54% | 0.000279 ± 0.0000090 | 3518 ± 0.32% | 3583 ± 116 | 3464 | | Read a 100000-lines CSV | 0.028000 ± 1.38% | 0.027537 ± 0.000254 | 36 ± 1.19% | 36 ± 0 | 64 | | Read a 1000000-lines CSV | 0.310751 ± 0.75% | 0.308330 ± 0.004228 | 3 ± 0.70% | 3 ± 0 | 64 | | Write a 100-lines CSV | 0.000076 ± 0.46% | 0.000069 ± 0.0000052 | 13665 ± 0.28% | 14467 ± 1148 | 13140 | | Write a 1000-lines CSV | 0.000213 ± 0.43% | 0.000209 ± 0.0000056 | 4724 ± 0.16% | 4785 ± 130 | 4700 | | Write a 100000-lines CSV | 0.013886 ± 0.86% | 0.013756 ± 0.000171 | 72 ± 0.78% | 73 ± 1 | 73 | | Write a 1000000-lines CSV | 0.146282 ± 1.39% | 0.146108 ± 0.005752 | 7 ± 1.31% | 7 ± 0 | 64 |

Here is how the tool compares to the read_csv and to_csv/write_csv functions in Pandas and Polars (script):

| Dataset | Pandas (s) | Polars (s) | | ------------------- | ---------- | ---------- | | Read 100 lines | 0.038291 | 0.033831 | | Read 1000 lines | 0.037794 | 0.016517 | | Read 100000 lines | 0.471109 | 0.029076 | | Read 1000000 lines | 4.153507 | 0.216254 | | Write 100 lines | 0.035926 | 0.043052 | | Write 1000 lines | 0.067816 | 0.017617 | | Write 100000 lines | 0.892885 | 0.031329 | | Write 1000000 lines | 8.549390 | 0.331897 |

And here it how it compares with csv-parse and csv-stringify+writeFileSync (script):

| Task | Latency avg (s) | Latency med (s) | Throughput avg (ops/s) | Throughput med (ops/s) | Samples | | ------------------------- | ---------------- | -------------------- | ---------------------- | ---------------------- | ------- | | Read a 100-lines CSV | 0.000207 ± 2.31% | 0.000191 ± 0.0000087 | 5086 ± 0.31% | 5224 ± 244 | 4842 | | Read a 1000-lines CSV | 0.001244 ± 0.42% | 0.001233 ± 0.0000229 | 806 ± 0.33% | 811 ± 15 | 805 | | Read a 100000-lines CSV | 0.120565 ± 0.63% | 0.119515 ± 0.001141 | 8 ± 0.60% | 8 ± 0 | 64 | | Read a 1000000-lines CSV | 1.216019 ± 0.46% | 1.209978 ± 0.006709 | 1 ± 0.44% | 1 ± 0 | 64 | | Write a 100-lines CSV | 0.000087 ± 0.52% | 0.000080 ± 0.0000078 | 12010 ± 0.31% | 12526 ± 1267 | 11503 | | Write a 1000-lines CSV | 0.000290 ± 1.14% | 0.000275 ± 0.0000192 | 3555 ± 0.42% | 3635 ± 258 | 3451 | | Write a 100000-lines CSV | 0.027303 ± 2.96% | 0.027014 ± 0.000900 | 37 ± 1.87% | 37 ± 1 | 64 | | Write a 1000000-lines CSV | 0.273814 ± 2.08% | 0.265154 ± 0.005824 | 4 ± 1.77% | 4 ± 0 | 64 |

Development

Requirements:

Install the latest Rust
Install Node.js@10+ which fully supported Node-API
Install [email protected]

Test locally

yarn (install)
yarn build (build package based on src/lib.rs)
yarn test

And you will see something along the lines of:

$ ava

  ✔ readCsv reads a CSV and returns a DataFrame with correct datatypes
  ✔ DataFrame class methods work correctly
  ✔ Column to array functions work
  ─

  3 tests passed

Benchmarks

To run benchmarks, you will need uv installed (for benchmark data generation)

Run benchmarks with:

yarn bench
yarn bench:comp-py # compare with python libraries
yarn bench:comp-ts # compare with csv-parse

The commands will generate testfiles/generated-*.csv files (with 100, 1000, 100.000 and 1.000.000 rows), and will perform time-based benchmarks for the readCsv (sunbears), read_csv (Pandas/Polars) and parse (csv-parse) functions.

Linting and Formatting

You can run formatting for TypeScript, TOML and Rust code with one command:

yarn format

You can also run specific formatting checks:

yarn format:rs
yarn format:rs-check # checks formatting, without modifying files
yarn format:prettier
yarn format:toml

For linting, you need to run both oxlint and clippy:

yarn lint # oxlint
yarn clippy # clippy

Release package

Only necessary for maintainers

![NOTE]
Ensure you have set your NPM_TOKEN in the GitHub project setting.

When you want to release the package:

yarn version [<newversion> | major | minor | patch | premajor | preminor | prepatch | prerelease [--preid=<prerelease-id>] | from-git]

git push

GitHub actions will do the rest job for you.

WARN: Don't run npm publish manually.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

sunbears