@cle-does-things/sunbears
v1.1.0
Published
A CSV data loader for TypeScript with an API similar to Polars and Pandas, written in pure Rust
Maintainers
Readme
sunbears
A CSV data loader for TypeScript with an API similar to Polars and Pandas, written in pure Rust.
Installation
Install the package with your favorite package manager:
npm install @cle-does-things/sunbearsUsage
readCsv
The main function for sunbears is readCsv, which loads the data contained in a CSV file as a DataFrame, a columnar data format.
import { readCsv } from '@cle-does-things/sunbears'
const df = readCsv('test.csv')The DataFrame class exposes two methods:
colDtype: retrieve the data type of the records contained within a column (integer, float, boolean or string)get: get a columnwriteCsv: write the dataframe to CSV (see the next paragraph)
const dt = df.colDtype('name')
const colData = df.get('name')Based on the data type of the column, you can use one of the following helper functions to extract the associated array of data (as string[], boolean[] or number[]):
import { DataType, asBooleanArray, asFloatArray, asIntArray, asStringArray } from '@cle-does-things/sunbears'
let arr
switch (dt) {
case DataType.Float:
arr = asFloatArray(colData)
break
case DataType.Integer:
arr = asIntArray(colData)
break
case DataType.Boolean:
arr = asBooleanArray(colData)
break
default:
arr = asStringArray(colData)
break
}If the helper function is used on the wrong data type, it will return null.
You can then chain these methods and functions to perform filter or map operations (natively supported by TypeScript arrays):
const filteredNames = asStringArray(readCsv('test.csv').get('name'))?.filter((n) => n === 'John Doe')
const mappedNames = asStringArray(readCsv('test.csv').get('name'))?.map((n) => n.toUpperCase())DataFrame.writeCsv
The writeCsv method writes a DataFrame to CSV.
You can construct a DataFrame simply starting from arrays, using the following helper functions:
import { DataFrame, toIntColumn, toFloatColumn, toStringColumn, toBoolColumn } from '@cle-does-things/sunbears'
const col1 = toStringColumn(['hello', 'world'])
const col2 = toFloatColumn([1.2, 2.3])
const col3 = toIntColumn([4, 5])
const col4 = toBoolColumn([true, false])You can then use the fromColumns factory for the DataFrame class to turn column data into a DataFrame: if the columns do not have the same length, an error will be thrown.
const df = DataFrame.fromColumns({
col1: col1,
col2: col2,
col3: col3,
col4: col4,
})Writing to the CSV file is then trivial:
df.writeCsv('test.csv')The file will look like this:
col1,col2,col3,col4
hello,1.2,4,true
world,2.3,5,falseBenchmarking
sunbears was benchmarked using the tinybench-based script you can find here. The script reports latency statistics related to the readCsv and writeCsv functions reading/writing increasingly large CSV files (100, 1000, 100.000 and 1.000.000 rows).
The latest benchmark run was:
| Task | Latency avg (s) | Latency med (s) | Throughput avg (ops/s) | Throughput med (ops/s) | Samples | | ------------------------- | ---------------- | -------------------- | ---------------------- | ---------------------- | ------- | | Read a 100-lines CSV | 0.000054 ± 0.24% | 0.000050 ± 0.0000023 | 18964 ± 0.16% | 19967 ± 921 | 18654 | | Read a 1000-lines CSV | 0.000289 ± 0.54% | 0.000279 ± 0.0000090 | 3518 ± 0.32% | 3583 ± 116 | 3464 | | Read a 100000-lines CSV | 0.028000 ± 1.38% | 0.027537 ± 0.000254 | 36 ± 1.19% | 36 ± 0 | 64 | | Read a 1000000-lines CSV | 0.310751 ± 0.75% | 0.308330 ± 0.004228 | 3 ± 0.70% | 3 ± 0 | 64 | | Write a 100-lines CSV | 0.000076 ± 0.46% | 0.000069 ± 0.0000052 | 13665 ± 0.28% | 14467 ± 1148 | 13140 | | Write a 1000-lines CSV | 0.000213 ± 0.43% | 0.000209 ± 0.0000056 | 4724 ± 0.16% | 4785 ± 130 | 4700 | | Write a 100000-lines CSV | 0.013886 ± 0.86% | 0.013756 ± 0.000171 | 72 ± 0.78% | 73 ± 1 | 73 | | Write a 1000000-lines CSV | 0.146282 ± 1.39% | 0.146108 ± 0.005752 | 7 ± 1.31% | 7 ± 0 | 64 |
Here is how the tool compares to the read_csv and to_csv/write_csv functions in Pandas and Polars (script):
| Dataset | Pandas (s) | Polars (s) | | ------------------- | ---------- | ---------- | | Read 100 lines | 0.038291 | 0.033831 | | Read 1000 lines | 0.037794 | 0.016517 | | Read 100000 lines | 0.471109 | 0.029076 | | Read 1000000 lines | 4.153507 | 0.216254 | | Write 100 lines | 0.035926 | 0.043052 | | Write 1000 lines | 0.067816 | 0.017617 | | Write 100000 lines | 0.892885 | 0.031329 | | Write 1000000 lines | 8.549390 | 0.331897 |
And here it how it compares with csv-parse and csv-stringify+writeFileSync (script):
| Task | Latency avg (s) | Latency med (s) | Throughput avg (ops/s) | Throughput med (ops/s) | Samples | | ------------------------- | ---------------- | -------------------- | ---------------------- | ---------------------- | ------- | | Read a 100-lines CSV | 0.000207 ± 2.31% | 0.000191 ± 0.0000087 | 5086 ± 0.31% | 5224 ± 244 | 4842 | | Read a 1000-lines CSV | 0.001244 ± 0.42% | 0.001233 ± 0.0000229 | 806 ± 0.33% | 811 ± 15 | 805 | | Read a 100000-lines CSV | 0.120565 ± 0.63% | 0.119515 ± 0.001141 | 8 ± 0.60% | 8 ± 0 | 64 | | Read a 1000000-lines CSV | 1.216019 ± 0.46% | 1.209978 ± 0.006709 | 1 ± 0.44% | 1 ± 0 | 64 | | Write a 100-lines CSV | 0.000087 ± 0.52% | 0.000080 ± 0.0000078 | 12010 ± 0.31% | 12526 ± 1267 | 11503 | | Write a 1000-lines CSV | 0.000290 ± 1.14% | 0.000275 ± 0.0000192 | 3555 ± 0.42% | 3635 ± 258 | 3451 | | Write a 100000-lines CSV | 0.027303 ± 2.96% | 0.027014 ± 0.000900 | 37 ± 1.87% | 37 ± 1 | 64 | | Write a 1000000-lines CSV | 0.273814 ± 2.08% | 0.265154 ± 0.005824 | 4 ± 1.77% | 4 ± 0 | 64 |
Development
Requirements:
- Install the latest
Rust - Install
Node.js@10+which fully supportedNode-API - Install
[email protected]
Test locally
- yarn (install)
- yarn build (build package based on
src/lib.rs) - yarn test
And you will see something along the lines of:
$ ava
✔ readCsv reads a CSV and returns a DataFrame with correct datatypes
✔ DataFrame class methods work correctly
✔ Column to array functions work
─
3 tests passedBenchmarks
To run benchmarks, you will need
uvinstalled (for benchmark data generation)
Run benchmarks with:
yarn bench
yarn bench:comp-py # compare with python libraries
yarn bench:comp-ts # compare with csv-parseThe commands will generate testfiles/generated-*.csv files (with 100, 1000, 100.000 and 1.000.000 rows), and will perform time-based benchmarks for the readCsv (sunbears), read_csv (Pandas/Polars) and parse (csv-parse) functions.
Linting and Formatting
You can run formatting for TypeScript, TOML and Rust code with one command:
yarn formatYou can also run specific formatting checks:
yarn format:rs
yarn format:rs-check # checks formatting, without modifying files
yarn format:prettier
yarn format:tomlFor linting, you need to run both oxlint and clippy:
yarn lint # oxlint
yarn clippy # clippyRelease package
Only necessary for maintainers
![NOTE]
Ensure you have set your NPM_TOKEN in the
GitHubproject setting.
When you want to release the package:
yarn version [<newversion> | major | minor | patch | premajor | preminor | prepatch | prerelease [--preid=<prerelease-id>] | from-git]
git pushGitHub actions will do the rest job for you.
WARN: Don't run
npm publishmanually.
License
MIT
