krakens

v1.9.0

Published

2 years ago

A reactive & declarative dataframe implemented in pure Javascript using RxJS.

Downloads

0High
0Medium
0Low

brianbuccaneer

dataframe statistics data science data analysis analytics rxjs observables reactive

Krakens (alpha)

Note: This is a work in progress. It's not recommended for public use yet. You can expect a stable version to arrive in July 2019.

Krakens is a simple, declarative and reactive dataframe built in pure Javascript using RxJS as part of the Swashbuckler machine learning toolchain.

Declarative and platform-agnostic. Krakens allows data-driven applications to declare what data streams they want and implement business logic around those without worrying about where the dataframe stores its data or how the dataframe runs its computations. You can run the exact same data pipelines and operations on input data regardless of whether it is stored in memory, in a CSV file or 1000 miles away in a MongoDB cluster. Krakens makes it possible to offload many calculations to an underlying computational engine (like a MongoDB cluster or GraphQL API) but is also supports in-memory computations.

Built for analysts. Krakens is designed to suit the needs of data analysts and machine learning engineers. It provides helpful functions to support data analysis tasks like data exploration, queries, descriptive statistics and calculations across columns.

Everything is an Observable. Every Krakens function returns an RxJS observable. Columns are Observable streams. Rows are also Observable streams. All of the normal RxJS operators can be applied on columns and rows of data since they are simply RxJS Observables.

Universal. Krakens works with browsers, node.js, React Native, Electron and IoT devices.

Reactive. Krakens dataframes stream their data from a source. The data source can be static (like an in-memory matrix) but it doesn't need to be since Krakens can also stream data asynchronously on-demand. This allows users to run incremental analysis as additional rows are streamed into the dataframe.

Composable. Since every column in a Krakens dataframe is simply an RxJS Observable, Krakens dataframes can be composed from any data source or multiple data sources. For example, if you want to combine certain fields from a collection in MongoDB with additional calculated fields stored in memory or a CSV file, Krakens can do this. The only requirement is that the rows need to be loaded into Krakens in the same order (at the same index).

Installation

npm i --save krakens
yarn add krakens

Create a Dataframe

From an Array (containing rows) or Observable (containing the rows):

import { create } from 'krakens';
import { from } from 'rxjs';

const pirates = [
  {name: 'Blackbeard', booty: 100},
  {name: 'Sparrow', booty: 5},
  {name: 'Captain Crunch', booty: 10000000},
];

// Or create a dataframe from an array (containing the rows):
const dfFromArray$ = create(pirates);

// Create a dataframe from an observable containing the rows:
const pirate$ = from(pirates);
const dfFromObservable$ = create(pirate$);

From remote or local csv:

import { create } from 'krakens-csv';

// from local file
const dfFromLocalCsv$ = create('./pirates.csv'); 
// Remote files work, too:
const dfFromRemoteCsv$ = create('https://datasets.buccaneer.ai/pirates.csv');
// Gzipping remote files is a good idea. Krakens supports it:
const gzippedCsv$ = create('https://datasets.buccaneer.ai/pirates.csv.gz');
// AWS S3 is wildly popular so Krakens handles S3 CSV files out of the box:
const dfFromS3$ = create('s3://mybucket/pirates.csv');

From a MongoDB collection:

import { create } from 'krakens-mongo';

const dfFromMongoCollection$ = create({
  mongoUrl: process.env.MONGO_URL,
  collection: 'pirates',
});

From custom data sources:

You can often just pull all the data into an observable and then create a Krakens dataframe. Here's an example that pulls data from a RESTful HTTP endpoint:

import { from } from 'rxjs';
import { mergeMap } from 'rxjs/operators';
import superagent from 'superagent'; // install a HTTP client

// Stream data from an HTTP API
const httpResponse = superagent.get('https://api.buccaner.ai/pirates');
const row$ = from(httpResponse).pipe(mergeMap(json => from(json.data));
const df$ = create(row$);

If you want your Krakens client to fetch only what it needs or delegate computations to another software system, it's easy to implement your own Krakens client. Fear not! It really isn't hard!

Using Operators

Krakens implements a handful of pipe-able operators that play nice with databases and tabular data. Operators are declarative meaning that you declare what transformations you want the dataframe to run without worrying about how the data source implements the operator.

Note: Every operator expects to receive a dataframe. If you try to use an operator on a stream that doesn't contain a dataframe as its first parameter, then your pipeline will error out. For example:

import { of, from } from 'rxjs';
import { mergeMap } from 'rxjs/operators';
import { create } from 'krakens';
import { count } from 'krakens/operators';

const nauticalFriends = [{name: 'Ahab'}, {name: 'Nemo'}, {name: 'Ariel'}];
const df$ = create(nauticalFriends);

// Yarrr. This works.
df$.pipe(count());

// You can also pipe transformations into krakens operators, as long as they 
// return a dataframe:
of({foo: bar}).pipe(
  mergeMap(() => df$),
  count()
);

// No, Matey! This will fail.
of('notadataframe').pipe(count()); // triggers the onError method

`count()`

Returns an observable containing the row count for the dataframe.

import { create } from 'krakens';
import { count } from 'krakens/operators';

const df$ = create([
  {name: 'Blackbeard', booty: 100},
  {name: 'Sparrow', booty: 5},
  {name: 'Captain Crunch', booty: 10000000},
]);
const allRowCount$ = df$.pipe(
  count()
);
const filteredRowCount$ = df$.pipe(
  count({name: $eq: 'Blackbeard'})
);

`createColumn()`

import { create } from 'krakens';
import { createColumn } from 'krakens/operators';

const pirates = [
  {name: 'Blackbeard', booty: 100},
  {name: 'Sparrow', booty: 5},
  {name: 'Captain Crunch', booty: 10000000},
];
const df$ = create(pirates);

`cols(columnNames, <Array><String,Integer>)`

Given a column name or column index (either a String or integer), returns an observable containing the data from a particular column.

import { create } from 'krakens';
import { cols } from 'krakens/operators';

const df$ = create([
  {name: 'Blackbeard', booty: 100},
  {name: 'Sparrow', booty: 5},
  {name: 'Captain Crunch', booty: 10000000},
]);

const bootyCol$ = df$.pipe(
  cols(['booty'])
);

const columnsByIndex$ = df$.pipe(
  cols([1]) // this is the second key, which in this case is "booty"
);

const columnsByColumnName$ = df$.pipe(
  cols(['drinksRum', booty']
);
const rowWithColIndex$ = df$.pipe(
  cols([0, 2])
);

`mapCols(mapper<Function>, options<{fields<Array><String,Number>, colName<String>}>)`

Creates a new column by mapping one or more existing columns and transforming them to a new value.
Creates a new dataframe column and pushes the new dataframe (containing the new column) to the dataframe's Subject.

import { create, mapCols } from 'krakens';

const df$ = create([
  {name: 'Blackbeard', booty: 100},
  {name: 'Sparrow', booty: 5},
  {name: 'Captain Crunch', booty: 10000000},
]);

const mappedCol$ = df$.pipe(
  mapCols(
    row => row.booty * row.booty, 
    {fields: ['booty'], colName: 'bootySquared'}
  )
);
mappedCol$.subscribe(); // emits the value of the new 'bootySquared' column

// Henceforth, df$ will emit a dataframe with the new column "bootySquared", 
// which is saved in memory.

`where(query<Query>)`

import { create, where } from 'krakens';

const pirates = [
  {name: 'Blackbeard', booty: 100},
  {name: 'Sparrow', booty: 5},
  {name: 'Captain Crunch', booty: 10000000},
];
const df$ = create(pirates);

// where() supports the following operators, similar to MongoDB:
const results0$ = df$.pipe(where({booty: {$eq: 100}})); // equal to 100
const results1$ = df$.pipe(where({booty: {$ne: 5}})); // not equal to 5
const results2$ = df$.pipe(where({booty: {$in: [5, 100]}})); // equal to 5 or 100
const results3$ = df$.pipe(where({booty: {$nin: [5, 100]}})); // not equal to 5 or 100
const results4$ = df$.pipe(where({booty: {$exists: 1}})); // has a value for booty field
const results5$ = df$.pipe(where({booty: {$exists: 0}})); // has no value for booty field
const results6$ = df$.pipe(where({booty: {$gt: 100}})); // greater than 100
const results7$ = df$.pipe(where({booty: {$gte: 100}})); // greater than or equal to 100
const results8$ = df$.pipe(where({booty: {$lt: 100}})); // less than 100
const results9$ = df$.pipe(where({booty: {$lte: 100}})); // less than or equal to 100

// operators can be combined:
const results10$ = df$.pipe(where({booty: {$gt: 5}, drinksRum: {$eq: true}}));

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme