npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

apr144-hclust

v0.0.1

Published

Agglomerative hierarchical clustering in JavaScript

Downloads

11

Readme

hclust

Agglomerative hierarchical clustering in JavaScript

Inspired by the MIT-licensed hcluster.js by @ChrisPolis. See the comparison of the two below.


Usage

Browser

<script src="hclust.min.js"></script>
<script>
  hclust.clusterData(...);
  hclust.euclideanDistance(...);
  hclust.avgDistance(...);
</script>

Node

npm install @greenelab/hclust

or

yarn add @greenelab/hclust

then

import { clusterData } from 'hclust';
import { euclideanDistance } from 'hclust';
import { avgDistance } from 'hclust';

clusterData({ data, key, distance, linkage, onProgress })

Parameters

data

The data you want to cluster, in the format:

[
  ...
  [ ... 1, 2, 3 ...],
  [ ... 1, 2, 3 ...],
  [ ... 1, 2, 3 ...],
  ...
]

or if key parameter is specified:

[
  ...
  { someKey: [ ... 1, 2, 3 ...] },
  { someKey: [ ... 1, 2, 3 ...] },
  { someKey: [ ... 1, 2, 3 ...] },
  ...
]

The entries in the outer array can be considered the rows and the entries within each row array can be considered the cols. Each row should have the same number of cols.

Default value: []

key

A string key to specify which values to extract from the data array. If omitted, data is assumed to be an array of arrays. If specified, data is assumed to be array of objects, each with a key that contains the values for that row.

Default value: ''

distance

A function to calculate the distance between two equal-dimension vectors, used in calculating the distance matrix, in the format:

function (arrayA, arrayB) { return someNumber; }

The function receives two equal-length arrays of numbers (ints or floats) and should return a number (int or float).

Default value: euclideanDistance from this hclust package

linkage

A function to calculate the distance between pairs of clusters based on a distance matrix, used in determining linkage criterion, in the format:

function (arrayA, arrayB, distanceMatrix) { return someNumber; }

The function receives two sets of indexes and the distance matrix computed between each datum and every other datum. The function should return a number (int or float)

Default value: averageDistance from this hclust package

onProgress

A function that is called several times throughout clustering, and is provided the current progress through the clustering, in the format:

function (progress) { }

The function receives the percent progress between 0 and 1.

Default value: an internal function that console.log's the progress

Note: postMessage is called in the same places as onProgress, if the script is running as a web worker.

Returns

const { clusters, distances, order, clustersGivenK } = clusterData(...);

clusters

The resulting cluster tree, in the format:

{
  indexes: [ ... Number, Number, Number ... ],
  height: Number,
  children: [ ... {}, {}, {} ... ]
}

distances

The computed distance matrix, in the format:

[
  ...
  [ ... Number, Number, Number ...],
  [ ... Number, Number, Number ...],
  [ ... Number, Number, Number ...]
  ...
]

order

The new order of the data, in terms of original data array indexes, in the format:

[ ... Number, Number, Number ... ]

Equivalent to clusters.indexes and clustersGivenK[1].

clustersGivenK

A list of tree slices in terms of original data array indexes, where index = K, in the format:

[
  [], // K = 0
  [ [] ], // K = 1
  [ [], [] ], // K = 2
  [ [], [], [] ], // K = 3
  [ [], [], [], [] ], // K = 4
  [ [], [], [], [], [] ] // K = 5
  ...
]

euclideanDistance(arrayA, arrayB)

Calculates the euclidean distance between two equal-dimension vectors.


avgDistance(arrayA, arrayB, distanceMatrix)

Calculates the average distance between pairs of clusters based on a distance matrix.


Comparison with hcluster.js

  • This package does not duplicate items from the original dataset in the results. Results are given in terms of indexes, either with respect to the original dataset or the distance matrix.
  • This package uses more modern JavaScript syntaxes and practices to make the code cleaner and simpler.
  • This package provides an onProgress callback and calls postMessage for use in web workers. Because clustering can take a long time with large data sets, you may want to run it as a web worker so the browser doesn't freeze for a long time, and you may need a callback so you can give users visual feedback on its progress.
  • This package makes some performance optimizations, such as removing unnecessary loops through big sets. It has been tested on various OS's (Windows, Mac, Linux, iOS, Android), devices (desktop, laptop, mobile), browsers (Chrome, Firefox, Safari), contexts (main thread, web worker), and hosting locations (local, online). The results vary widely, and are likely sensitive to the specifics of hardware, cpu cores, browser implementation, etc. But in general, this package is more performant than hcluster.js, to varying degrees, and is always at least as performant on average. Chrome seems to see the most performance gains (up to 10x, when the row number is high), while Firefox seems to see no gains.
  • This package does not touch the input data object, whereas the hcluster.js package does. D3 often expects you to mutate data objects directly, which is now typically considered bad practice in JavaScript. Instead, this package returns the useful data from the clustering algorithm (including the distance matrix), and allows you to mutate or not mutate the data object depending on your needs.
  • This package leaves out the minDistance or maxDistance functions that are built into hcluster.js, because -- per this reference -- they are not as effective as averageDistance.

Making changes to the library

  1. Install Node
  2. Install Yarn
  3. Clone this repo and navigate to it in your command terminal
  4. Run yarn install to install this package's dependencies
  5. Make desired changes to ./src/hclust.js
  6. Run yarn test to automatically rebuild the library and run test suite
  7. Run yarn build to just rebuild the library, and output the compiled contents to ./build/hclust.min.js
  8. Commit changes to repo if necessary. Make sure to run the build command before committing; it won't happen automatically.

Similar libraries

cmpolis/hcluster.js harthur/clustering mljs/hclust math-utils/hierarchical-clustering


Further reading

The AGNES (AGglomerative NESting) method; continuously merge nodes that have the least dissimilarity.