clusternova

v1.5.5

Published

6 months ago

HDBSCAN clustering algorithm implementation in TypeScript

0High
0Medium
0Low

led101

hdbscan dbscan clustering machine-learning javascript typescript nodejs data-science ml ai

Clusternova - HDBSCAN in Typescript

HDBSCAN clustering in TypeScript. Zero dependencies.

Unlike k-means, HDBSCAN automatically determines the number of clusters and identifies outliers. It handles clusters of varying densities and shapes.

Based on the paper "Density-Based Clustering Based on Hierarchical Density Estimates" by Campello, Moulavi, and Sander. Implementation draws from the Python HDBSCAN library and JSAT.

Install

npm install clusternova

Usage

import HDBSCAN, { findCentralElements, cosine, VectorPoint } from "clusternova";
// manhattan, euclidean are also available imports

// Extend VectorPoint with your additional fields
interface MyDataPoint extends VectorPoint {
  title: string;        // Optional fields
  timestamp: Date;      // that you might need
}

const data: MyDataPoint[] = [
  {
    id: "1",
    vector: [1, 2, 3],
    title: "First point",
    timestamp: new Date()
  },
  {
    id: "2",
    vector: [4, 5, 6],
    title: "Second point",
    timestamp: new Date()
  }
  // ... more points
];

const hdbscan = new HDBSCAN(data, 3, cosine); // minimum points = 3, cosine distance metric
const { clusters, outliers } = hdbscan.run();
// Types of returned data:
// clusters: MyDataPoint[][] - Array of clusters, each containing array of your data points
// outliers: MyDataPoint[] - Array of outlier points

// Example output:
{
    clusters: [
    [
        { id: "1", vector: [1, 2, 3], title: "First point", timestamp: "2024-01-01..." },
        { id: "2", vector: [4, 5, 6], title: "Second point", timestamp: "2024-01-01..." }
    ],
    // ... more clusters
    ],
    outliers: [
    { id: "7", vector: [7, 8, 9], title: "Outlier point", timestamp: "2024-01-01..." }
    // ... more outliers
    ]
}

// Find central elements of each cluster - returns array of elements with distance field (distance from the center of the cluster)
clusters.forEach(cluster => {
  const centralElements = findCentralElements(cluster, 3, cosine);
  // Returns: (MyDataPoint & { distance: number })[]
  // Example:
  // [
  //   { id: "1", vector: [1,2,3], title: "First point", distance: 0.2 },
  //   { id: "2", vector: [4,5,6], title: "Second point", distance: 0.3 }
  // ]
});

Demo

Tweet clustering demo - clusters tweet embeddings and summarizes each cluster with GPT-3.5 (bring your own API key, source in examples/)
Citibike Stories - clusters hourly Citibike trips

Notes

Despite theoretical limitations of clustering in high dimensions, HDBSCAN works well with text embeddings using cosine distance.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords