hdbscan-wasm

v0.1.0

Published

2 months ago

HDBSCAN clustering in Rust/WASM — ported from scikit-learn, optimized for 2D Euclidean distance

0High
0Medium
0Low

vuski

hdbscan clustering wasm geospatial

hdbscan-wasm

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) clustering algorithm, implemented in Rust and compiled to WebAssembly.

Origin

This is a direct port of scikit-learn's HDBSCAN implementation to Rust/WASM. The core algorithm was ported from:

sklearn/cluster/_hdbscan/_tree.pyx — condensed tree construction, cluster stability, Excess of Mass (EOM) selection, Union-Find label assignment
sklearn/cluster/_hdbscan/_reachability.pyx — core distance and mutual reachability distance computation

The porting was done by reading the scikit-learn source code line by line and translating each step to Rust. This is not a reimplementation from the paper — it follows scikit-learn's specific algorithmic choices and data structures.

What it does

HDBSCAN finds clusters of varying density in a point cloud. Unlike DBSCAN, it does not require an eps parameter — it automatically discovers clusters at multiple density scales and selects the most stable ones.

Algorithm steps:

Core distances — For each point, compute the distance to its k-th nearest neighbor (KD-tree, O(n log n))
Mutual reachability distance — max(core_dist[a], core_dist[b], euclidean(a, b))
Minimum spanning tree — Prim's algorithm on the mutual reachability graph (O(n²) time, O(n) memory — no distance matrix)
Condensed tree — Walk the single-linkage dendrogram, splitting at nodes with fewer than min_cluster_size points
Cluster stability — Compute stability scores via excess of mass (EOM)
Label assignment — Union-Find over the condensed tree to assign each point to a cluster or noise (-1)

Installation

npm install hdbscan-wasm

Usage

Basic

import init, { hdbscan } from 'hdbscan-wasm';

// Initialize WASM (call once)
await init();

const xs = new Float64Array([0.0, 0.1, 0.2, 5.0, 5.1, 5.2]);
const ys = new Float64Array([0.0, 0.1, 0.0, 3.0, 3.1, 3.0]);

const labels = hdbscan(xs, ys, 3, 3);
// Int32Array: [0, 0, 0, 1, 1, 1]
// -1 means noise

Web Worker (recommended for large datasets)

// worker.ts
import init, { hdbscan } from 'hdbscan-wasm';

self.onmessage = async (e) => {
  await init();
  const { xs, ys, minClusterSize, minSamples } = e.data;
  const labels = hdbscan(
    new Float64Array(xs),
    new Float64Array(ys),
    minClusterSize,
    minSamples,
  );
  self.postMessage({ labels: Array.from(labels) });
};

Next.js / Vite / webpack

Most modern bundlers handle WASM automatically. For Next.js, ensure the component using hdbscan-wasm is loaded with dynamic(() => import(...), { ssr: false }) since WASM cannot run during server-side rendering.

API

Single function, 2D Euclidean distance only:

function hdbscan(
  xs: Float64Array,       // X coordinates (length n)
  ys: Float64Array,       // Y coordinates (length n)
  min_cluster_size: number,  // minimum points to form a cluster
  min_samples: number,       // k for core distance (controls density smoothing)
): Int32Array               // cluster labels (length n, -1 = noise)

Parameters

| Parameter | Description | Typical range | |---|---|---| | min_cluster_size | Minimum number of points to form a cluster. Larger values produce fewer, bigger clusters. | 5–50 | | min_samples | Number of neighbors for core distance estimation. Larger values are more conservative (more noise). | 1–20 |

Relationship: min_samples controls how "dense" a region must be to be considered a core point. min_cluster_size controls how many core points are needed to form a cluster. They are independent parameters.

Performance

Core distance: O(n log n) via KD-tree k-NN queries
MST: O(n²) via Prim's algorithm — computed on-the-fly without building a distance matrix (O(n) memory)
Condensed tree + labelling: O(n)

Approximate benchmarks (single-threaded WASM):

| Points | Time | |---|---| | 1,000 | ~50ms | | 5,000 | ~500ms | | 10,000 | ~2s |

The MST step is the bottleneck (O(n²)). For >20,000 points, consider downsampling (e.g., grid coarsening).

Limitations

2D Euclidean only. The distance function is hardcoded as sqrt((x1-x2)² + (y1-y2)²). No support for higher dimensions, Haversine, or custom metrics.
No sample weights. Every point has equal weight. Uniform grids (e.g., 250m population grids) may produce fragmented clusters because density is measured by point count, not by an associated value like population.
No prediction. Batch algorithm only — no approximate_predict() for new points.
O(n²) MST. The Prim MST step scales quadratically. Practical limit is ~15,000 points in interactive use.
Tie-breaking. When multiple MST edges have identical weights, the order may differ from scikit-learn (which uses a different MST algorithm). This can cause minor differences in cluster boundaries for ambiguous points, but the overall clustering structure is equivalent.
No soft clustering. Returns hard labels only, no probability scores.
EOM only. Only Excess of Mass cluster selection is supported (no leaf clustering).

Comparison with scikit-learn

| Feature | scikit-learn | hdbscan-wasm | |---|---|---| | Dimensions | arbitrary | 2D only | | Distance metrics | many (euclidean, manhattan, ...) | euclidean only | | MST algorithm | dual-tree Boruvka (O(n log n) amortized) | Prim (O(n²)) | | Sample weights | no | no | | Soft clustering | yes (probabilities) | no | | Prediction | yes (approximate_predict) | no | | Cluster selection | EOM + leaf | EOM only | | Runtime | Python + Cython | WASM (runs in browser) |

Building from source

Prerequisites: Rust toolchain + wasm-pack

# Install wasm-pack
cargo install wasm-pack

# Build WASM
wasm-pack build --target bundler --release

# Run Rust tests
cargo test --release

Tests

5 test cases included in src/lib.rs:

two_moons — Classic two interleaving half-circles (100 points)
two_blobs_noise — Two Gaussian blobs with uniform noise (110 points)
grid_clusters — Three separated grid clusters (108 points)
small_debug — Minimal 10-point debug case
gangnam_noise_in_hull — Real-world test with projected coordinates, verifying no noise points appear inside cluster convex hulls

cargo test --release

License

BSD 3-Clause — compatible with scikit-learn's license, as this is a derived work.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

hdbscan-wasm

Origin

What it does

Installation

Usage

Basic

Web Worker (recommended for large datasets)

Next.js / Vite / webpack

API

Parameters

Performance

Limitations

Comparison with scikit-learn

Building from source

Tests

License