hdbscan-wasm
v0.1.0
Published
HDBSCAN clustering in Rust/WASM — ported from scikit-learn, optimized for 2D Euclidean distance
Maintainers
Readme
hdbscan-wasm
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) clustering algorithm, implemented in Rust and compiled to WebAssembly.
Origin
This is a direct port of scikit-learn's HDBSCAN implementation to Rust/WASM. The core algorithm was ported from:
sklearn/cluster/_hdbscan/_tree.pyx— condensed tree construction, cluster stability, Excess of Mass (EOM) selection, Union-Find label assignmentsklearn/cluster/_hdbscan/_reachability.pyx— core distance and mutual reachability distance computation
The porting was done by reading the scikit-learn source code line by line and translating each step to Rust. This is not a reimplementation from the paper — it follows scikit-learn's specific algorithmic choices and data structures.
What it does
HDBSCAN finds clusters of varying density in a point cloud. Unlike DBSCAN, it does not require an eps parameter — it automatically discovers clusters at multiple density scales and selects the most stable ones.
Algorithm steps:
- Core distances — For each point, compute the distance to its k-th nearest neighbor (KD-tree, O(n log n))
- Mutual reachability distance —
max(core_dist[a], core_dist[b], euclidean(a, b)) - Minimum spanning tree — Prim's algorithm on the mutual reachability graph (O(n²) time, O(n) memory — no distance matrix)
- Condensed tree — Walk the single-linkage dendrogram, splitting at nodes with fewer than
min_cluster_sizepoints - Cluster stability — Compute stability scores via excess of mass (EOM)
- Label assignment — Union-Find over the condensed tree to assign each point to a cluster or noise (-1)
Installation
npm install hdbscan-wasmUsage
Basic
import init, { hdbscan } from 'hdbscan-wasm';
// Initialize WASM (call once)
await init();
const xs = new Float64Array([0.0, 0.1, 0.2, 5.0, 5.1, 5.2]);
const ys = new Float64Array([0.0, 0.1, 0.0, 3.0, 3.1, 3.0]);
const labels = hdbscan(xs, ys, 3, 3);
// Int32Array: [0, 0, 0, 1, 1, 1]
// -1 means noiseWeb Worker (recommended for large datasets)
// worker.ts
import init, { hdbscan } from 'hdbscan-wasm';
self.onmessage = async (e) => {
await init();
const { xs, ys, minClusterSize, minSamples } = e.data;
const labels = hdbscan(
new Float64Array(xs),
new Float64Array(ys),
minClusterSize,
minSamples,
);
self.postMessage({ labels: Array.from(labels) });
};Next.js / Vite / webpack
Most modern bundlers handle WASM automatically. For Next.js, ensure the component using hdbscan-wasm is loaded with dynamic(() => import(...), { ssr: false }) since WASM cannot run during server-side rendering.
API
Single function, 2D Euclidean distance only:
function hdbscan(
xs: Float64Array, // X coordinates (length n)
ys: Float64Array, // Y coordinates (length n)
min_cluster_size: number, // minimum points to form a cluster
min_samples: number, // k for core distance (controls density smoothing)
): Int32Array // cluster labels (length n, -1 = noise)Parameters
| Parameter | Description | Typical range |
|---|---|---|
| min_cluster_size | Minimum number of points to form a cluster. Larger values produce fewer, bigger clusters. | 5–50 |
| min_samples | Number of neighbors for core distance estimation. Larger values are more conservative (more noise). | 1–20 |
Relationship: min_samples controls how "dense" a region must be to be considered a core point. min_cluster_size controls how many core points are needed to form a cluster. They are independent parameters.
Performance
- Core distance: O(n log n) via KD-tree k-NN queries
- MST: O(n²) via Prim's algorithm — computed on-the-fly without building a distance matrix (O(n) memory)
- Condensed tree + labelling: O(n)
Approximate benchmarks (single-threaded WASM):
| Points | Time | |---|---| | 1,000 | ~50ms | | 5,000 | ~500ms | | 10,000 | ~2s |
The MST step is the bottleneck (O(n²)). For >20,000 points, consider downsampling (e.g., grid coarsening).
Limitations
- 2D Euclidean only. The distance function is hardcoded as
sqrt((x1-x2)² + (y1-y2)²). No support for higher dimensions, Haversine, or custom metrics. - No sample weights. Every point has equal weight. Uniform grids (e.g., 250m population grids) may produce fragmented clusters because density is measured by point count, not by an associated value like population.
- No prediction. Batch algorithm only — no
approximate_predict()for new points. - O(n²) MST. The Prim MST step scales quadratically. Practical limit is ~15,000 points in interactive use.
- Tie-breaking. When multiple MST edges have identical weights, the order may differ from scikit-learn (which uses a different MST algorithm). This can cause minor differences in cluster boundaries for ambiguous points, but the overall clustering structure is equivalent.
- No soft clustering. Returns hard labels only, no probability scores.
- EOM only. Only Excess of Mass cluster selection is supported (no leaf clustering).
Comparison with scikit-learn
| Feature | scikit-learn | hdbscan-wasm | |---|---|---| | Dimensions | arbitrary | 2D only | | Distance metrics | many (euclidean, manhattan, ...) | euclidean only | | MST algorithm | dual-tree Boruvka (O(n log n) amortized) | Prim (O(n²)) | | Sample weights | no | no | | Soft clustering | yes (probabilities) | no | | Prediction | yes (approximate_predict) | no | | Cluster selection | EOM + leaf | EOM only | | Runtime | Python + Cython | WASM (runs in browser) |
Building from source
Prerequisites: Rust toolchain + wasm-pack
# Install wasm-pack
cargo install wasm-pack
# Build WASM
wasm-pack build --target bundler --release
# Run Rust tests
cargo test --releaseTests
5 test cases included in src/lib.rs:
- two_moons — Classic two interleaving half-circles (100 points)
- two_blobs_noise — Two Gaussian blobs with uniform noise (110 points)
- grid_clusters — Three separated grid clusters (108 points)
- small_debug — Minimal 10-point debug case
- gangnam_noise_in_hull — Real-world test with projected coordinates, verifying no noise points appear inside cluster convex hulls
cargo test --releaseLicense
BSD 3-Clause — compatible with scikit-learn's license, as this is a derived work.
