arrow-supercluster
v0.3.1
Published
Arrow-native spatial clustering engine — Supercluster reimplemented for Apache Arrow tables
Maintainers
Readme
arrow-supercluster
A spatial clustering engine for Apache Arrow tables. Reimplements the Supercluster algorithm to work directly with Arrow columnar memory — no GeoJSON serialization, no intermediate JS objects.
Live Demo — see it in action with up to 2M points.
Why
Supercluster expects GeoJSON in and produces GeoJSON out. If your data is already in Arrow format (e.g. loaded from GeoParquet), that means:
- Iterating the Arrow table to build GeoJSON features
- Supercluster internally converts those back to flat arrays
getClusters()builds new GeoJSON Feature objects on every call
This library skips all of that. It reads coordinate buffers directly from the Arrow geometry column and outputs typed arrays (Float64Array, Uint32Array, Uint8Array) ready for any rendering pipeline.
Install
# pnpm
pnpm add arrow-supercluster apache-arrow
# npm
npm install arrow-supercluster apache-arrow
# yarn
yarn add arrow-supercluster apache-arrowapache-arrow is a peer dependency — you control the version (>=14 supported).
Usage
import { ArrowClusterEngine } from "arrow-supercluster";
import type { Table } from "apache-arrow";
// `table` is an Arrow Table with a GeoArrow Point geometry column
// (FixedSizeList[2] of Float64 — the standard encoding for point data)
const engine = new ArrowClusterEngine({
radius: 75, // cluster radius in pixels (default: 40)
maxZoom: 16, // max zoom level to cluster (default: 16)
minZoom: 0, // min zoom level to cluster (default: 0)
minPoints: 2, // minimum points to form a cluster (default: 2)
});
engine.load(table, "geometry");
// Query clusters for a bounding box and zoom level
const output = engine.getClusters([-180, -85, 180, 85], 4);
// output.positions — Float64Array [lng0, lat0, lng1, lat1, ...]
// output.pointCounts — Uint32Array [count0, count1, ...]
// output.ids — Float64Array [id0, id1, ...]
// output.isCluster — Uint8Array [1, 0, 1, ...] (1 = cluster, 0 = point)
// output.length — numberAPI
new ArrowClusterEngine(options?)
| Option | Type | Default | Description |
| ----------- | -------- | ------- | ---------------------------------------- |
| radius | number | 40 | Cluster radius in pixels |
| extent | number | 512 | Tile extent (radius is relative to this) |
| minZoom | number | 0 | Minimum zoom level for clustering |
| maxZoom | number | 16 | Maximum zoom level for clustering |
| minPoints | number | 2 | Minimum points to form a cluster |
engine.load(table, geometryColumn?, idColumn?, filterMask?)
Index an Arrow Table. The geometry column must be GeoArrow Point encoding (FixedSizeList[2] of Float64). Single-chunk tables use a zero-copy fast path.
geometryColumn— name of the geometry column (default:"geometry")idColumn— reserved for future use. Currently ignored; point IDs are always Arrow row indices. (default:"id")filterMask— optionalUint8Arrayof lengthtable.numRows. When provided, only rows wherefilterMask[i]is non-zero are indexed. Rows with0are excluded from clustering entirely. Passnullor omit to include all rows.
engine.getClusters(bbox, zoom) → ClusterOutput
Query clusters within a bounding box [minLng, minLat, maxLng, maxLat] at the given zoom level. Returns typed arrays — no object allocation per result.
The returned arrays are views into reusable internal buffers. They're valid until the next getClusters() call. Copy them if you need to retain the data.
engine.getChildren(clusterId) → ClusterOutput
Get the immediate children of a cluster.
engine.getLeaves(clusterId, limit?, offset?) → number[]
Get all leaf point indices for a cluster. Returns indices into the original Arrow table — use table.get(index) to materialize rows.
engine.getClusterExpansionZoom(clusterId) → number
Get the zoom level at which a cluster expands into its children.
engine.getOriginZoom(clusterId) → number
Decode the zoom level from an encoded cluster ID.
engine.getOriginId(clusterId) → number
Decode the origin index from an encoded cluster ID.
ClusterOutput
interface ClusterOutput {
positions: Float64Array; // interleaved [lng, lat, lng, lat, ...]
pointCounts: Uint32Array; // points per cluster (1 for individual points)
ids: Float64Array; // cluster ID or Arrow row index
isCluster: Uint8Array; // 1 = cluster, 0 = individual point
length: number; // total items
}Performance
Benchmarked against Supercluster with the same datasets:
| Metric | 200k points | 1M points | | ----------------------------- | ------------ | ------------ | | Load time | ~1× (parity) | ~1× (parity) | | Query time (avg) | ~7.5× faster | ~8× faster | | Query time (mid-zoom peak) | ~20× faster | ~27× faster | | Wire size (Arrow IPC vs JSON) | 84% smaller | 84% smaller |
Query speedups come from returning pre-allocated typed arrays instead of GeoJSON Feature objects. The more clustering happening (low/mid zoom), the bigger the win.
How It Works
Same algorithm as Supercluster (~400 lines), different I/O:
- Reads
Float64Arraycoordinate buffer directly from the Arrow geometry column - Converts lng/lat → Mercator, packs into flat arrays
- Builds a KDBush spatial index per zoom level (top-down clustering)
getClusters()does a range query and writes results into reusable typed array buffers
For individual points at high zoom, coordinates are read directly from the original Arrow buffer — no inverse Mercator transform needed.
License
ISC (same as Supercluster)
