@iyulab/u-insight

v0.13.0

Published

21 days ago

Statistical analysis and data profiling engine with C FFI bindings.

0High
0Medium
0Low

caveman

iyujune

statistics profiling analytics ffi

u-insight

A statistical analysis and data profiling engine in Rust with C FFI bindings.

What's New in 0.9.1

BREAKING — Rust: InsightError::NonNumericColumn variant removed. The 0.9.0 audit redirected all internal call sites to DegenerateData, leaving the variant unused. Removed per Delete over deprecate policy. External match arms over InsightError must drop the corresponding branch.

What's New in 0.9.0

Kendall tau-b correlation added to CorrelationMethod (Pearson / Spearman / Kendall)
Outlier fences exposed on OutlierResult — lower_fence, upper_fence, center, spread
detect_outliers_slice(&[f64], method) helper for raw-slice input
vif_analysis() and condition_number() standalone multicollinearity diagnostics
WASM: correlation_matrix accepts optional _method field ("pearson" | "spearman" | "kendall"); new detect_univariate_outliers, vif_diagnostic, condition_number_diagnostic
C#: CorrelationMethodKind enum + Correlate(data, method) parameter
BREAKING — Rust: Non-finite numeric inputs now return InsightError::DegenerateData (was NonNumericColumn); audit covered 11 call sites
BREAKING — Rust: OutlierResult gained 4 new fields (exhaustive pattern matches must be updated)
BREAKING — FFI: insight_correlation signature gained a method: u32 parameter (use INSIGHT_CORR_PEARSON = 0 to keep prior behaviour)
BREAKING — C#: Correlate(...) now takes an optional CorrelationMethodKind parameter (default Pearson keeps existing call-sites compiling)

Overview

u-insight transforms raw tabular data into actionable statistical insights. It operates in two distinct layers with opposite assumptions about input data quality:

CSV (raw)
  │
  ├─→ Profiling ─→ "What is the state of this data?"
  │     Tolerates dirty data (missing values, type mismatches expected)
  │
  │   (external preprocessing)
  │
  └─→ Analysis  ─→ "What can we learn from this data?"
        Requires clean numeric data (no NaN, no missing)

Built on u-analytics (statistical algorithms), u-numflow (math primitives).

Modules

Data Layer

| Module | Description | |--------|-------------| | dataframe | Column-major tabular data model (DataFrame, Column, DataType) | | csv_parser | CSV parsing with automatic type inference | | error | Error types (InsightError) |

Profiling Layer (dirty data tolerated)

| Module | Description | |--------|-------------| | profiling | Column-level and dataset-level data profiling — descriptive stats, missing analysis, outlier flagging (IQR/Z-score/Modified Z-score), diagnostic flags |

Analysis Layer (clean data required)

| Module | Description | |--------|-------------| | analysis | Correlation (Pearson/Spearman), regression (simple/multiple OLS), Cramer's V contingency analysis | | clustering | K-Means++ (auto-K, Gap Statistic), Mini-Batch K-Means, DBSCAN, Hierarchical Agglomerative (Single/Complete/Average/Ward), HDBSCAN | | distribution | ECDF, histogram bins (Sturges/Scott/FD), QQ-plot, normality tests (KS, Jarque-Bera, Shapiro-Wilk, Anderson-Darling), Grubbs test, distribution fitting | | pca | Principal Component Analysis with auto-scaling option | | isolation_forest | Isolation Forest anomaly detection (Liu et al. 2008) | | lof | Local Outlier Factor (LOF) density-based anomaly detection | | mahalanobis | Mahalanobis distance multivariate outlier detection | | feature_importance | Variance threshold, correlation filter, VIF, condition number, composite importance, ANOVA F-test selection, Mutual Information, Permutation Importance |

FFI Layer

| Module | Description | |--------|-------------| | ffi | C FFI bindings — 32 functions, 20 #[repr(C)] structs, auto-generated C header via cbindgen |

Quick Start

use u_insight::csv_parser::CsvParser;
use u_insight::profiling::profile_dataframe;

// 1. Parse CSV
let csv = "name,value,active\nAlice,1.5,true\nBob,2.3,false\nCharlie,3.1,true\n";
let df = CsvParser::new().parse_str(csv).unwrap();

// 2. Profile
let profiles = profile_dataframe(&df);

Clustering

use u_insight::clustering::{kmeans, dbscan, KMeansConfig, DbscanConfig};

let data = vec![
    vec![0.0, 0.0], vec![0.5, 0.5],
    vec![10.0, 10.0], vec![10.5, 10.5],
];

// K-Means
let km = kmeans(&data, &KMeansConfig::new(2)).unwrap();
assert_eq!(km.k, 2);

// DBSCAN
let db = dbscan(&data, &DbscanConfig::new(1.5, 2)).unwrap();
assert_eq!(db.n_clusters, 2);

Distribution Analysis

use u_insight::distribution::{distribution_analysis, DistributionConfig};

let data: Vec<f64> = (0..50).map(|i| (i as f64 - 25.0) * 0.2).collect();
let result = distribution_analysis(&data, &DistributionConfig::default()).unwrap();
println!("Normal: {}", result.normality.is_normal);

C FFI

u-insight builds as cdylib + staticlib for cross-language interop. A C header (u_insight.h) is auto-generated by cbindgen at build time.

Profiling

| Function | Description | |----------|-------------| | insight_profile_csv | Profile a CSV string → opaque context | | insight_profile_free | Free profile context | | insight_profile_row_count | Row count from profile | | insight_profile_col_count | Column count from profile | | insight_profile_column | Get column summary |

Clustering

| Function | Description | |----------|-------------| | insight_kmeans | K-Means++ clustering | | insight_mini_batch_kmeans | Mini-Batch K-Means clustering | | insight_dbscan | DBSCAN density-based clustering | | insight_hierarchical | Hierarchical Agglomerative clustering (4 linkages) | | insight_hdbscan | HDBSCAN clustering with membership probabilities | | insight_gap_statistic | Gap statistic for optimal K selection |

Dimensionality Reduction

| Function | Description | |----------|-------------| | insight_pca | Principal Component Analysis |

Anomaly Detection

| Function | Description | |----------|-------------| | insight_isolation_forest | Isolation Forest anomaly detection | | insight_lof | Local Outlier Factor detection | | insight_mahalanobis | Mahalanobis distance outlier detection |

Statistical Analysis

| Function | Description | |----------|-------------| | insight_correlation | Pearson correlation matrix | | insight_regression | Simple linear regression | | insight_cramers_v | Cramer's V contingency analysis |

Distribution

| Function | Description | |----------|-------------| | insight_distribution | Normality testing (KS, JB, SW, AD) |

Feature Importance

| Function | Description | |----------|-------------| | insight_feature_importance | Composite feature importance scores | | insight_anova_select | ANOVA F-test feature selection | | insight_mutual_info | Mutual information feature ranking | | insight_permutation_importance | Permutation importance for regression |

Memory Management

| Function | Description | |----------|-------------| | insight_free_labels | Free u32 label arrays | | insight_free_i32_array | Free i32 arrays | | insight_free_f64_array | Free f64 arrays | | insight_free_anova_features | Free ANOVA feature arrays | | insight_free_mi_features | Free MI feature arrays | | insight_free_perm_features | Free permutation importance arrays |

Error & Version

| Function | Description | |----------|-------------| | insight_last_error | Last error message (thread-local) | | insight_clear_error | Clear error state | | insight_version | Library version string |

All FFI functions use catch_unwind to prevent panics from crossing the FFI boundary.

C# Binding (UInsight)

Install via NuGet — native libraries are bundled automatically:

dotnet add package UInsight

using UInsight;

using var client = new InsightClient();
Console.WriteLine(client.GetVersion());

var data = new double[,] { {0,0}, {1,1}, {10,10}, {11,11} };
var result = client.KMeans(data, k: 2);
Console.WriteLine($"K={result.K}, WCSS={result.Wcss:F2}");

The binding is in bindings/csharp/UInsight/ with:

Interop/NativeLibrary.cs — [LibraryImport] declarations for all 32 FFI functions
Interop/NativeStructs.cs — [StructLayout] mappings for all 20 C structs
InsightClient.cs — High-level managed API (automatic memory management)
InsightException.cs — Error code to exception conversion

Test Status

357 lib tests + 49 doc-tests = 406 total
0 clippy warnings
Build: lib + cdylib + staticlib
C header: auto-generated via cbindgen (20 structs, 32 functions)

Scope & Non-Goals

In Scope:

Data profiling (dirty data → quality report + diagnostic flags)
Statistical analysis (clean data → patterns + relationships)
Correlation, regression, clustering, PCA, anomaly detection
Feature importance and selection (ANOVA, MI, Permutation)
Distribution analysis and normality testing
C FFI for cross-language use
C# binding (UInsight NuGet package)

Out of Scope:

Visualization / charting
Data cleaning / transformation / imputation
ML model training / deployment
Deep learning

Requirements

Rust 1.75+
Dependencies: u-analytics, u-numflow

WebAssembly / npm

Available as an npm package via wasm-pack.

npm install @iyulab/u-insight

Quick Start

import init, { describe, kmeans } from '@iyulab/u-insight';

await init();
const stats = describe({ col1: [1, 2, 3], col2: [4, 5, 6] });

Functions

`describe(data) -> [ColumnResult]`

Descriptive statistics per column. Input: column-major { "col1": [1,2,3] }.

Output: Array of { name, data_type, numeric: { count, min, max, mean, median, std_dev, variance, skewness, kurtosis, q1, q3, iqr, p5, p95, ... } }.

`correlation_matrix(data) -> CorrelationResult`

Pearson correlation matrix. Input: column-major { "col1": [1,2,3], "col2": [4,5,6] }.

Output:

{ "names": ["col1","col2"], "matrix": [1,0.99,0.99,1], "n": 2, "high_pairs": [{ "col_a": "col1", "col_b": "col2", "r": 0.99, "p_value": 0.01 }] }

`kmeans(data, k) -> KMeansResult`

K-Means++ clustering on row-major data [[x,y,...], ...].

Output:

{ "k": 3, "labels": [0,0,1,1,2,2], "centroids": [[...]], "wcss": 5.2, "iterations": 12, "cluster_sizes": [2,2,2] }

`silhouette(data, labels, k) -> SilhouetteResult`

Silhouette analysis for an existing clustering assignment. Works with any clustering output (kmeans, dbscan, hierarchical, etc.). data is row-major [[x,y,...], ...], labels is one cluster id per row (each < k), k is the number of distinct clusters. O(n²) — use sparingly on very large inputs.

Output:

{ "avg": 0.74, "per_sample": [0.81, 0.79, 0.62, ...] }

avg ranges from -1 (wrong cluster) to +1 (well-separated); singleton-cluster points report 0.0 in per_sample.

`pca(data, n_components) -> PcaResult`

Principal Component Analysis on row-major data.

Output:

{ "n_components": 2, "n_features": 4, "eigenvalues": [3.1,0.9], "explained_variance_ratio": [0.77,0.23], "cumulative_variance_ratio": [0.77,1.0], "loadings": [[...]], "scores": [[...]], "means": [...], "stds": [...] }

`dbscan(data, config) -> DbscanResult`

DBSCAN density-based clustering. config: { "epsilon": 1.5, "min_samples": 3 }.

Output:

{ "labels": [0,0,null,1,1], "n_clusters": 2, "noise_count": 1, "cluster_sizes": [2,2], "core_points": [true,true,false,true,true] }

`hierarchical(data, config) -> HierarchicalResult`

Hierarchical agglomerative clustering (nearest-neighbor-chain, O(n²) time / O(n²) memory). config: { "linkage": "ward", "n_clusters": 3 } or { "linkage": "single", "distance_threshold": 5.0 }.

Config fields:

linkage — "single" | "complete" | "average" | "ward" (default "ward").
n_clusters — flat clusters to extract (mutually exclusive with distance_threshold).
distance_threshold — dendrogram cut height (mutually exclusive with n_clusters).
max_points — memory guard; inputs with more points are rejected before allocating the O(n²) distance matrix. Omit for the default (10000, ≈400 MB matrix); set 0 to disable. Raise it for large native batches; lower it for tight memory (e.g. a browser tab).

// large dataset on a memory-constrained page: cap it explicitly
hierarchical(data, { linkage: "ward", n_clusters: 3, max_points: 5000 });

Output:

{ "merges": [{ "cluster_a": 0, "cluster_b": 1, "distance": 1.2, "size": 2 }], "labels": [0,0,1,1,2], "n_clusters": 3 }

`isolation_forest(data, config) -> IsolationForestResult`

Isolation Forest anomaly detection. config: { "n_estimators": 100, "contamination": 0.1, "seed": 42 }.

Output:

{ "scores": [0.45, 0.82], "anomalies": [false, true], "threshold": 0.65, "anomaly_count": 1, "anomaly_fraction": 0.5 }

`lof(data, config) -> LofResult`

Local Outlier Factor anomaly detection. config: { "k": 20, "threshold": 1.5 }.

Output:

{ "scores": [1.0, 2.3], "anomalies": [false, true], "threshold": 1.5, "anomaly_count": 1, "anomaly_fraction": 0.5 }

`distribution_analysis(data, config) -> DistributionResult`

Distribution analysis on a 1-D array. config: { "bin_method": "freedman_diaconis", "bins": null, "significance_level": 0.05, "compute_ecdf": true, "compute_histogram": true, "compute_qq_plot": true, "fit_distributions": false }.

bin_method: "sturges" | "scott" | "freedman_diaconis" — automatic bin count rule (default "freedman_diaconis").
bins (optional, integer >= 1): explicit histogram bin count. When set it takes precedence over bin_method, and the histogram method field echoes "Fixed(n)".

Output:

{ "n": 100, "ecdf": { "values": [...], "probabilities": [...] }, "histogram": { "n_bins": 10, "bin_width": 0.5, "edges": [...], "counts": [...] }, "qq_plot": { "theoretical": [...], "sample": [...] }, "normality": { "shapiro_wilk": { "statistic": 0.98, "p_value": 0.45, "rejected": false }, "is_normal": true }, "fits": [] }

`regression(data) -> RegressionResult`

OLS regression analysis.

Input:

{ "predictors": { "x1": [1,2,3,4,5] }, "target": [2.1, 3.9, 6.1, 7.9, 10.1], "target_name": "y" }

Output:

{ "target_name": "y", "predictor_names": ["x1"], "r_squared": 0.99, "adj_r_squared": 0.99, "coefficients": [0.1, 2.0], "p_values": [0.9, 0.0001], "vif": [1.0], "f_p_value": 0.0001 }

`feature_importance(data) -> FeatureImportanceResult`

Feature importance via permutation, ANOVA, or mutual information.

Input:

{ "features": { "f1": [1,2,3], "f2": [5,4,3] }, "target": [0,0,1], "method": "permutation", "n_repeats": 5, "seed": 42 }

Output:

{ "method": "permutation", "features": [{ "name": "f1", "index": 0, "score": 0.8, "std_dev": 0.1 }], "baseline_score": 0.5 }

npm (WebAssembly)

npm install @iyulab/u-insight

The package resolves per environment via a conditional exports map:

| Environment | Entry | |---|---| | Bundlers (webpack, Vite, …) | ESM + WebAssembly ESM-integration (default condition) | | Node.js — require(), ESM import, CJS TS runners (tsx, ts-node) | CJS glue loading the wasm from the filesystem (node condition) — no loader hooks or flags |

u-analytics -- Statistical analytics
u-numflow -- Mathematical primitives

License

MIT License

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

u-insight

What's New in 0.9.1

What's New in 0.9.0

Overview

Modules

Data Layer

Profiling Layer (dirty data tolerated)

Analysis Layer (clean data required)

FFI Layer

Quick Start

Clustering

Distribution Analysis

C FFI

Profiling

Clustering

Dimensionality Reduction

Anomaly Detection

Statistical Analysis

Distribution

Feature Importance

Memory Management

Error & Version

C# Binding (UInsight)

Test Status

Scope & Non-Goals

Requirements

WebAssembly / npm

Quick Start

Functions

describe(data) -> [ColumnResult]

correlation_matrix(data) -> CorrelationResult

kmeans(data, k) -> KMeansResult

silhouette(data, labels, k) -> SilhouetteResult

pca(data, n_components) -> PcaResult

dbscan(data, config) -> DbscanResult

hierarchical(data, config) -> HierarchicalResult

isolation_forest(data, config) -> IsolationForestResult

lof(data, config) -> LofResult

distribution_analysis(data, config) -> DistributionResult

regression(data) -> RegressionResult

feature_importance(data) -> FeatureImportanceResult

npm (WebAssembly)

Related

License

`describe(data) -> [ColumnResult]`

`correlation_matrix(data) -> CorrelationResult`

`kmeans(data, k) -> KMeansResult`

`silhouette(data, labels, k) -> SilhouetteResult`

`pca(data, n_components) -> PcaResult`

`dbscan(data, config) -> DbscanResult`

`hierarchical(data, config) -> HierarchicalResult`

`isolation_forest(data, config) -> IsolationForestResult`

`lof(data, config) -> LofResult`

`distribution_analysis(data, config) -> DistributionResult`

`regression(data) -> RegressionResult`

`feature_importance(data) -> FeatureImportanceResult`