@iyulab/u-insight
v0.3.1
Published
Statistical analysis and data profiling engine with C FFI bindings.
Readme
u-insight
A statistical analysis and data profiling engine in Rust with C FFI bindings.
Overview
u-insight transforms raw tabular data into actionable statistical insights. It operates in two distinct layers with opposite assumptions about input data quality:
CSV (raw)
│
├─→ Profiling ─→ "What is the state of this data?"
│ Tolerates dirty data (missing values, type mismatches expected)
│
│ (external preprocessing)
│
└─→ Analysis ─→ "What can we learn from this data?"
Requires clean numeric data (no NaN, no missing)Built on u-analytics (statistical algorithms), u-numflow (math primitives).
Modules
Data Layer
| Module | Description |
|--------|-------------|
| dataframe | Column-major tabular data model (DataFrame, Column, DataType) |
| csv_parser | CSV parsing with automatic type inference |
| error | Error types (InsightError) |
Profiling Layer (dirty data tolerated)
| Module | Description |
|--------|-------------|
| profiling | Column-level and dataset-level data profiling — descriptive stats, missing analysis, outlier flagging (IQR/Z-score/Modified Z-score), diagnostic flags |
Analysis Layer (clean data required)
| Module | Description |
|--------|-------------|
| analysis | Correlation (Pearson/Spearman), regression (simple/multiple OLS), Cramer's V contingency analysis |
| clustering | K-Means++ (auto-K, Gap Statistic), Mini-Batch K-Means, DBSCAN, Hierarchical Agglomerative (Single/Complete/Average/Ward), HDBSCAN |
| distribution | ECDF, histogram bins (Sturges/Scott/FD), QQ-plot, normality tests (KS, Jarque-Bera, Shapiro-Wilk, Anderson-Darling), Grubbs test, distribution fitting |
| pca | Principal Component Analysis with auto-scaling option |
| isolation_forest | Isolation Forest anomaly detection (Liu et al. 2008) |
| lof | Local Outlier Factor (LOF) density-based anomaly detection |
| mahalanobis | Mahalanobis distance multivariate outlier detection |
| feature_importance | Variance threshold, correlation filter, VIF, condition number, composite importance, ANOVA F-test selection, Mutual Information, Permutation Importance |
FFI Layer
| Module | Description |
|--------|-------------|
| ffi | C FFI bindings — 32 functions, 20 #[repr(C)] structs, auto-generated C header via cbindgen |
Quick Start
use u_insight::csv_parser::CsvParser;
use u_insight::profiling::profile_dataframe;
// 1. Parse CSV
let csv = "name,value,active\nAlice,1.5,true\nBob,2.3,false\nCharlie,3.1,true\n";
let df = CsvParser::new().parse_str(csv).unwrap();
// 2. Profile
let profiles = profile_dataframe(&df);Clustering
use u_insight::clustering::{kmeans, dbscan, KMeansConfig, DbscanConfig};
let data = vec![
vec![0.0, 0.0], vec![0.5, 0.5],
vec![10.0, 10.0], vec![10.5, 10.5],
];
// K-Means
let km = kmeans(&data, &KMeansConfig::new(2)).unwrap();
assert_eq!(km.k, 2);
// DBSCAN
let db = dbscan(&data, &DbscanConfig::new(1.5, 2)).unwrap();
assert_eq!(db.n_clusters, 2);Distribution Analysis
use u_insight::distribution::{distribution_analysis, DistributionConfig};
let data: Vec<f64> = (0..50).map(|i| (i as f64 - 25.0) * 0.2).collect();
let result = distribution_analysis(&data, &DistributionConfig::default()).unwrap();
println!("Normal: {}", result.normality.is_normal);C FFI
u-insight builds as cdylib + staticlib for cross-language interop. A C header (u_insight.h) is auto-generated by cbindgen at build time.
Profiling
| Function | Description |
|----------|-------------|
| insight_profile_csv | Profile a CSV string → opaque context |
| insight_profile_free | Free profile context |
| insight_profile_row_count | Row count from profile |
| insight_profile_col_count | Column count from profile |
| insight_profile_column | Get column summary |
Clustering
| Function | Description |
|----------|-------------|
| insight_kmeans | K-Means++ clustering |
| insight_mini_batch_kmeans | Mini-Batch K-Means clustering |
| insight_dbscan | DBSCAN density-based clustering |
| insight_hierarchical | Hierarchical Agglomerative clustering (4 linkages) |
| insight_hdbscan | HDBSCAN clustering with membership probabilities |
| insight_gap_statistic | Gap statistic for optimal K selection |
Dimensionality Reduction
| Function | Description |
|----------|-------------|
| insight_pca | Principal Component Analysis |
Anomaly Detection
| Function | Description |
|----------|-------------|
| insight_isolation_forest | Isolation Forest anomaly detection |
| insight_lof | Local Outlier Factor detection |
| insight_mahalanobis | Mahalanobis distance outlier detection |
Statistical Analysis
| Function | Description |
|----------|-------------|
| insight_correlation | Pearson correlation matrix |
| insight_regression | Simple linear regression |
| insight_cramers_v | Cramer's V contingency analysis |
Distribution
| Function | Description |
|----------|-------------|
| insight_distribution | Normality testing (KS, JB, SW, AD) |
Feature Importance
| Function | Description |
|----------|-------------|
| insight_feature_importance | Composite feature importance scores |
| insight_anova_select | ANOVA F-test feature selection |
| insight_mutual_info | Mutual information feature ranking |
| insight_permutation_importance | Permutation importance for regression |
Memory Management
| Function | Description |
|----------|-------------|
| insight_free_labels | Free u32 label arrays |
| insight_free_i32_array | Free i32 arrays |
| insight_free_f64_array | Free f64 arrays |
| insight_free_anova_features | Free ANOVA feature arrays |
| insight_free_mi_features | Free MI feature arrays |
| insight_free_perm_features | Free permutation importance arrays |
Error & Version
| Function | Description |
|----------|-------------|
| insight_last_error | Last error message (thread-local) |
| insight_clear_error | Clear error state |
| insight_version | Library version string |
All FFI functions use catch_unwind to prevent panics from crossing the FFI boundary.
C# Binding (UInsight)
Install via NuGet — native libraries are bundled automatically:
dotnet add package UInsightusing UInsight;
using var client = new InsightClient();
Console.WriteLine(client.GetVersion());
var data = new double[,] { {0,0}, {1,1}, {10,10}, {11,11} };
var result = client.KMeans(data, k: 2);
Console.WriteLine($"K={result.K}, WCSS={result.Wcss:F2}");The binding is in bindings/csharp/UInsight/ with:
Interop/NativeLibrary.cs—[LibraryImport]declarations for all 32 FFI functionsInterop/NativeStructs.cs—[StructLayout]mappings for all 20 C structsInsightClient.cs— High-level managed API (automatic memory management)InsightException.cs— Error code to exception conversion
Test Status
357 lib tests + 49 doc-tests = 406 total
0 clippy warnings
Build: lib + cdylib + staticlib
C header: auto-generated via cbindgen (20 structs, 32 functions)Scope & Non-Goals
In Scope:
- Data profiling (dirty data → quality report + diagnostic flags)
- Statistical analysis (clean data → patterns + relationships)
- Correlation, regression, clustering, PCA, anomaly detection
- Feature importance and selection (ANOVA, MI, Permutation)
- Distribution analysis and normality testing
- C FFI for cross-language use
- C# binding (UInsight NuGet package)
Out of Scope:
- Visualization / charting
- Data cleaning / transformation / imputation
- ML model training / deployment
- Deep learning
Requirements
- Rust 1.75+
- Dependencies:
u-analytics,u-numflow
Related
- u-analytics -- Statistical analytics
- u-numflow -- Mathematical primitives
License
MIT License
