@reaatech/classifier-evals-dataset

v0.1.1

Published

22 days ago

Multi-format dataset loader for classifier evaluation

0High
0Medium
0Low

reaatech

@reaatech/classifier-evals-dataset

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Multi-format dataset loader for classifier evaluation with CSV (RFC 4180), JSON, and JSONL support. Includes validation, train/test splitting with stratification, K-fold cross-validation, label normalization, alias resolution, and hierarchical label handling.

Installation

npm install @reaatech/classifier-evals-dataset
# or
pnpm add @reaatech/classifier-evals-dataset

Feature Overview

Multi-format loading — CSV (RFC 4180 compliant), JSON (array or { samples, data, results }), JSONL
Schema validation — validates required fields (text, label, predicted_label), confidence ranges, and data types
Train/test splitting — random or stratified splits with reproducible seeding (Mulberry32 PRNG)
K-fold cross-validation — generates K train/test folds with optional full-split pairs
Label normalization — lowercase, trim, separator normalization, custom transforms
Label aliasing — map synonyms to canonical labels (e.g., "password_reset" → "account")
Unknown label handling — keep, remove, or map unknown labels to a canonical "unknown" class
Hierarchical labels — compute metrics at arbitrary hierarchy levels, navigate parent/child relationships
Distribution analysis — imbalance detection, duplicate detection, data leakage checks, confidence distribution analysis
Dual ESM/CJS output — works with import and require

Quick Start

import { loadDataset, validateDataset, splitDataset } from "@reaatech/classifier-evals-dataset";

// Load a CSV dataset
const dataset = await loadDataset("./datasets/test-set.csv");

// Validate for common issues
const validation = validateDataset(dataset);
if (validation.warnings.length > 0) {
  console.log("Warnings:", validation.warnings.map(w => w.message));
}

// Split into train/test (stratified by label, 80/20, seed 42)
const { train, test } = splitDataset(dataset, {
  testSize: 0.2,
  stratify: true,
  seed: 42,
});

console.log(`Train: ${train.samples.length}, Test: ${test.samples.length}`);

API Reference

Dataset Loading

`loadDataset(filePath: string, format?: 'csv' | 'json' | 'jsonl'): Promise<EvalDataset>`

Loads a dataset from a file path. Format is auto-detected from the file extension. Returns an EvalDataset with samples and metadata.

const csvData = await loadDataset("./data/samples.csv");
const jsonData = await loadDataset("./data/samples.json");
const jsonlData = await loadDataset("./data/samples.jsonl");

`loadDatasetFromContent(content: string, format: 'csv' | 'json' | 'jsonl'): Promise<EvalDataset>`

Loads a dataset from a raw string. Useful for in-memory data or streaming sources.

const csvContent = "text,label,predicted_label\nHello,greeting,greeting";
const dataset = await loadDatasetFromContent(csvContent, "csv");

Dataset Validation

`validateDataset(dataset: EvalDataset): ValidationResult`

Validates a dataset for common issues. Returns { valid, errors, warnings }.

const result = validateDataset(dataset);

// Schema errors (empty text, missing labels, invalid confidence)
result.errors; // ValidationError[]

// Distribution warnings (imbalance, duplicates, leakage)
result.warnings; // ValidationWarning[]

| Check | Type | When | |-------|------|------| | Empty text | Error | text field is missing or empty | | Empty label | Error | label field is missing or empty | | Empty predicted_label | Error | predicted_label field is missing or empty | | Invalid confidence | Error | Confidence outside [0, 1] range | | Severe imbalance | Warning | Min/max class ratio > 10:1 | | Duplicate texts | Warning | Identical text content across samples | | Data leakage | Warning | >95% accuracy on raw predictions | | Low confidence | Warning | >50% of predictions below 0.5 |

`validateSamples(samples: ClassificationResult[]): ValidationResult`

Validates a raw array of classification results without full dataset metadata.

`getDatasetSummary(dataset: EvalDataset): DatasetSummary`

Returns a summary object with totalSamples, numLabels, labelDistribution, accuracy, and avgConfidence.

Dataset Splitting

`splitDataset(dataset: EvalDataset, options: SplitOptions): { train: EvalDataset, test: EvalDataset }`

Splits a dataset into train and test sets with optional stratification.

const { train, test } = splitDataset(dataset, {
  testSize: 0.3,       // 30% test split
  stratify: true,      // Maintain label proportions
  seed: 42,            // Reproducible splits
  shuffle: true,       // Shuffle before splitting
});

| Option | Type | Default | Description | |--------|------|---------|-------------| | testSize | number | (required) | Fraction (0-1) or absolute count for test set | | stratify | boolean | true | Maintain label proportions across splits | | seed | number | 42 | Random seed for reproducibility (Mulberry32) | | shuffle | boolean | true | Shuffle data before splitting |

`kFoldSplit(dataset: EvalDataset, k?: number, seed?: number): EvalDataset[]`

Generates K evenly-distributed folds for cross-validation.

const folds = kFoldSplit(dataset, 5, 42);
for (const fold of folds) {
  console.log(`${fold.samples.length} samples in fold`);
}

`kFoldSplits(dataset: EvalDataset, k?: number, seed?: number): { train: EvalDataset, test: EvalDataset }[]`

Generates K train/test pairs for cross-validation, where each fold serves as the test set once.

const splits = kFoldSplits(dataset, 5);
for (const { train, test } of splits) {
  // train = all other folds, test = current fold
}

Label Management

`normalizeLabels(dataset: EvalDataset, options: NormalizationOptions): EvalDataset`

Normalizes all labels in a dataset with configurable transformations.

const normalized = normalizeLabels(dataset, {
  lowercase: true,
  trim: true,
  normalizeSeparators: "underscores", // "password reset" → "password_reset"
});

| Option | Type | Default | Description | |--------|------|---------|-------------| | lowercase | boolean | true | Convert labels to lowercase | | trim | boolean | true | Trim whitespace | | normalizeSeparators | "spaces" \| "underscores" \| "none" | — | Convert between space and underscore separators | | custom | (label: string) => string | — | Custom normalization function |

`applyLabelAliases(dataset: EvalDataset, aliases: LabelAliases): EvalDataset`

Maps synonyms to canonical labels.

const mapped = applyLabelAliases(dataset, {
  "password_reset": "account",
  "forgot_password": "account",
  "change_pw": "account",
});

`handleUnknownLabels(dataset: EvalDataset, options: UnknownLabelOptions): EvalDataset`

Handles labels not in a known set with configurable actions.

const cleaned = handleUnknownLabels(dataset, {
  action: "map_to_unknown",
  knownLabels: ["greeting", "account", "billing"],
  unknownLabel: "other",
});

| Action | Behavior | |--------|----------| | keep | Leave unknown labels as-is | | remove | Remove samples with unknown labels | | map_to_unknown | Replace unknown labels with a canonical class |

`getLabelStats(dataset: EvalDataset): LabelStats`

Returns { totalLabels, uniqueLabels, distribution, mostCommon, leastCommon, avgSamplesPerLabel }.

`getParentLabel(label: string, hierarchy: LabelHierarchy): string | null`

Finds the parent of a label in a hierarchy.

`computeHierarchicalMetrics(dataset: EvalDataset, hierarchy: LabelHierarchy, level?: number): HierarchicalMetrics`

Computes accuracy at a specific hierarchy level by walking labels up to their parent nodes.

CSV Format (RFC 4180)

The CSV parser follows RFC 4180 with proper quoted-field handling:

text,label,predicted_label,confidence
"Reset my password, please",password_reset,password_reset,0.95
"Cancel my subscription",cancel_subscription,refund_request,0.72
"Where is my order",order_status,order_status,0.88

Required columns: text, label, predicted_label. Optional: confidence (defaults to 1.0).

Usage Pattern

Each schema export has a matching type export. Use the schema for runtime validation and the type for compile-time checking:

import { EvalDatasetSchema, type EvalDataset } from "@reaatech/classifier-evals";

function handleResponse(raw: unknown): EvalDataset {
  return EvalDatasetSchema.parse(raw);
}

Related Packages

@reaatech/classifier-evals — Core types and schemas
@reaatech/classifier-evals-metrics — Confusion matrix and classification metrics

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@reaatech/classifier-evals-dataset

Installation

Feature Overview

Quick Start

API Reference

Dataset Loading

loadDataset(filePath: string, format?: 'csv' | 'json' | 'jsonl'): Promise<EvalDataset>

loadDatasetFromContent(content: string, format: 'csv' | 'json' | 'jsonl'): Promise<EvalDataset>

Dataset Validation

validateDataset(dataset: EvalDataset): ValidationResult

validateSamples(samples: ClassificationResult[]): ValidationResult

getDatasetSummary(dataset: EvalDataset): DatasetSummary

Dataset Splitting

splitDataset(dataset: EvalDataset, options: SplitOptions): { train: EvalDataset, test: EvalDataset }

kFoldSplit(dataset: EvalDataset, k?: number, seed?: number): EvalDataset[]

kFoldSplits(dataset: EvalDataset, k?: number, seed?: number): { train: EvalDataset, test: EvalDataset }[]

Label Management

normalizeLabels(dataset: EvalDataset, options: NormalizationOptions): EvalDataset

applyLabelAliases(dataset: EvalDataset, aliases: LabelAliases): EvalDataset

handleUnknownLabels(dataset: EvalDataset, options: UnknownLabelOptions): EvalDataset

getLabelStats(dataset: EvalDataset): LabelStats

getParentLabel(label: string, hierarchy: LabelHierarchy): string | null

computeHierarchicalMetrics(dataset: EvalDataset, hierarchy: LabelHierarchy, level?: number): HierarchicalMetrics

CSV Format (RFC 4180)

Usage Pattern

Related Packages

License

`loadDataset(filePath: string, format?: 'csv' | 'json' | 'jsonl'): Promise<EvalDataset>`

`loadDatasetFromContent(content: string, format: 'csv' | 'json' | 'jsonl'): Promise<EvalDataset>`

`validateDataset(dataset: EvalDataset): ValidationResult`

`validateSamples(samples: ClassificationResult[]): ValidationResult`

`getDatasetSummary(dataset: EvalDataset): DatasetSummary`

`splitDataset(dataset: EvalDataset, options: SplitOptions): { train: EvalDataset, test: EvalDataset }`

`kFoldSplit(dataset: EvalDataset, k?: number, seed?: number): EvalDataset[]`

`kFoldSplits(dataset: EvalDataset, k?: number, seed?: number): { train: EvalDataset, test: EvalDataset }[]`

`normalizeLabels(dataset: EvalDataset, options: NormalizationOptions): EvalDataset`

`applyLabelAliases(dataset: EvalDataset, aliases: LabelAliases): EvalDataset`

`handleUnknownLabels(dataset: EvalDataset, options: UnknownLabelOptions): EvalDataset`

`getLabelStats(dataset: EvalDataset): LabelStats`

`getParentLabel(label: string, hierarchy: LabelHierarchy): string | null`

`computeHierarchicalMetrics(dataset: EvalDataset, hierarchy: LabelHierarchy, level?: number): HierarchicalMetrics`