table-header-detective

v1.0.0

Published

a year ago

Intelligent detection of header rows in tabular data using multiple heuristics

0High
0Medium
0Low

lyleunderwood

table header detection tabular-data data-analysis csv spreadsheet excel data-processing typescript

🕵️ Table Header Detective

An intelligent utility for detecting header rows in tabular data using multiple heuristics and adaptive algorithms.

🔍 Features

Detects if tabular data has one or more header rows
Analyzes multiple factors to make an intelligent decision:
- Type differences (strings vs numbers)
- Capitalization patterns (Title Case, ALL CAPS)
- Format consistency
- Text length patterns
- Special character usage
- Empty cell analysis
- Value uniqueness
- Common header terminology
Advanced detection capabilities:
- Multi-header detection for tables with title rows
- Adaptive weighting based on table size and characteristics
- Handling of challenging edge cases and varied formats
Returns detailed reasoning for why rows were classified as headers
Written in TypeScript with full type safety
Zero dependencies
Configurable detection parameters
Thoroughly tested with extensive test fixtures

📦 Installation

npm install table-header-detective
# or
yarn add table-header-detective

🚀 Quick Start

// Import the detector and all default heuristics
import { detectHeaderRows, DEFAULT_HEURISTICS } from "table-header-detective";

// Sample tabular data
const tableData = [
  ["ID", "Name", "Age", "Department", "Salary"],
  [1, "John Doe", 32, "Marketing", 58000],
  [2, "Jane Smith", 28, "Engineering", 72000],
  [3, "Bob Johnson", 45, "Finance", 91000],
];

// Detect header rows using all heuristics
const result = detectHeaderRows(tableData, {}, DEFAULT_HEURISTICS);

console.log(`Found ${result.headerRowIndices.length} header rows`);
console.log(`Confidence: ${(result.confidence * 100).toFixed(1)}%`);
console.log("Header rows:", result.headerRowIndices);

Multi-header Detection

The library can detect complex header structures with multiple consecutive header rows:

// Table with title and column headers
const multiHeaderTable = [
  ["QUARTERLY FINANCIAL REPORT", "", "", "", "", ""],
  ["Q1", "Q2", "Q3", "Q4", "YTD", "Budget"],
  [125000, 132000, 141000, 138000, 536000, 520000],
  [95000, 97000, 102000, 99000, 393000, 400000],
];

const result = detectHeaderRows(multiHeaderTable, {}, DEFAULT_HEURISTICS);
// Detects rows 0 and 1 as headers with high confidence

Optimizing Bundle Size

For production applications, you can reduce bundle size by only importing the specific heuristics you need:

import { detectHeaderRows } from "table-header-detective";
import { 
  TypeDifferenceHeuristic, 
  CapitalizationPatternHeuristic 
} from "table-header-detective";

// Use only specific heuristics to reduce bundle size
const result = detectHeaderRows(tableData, {}, [
  TypeDifferenceHeuristic,
  CapitalizationPatternHeuristic
]);

📋 API Reference

`detectHeaderRows(rows, options?, heuristics?)`

Main function that analyzes tabular data and detects header rows.

Parameters

rows: TableData - The tabular data to analyze as an array of arrays
options?: HeaderDetectionOptions - Optional configuration settings
heuristics?: HeaderHeuristic[] - Array of heuristics to use for detection. You must provide either specific heuristics or DEFAULT_HEURISTICS.

Returns

Returns a HeaderDetectionResult object containing:

headerRowIndices: number[] - Array of row indices identified as headers
confidence: number - Confidence level (0-1) that the detected rows are headers
message: string - Human-readable result message
details: RowDetail[] - Detailed analysis of each row

Available Heuristics

The library provides multiple heuristics, each analyzing different aspects of the table:

TypeDifferenceHeuristic - Analyzes data type patterns (strings vs. numbers)
CapitalizationPatternHeuristic - Detects typical header capitalization (ALL CAPS, Title Case)
FormatConsistencyHeuristic - Identifies format pattern differences between rows
TextLengthPatternHeuristic - Analyzes text length patterns (headers tend to be shorter)
SpecialCharUsageHeuristic - Examines special character usage in cells
EmptyCellAnalysisHeuristic - Analyzes patterns of empty cells
ValueUniquenessHeuristic - Checks for unique values typical of headers
HeaderTerminologyHeuristic - Detects common header terms and terminology

Import only the heuristics you need to optimize bundle size, or use DEFAULT_HEURISTICS to include all of them.

Adaptive Weighting System

The detector uses an intelligent adaptive weighting system that adjusts heuristic importance based on table characteristics:

For small tables (1-3 rows):
- Upweights terminology-based detection
- Downweights pattern-based heuristics that need more rows
For large tables (10+ rows):
- Upweights pattern recognition heuristics
- Better utilizes statistical patterns across larger datasets

This ensures optimal header detection regardless of table size or structure.

Configuration Options

You can customize the detection behavior:

import { detectHeaderRows, DEFAULT_HEURISTICS } from "table-header-detective";

const result = detectHeaderRows(myData, {
  // Minimum score to consider a row a header (0-1)
  headerThreshold: 0.6, // Default is 0.6

  // Maximum number of consecutive header rows to detect
  maxHeaderRows: 3, // Default is 3

  // Whether to apply first row bias (higher score for first row)
  applyFirstRowBias: true, // Default is true

  // Additional custom header terms to look for
  customHeaderTerms: ["code", "reference", "location"],
  
  // Minimum rows needed for sampling patterns
  minSampleRows: 2 // Default is 2
}, DEFAULT_HEURISTICS);

// You can also pass configuration options to specific heuristics
import { HeaderTerminologyHeuristic } from "table-header-detective";

const customTerminologyHeuristic = {
  ...HeaderTerminologyHeuristic,
  analyze: (rows) => HeaderTerminologyHeuristic.analyze(rows, {
    customHeaderTerms: ["product_id", "sku_code", "category_name"],
    matchPartialTerms: true,
    headerTermBoost: 0.5
  })
};

const result = detectHeaderRows(myData, options, [
  customTerminologyHeuristic,
  // ...other heuristics
]);

📊 Examples

Example 1: CSV Data

import { detectHeaderRows } from "table-header-detective";
import { DEFAULT_HEURISTICS } from "table-header-detective";
import { parse } from "csv-parse/sync";
import fs from "fs";

// Read and parse CSV
const csvData = fs.readFileSync("data.csv", "utf8");
const records = parse(csvData);

// Detect headers
const result = detectHeaderRows(records, {}, DEFAULT_HEURISTICS);

if (result.headerRowIndices.length > 0) {
  // Extract headers
  const headers = result.headerRowIndices.map((idx) => records[idx]);
  console.log("Detected headers:", headers);

  // Extract data (rows after headers)
  const lastHeaderIdx = Math.max(...result.headerRowIndices);
  const data = records.slice(lastHeaderIdx + 1);
  console.log(`${data.length} data rows found`);
}

Example 2: Exploring Detailed Results

import { detectHeaderRows } from "table-header-detective";

const data = [
  /* your table data */
];
const result = detectHeaderRows(data);

// Examine detailed analysis for each row
result.details.forEach((detail) => {
  console.log(`Row ${detail.rowIndex}:`);
  console.log(`  Is header: ${detail.isLikelyHeader}`);
  console.log(`  Header Type: ${detail.headerType}`);
  console.log(`  Confidence: ${(detail.confidence * 100).toFixed(1)}%`);
  console.log(`  Reasons:`);
  detail.reasons.forEach((reason) => {
    console.log(`    - ${reason}`);
  });
  console.log("");
});

🛠️ Advanced Usage

Custom Heuristic Combinations

You can select and configure specific heuristics based on your data characteristics:

import { detectHeaderRows } from "table-header-detective";
import { 
  TypeDifferenceHeuristic, 
  HeaderTerminologyHeuristic,
  EmptyCellAnalysisHeuristic
} from "table-header-detective";

// For tables with mixed data types and specialized terminology
const result = detectHeaderRows(tableData, options, [
  TypeDifferenceHeuristic,
  HeaderTerminologyHeuristic,
  EmptyCellAnalysisHeuristic
]);

Multi-Header Table Detection

The library is specially designed to recognize complex multi-header tables, like:

// Complex header with title rows and column headers
const complexHeaderTable = [
  ['EMPLOYEE INFORMATION', '', '', '', ''],
  ['FISCAL YEAR 2023', '', '', '', ''],
  ['ID', 'Name', 'Age', 'Department', 'Salary'],
  [1, 'John Doe', 32, 'Marketing', 58000],
  // More data rows...
];

const result = detectHeaderRows(complexHeaderTable, {}, DEFAULT_HEURISTICS);
// Will detect all three rows as headers (0, 1, and 2)

Creating Custom Heuristics

You can create your own custom heuristics by implementing the HeaderHeuristic interface:

import { HeaderHeuristic } from "table-header-detective";
import { TableData, HeuristicResult } from "table-header-detective";

// Custom heuristic for multi-line headers
const MultiLineHeaderHeuristic: HeaderHeuristic = {
  name: 'multiLineHeader',
  analyze: (rows: TableData): HeuristicResult => {
    // Your custom analysis logic here
    const rowScores: number[] = [];
    const reasons: string[][] = [];
    
    // Analyze each row and populate rowScores and reasons
    // ...
    
    return { rowScores, reasons };
  }
};

// Use your custom heuristic
const result = detectHeaderRows(tableData, options, [
  TypeDifferenceHeuristic,
  MultiLineHeaderHeuristic
]);

Helper Functions

The package exports additional helper functions for advanced usage:

isTitleCase(str) - Checks if a string follows Title Case convention
detectCellType(value) - Detects the data type of a cell value
getRowCompleteness(row, options) - Analyzes how complete a row is
normalizeValue(value, options) - Normalizes values for comparison

🧪 Testing

The library includes comprehensive test coverage for various table scenarios:

Basic header detection for simple tables
Complex multi-level header structures
Tables with various formatting challenges
Large tables with many rows
Edge cases and challenging patterns

Each heuristic is individually tested, and integration tests verify the system works correctly as a whole.

🔮 Future Enhancements

Future versions will add support for even more complex table structures:

Full multi-level header detection with column groups spanning multiple columns
Better detection of mid-table category headers
Support for merged cell patterns from spreadsheet exports
Handling of tables with irregular structures