table-header-detective
v1.0.0
Published
Intelligent detection of header rows in tabular data using multiple heuristics
Maintainers
Readme
🕵️ Table Header Detective
An intelligent utility for detecting header rows in tabular data using multiple heuristics and adaptive algorithms.
🔍 Features
- Detects if tabular data has one or more header rows
- Analyzes multiple factors to make an intelligent decision:
- Type differences (strings vs numbers)
- Capitalization patterns (Title Case, ALL CAPS)
- Format consistency
- Text length patterns
- Special character usage
- Empty cell analysis
- Value uniqueness
- Common header terminology
- Advanced detection capabilities:
- Multi-header detection for tables with title rows
- Adaptive weighting based on table size and characteristics
- Handling of challenging edge cases and varied formats
- Returns detailed reasoning for why rows were classified as headers
- Written in TypeScript with full type safety
- Zero dependencies
- Configurable detection parameters
- Thoroughly tested with extensive test fixtures
📦 Installation
npm install table-header-detective
# or
yarn add table-header-detective🚀 Quick Start
// Import the detector and all default heuristics
import { detectHeaderRows, DEFAULT_HEURISTICS } from "table-header-detective";
// Sample tabular data
const tableData = [
["ID", "Name", "Age", "Department", "Salary"],
[1, "John Doe", 32, "Marketing", 58000],
[2, "Jane Smith", 28, "Engineering", 72000],
[3, "Bob Johnson", 45, "Finance", 91000],
];
// Detect header rows using all heuristics
const result = detectHeaderRows(tableData, {}, DEFAULT_HEURISTICS);
console.log(`Found ${result.headerRowIndices.length} header rows`);
console.log(`Confidence: ${(result.confidence * 100).toFixed(1)}%`);
console.log("Header rows:", result.headerRowIndices);Multi-header Detection
The library can detect complex header structures with multiple consecutive header rows:
// Table with title and column headers
const multiHeaderTable = [
["QUARTERLY FINANCIAL REPORT", "", "", "", "", ""],
["Q1", "Q2", "Q3", "Q4", "YTD", "Budget"],
[125000, 132000, 141000, 138000, 536000, 520000],
[95000, 97000, 102000, 99000, 393000, 400000],
];
const result = detectHeaderRows(multiHeaderTable, {}, DEFAULT_HEURISTICS);
// Detects rows 0 and 1 as headers with high confidenceOptimizing Bundle Size
For production applications, you can reduce bundle size by only importing the specific heuristics you need:
import { detectHeaderRows } from "table-header-detective";
import {
TypeDifferenceHeuristic,
CapitalizationPatternHeuristic
} from "table-header-detective";
// Use only specific heuristics to reduce bundle size
const result = detectHeaderRows(tableData, {}, [
TypeDifferenceHeuristic,
CapitalizationPatternHeuristic
]);📋 API Reference
detectHeaderRows(rows, options?, heuristics?)
Main function that analyzes tabular data and detects header rows.
Parameters
rows: TableData- The tabular data to analyze as an array of arraysoptions?: HeaderDetectionOptions- Optional configuration settingsheuristics?: HeaderHeuristic[]- Array of heuristics to use for detection. You must provide either specific heuristics or DEFAULT_HEURISTICS.
Returns
Returns a HeaderDetectionResult object containing:
headerRowIndices: number[]- Array of row indices identified as headersconfidence: number- Confidence level (0-1) that the detected rows are headersmessage: string- Human-readable result messagedetails: RowDetail[]- Detailed analysis of each row
Available Heuristics
The library provides multiple heuristics, each analyzing different aspects of the table:
TypeDifferenceHeuristic- Analyzes data type patterns (strings vs. numbers)CapitalizationPatternHeuristic- Detects typical header capitalization (ALL CAPS, Title Case)FormatConsistencyHeuristic- Identifies format pattern differences between rowsTextLengthPatternHeuristic- Analyzes text length patterns (headers tend to be shorter)SpecialCharUsageHeuristic- Examines special character usage in cellsEmptyCellAnalysisHeuristic- Analyzes patterns of empty cellsValueUniquenessHeuristic- Checks for unique values typical of headersHeaderTerminologyHeuristic- Detects common header terms and terminology
Import only the heuristics you need to optimize bundle size, or use DEFAULT_HEURISTICS to include all of them.
Adaptive Weighting System
The detector uses an intelligent adaptive weighting system that adjusts heuristic importance based on table characteristics:
For small tables (1-3 rows):
- Upweights terminology-based detection
- Downweights pattern-based heuristics that need more rows
For large tables (10+ rows):
- Upweights pattern recognition heuristics
- Better utilizes statistical patterns across larger datasets
This ensures optimal header detection regardless of table size or structure.
Configuration Options
You can customize the detection behavior:
import { detectHeaderRows, DEFAULT_HEURISTICS } from "table-header-detective";
const result = detectHeaderRows(myData, {
// Minimum score to consider a row a header (0-1)
headerThreshold: 0.6, // Default is 0.6
// Maximum number of consecutive header rows to detect
maxHeaderRows: 3, // Default is 3
// Whether to apply first row bias (higher score for first row)
applyFirstRowBias: true, // Default is true
// Additional custom header terms to look for
customHeaderTerms: ["code", "reference", "location"],
// Minimum rows needed for sampling patterns
minSampleRows: 2 // Default is 2
}, DEFAULT_HEURISTICS);
// You can also pass configuration options to specific heuristics
import { HeaderTerminologyHeuristic } from "table-header-detective";
const customTerminologyHeuristic = {
...HeaderTerminologyHeuristic,
analyze: (rows) => HeaderTerminologyHeuristic.analyze(rows, {
customHeaderTerms: ["product_id", "sku_code", "category_name"],
matchPartialTerms: true,
headerTermBoost: 0.5
})
};
const result = detectHeaderRows(myData, options, [
customTerminologyHeuristic,
// ...other heuristics
]);📊 Examples
Example 1: CSV Data
import { detectHeaderRows } from "table-header-detective";
import { DEFAULT_HEURISTICS } from "table-header-detective";
import { parse } from "csv-parse/sync";
import fs from "fs";
// Read and parse CSV
const csvData = fs.readFileSync("data.csv", "utf8");
const records = parse(csvData);
// Detect headers
const result = detectHeaderRows(records, {}, DEFAULT_HEURISTICS);
if (result.headerRowIndices.length > 0) {
// Extract headers
const headers = result.headerRowIndices.map((idx) => records[idx]);
console.log("Detected headers:", headers);
// Extract data (rows after headers)
const lastHeaderIdx = Math.max(...result.headerRowIndices);
const data = records.slice(lastHeaderIdx + 1);
console.log(`${data.length} data rows found`);
}Example 2: Exploring Detailed Results
import { detectHeaderRows } from "table-header-detective";
const data = [
/* your table data */
];
const result = detectHeaderRows(data);
// Examine detailed analysis for each row
result.details.forEach((detail) => {
console.log(`Row ${detail.rowIndex}:`);
console.log(` Is header: ${detail.isLikelyHeader}`);
console.log(` Header Type: ${detail.headerType}`);
console.log(` Confidence: ${(detail.confidence * 100).toFixed(1)}%`);
console.log(` Reasons:`);
detail.reasons.forEach((reason) => {
console.log(` - ${reason}`);
});
console.log("");
});🛠️ Advanced Usage
Custom Heuristic Combinations
You can select and configure specific heuristics based on your data characteristics:
import { detectHeaderRows } from "table-header-detective";
import {
TypeDifferenceHeuristic,
HeaderTerminologyHeuristic,
EmptyCellAnalysisHeuristic
} from "table-header-detective";
// For tables with mixed data types and specialized terminology
const result = detectHeaderRows(tableData, options, [
TypeDifferenceHeuristic,
HeaderTerminologyHeuristic,
EmptyCellAnalysisHeuristic
]);Multi-Header Table Detection
The library is specially designed to recognize complex multi-header tables, like:
// Complex header with title rows and column headers
const complexHeaderTable = [
['EMPLOYEE INFORMATION', '', '', '', ''],
['FISCAL YEAR 2023', '', '', '', ''],
['ID', 'Name', 'Age', 'Department', 'Salary'],
[1, 'John Doe', 32, 'Marketing', 58000],
// More data rows...
];
const result = detectHeaderRows(complexHeaderTable, {}, DEFAULT_HEURISTICS);
// Will detect all three rows as headers (0, 1, and 2)Creating Custom Heuristics
You can create your own custom heuristics by implementing the HeaderHeuristic interface:
import { HeaderHeuristic } from "table-header-detective";
import { TableData, HeuristicResult } from "table-header-detective";
// Custom heuristic for multi-line headers
const MultiLineHeaderHeuristic: HeaderHeuristic = {
name: 'multiLineHeader',
analyze: (rows: TableData): HeuristicResult => {
// Your custom analysis logic here
const rowScores: number[] = [];
const reasons: string[][] = [];
// Analyze each row and populate rowScores and reasons
// ...
return { rowScores, reasons };
}
};
// Use your custom heuristic
const result = detectHeaderRows(tableData, options, [
TypeDifferenceHeuristic,
MultiLineHeaderHeuristic
]);Helper Functions
The package exports additional helper functions for advanced usage:
isTitleCase(str)- Checks if a string follows Title Case conventiondetectCellType(value)- Detects the data type of a cell valuegetRowCompleteness(row, options)- Analyzes how complete a row isnormalizeValue(value, options)- Normalizes values for comparison
🧪 Testing
The library includes comprehensive test coverage for various table scenarios:
- Basic header detection for simple tables
- Complex multi-level header structures
- Tables with various formatting challenges
- Large tables with many rows
- Edge cases and challenging patterns
Each heuristic is individually tested, and integration tests verify the system works correctly as a whole.
🔮 Future Enhancements
Future versions will add support for even more complex table structures:
- Full multi-level header detection with column groups spanning multiple columns
- Better detection of mid-table category headers
- Support for merged cell patterns from spreadsheet exports
- Handling of tables with irregular structures
📄 License
MIT © Lyle Underwood
