table-chunk

v0.2.3

Published

2 months ago

Extract and chunk tables preserving row/column structure

0High
0Medium
0Low

table-chunk

Format-aware table extraction and chunking for RAG pipelines. Parses tables from Markdown, HTML, and CSV/TSV, then produces embedding-optimized chunks where every chunk carries its column headers.

Description

Standard text splitters treat tables as flat text. When a 40-row Markdown table is split at a 512-token boundary, half the rows end up in one chunk without headers and the other half in another. The embedded vectors encode rows stripped of all column context. A retrieval query for "what was Alice's order total?" returns a chunk containing | Alice | 847.50 | ... | with no column headers -- the embedding model has no idea that "847.50" refers to an order total.

table-chunk solves this with a three-stage pipeline:

Detect and extract every table in a document (GFM pipe tables, HTML <table> elements with rowspan/colspan, CSV/TSV files).
Normalize the extracted table into a Table object with a consistent headers: string[] + rows: string[][] representation, expanding merged cells and resolving multi-level headers.
Chunk using a configurable strategy -- row-based, serialized, column-based, cell-level, section-based, or whole-table -- so that every chunk carries the headers it needs to be semantically self-contained.

Zero runtime dependencies. All Markdown, HTML, and CSV parsing is implemented from scratch with no external libraries.

Installation

npm install table-chunk

Quick Start

import { chunkTable } from 'table-chunk';

const markdown = `| Product | SKU | Price | Stock |
| --- | --- | --- | --- |
| Widget A | W-001 | $49.99 | 240 |
| Widget B | W-002 | $39.99 | 85 |
| Gadget X | G-001 | $129.00 | 12 |
| Gadget Y | G-002 | $99.00 | 55 |
| Gizmo Z | Z-001 | $19.99 | 300 |`;

const chunks = chunkTable(markdown, {
  strategy: 'row-based',
  rowsPerChunk: 3,
});

for (const chunk of chunks) {
  console.log(chunk.text);
  // Each chunk is a self-contained table with the header row repeated
  console.log(chunk.metadata.rowRange);
  // [0, 3], [3, 5]
}

Serialized output for natural language embedding

const chunks = chunkTable(markdown, {
  strategy: 'serialized',
  serialization: { format: 'key-value' },
  rowsPerChunk: 1,
});

console.log(chunks[0].text);
// "Product: Widget A, SKU: W-001, Price: $49.99, Stock: 240"

Token-bounded chunking

const chunks = chunkTable(markdown, {
  strategy: 'row-based',
  maxTokens: 512,
  tokenCounter: (text) => Math.ceil(text.length / 4),
});
// Rows are batched to stay within the token budget

Mixed-content document workflow

import { detectTables, parseTable, chunk } from 'table-chunk';

const document = `# Quarterly Report
Some introductory text.

| Quarter | Revenue | Profit |
| --- | --- | --- |
| Q1 | $1M | $200K |
| Q2 | $1.2M | $250K |

Conclusion text.`;

// Step 1: Find all tables in the document
const regions = detectTables(document);

// Step 2: Parse and chunk each table
for (const region of regions) {
  const table = parseTable(region.content!, 'markdown');
  const chunks = chunk(table, {
    strategy: 'serialized',
    serialization: { format: 'key-value' },
    rowsPerChunk: 2,
  });
  // Insert chunks into your vector store
}

Features

Three source formats: GFM Markdown pipe tables, HTML <table> elements, and CSV/TSV delimited text.
Six chunking strategies: row-based, serialized, column-based, cell-level, section-based, and whole-table.
Header preservation: every chunk carries its column headers regardless of strategy.
Format auto-detection: automatically distinguishes Markdown, HTML, and CSV input.
HTML merged cell expansion: rowspan and colspan are expanded into an explicit two-dimensional grid before chunking.
Multi-level HTML header flattening: stacked <th> rows are flattened into qualified column names (e.g., "Q1 Revenue", "Q2 Cost").
CSV auto-delimiter detection: comma, tab, semicolon, and pipe delimiters are detected automatically.
RFC 4180 CSV parsing: quoted fields, escaped quotes, newlines within quoted fields, and CRLF line endings.
Token-bounded chunking: when maxTokens is set, rows are batched to keep chunks within the token budget.
Pluggable token counter: supply your own function wrapping tiktoken, gpt-tokenizer, or any provider tokenizer.
Rich chunk metadata: table index, row range, column range, header list, source format, token count, strategy used, and more.
Table detection in mixed documents: find all table regions in documents containing both prose and tables, respecting fenced code blocks.
Four serialization formats: key-value, newline, sentence, and template.
Reusable chunker factory: createTableChunker amortizes configuration across many calls.
Zero runtime dependencies: all parsing is built in.
Deterministic: same input with same options always produces the same output. No LLM calls, no network access.

API Reference

`chunkTable(input, options?)`

Primary entry point. Detects the source format, parses the table, applies the chunking strategy, and returns TableChunk[].

function chunkTable(input: string, options?: ChunkTableOptions): TableChunk[];

Parameters:

| Parameter | Type | Description | |---|---|---| | input | string | Raw table content (Markdown, HTML, or CSV/TSV) | | options | ChunkTableOptions | Configuration options (see Configuration) |

Returns: TableChunk[]

const chunks = chunkTable(htmlTable, {
  format: 'html',
  strategy: 'serialized',
  serialization: { format: 'key-value' },
  rowsPerChunk: 1,
});

`parseTable(input, format?)`

Parses raw table input into a normalized Table object without chunking.

function parseTable(input: string, format?: TableFormat): Table;

Parameters:

| Parameter | Type | Default | Description | |---|---|---|---| | input | string | | Raw table content | | format | TableFormat | 'auto' | Source format hint |

Returns: Table

const table = parseTable(csvData, 'csv');
console.log(table.headers);   // ['Name', 'Age', 'City']
console.log(table.rows[0]);   // ['Alice', '30', 'New York']
console.log(table.metadata);  // { format: 'csv', rowCount: 2, columnCount: 3, ... }

`chunk(table, options?)`

Chunks an already-parsed Table object. Use this when you have already called parseTable and want to apply a chunking strategy separately.

function chunk(table: Table, options?: ChunkTableOptions): TableChunk[];

Parameters:

| Parameter | Type | Description | |---|---|---| | table | Table | A parsed table object | | options | ChunkTableOptions | Configuration options |

Returns: TableChunk[]

const table = parseTable(input, 'markdown');
const chunks = chunk(table, { strategy: 'row-based', rowsPerChunk: 5 });

`detectTables(document, format?)`

Finds all table regions in a mixed-content document. Supports Markdown GFM pipe tables and HTML <table> elements. Ignores tables inside fenced code blocks.

function detectTables(
  document: string,
  format?: 'auto' | 'markdown' | 'html'
): TableRegion[];

Parameters:

| Parameter | Type | Default | Description | |---|---|---|---| | document | string | | Full document content | | format | 'auto' \| 'markdown' \| 'html' | 'auto' | Restrict detection to a specific format |

Returns: TableRegion[]

Each TableRegion contains:

| Property | Type | Description | |---|---|---| | format | 'markdown' \| 'html' | Detected table format | | startLine | number \| undefined | Zero-based first line (Markdown) | | endLine | number \| undefined | Zero-based last line (Markdown) | | startOffset | number \| undefined | Character offset of <table> (HTML) | | endOffset | number \| undefined | Character offset after </table> (HTML) | | estimatedRows | number | Estimated data row count | | estimatedColumns | number | Estimated column count | | content | string \| undefined | Raw table string extracted from the document |

const regions = detectTables(mixedDocument);
for (const region of regions) {
  const chunks = chunkTable(region.content!, { format: region.format });
}

`serializeRow(row, headers, options?)`

Serializes a single data row into an embedding-friendly string. Supports four serialization formats.

function serializeRow(
  row: string[],
  headers: string[],
  options?: SerializeRowOptions
): string;

Parameters:

| Parameter | Type | Description | |---|---|---| | row | string[] | Cell values for one row | | headers | string[] | Column header names | | options | SerializeRowOptions | Serialization configuration |

Returns: string

Serialization formats:

// key-value (default)
serializeRow(['Alice', '30', 'NY'], ['Name', 'Age', 'City']);
// "Name: Alice, Age: 30, City: NY"

// newline
serializeRow(['Alice', '30', 'NY'], ['Name', 'Age', 'City'], { format: 'newline' });
// "Name: Alice\nAge: 30\nCity: NY"

// sentence
serializeRow(['Alice', '30', 'NY'], ['Name', 'Age', 'City'], { format: 'sentence' });
// "Alice, Age: 30, and City: NY."

// template
serializeRow(
  ['Widget A', 'W-001', '$49.99'],
  ['Product', 'SKU', 'Price'],
  {
    format: 'template',
    template: '{{Product}} ({{SKU}}) costs {{Price}}.',
  }
);
// "Widget A (W-001) costs $49.99."

`estimateTokens(text)`

Default token counter using the chars / 4 heuristic.

function estimateTokens(text: string): number;

Parameters:

| Parameter | Type | Description | |---|---|---| | text | string | Input text |

Returns: number -- estimated token count (Math.ceil(text.length / 4))

`createTableChunker(config)`

Factory that returns a configured chunker instance. Useful when processing many tables with the same configuration.

function createTableChunker(config: ChunkTableOptions): TableChunker;

Returns: TableChunker

The TableChunker interface provides three methods:

| Method | Signature | Description | |---|---|---| | chunk | (input: string) => TableChunk[] | Auto-detect format, parse, and chunk | | parse | (input: string) => Table | Auto-detect format and parse | | chunkTable | (table: Table) => TableChunk[] | Chunk an already-parsed Table |

const chunker = createTableChunker({
  strategy: 'serialized',
  serialization: { format: 'key-value' },
  maxTokens: 512,
});

const chunks = chunker.chunk(tableHtml);
const table = chunker.parse(tableCsv);
const moreChunks = chunker.chunkTable(table);

Configuration

`ChunkTableOptions`

| Option | Type | Default | Description | |---|---|---|---| | format | 'auto' \| 'markdown' \| 'html' \| 'csv' \| 'tsv' | 'auto' | Source table format. When 'auto', the format is detected from the input content. | | strategy | ChunkStrategy | 'row-based' | Chunking strategy to apply. | | rowsPerChunk | number | 10 | Number of data rows per chunk (row-based and serialized strategies). | | columnsPerChunk | number | 5 | Number of columns per chunk (column-based strategy). | | anchorColumns | number[] | [0] | Column indices always included in every chunk (column-based strategy). | | columnOverlap | number | 1 | Number of overlapping columns between adjacent column chunks. | | maxTokens | number | undefined | Maximum token count per chunk. When set, rows are batched by token budget instead of fixed count. | | tokenCounter | (text: string) => number | estimateTokens | Function to count tokens. Default uses chars / 4. | | outputFormat | 'markdown' \| 'csv' \| 'tsv' \| 'plain' | 'markdown' | Output format for row-based and column-based chunk text. | | serialization | SerializeRowOptions | { format: 'key-value' } | Serialization options for the 'serialized' strategy. | | sectionColumn | number | undefined | Column index whose value changes define section boundaries (section-based strategy). | | identifierColumn | number | 0 | Column index used as the row identifier (cell-level strategy). | | hasHeader | boolean \| 'auto' | 'auto' | Whether the first row is a header. When 'auto', headers are detected for CSV/TSV. | | nestedTables | 'ignore' \| 'extract' \| 'flatten' | 'extract' | How to handle nested HTML tables. | | preserveCellHtml | boolean | false | When true, preserves inner HTML markup in cell values instead of stripping tags. | | tableIndex | number | 0 | Table index for multi-table documents. Stored in chunk metadata. | | includeEmptyCells | boolean | false | Include empty cells in serialized output. |

`SerializeRowOptions`

| Option | Type | Default | Description | |---|---|---|---| | format | 'key-value' \| 'newline' \| 'sentence' \| 'template' | 'key-value' | Serialization format. | | template | string | '' | Template string with {{ColumnName}} placeholders (template format). | | templateCaseSensitive | boolean | false | Whether template placeholder matching is case-sensitive. | | removeMissingPlaceholders | boolean | false | Remove unmatched {{placeholder}} tokens instead of preserving them. | | sentenceSubjectColumn | number | 0 | Column index used as the sentence subject (sentence format). | | includeEmptyCells | boolean | false | Include empty cells in serialized output. |

Chunking strategies

| Strategy | Best for | Description | |---|---|---| | 'row-based' | Most tables | Groups N rows per chunk, repeating the header row in each chunk. Output format is configurable (Markdown, CSV, TSV, plain). | | 'serialized' | Natural language search | Converts each row to a flat string (key-value, newline, sentence, or template format). Produces the highest embedding relevance for column-value queries. | | 'column-based' | Wide tables (20+ columns) | Splits columns into groups, with configurable anchor columns included in every chunk. | | 'cell-level' | Lookup/reference tables | Produces one chunk per cell, each containing the row identifier and column name for maximum granularity. | | 'section-based' | Grouped/categorized tables | Splits on blank rows, section header rows, or value changes in a designated section column. | | 'whole-table' | Small tables | Returns the entire table as a single chunk. Marks the chunk as oversized if it exceeds maxTokens. |

Error Handling

table-chunk throws errors in the following cases:

Invalid Markdown table: parseTable(input, 'markdown') throws if the input has fewer than two lines (header + separator) or if no separator row is found.

try {
  parseTable('| A | B |\n| 1 | 2 |', 'markdown');
} catch (err) {
  // Error: Invalid markdown table: no separator row found
}

Missing HTML table element: parseTable(input, 'html') throws if no <table> element is found in the input.

try {
  parseTable('<div>no table</div>', 'html');
} catch (err) {
  // Error: No <table> element found in input
}

Empty CSV input: returns a Table with inferred headers (['Column 1']) and zero rows. Does not throw.
Oversized whole-table chunks: when using the 'whole-table' strategy with maxTokens, chunks that exceed the limit are not split further but are flagged with metadata.oversized: true. Check this field to detect chunks that need alternative handling.

const chunks = chunkTable(largeTable, { strategy: 'whole-table', maxTokens: 512 });
if (chunks[0].metadata.oversized) {
  // Fall back to row-based chunking
  const smallerChunks = chunkTable(largeTable, { strategy: 'row-based', maxTokens: 512 });
}

Advanced Usage

Custom token counter with tiktoken

import { chunkTable } from 'table-chunk';
import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4');

const chunks = chunkTable(table, {
  strategy: 'row-based',
  maxTokens: 512,
  tokenCounter: (text) => enc.encode(text).length,
});

Column-based chunking for wide tables

Split a 20-column table into manageable groups while keeping an anchor column (e.g., the row identifier) in every chunk:

const chunks = chunkTable(wideTable, {
  strategy: 'column-based',
  columnsPerChunk: 5,
  anchorColumns: [0],
  columnOverlap: 1,
});

// Every chunk includes column 0 (the anchor) plus a subset of remaining columns
for (const chunk of chunks) {
  console.log(chunk.metadata.columnRange);  // [0, 4], [0, 8], etc.
  console.log(chunk.metadata.headers);      // anchor + group headers
}

Section-based chunking by column value

Split a table into sections wherever a column value changes:

import { chunk, parseTable } from 'table-chunk';

const table = parseTable(csvData, 'csv');
const chunks = chunk(table, {
  strategy: 'section-based',
  sectionColumn: 0,  // split when column 0 value changes
});

for (const c of chunks) {
  console.log(c.metadata.sectionLabel);
  // "Engineering", "Marketing", etc.
}

Template-based serialization

Use custom templates with {{ColumnName}} placeholders for domain-specific embedding text:

import { serializeRow } from 'table-chunk';

const text = serializeRow(
  ['Widget A', 'W-001', '$49.99', '240'],
  ['Product', 'SKU', 'Price', 'Stock'],
  {
    format: 'template',
    template: '{{Product}} (SKU: {{SKU}}) is priced at {{Price}} with {{Stock}} units in stock.',
  }
);
// "Widget A (SKU: W-001) is priced at $49.99 with 240 units in stock."

Template matching is case-insensitive by default. Enable case-sensitive matching or remove unmatched placeholders:

serializeRow(['Alice'], ['Name'], {
  format: 'template',
  template: '{{name}} -- {{Missing}}',
  templateCaseSensitive: false,
  removeMissingPlaceholders: true,
});
// "Alice -- "

HTML tables with merged cells and captions

const html = `<table summary="Q1 Financial Data">
  <caption>Quarterly Results</caption>
  <thead>
    <tr><th colspan="2">Q1</th><th colspan="2">Q2</th></tr>
    <tr><th>Revenue</th><th>Cost</th><th>Revenue</th><th>Cost</th></tr>
  </thead>
  <tbody>
    <tr><td>100</td><td>50</td><td>120</td><td>60</td></tr>
  </tbody>
</table>`;

const table = parseTable(html, 'html');
console.log(table.headers);
// ['Q1 Revenue', 'Q1 Cost', 'Q2 Revenue', 'Q2 Cost']
console.log(table.metadata.caption);
// 'Quarterly Results'
console.log(table.metadata.htmlSummary);
// 'Q1 Financial Data'
console.log(table.metadata.hadMergedCells);
// true
console.log(table.metadata.originalHeaderLevels);
// [['Q1', 'Q1', 'Q2', 'Q2'], ['Revenue', 'Cost', 'Revenue', 'Cost']]

Preserving cell HTML

By default, HTML tags inside cells are stripped. To keep them:

const chunks = chunkTable(htmlTable, {
  format: 'html',
  preserveCellHtml: true,
  strategy: 'whole-table',
});
// Cell values retain their inner HTML: "<b>Alice</b>" instead of "Alice"

Sentence serialization with custom subject

import { serializeRow } from 'table-chunk';

const text = serializeRow(
  ['Alice Johnson', '30', 'New York', 'Engineering'],
  ['Name', 'Age', 'City', 'Department'],
  {
    format: 'sentence',
    sentenceSubjectColumn: 0,
  }
);
// "Alice Johnson, Age: 30, City: New York, and Department: Engineering."

TypeScript

table-chunk is written in TypeScript and ships type declarations with the package. All types are exported from the main entry point.

Exported types

import type {
  Table,
  TableMetadata,
  TableChunk,
  TableChunkMetadata,
  TableRegion,
  ChunkTableOptions,
  SerializeRowOptions,
  TableChunker,
  TableFormat,
  ChunkStrategy,
  RowOutputFormat,
  SerializationFormat,
} from 'table-chunk';

`Table`

interface Table {
  headers: string[];          // Column headers, one per column
  rows: string[][];           // Data rows, each with the same length as headers
  metadata: TableMetadata;    // Source format, dimensions, and parsing details
}

`TableMetadata`

interface TableMetadata {
  format: 'markdown' | 'html' | 'csv' | 'tsv';
  rowCount: number;
  columnCount: number;
  inferredHeaders: boolean;
  caption?: string;                              // HTML <caption> text
  htmlSummary?: string;                          // HTML summary attribute
  alignment?: Array<'left' | 'center' | 'right' | 'none'>;  // Markdown alignment
  hadMergedCells?: boolean;                      // True if rowspan/colspan was expanded
  originalHeaderLevels?: string[][];             // Multi-level headers before flattening
}

`TableChunk`

interface TableChunk {
  text: string;                  // Chunk text, ready for embedding
  metadata: TableChunkMetadata;  // Origin, range, and structural metadata
}

`TableChunkMetadata`

interface TableChunkMetadata {
  chunkIndex: number;
  totalChunks: number;
  tableIndex: number;
  rowRange?: [number, number];             // [startRow, endRow), exclusive end
  columnRange?: [number, number];          // [startCol, endCol), exclusive end
  headers: string[];
  sourceFormat: 'markdown' | 'html' | 'csv' | 'tsv';
  strategy: ChunkStrategy;
  serializationFormat?: SerializationFormat;
  tableRowCount: number;
  tableColumnCount: number;
  tokenCount: number;
  hadMergedCells?: boolean;
  oversized?: boolean;
  caption?: string;
  sectionLabel?: string;                   // Section-based strategy
  cellContext?: {                           // Cell-level strategy
    rowIdentifier: string;
    columnName: string;
  };
}

`TableRegion`

interface TableRegion {
  format: 'markdown' | 'html';
  startLine?: number;          // Markdown: zero-based first line
  endLine?: number;            // Markdown: zero-based last line
  startOffset?: number;        // HTML: character offset of <table>
  endOffset?: number;          // HTML: character offset after </table>
  estimatedRows: number;
  estimatedColumns: number;
  content?: string;            // Raw table string
}

Type aliases

type TableFormat = 'auto' | 'markdown' | 'html' | 'csv' | 'tsv';
type ChunkStrategy = 'row-based' | 'serialized' | 'column-based' | 'cell-level' | 'section-based' | 'whole-table';
type RowOutputFormat = 'markdown' | 'csv' | 'tsv' | 'plain';
type SerializationFormat = 'key-value' | 'newline' | 'sentence' | 'template';

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

table-chunk

Description

Installation

Quick Start

Serialized output for natural language embedding

Token-bounded chunking

Mixed-content document workflow

Features

API Reference

chunkTable(input, options?)

parseTable(input, format?)

chunk(table, options?)

detectTables(document, format?)

serializeRow(row, headers, options?)

estimateTokens(text)

createTableChunker(config)

Configuration

ChunkTableOptions

SerializeRowOptions

Chunking strategies

Error Handling

Advanced Usage

Custom token counter with tiktoken

Column-based chunking for wide tables

Section-based chunking by column value

Template-based serialization

HTML tables with merged cells and captions

Preserving cell HTML

Sentence serialization with custom subject

TypeScript

Exported types

Table

TableMetadata

TableChunk

TableChunkMetadata

TableRegion

Type aliases

License

`chunkTable(input, options?)`

`parseTable(input, format?)`

`chunk(table, options?)`

`detectTables(document, format?)`

`serializeRow(row, headers, options?)`

`estimateTokens(text)`

`createTableChunker(config)`

`ChunkTableOptions`

`SerializeRowOptions`

`Table`

`TableMetadata`

`TableChunk`

`TableChunkMetadata`

`TableRegion`