xlsx-stream-rows

v1.1.0

Published

3 days ago

Zero-dependency streaming XLSX/CSV/XLS reader for the browser. Bounded memory regardless of file size.

xlsx-stream-rows

Streaming, chunked, low-memory spreadsheet reader for the browser. Reads XLSX, CSV, and XLS files row-by-row without loading the entire file into memory — a 1 GB workbook reads in roughly the same memory envelope as a 1 MB one.

Zero runtime dependencies. TypeScript. ESM + CJS. Works in browsers and Web Workers.

npm install xlsx-stream-rows

Why

Most browser-side spreadsheet libraries materialise the entire file before yielding a single row. A 300 MB XLSX usually peaks at 1–2 GB of JS heap and crashes the tab. xlsx-stream-rows treats File as a handle, not a byte array: it reads only the ZIP Central Directory at the end of the archive, then pipes the target sheet through DecompressionStream into an incremental SAX parser, emitting rows lazily, on demand.

If you've been searching for terms like streaming xlsx parser, chunked spreadsheet reader, lazy xlsx, partial xlsx parsing, incremental xlsx reader, async iterator over xlsx rows, row-by-row xlsx, on-demand spreadsheet loading, memory-bounded xlsx, or read large xlsx in browser without OOM — that's what this library does.

Features

True streaming. AsyncIterable<Row> — pull rows on demand, stop whenever. No batch parse, no buffered intermediate result.
Bounded memory. Peak heap is proportional to the data you consume, not the file size. Measured on a 1,000,000-row XLSX (51 MiB uncompressed sheet, 7.2 MiB on disk): 1.5 MiB peak heap growth, ~34× smaller than the sheet (tests/memory.smoke.test.ts).
Three stop mechanisms, all equivalent. maxRows, break out of for await, or AbortSignal. Pipeline tears down promptly: no further bytes fetched, decompressed, or parsed.
Format coverage. XLSX (streaming), CSV (streaming), XLS (delegated to optional xlsx peer dep, bounded by xlsMaxBytes).
Auto-detect. ZIP / OLE2 magic-byte sniff with filename extension as fallback.
Standards-grounded. Every byte offset and XML path traces to PKWARE APPNOTE.TXT (ZIP) or ECMA-376 (OOXML), not reverse-engineered from other libraries.
OPC-correct. Resolves package parts via _rels/.rels indirection rather than hardcoded paths, so files from LibreOffice, Google Sheets, or custom generators work too.
Strict + transitional schemas. Both SpreadsheetML namespaces accepted. All ST_CellType values handled (n, s, str, inlineStr, b, e, d).
Worker-safe. No DOM APIs. Runs inside a Web Worker.
Zero dependencies for XLSX/CSV. Web Platform APIs only (File, Blob, DecompressionStream, TextDecoderStream).

Quick start

import { openWorkbook, streamRows, readRows } from 'xlsx-stream-rows';

// 1. List sheets without reading row data (≈ 100 KiB read regardless of file size).
const info = await openWorkbook(file);
console.log(info.sheetNames, info.format); // ['Sheet1', 'Data'], 'xlsx'

// 2. Stream rows one at a time. Memory stays bounded for files of any size.
for await (const row of streamRows(file, { maxRows: 100 })) {
  console.log(row); // (string | number | boolean | Date | null)[]
}

// 3. Convenience: collect into an array.
const rows = await readRows(file, { sheetName: 'Data', maxRows: 1000 });

Bounded read — preview the first N rows of a huge file

// Reads only the bytes needed for the first 50 rows. The decompression
// stream is cancelled as soon as the limit is hit; nothing past that point
// is fetched from the file.
const preview = await readRows(file, { maxRows: 50 });

Stop on a predicate

// `break` cancels the underlying pipeline — no further bytes are read.
for await (const row of streamRows(file)) {
  if (row[0] === 'END') break;
  process(row);
}

Cancel from the outside

const ac = new AbortController();
cancelButton.onclick = () => ac.abort();

try {
  for await (const row of streamRows(file, { signal: ac.signal })) {
    insertIntoTable(row);
  }
} catch (e) {
  if (e === ac.signal.reason) console.log('user cancelled');
  else throw e;
}

AbortSignal cancels in-flight metadata fetches too — you can abort before the first row is yielded (e.g. while sharedStrings.xml is still downloading).

Progress reporting

The pull-based design makes progress trivial — count rows yourself, or report file-position progress via your UI:

let count = 0;
const total = info.estimatedRowCount; // your own estimate, e.g. file.size / 100
for await (const row of streamRows(file)) {
  count++;
  if (count % 1000 === 0) updateProgress(count, total);
  await processRow(row);
}

For byte-level progress you can wrap the input file with a Proxy over slice() that reports cumulative bytes; ask in an issue and we'll add an example.

Run inside a Web Worker

The library has no DOM dependencies, so move heavy spreadsheet imports off the main thread:

// worker.ts
import { streamRows } from 'xlsx-stream-rows';

self.onmessage = async (e: MessageEvent<File>) => {
  const file = e.data;
  for await (const row of streamRows(file, { maxRows: 10_000 })) {
    self.postMessage({ type: 'row', row });
  }
  self.postMessage({ type: 'done' });
};

// main.ts
const worker = new Worker(new URL('./worker.ts', import.meta.url), { type: 'module' });
worker.onmessage = (e) => { /* … */ };
worker.postMessage(file); // File is structured-cloneable

Pick a sheet by name

const info = await openWorkbook(file);
console.log(info.sheetNames); // ['Summary', 'Q1', 'Q2', 'Q3', 'Q4']

for await (const row of streamRows(file, { sheetName: 'Q3' })) {
  // …
}

Date detection (XLSX)

XLSX stores dates as numbers styled with a date format. By default, numeric cells whose style references a date numFmtId (built-in 14–22, 27–36, 45–47, 50–58, plus custom <numFmt> containing date tokens) are returned as Date. Disable this if you want raw serial numbers:

const rows = await readRows(file, { parseDates: false }); // 44197 instead of new Date(2021, 0, 1)

CSV with non-UTF encoding

UTF-8, UTF-16 LE, and UTF-16 BE BOMs are auto-detected. For other encodings (e.g. Windows-1251 / cp1251), pass csvEncoding:

const rows = await readRows(file, { csvEncoding: 'windows-1251' });

Format support

| Format | Streaming | Memory peak | Optional dependency | |--------|-----------|-------------|---------------------| | XLSX / XLSM | yes | ≈ sharedStrings size + few MiB | none | | CSV | yes | ≈ one row + decoder window | none | | XLS | no — full file load | ≤ xlsMaxBytes (default 50 MiB) | xlsx peer dep |

Install xlsx only if you need XLS support:

npm install xlsx

API

type CellValue = string | number | boolean | Date | null;
type Row = CellValue[];

interface WorkbookInfo {
  filename: string;
  sheetNames: string[];
  format: 'xlsx' | 'xls' | 'csv';
}

interface ReadOptions {
  /** Stop after yielding this many rows. */
  maxRows?: number;
  /** Sheet to read. Default: first sheet. Ignored for CSV. */
  sheetName?: string;
  /** XLSX/XLS only: convert numeric date-styled cells to `Date`. Default true. */
  parseDates?: boolean;
  /** XLSX only: cap on `xl/sharedStrings.xml` uncompressed size. Default 64 MiB. */
  sharedStringsMaxBytes?: number;
  /** XLS only: cap on the file size loaded into memory. Default 50 MiB. */
  xlsMaxBytes?: number;
  /** CSV only: text encoding. UTF-8/16 BOMs are auto-detected. Default 'utf-8'. */
  csvEncoding?: string;
  /** Cancel the read at any point — including before the first row is yielded. */
  signal?: AbortSignal;
}

function openWorkbook(file: File): Promise<WorkbookInfo>;
function streamRows(file: File, options?: ReadOptions): AsyncIterable<Row>;
function readRows(file: File, options?: ReadOptions): Promise<Row[]>;

Per-format adapters (openXlsxWorkbook, streamXlsxRows, openCsvWorkbook, streamCsvRows, openXlsWorkbook, streamXlsRows) and lower-level building blocks (readZipEntries, createRowParser, createCsvParser, resolvePackagePaths, …) are also exported for power users.

Errors

All errors inherit from XlsxStreamError so callers can catch the family with one instanceof check.

| Error | When | |-------|------| | NotAZipError | EOCD signature not found within 65,557 bytes of EOF | | Zip64NotSupportedError | File uses ZIP64 extensions (>4 GiB / >65,535 entries) | | InvalidLocalHeaderError | LFH signature mismatch — file corrupted | | UnsupportedCompressionError | ZIP method is not 0 (stored) or 8 (deflate) | | InvalidOpcPackageError | Missing _rels/.rels or officeDocument relationship | | SharedStringsTooLargeError | sharedStrings.xml exceeds sharedStringsMaxBytes | | SheetNotFoundError | Requested sheetName not in workbook | | EntryTooLargeError | Bounded entry read exceeded its uncompressed cap | | XlsFileTooLargeError | XLS file exceeds xlsMaxBytes | | XlsxPackageMissingError | XLS read attempted without the xlsx peer dep installed | | FormatNotSupportedError | Magic + extension both unrecognised |

Pipeline cancellation (maxRows hit, iterator return(), AbortSignal) is not an error — the iterator simply ends, or rejects with signal.reason.

How it works

XLSX files are ZIP archives of XML parts. The Central Directory at the end of the archive lists every entry's offset and size, so:

Fetch the trailing 64 KiB of the file → locate the EOCD record → read the Central Directory.
Resolve the workbook part via OPC relationships (_rels/.rels → xl/_rels/workbook.xml.rels).
Open a ReadableStream over the target sheet's compressed bytes (Blob.slice().stream()), pipe through DecompressionStream('deflate-raw') and TextDecoderStream, feed the result to a hand-rolled SAX state machine that emits rows as </row> closes.
Cancelling the iterator (maxRows, break, AbortSignal) cancels the reader, which propagates up the pipeline — no more bytes are fetched.

For CSV: same shape, simpler — file.stream() → TextDecoderStream → RFC-4180 parser.

For XLS: no streaming primitive exists in the BIFF/OLE2 format, so we delegate to the xlsx package and bound the file size to keep memory predictable.

Limitations

No ZIP64. Files with the Central Directory or any individual entry over 4 GiB, or with more than 65,535 entries, are rejected with Zip64NotSupportedError. In practice you can comfortably read 5–10 GB of sheet data; multi-GB compressed XLSX with monster sharedStrings tables are the edge case to watch.
No formula evaluation. Formula cells return the cached <v> written by the producer; if absent, null. We do not re-evaluate.
No merged-cell expansion. Cells appear exactly where the XML places them.
CSV is comma-only. Semicolon and tab dialects are out of scope — preprocess if needed.
No password-protected files.

Standards

ZIP container — PKWARE APPNOTE.TXT 6.3.10
OOXML — ECMA-376 5th Edition Part 1 (2016) (SpreadsheetML) and Part 2 (2021) (OPC)
CSV — RFC 4180

Try it in your browser

The repo includes a self-contained playground:

git clone https://github.com/gudoshnikovn/xlsx-stream-rows
cd xlsx-stream-rows
npm install && npm run build
npx serve .   # or any static server
# open http://localhost:3000/examples/playground.html

Drop a real spreadsheet in, set maxRows, watch it stream.

Development

npm install
npm test            # Node test suite (≈ 150 tests, including fuzz)
npm run test:browser  # same suite in real Chromium / Firefox / WebKit via Playwright
npm run test:memory   # 1M-row memory smoke (set --pool=forks for forced GC)
npm run build         # tsup → dist/ (ESM + CJS + d.ts)
npm run typecheck

Requires Node 20+ for File, Blob.stream(), and DecompressionStream globals. CI runs on Node 20 / 22 plus all three browsers.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

xlsx-stream-rows

Why

Features

Quick start

Bounded read — preview the first N rows of a huge file

Stop on a predicate

Cancel from the outside

Progress reporting

Run inside a Web Worker

Pick a sheet by name

Date detection (XLSX)

CSV with non-UTF encoding

Format support

API

Errors

How it works

Limitations

Standards

Try it in your browser

Development

License