@bitofsky/merge-streams

v1.1.0

Published

3 months ago

Merge multiple chunked streams (CSV, JSON_ARRAY, ARROW_STREAM) into one unified stream - perfect for Databricks External Links

Downloads

0High
0Medium
0Low

bitofsky

stream merge csv json arrow apache-arrow concat streaming fetch

@bitofsky/merge-streams

When Databricks gives you 90+ presigned URLs, merge them into one.

Because nobody wants to explain to their MCP client why it needs to juggle dozens of chunk URLs.

Why I Made This

I was building an MCP Server that queries Databricks SQL for large datasets. I chose External Links format because INLINE would blow up memory.

But then Databricks handed me back something like this:

chunk_0.arrow (presigned URL)
chunk_1.arrow (presigned URL)
chunk_2.arrow (presigned URL)
...
chunk_89.arrow (presigned URL)

My client would have to:

Fetch each chunk sequentially
Parse and merge them correctly (CSV headers? JSON array brackets? Arrow EOS markers?)
Handle errors across 90 HTTP requests
Pray nothing times out

That was unacceptable. So I built this.

The Solution

merge-streams takes those chunked External Links and merges them into a single, unified stream.

90+ presigned URLs → merge-streams → 1 clean stream → S3 → 1 presigned URL

Now my MCP client gets one URL. Done.

What Makes It Fast

Pre-connected: Next chunk's connection opens while current chunk streams. No idle time.
Zero accumulation: Pure stream piping. Memory stays flat regardless of data size.
Format-aware: Not byte concatenation — actual format understanding.

Features

CSV: Automatically deduplicates headers across chunks
JSON_ARRAY: Properly concatenates JSON arrays (handles brackets and commas)
ARROW_STREAM: Merges Arrow IPC streams batch-by-batch (doesn't just byte-concat)
Memory-efficient: Streaming-based, never loads entire files into memory
AbortSignal support: Cancel mid-stream when needed
Progress tracking: Monitor merge progress with byte-level granularity

Installation

npm install @bitofsky/merge-streams

Requires Node.js 20+ (uses native fetch() and Readable.fromWeb())

Quick Start: The Databricks Use Case

See test/databricks.spec.ts for a complete working example.

# Run the integration test
DATABRICKS_TOKEN=dapi... \
DATABRICKS_HOST=xxx.cloud.databricks.com \
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/xxx \
npm test -- test/databricks.spec.ts

API

URL-based (for Databricks External Links)

import { mergeStreamsFromUrls } from '@bitofsky/merge-streams'

await mergeStreamsFromUrls('CSV', { urls, output })
await mergeStreamsFromUrls('JSON_ARRAY', { urls, output })
await mergeStreamsFromUrls('ARROW_STREAM', { urls, output })

With AbortSignal

const controller = new AbortController()

await mergeStreamsFromUrls('CSV', {
  urls,
  output,
  signal: controller.signal,
})

// Cancel anytime
controller.abort()

With Progress Tracking

await mergeStreamsFromUrls('CSV', {
  urls,
  output,
  onProgress: ({ inputIndex, totalInputs, inputedBytes, mergedBytes }) => {
    console.log(`Processing ${inputIndex + 1}/${totalInputs}: ${inputedBytes} bytes read, ${mergedBytes} bytes merged`)
  },
})

Stream-based (for custom input sources)

import { mergeStreams, mergeCsv, mergeJson, mergeArrow } from '@bitofsky/merge-streams'

// Using unified API
await mergeStreams('CSV', { inputs, output })

// Or use format-specific functions directly
await mergeCsv({ inputs, output, signal })
await mergeJson({ inputs, output, signal })
await mergeArrow({ inputs, output, signal })

Inputs can be:

Readable streams directly
Sync factories: () => Readable
Async factories: () => Promise<Readable> (recommended for lazy fetching)

Format Details

| Format | Behavior | |--------|----------| | CSV | Writes header once, skips duplicate headers from subsequent chunks | | JSON_ARRAY | Wraps in [], strips brackets from chunks, inserts commas | | ARROW_STREAM | Re-encodes RecordBatches into single IPC stream (not byte-concat) |

Types

import type { Readable, Writable } from 'node:stream'

type MergeFormat = 'ARROW_STREAM' | 'CSV' | 'JSON_ARRAY'
type InputSource = Readable | (() => Readable) | (() => Promise<Readable>)

interface MergeOptions {
  inputs: InputSource[]
  output: Writable
  signal?: AbortSignal
  onProgress?: (progress: MergeOptionsProgress) => void
  progressIntervalMs?: number  // Throttle interval (default: 1000, 0 = no throttle)
}

interface MergeOptionsProgress {
  inputIndex: number    // Index of the input being processed
  totalInputs: number   // Total number of inputs
  inputedBytes: number  // Total bytes read from all inputs
  mergedBytes: number   // Total bytes written to output
}

function mergeStreams(
  format: MergeFormat,
  options: MergeOptions
): Promise<void>

function mergeStreamsFromUrls(
  format: MergeFormat,
  options: { urls: string[]; output: Writable; signal?: AbortSignal; onProgress?: (progress: MergeOptionsProgress) => void; progressIntervalMs?: number }
): Promise<void>

Why Not Just Byte-Concatenate?

CSV: You'd get duplicate headers scattered throughout
JSON_ARRAY: [1,2][3,4] is not valid JSON
Arrow: Most Arrow readers stop at the first EOS marker

Each format needs format-aware merging. That's what this library does.

Scope

This library was born from a specific pain point: making Databricks External Links usable in MCP Server development. It does that one thing well.

If you have other use cases in mind, PRs are welcome.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@bitofsky/merge-streams

Why I Made This

The Solution

What Makes It Fast

Features

Installation

Quick Start: The Databricks Use Case

API

URL-based (for Databricks External Links)

With AbortSignal

With Progress Tracking

Stream-based (for custom input sources)

Format Details

Types

Why Not Just Byte-Concatenate?

Scope

License