npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@bitofsky/merge-streams

v1.1.0

Published

Merge multiple chunked streams (CSV, JSON_ARRAY, ARROW_STREAM) into one unified stream - perfect for Databricks External Links

Downloads

173

Readme

@bitofsky/merge-streams

When Databricks gives you 90+ presigned URLs, merge them into one.

Because nobody wants to explain to their MCP client why it needs to juggle dozens of chunk URLs.


Why I Made This

I was building an MCP Server that queries Databricks SQL for large datasets. I chose External Links format because INLINE would blow up memory.

But then Databricks handed me back something like this:

chunk_0.arrow (presigned URL)
chunk_1.arrow (presigned URL)
chunk_2.arrow (presigned URL)
...
chunk_89.arrow (presigned URL)

My client would have to:

  1. Fetch each chunk sequentially
  2. Parse and merge them correctly (CSV headers? JSON array brackets? Arrow EOS markers?)
  3. Handle errors across 90 HTTP requests
  4. Pray nothing times out

That was unacceptable. So I built this.


The Solution

merge-streams takes those chunked External Links and merges them into a single, unified stream.

90+ presigned URLs → merge-streams → 1 clean stream → S3 → 1 presigned URL

Now my MCP client gets one URL. Done.

What Makes It Fast

  • Pre-connected: Next chunk's connection opens while current chunk streams. No idle time.
  • Zero accumulation: Pure stream piping. Memory stays flat regardless of data size.
  • Format-aware: Not byte concatenation — actual format understanding.

Features

  • CSV: Automatically deduplicates headers across chunks
  • JSON_ARRAY: Properly concatenates JSON arrays (handles brackets and commas)
  • ARROW_STREAM: Merges Arrow IPC streams batch-by-batch (doesn't just byte-concat)
  • Memory-efficient: Streaming-based, never loads entire files into memory
  • AbortSignal support: Cancel mid-stream when needed
  • Progress tracking: Monitor merge progress with byte-level granularity

Installation

npm install @bitofsky/merge-streams

Requires Node.js 20+ (uses native fetch() and Readable.fromWeb())


Quick Start: The Databricks Use Case

See test/databricks.spec.ts for a complete working example.

# Run the integration test
DATABRICKS_TOKEN=dapi... \
DATABRICKS_HOST=xxx.cloud.databricks.com \
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/xxx \
npm test -- test/databricks.spec.ts

API

URL-based (for Databricks External Links)

import { mergeStreamsFromUrls } from '@bitofsky/merge-streams'

await mergeStreamsFromUrls('CSV', { urls, output })
await mergeStreamsFromUrls('JSON_ARRAY', { urls, output })
await mergeStreamsFromUrls('ARROW_STREAM', { urls, output })

With AbortSignal

const controller = new AbortController()

await mergeStreamsFromUrls('CSV', {
  urls,
  output,
  signal: controller.signal,
})

// Cancel anytime
controller.abort()

With Progress Tracking

await mergeStreamsFromUrls('CSV', {
  urls,
  output,
  onProgress: ({ inputIndex, totalInputs, inputedBytes, mergedBytes }) => {
    console.log(`Processing ${inputIndex + 1}/${totalInputs}: ${inputedBytes} bytes read, ${mergedBytes} bytes merged`)
  },
})

Stream-based (for custom input sources)

import { mergeStreams, mergeCsv, mergeJson, mergeArrow } from '@bitofsky/merge-streams'

// Using unified API
await mergeStreams('CSV', { inputs, output })

// Or use format-specific functions directly
await mergeCsv({ inputs, output, signal })
await mergeJson({ inputs, output, signal })
await mergeArrow({ inputs, output, signal })

Inputs can be:

  • Readable streams directly
  • Sync factories: () => Readable
  • Async factories: () => Promise<Readable> (recommended for lazy fetching)

Format Details

| Format | Behavior | |--------|----------| | CSV | Writes header once, skips duplicate headers from subsequent chunks | | JSON_ARRAY | Wraps in [], strips brackets from chunks, inserts commas | | ARROW_STREAM | Re-encodes RecordBatches into single IPC stream (not byte-concat) |


Types

import type { Readable, Writable } from 'node:stream'

type MergeFormat = 'ARROW_STREAM' | 'CSV' | 'JSON_ARRAY'
type InputSource = Readable | (() => Readable) | (() => Promise<Readable>)

interface MergeOptions {
  inputs: InputSource[]
  output: Writable
  signal?: AbortSignal
  onProgress?: (progress: MergeOptionsProgress) => void
  progressIntervalMs?: number  // Throttle interval (default: 1000, 0 = no throttle)
}

interface MergeOptionsProgress {
  inputIndex: number    // Index of the input being processed
  totalInputs: number   // Total number of inputs
  inputedBytes: number  // Total bytes read from all inputs
  mergedBytes: number   // Total bytes written to output
}

function mergeStreams(
  format: MergeFormat,
  options: MergeOptions
): Promise<void>

function mergeStreamsFromUrls(
  format: MergeFormat,
  options: { urls: string[]; output: Writable; signal?: AbortSignal; onProgress?: (progress: MergeOptionsProgress) => void; progressIntervalMs?: number }
): Promise<void>

Why Not Just Byte-Concatenate?

  • CSV: You'd get duplicate headers scattered throughout
  • JSON_ARRAY: [1,2][3,4] is not valid JSON
  • Arrow: Most Arrow readers stop at the first EOS marker

Each format needs format-aware merging. That's what this library does.


Scope

This library was born from a specific pain point: making Databricks External Links usable in MCP Server development. It does that one thing well.

If you have other use cases in mind, PRs are welcome.


License

MIT