json-to-arrow-ipc

v0.1.0

Published

3 months ago

Convert JSON data to Apache Arrow IPC format

0High
0Medium
0Low

tobilg

arrow apache-arrow json ipc data conversion

json-to-arrow-ipc

Convert JSON data to Apache Arrow IPC format (without dictionary encoding, making it compatible with DuckDB's arrow extension).

Installation

npm install json-to-arrow-ipc

Why this package?

When using Apache Arrow's JavaScript library with tableFromJSON(), string columns are automatically dictionary-encoded. DuckDB's arrow extension doesn't support dictionary-encoded columns when reading Arrow IPC streams, resulting in errors like:

Schema message field with DictionaryEncoding not supported

This package provides functions that convert JSON to Arrow IPC format without dictionary encoding, making it fully compatible with DuckDB's arrow extension.

Features

No dictionary encoding - Compatible with DuckDB's shellfs extension
Automatic type inference - Samples data to determine optimal column types
Schema mismatch handling - Configurable behavior for dirty data
Nested object support - Flatten or serialize as JSON strings
Single-pass processing - Optimized for large datasets
Full TypeScript support - Comprehensive type definitions

Usage

Zero-Install Scripts with Bun

Bun features transparent dependency installation - when you run a TypeScript file, Bun automatically installs any missing npm packages on-the-fly. This means you can create standalone .ts scripts that "just work" without running npm install first.

Simply create a script file and run it directly:

#!/usr/bin/env bun

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const data = [
  { name: 'John', age: 30, city: 'New York' },
  { name: 'Jane', age: 25, city: 'Los Angeles' }
];

const ipc = jsonToArrowIPC(data);
process.stdout.write(ipc);

# No npm install needed - Bun handles it automatically!
./my-script.ts

This is especially powerful when combined with DuckDB's shellfs extension, enabling dynamic data pipelines with zero setup.

With DuckDB

Create a script (fetch-data.ts):

#!/usr/bin/env bun

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const response = await fetch('https://api.example.com/users');
const users = await response.json();

process.stdout.write(jsonToArrowIPC(users));

Then in DuckDB:

-- First, install and load required extensions
INSTALL shellfs FROM community;
INSTALL arrow FROM community;
LOAD shellfs;
LOAD arrow;
CREATE OR REPLACE MACRO bun(script, args := '') AS TABLE
SELECT * FROM read_arrow('bun ' || script || ' ' || args || ' |');

-- Query the data
SELECT * FROM bun('fetch-data.ts');

Configuration Options

import { jsonToArrowIPC, type JsonToArrowOptions } from 'json-to-arrow-ipc';

const options: JsonToArrowOptions = {
  // Number of rows to sample for schema inference (default: 100)
  schemaSampleSize: 50,

  // How to handle schema mismatches: 'error' | 'skip' | 'coerce' (default: 'coerce')
  onSchemaMismatch: 'skip',

  // Flatten nested objects with dot notation (default: false)
  flattenNestedObjects: true,

  // Serialize arrays as JSON strings (default: true)
  serializeArrays: true
};

const ipc = jsonToArrowIPC(data, options);

Getting Conversion Statistics

import { jsonToArrowTableWithStats } from 'json-to-arrow-ipc';

const { table, skippedRows, totalRows } = jsonToArrowTableWithStats(data, {
  onSchemaMismatch: 'skip'
});

console.log(`Converted ${totalRows - skippedRows}/${totalRows} rows`);

API Reference

`jsonToArrowIPC(data, options?)`

Converts JSON data directly to Arrow IPC stream format.

data: Array of JSON objects or a single object
options: Optional configuration (see below)
Returns: Uint8Array containing Arrow IPC stream bytes

`jsonToArrowTable(data, options?)`

Converts JSON data to an Apache Arrow Table.

data: Array of JSON objects or a single object
options: Optional configuration (see below)
Returns: Apache Arrow Table

`jsonToArrowTableWithStats(data, options?)`

Converts JSON data to an Arrow Table with conversion statistics.

Returns: { table: Table, skippedRows: number, totalRows: number }

Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | schemaSampleSize | number | 100 | Number of rows to sample for schema inference | | onSchemaMismatch | 'error' \| 'skip' \| 'coerce' | 'coerce' | How to handle schema mismatches | | flattenNestedObjects | boolean | false | Flatten nested objects with dot notation | | serializeArrays | boolean | true | Serialize arrays as JSON strings |

Type Mapping

| JSON Type | Arrow Type | |-----------|------------| | string | Utf8 | | ISO 8601 date strings | DateMillisecond | | number (integer, 32-bit) | Int32 | | number (integer, 64-bit) | Int64 | | number (float) | Float64 | | boolean | Bool | | null | nullable column | | Nested objects/arrays | Utf8 (JSON serialized) |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

json-to-arrow-ipc

Installation

Why this package?

Features

Usage

Zero-Install Scripts with Bun

With DuckDB

Configuration Options

Getting Conversion Statistics

API Reference

jsonToArrowIPC(data, options?)

jsonToArrowTable(data, options?)

jsonToArrowTableWithStats(data, options?)

Options

Type Mapping

License

`jsonToArrowIPC(data, options?)`

`jsonToArrowTable(data, options?)`

`jsonToArrowTableWithStats(data, options?)`