json-to-arrow-ipc
v0.1.0
Published
Convert JSON data to Apache Arrow IPC format
Maintainers
Readme
json-to-arrow-ipc
Convert JSON data to Apache Arrow IPC format (without dictionary encoding, making it compatible with DuckDB's arrow extension).
Installation
npm install json-to-arrow-ipcWhy this package?
When using Apache Arrow's JavaScript library with tableFromJSON(), string columns are automatically dictionary-encoded. DuckDB's arrow extension doesn't support dictionary-encoded columns when reading Arrow IPC streams, resulting in errors like:
Schema message field with DictionaryEncoding not supportedThis package provides functions that convert JSON to Arrow IPC format without dictionary encoding, making it fully compatible with DuckDB's arrow extension.
Features
- No dictionary encoding - Compatible with DuckDB's shellfs extension
- Automatic type inference - Samples data to determine optimal column types
- Schema mismatch handling - Configurable behavior for dirty data
- Nested object support - Flatten or serialize as JSON strings
- Single-pass processing - Optimized for large datasets
- Full TypeScript support - Comprehensive type definitions
Usage
Zero-Install Scripts with Bun
Bun features transparent dependency installation - when you run a TypeScript file, Bun automatically installs any missing npm packages on-the-fly. This means you can create standalone .ts scripts that "just work" without running npm install first.
Simply create a script file and run it directly:
#!/usr/bin/env bun
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const data = [
{ name: 'John', age: 30, city: 'New York' },
{ name: 'Jane', age: 25, city: 'Los Angeles' }
];
const ipc = jsonToArrowIPC(data);
process.stdout.write(ipc);# No npm install needed - Bun handles it automatically!
./my-script.tsThis is especially powerful when combined with DuckDB's shellfs extension, enabling dynamic data pipelines with zero setup.
With DuckDB
Create a script (fetch-data.ts):
#!/usr/bin/env bun
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const response = await fetch('https://api.example.com/users');
const users = await response.json();
process.stdout.write(jsonToArrowIPC(users));Then in DuckDB:
-- First, install and load required extensions
INSTALL shellfs FROM community;
INSTALL arrow FROM community;
LOAD shellfs;
LOAD arrow;
CREATE OR REPLACE MACRO bun(script, args := '') AS TABLE
SELECT * FROM read_arrow('bun ' || script || ' ' || args || ' |');
-- Query the data
SELECT * FROM bun('fetch-data.ts');Configuration Options
import { jsonToArrowIPC, type JsonToArrowOptions } from 'json-to-arrow-ipc';
const options: JsonToArrowOptions = {
// Number of rows to sample for schema inference (default: 100)
schemaSampleSize: 50,
// How to handle schema mismatches: 'error' | 'skip' | 'coerce' (default: 'coerce')
onSchemaMismatch: 'skip',
// Flatten nested objects with dot notation (default: false)
flattenNestedObjects: true,
// Serialize arrays as JSON strings (default: true)
serializeArrays: true
};
const ipc = jsonToArrowIPC(data, options);Getting Conversion Statistics
import { jsonToArrowTableWithStats } from 'json-to-arrow-ipc';
const { table, skippedRows, totalRows } = jsonToArrowTableWithStats(data, {
onSchemaMismatch: 'skip'
});
console.log(`Converted ${totalRows - skippedRows}/${totalRows} rows`);API Reference
jsonToArrowIPC(data, options?)
Converts JSON data directly to Arrow IPC stream format.
- data: Array of JSON objects or a single object
- options: Optional configuration (see below)
- Returns:
Uint8Arraycontaining Arrow IPC stream bytes
jsonToArrowTable(data, options?)
Converts JSON data to an Apache Arrow Table.
- data: Array of JSON objects or a single object
- options: Optional configuration (see below)
- Returns: Apache Arrow
Table
jsonToArrowTableWithStats(data, options?)
Converts JSON data to an Arrow Table with conversion statistics.
- Returns:
{ table: Table, skippedRows: number, totalRows: number }
Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| schemaSampleSize | number | 100 | Number of rows to sample for schema inference |
| onSchemaMismatch | 'error' \| 'skip' \| 'coerce' | 'coerce' | How to handle schema mismatches |
| flattenNestedObjects | boolean | false | Flatten nested objects with dot notation |
| serializeArrays | boolean | true | Serialize arrays as JSON strings |
Type Mapping
| JSON Type | Arrow Type |
|-----------|------------|
| string | Utf8 |
| ISO 8601 date strings | DateMillisecond |
| number (integer, 32-bit) | Int32 |
| number (integer, 64-bit) | Int64 |
| number (float) | Float64 |
| boolean | Bool |
| null | nullable column |
| Nested objects/arrays | Utf8 (JSON serialized) |
License
MIT
