@aoede/tamper

v1.0.0

Published

6 months ago

ESM encoder/decoder for Tamper - a compact format for bulk categorical datasets

0High
0Medium
0Low

aoede

tamper compression categorical tabular encoding bitmap rle esm

Tamper (ESM)

ESM encoder/decoder for Tamper - a compact format for bulk categorical datasets.

This repository contains an ESM-native implementation of the Tamper encoder and decoder format originally developed at the New York Times, plus strict parity tooling to ensure identical output to the frozen legacy implementation.

This project is an independent ESM implementation of the Tamper format. It does not define a new format and is not affiliated with the original NYT repository.

Tamper is a column-oriented packer for tabular categorical data (low-cardinality enums, booleans, bucketed integers) where JSON + compression becomes inefficient.

References

When to use this

Tamper is a good fit when your data is:

Tabular (many rows with the same attributes)
Categorical-heavy (enums, booleans, small integers)
Bulk (transferred or stored as snapshots)
Read-mostly / immutable
Required to match legacy Tamper output exactly

Use cases:

Analytics extracts for dashboards
Lookup / reference tables
ML-style categorical feature matrices shipped to JS or WASM

When not to use this

Do not use Tamper for:

Nested or hierarchical objects
General APIs or CRUD payloads
Arbitrary graphs
Free-form documents or HTML

If your data is not mostly categorical and tabular, JSON + Brotli/Zstd or a schema-based format (e.g. Protobuf, Arrow) will likely be a better fit.

Overview

Tamper is a data serialisation protocol originally developed at the New York Times to efficiently transfer large categorical datasets from server to browser.

This repository provides a modern ESM implementation of the original CommonJS codebase, with:

identical encoded output
identical decoded results
strict, automated parity checks against the frozen legacy implementation

Core encoding approach

Tamper packs categorical columns using bitwise encodings, automatically selecting the most efficient strategy per attribute:

Integer packing - sparse or bounded integer values
Bitmap packing - dense categorical values
Existence packing - tracks presence using run-length encoding

These strategies are chosen automatically by the encoder based on observed data characteristics.

Performance

Tamper achieves significant compression for categorical tabular data:

Sparse datasets: 10-15x compression (e.g., 500 events across 10K IDs)
Dense multi-value attributes: 20-30x compression (bitmap encoding)
Very sparse datasets: 4-5x compression at scale (existence encoding with RLE)

The compression ratio improves with dataset size due to fixed header overhead. See real examples with the size comparison script:

npm run example

This script demonstrates four scenarios showing Tamper vs plain JSON size, compression ratios, and the impact of:

Existence encoding for sparse data
Integer encoding for categorical values
Bitmap encoding for multi-value attributes
Fixed overhead on small vs large datasets

Note: These compression ratios are before any transport-level compression. Tamper packs can be further compressed with gzip/brotli for additional gains, often achieving better overall compression than gzip/brotli on plain JSON (due to Tamper's elimination of field name repetition and use of bit-packed encodings).

Repository structure

├── clients/js/src/         # ESM decoder (browser-side)
├── encoders/js/
│   ├── core/               # Environment-agnostic encoder logic
│   └── env/                # Node.js & browser adapters
├── legacy/                 # Frozen legacy implementation (reference only)
├── vendor/bitsy/           # Vendored bitset library (no npm deps)
├── scripts/                # Parity verification tools
└── test/                   # Test datasets & canonical outputs

Requirements

Node.js (ESM-capable; tested with current LTS)
npm (for installing dev tooling)
Encoder runtime uses a local vendor/bitsy shim (no network installs)

Install dev dependencies for TSX-driven scripts:

npm install

Usage

Decoder (ESM)

Exports:

createTamper() - decoder factory
Tamper - decoder methods
default export - alias of createTamper

import createTamper from "./clients/js/src/tamper.ts";
import fs from "node:fs/promises";

const tamper = createTamper();
const pack = JSON.parse(await fs.readFile("pack.json", "utf8"));
const items = tamper.unpackData(pack);

Encoder (ESM)

Entry points:

Node / standard ESM: encoders/js/index.ts
Browser / edge: compose core + environment adapter

Exports:

createPackSet, PackSet
Pack, IntegerPack, BitmapPack, ExistencePack

import { createPackSet } from "./encoders/js/index.ts";

const tamp = createPackSet();
// configure attributes + pack data...
const json = tamp.toJSON();

Browser / edge example:

import createEncoder from "./encoders/js/core/createEncoder.ts";
import browserEnv from "./encoders/js/env/browser.ts";

const { createPackSet } = createEncoder(browserEnv);

const tamp = createPackSet();
// configure attributes + pack data...
const json = tamp.toJSON();

Parity verification (strict)

Decoder parity compares decoded output from the legacy and ESM implementations:

tsx scripts/compare-decoders.ts

Encoder parity builds packs from test datasets and compares full JSON output against canonical fixtures:

tsx scripts/compare-encoders.ts

The ESM implementation's parity is verified by ensuring all canonical fixtures match byte-for-byte.

Notes

Encoder output is tuned to exactly match canonical JSON fixtures (including legacy fields such as max_guid and existence metadata).
The legacy implementation is retained only for parity verification and reference; it is not used at runtime.
The browser encoder uses Uint8Array and DataView and does not depend on Node.js Buffer.

Expected output

PASS large.json
PASS run.json
PASS run2.json
PASS small.json
PASS small2.json
PASS sparse.json
PASS spstart.json

All 7 file(s) passed parity checks.
...
All 7 file(s) passed encoder parity checks.