@docamz/json-tokenizer

v1.4.0

Published

3 months ago

Lightweight JSON tokenizer with multiple encoding strategies (alphabetic, numeric, base64, UUID, custom) for compression and optimization

Downloads

0High
0Medium
0Low

docamz

json tokenizer compression dictionary typescript alphabetic numeric base64 uuid encoding optimization react browser nodejs SSR

@docamz/json-tokenizer

🚀 Advanced JSON tokenizer with multiple encoding strategies for optimal compression and performance.

Lightweight and symmetric JSON tokenizer for compression and optimization. Generates consistent dictionaries and supports alphabetic, numeric, base64, UUID-based, and custom tokenization methods with symmetric encoding/decoding. Perfect for data compression, API optimization, and storage efficiency. Can be used standalone or with MessagePack or Gzip for enhanced compression.

Features

Multiple Tokenization Methods: Alphabetic, numeric, base64, UUID-short, and custom
Symmetric Encoding: Perfect reconstruction of original data
🔒 Security First: Built-in prototype pollution protection
High Performance: Optimized algorithms with minimal overhead
TypeScript Support: Full type safety
⚛️ React Hook API: First-class React support with useJsonTokenizer hook

Installation

npm install @docamz/json-tokenizer

Quick Start

import { generateDictionary, tokenize, detokenize, TokenizationMethod } from "@docamz/json-tokenizer";

const data = { name: "Alice", age: 30, city: "Paris" };
const keys = ["name", "age", "city"];

// Generate dictionary and tokenize
const dict = generateDictionary(keys);
const encoded = tokenize(data, dict.forward);
const decoded = detokenize(encoded, dict.reverse);

console.log(encoded); // { a: "Alice", b: 30, c: "Paris" }
console.log(decoded); // { name: "Alice", age: 30, city: "Paris" }

React Hook API

⚛️ React Hook for seamless integration with React applications

Installation (Hook)

The React hook requires React 16.8.0 or higher (for hooks support):

npm install @docamz/json-tokenizer react

Basic Usage

import { useJsonTokenizer, TokenizationMethod } from "@docamz/json-tokenizer/react";

function MyComponent() {
  const data = { name: "Alice", age: 30, city: "Paris" };

  const { tokenized, dictionary, isLoading, error } = useJsonTokenizer(data, {
    keys: ["name", "age", "city"],
    method: TokenizationMethod.ALPHABETIC
  });

  if (isLoading) return <div>Loading...</div>;
  if (error) return <div>Error: {error.message}</div>;

  return (
    <div>
      <h3>Original:</h3>
      <pre>{JSON.stringify(data, null, 2)}</pre>

      <h3>Tokenized:</h3>
      <pre>{JSON.stringify(tokenized, null, 2)}</pre>
    </div>
  );
}

Manual Control

Disable auto-tokenization and control when tokenization happens:

function ManualComponent() {
  const [data, setData] = useState({ name: "Alice", age: 30 });

  const { tokenized, tokenize, detokenize, reset } = useJsonTokenizer(data, {
    keys: ["name", "age"],
    autoTokenize: false  // Don't tokenize automatically
  });

  return (
    <div>
      <button onClick={tokenize}>Tokenize</button>
      <button onClick={() => detokenize(tokenized)}>Detokenize</button>
      <button onClick={reset}>Reset</button>
      <pre>{JSON.stringify(tokenized, null, 2)}</pre>
    </div>
  );
}

With Different Tokenization Methods

function TokenizationMethodsExample() {
  const data = { name: "Alice", age: 30, city: "Paris" };

  // Numeric tokenization
  const numeric = useJsonTokenizer(data, {
    keys: ["name", "age", "city"],
    method: TokenizationMethod.NUMERIC
  });

  // Base64 tokenization
  const base64 = useJsonTokenizer(data, {
    keys: ["name", "age", "city"],
    method: TokenizationMethod.BASE64
  });

  return (
    <div>
      <h3>Numeric: {JSON.stringify(numeric.tokenized)}</h3>
      <h3>Base64: {JSON.stringify(base64.tokenized)}</h3>
    </div>
  );
}

Custom Tokenization

function CustomTokenization() {
  const data = { name: "Alice", age: 30, city: "Paris" };

  const { tokenized } = useJsonTokenizer(data, {
    keys: ["name", "age", "city"],
    method: TokenizationMethod.CUSTOM,
    customGenerator: (index) => `field_${index}`
  });

  return <pre>{JSON.stringify(tokenized, null, 2)}</pre>;
}

Using Pre-generated Dictionary

For consistent tokenization across multiple components:

import { generateDictionary } from "@docamz/json-tokenizer";

// Generate dictionary once (e.g., in a context or constant)
const SHARED_DICTIONARY = generateDictionary(
  ["name", "age", "city"],
  { method: TokenizationMethod.ALPHABETIC }
);

function ComponentA() {
  const { tokenized } = useJsonTokenizer(
    { name: "Alice", age: 30, city: "Paris" },
    { dictionary: SHARED_DICTIONARY }
  );
  return <pre>{JSON.stringify(tokenized)}</pre>;
}

function ComponentB() {
  const { tokenized } = useJsonTokenizer(
    { name: "Bob", age: 25, city: "London" },
    { dictionary: SHARED_DICTIONARY }
  );
  return <pre>{JSON.stringify(tokenized)}</pre>;
}

React Hook API Reference

`useJsonTokenizer(input, options)`

Parameters:

input: any - The JSON data to tokenize
options: UseJsonTokenizerOptions - Configuration options

Options:

interface UseJsonTokenizerOptions {
  keys?: string[];              // Keys to include in dictionary generation
  dictionary?: Dictionary;       // Pre-generated dictionary (overrides keys)
  autoTokenize?: boolean;        // Auto-tokenize on input change (default: true)
  method?: TokenizationMethod;   // Tokenization method (default: ALPHABETIC)
  customGenerator?: (index: number) => string;  // For custom method
  paddingLength?: number;        // For padded numeric method
  prefix?: string;               // Prefix for tokens
}

Returns:

interface UseJsonTokenizerResult {
  tokenized: any;               // The tokenized data
  detokenized: any;             // The original/detokenized data
  dictionary: Dictionary | null; // The dictionary used
  isLoading: boolean;           // Loading state
  error: Error | null;          // Any error that occurred
  tokenize: () => void;         // Manually trigger tokenization
  detokenize: (data: any) => any; // Manually detokenize data
  reset: () => void;            // Reset state
}

SSR Considerations

The useJsonTokenizer hook is safe for Server-Side Rendering (SSR):

No browser-specific APIs are used
Works in Next.js, Remix, and other SSR frameworks
Dictionary generation happens synchronously
No side effects during initial render (when autoTokenize is false)

Example with Next.js:

// pages/tokenize.tsx
import { useJsonTokenizer, TokenizationMethod } from "@docamz/json-tokenizer/react";

export default function TokenizePage() {
  const data = { name: "Alice", age: 30 };

  const { tokenized, isLoading } = useJsonTokenizer(data, {
    keys: ["name", "age"],
    method: TokenizationMethod.ALPHABETIC
  });

  return <pre>{JSON.stringify(tokenized, null, 2)}</pre>;
}

TypeScript Support

The React hook exports are fully typed:

import type {
  UseJsonTokenizerOptions,
  UseJsonTokenizerResult,
  Dictionary,
  TokenizationMethod
} from "@docamz/json-tokenizer/react";

Tokenization Methods

1. Alphabetic (Default)

Perfect for maximum compression with readable tokens.

const dict = generateDictionary(keys, { method: TokenizationMethod.ALPHABETIC });
// Result: { name: "a", age: "b", city: "c" }

2. Numeric

Simple numeric tokens for databases and APIs.

const dict = generateDictionary(keys, { method: TokenizationMethod.NUMERIC });
// Result: { name: "0", age: "1", city: "2" }

3. Padded Numeric

Fixed-width numeric tokens for consistent formatting.

const dict = generateDictionary(keys, {
  method: TokenizationMethod.PADDED_NUMERIC,
  paddingLength: 3
});
// Result: { name: "000", age: "001", city: "002" }

4. Base64 Style

High-density encoding using alphanumeric + symbols.

const dict = generateDictionary(keys, { method: TokenizationMethod.BASE64 });
// Supports 64 characters: a-z, A-Z, 0-9, _, $
// Result: { name: "a", age: "b", city: "c", ... key63: "$", key64: "ba" }

5. UUID Short

Distributed-system friendly with timestamp + counter.

const dict = generateDictionary(keys, { method: TokenizationMethod.UUID_SHORT });
// Result: { name: "1a2b00", age: "1a2b01", city: "1a2b02" }
// Format: 4-char timestamp + 2-char counter (6 chars total)

6. Custom Generator

Define your own tokenization logic.

const dict = generateDictionary(keys, {
  method: TokenizationMethod.CUSTOM,
  customGenerator: (index) => `custom_${index}`
});
// Result: { name: "custom_0", age: "custom_1", city: "custom_2" }

7. Prefixed Tokens

Add prefixes to any tokenization method.

const dict = generateDictionary(keys, {
  method: TokenizationMethod.NUMERIC,
  prefix: "api_"
});
// Result: { name: "api_0", age: "api_1", city: "api_2" }

Advanced Usage

Complex Nested Objects

const complexData = {
  user: {
    profile: { firstName: "John", lastName: "Doe", email: "[email protected]" },
    settings: { theme: "dark", language: "en", notifications: true }
  },
  metadata: { version: "2.0", createdAt: "2023-01-01T00:00:00Z" }
};

const keys = [
  "user", "profile", "firstName", "lastName", "email",
  "settings", "theme", "language", "notifications",
  "metadata", "version", "createdAt"
];

const dict = generateDictionary(keys, { method: TokenizationMethod.ALPHABETIC });
const encoded = tokenize(complexData, dict.forward);
const decoded = detokenize(encoded, dict.reverse);

// Perfect reconstruction guaranteed
console.log(decoded === complexData); // true

Arrays of Objects

const arrayData = {
  users: [
    { name: "Alice", age: 30, role: "admin" },
    { name: "Bob", age: 25, role: "user" },
    { name: "Charlie", age: 35, role: "moderator" }
  ]
};

const keys = ["users", "name", "age", "role"];
const dict = generateDictionary(keys, { method: TokenizationMethod.BASE64 });
const encoded = tokenize(arrayData, dict.forward);
// Result: { a: [{ b: "Alice", c: 30, d: "admin" }, ...] }

Dictionary Serialization

// Save dictionary for later use
const dict = generateDictionary(keys, { method: TokenizationMethod.ALPHABETIC });
const serialized = JSON.stringify(dict);
fs.writeFileSync('dictionary.json', serialized);

// Load and use dictionary
const loaded = JSON.parse(fs.readFileSync('dictionary.json', 'utf-8'));
const decoded = detokenize(encodedData, loaded.reverse);

🔒 Security Features

Built-in protection against prototype pollution and security vulnerabilities:

import { tokenize, sanitizeObject, isSafeKey } from "@docamz/json-tokenizer";

// Automatic protection against dangerous keys
const maliciousData = { name: "Alice", "__proto__": { isAdmin: true } };
tokenize(maliciousData, dict.forward); // Throws: "Dangerous key detected"

// Sanitize untrusted input
const cleanData = sanitizeObject(untrustedInput, { throwOnUnsafeKeys: true });

// Validate keys manually
if (isSafeKey(keyName)) {
  // Safe to use
}

Protected against:

__proto__ pollution
constructor manipulation
Dangerous property access
Control character injection

📖 See SECURITY.md for complete security guide

API Reference

Core Functions

| Function | Parameters | Description | |----------|------------|-------------| | generateDictionary(keys, options?) | keys: string[], options?: TokenizationOptions | Generate tokenization dictionary | | tokenize(obj, dict) | obj: any, dict: Record<string, string> | Replace keys with tokens | | detokenize(obj, reverse) | obj: any, reverse: Record<string, string> | Restore original keys |

Tokenization Methods reference

| Method | Description | Use Case | |--------|-------------|----------| | ALPHABETIC | a, b, c, ..., z, aa, ab | Maximum compression, readable | | NUMERIC | 0, 1, 2, 3, ... | Simple, database-friendly | | PADDED_NUMERIC | 000, 001, 002, ... | Fixed-width, sortable | | BASE64 | a-z, A-Z, 0-9, _, $ | High-density encoding | | UUID_SHORT | timestamp + counter | Distributed systems | | CUSTOM | User-defined function | Custom requirements |

TokenizationOptions

interface TokenizationOptions {
  method?: TokenizationMethod;           // Default: ALPHABETIC
  customGenerator?: (index: number) => string; // For CUSTOM method
  paddingLength?: number;                // Default: 4 (for PADDED_NUMERIC)
  prefix?: string;                       // Default: "" (empty)
}

Sequence Generators

Access individual generators directly:

import {
  generateAlphabeticSequence,
  generateNumericSequence,
  generatePaddedNumericSequence,
  generateBase64Sequence,
  generateUuidShortSequence
} from "@docamz/json-tokenizer";

// Use specific generators
const token1 = generateAlphabeticSequence(0); // "a"
const token2 = generateBase64Sequence(63);    // "$"
const token3 = generateUuidShortSequence(0);  // "1a2b00"

Benchmarks

model1.json (83.8 KB file) 2679 Row - 216 unique keys dict
model2.json (134.4 KB file) 4069 Row - 216 unique keys dict
model3.json (148.7 KB file) 4424 Row - 216 unique keys dict
model4.json (33.1 file) 1056 Row - 216 unique keys dict

this files contains complex nested structures and arrays, their values are multiples(boolean, url, text, numbers..) to simulate real-world JSON data.

Compression Ratios

Compression Benchmarks for Different Tokenization Methods on model3.json (148.7 KB file) 4424 Row - 216 unique keys :

| Method | Dict Gen | Tokenize | Total | Original | Tokenized | Compression | Saved | |--------|----------|----------|-------|----------|-----------|-------------|-------| | alphabetic | 0.00 ms | 112.28 ms | 112.28 ms | 72.14 KB | 49.26 KB | 31.71% | 22.87 KB | | base64 | 0.00 ms | 111.24 ms | 111.24 ms | 72.14 KB | 48.70 KB | 32.49% | 23.44 KB | | numeric | 0.00 ms | 113.88 ms | 113.88 ms | 72.14 KB | 51.52 KB | 28.58% | 20.62 KB | | padded_numeric | 0.00 ms | 127.31 ms | 127.31 ms | 72.14 KB | 56.87 KB | 21.17% | 15.27 KB | | uuid_short | 0.00 ms | 113.00 ms | 113.00 ms | 72.14 KB | 63.82 KB | 11.53% | 8.31 KB |

FASTEST TOKENIZATION:

base64: 111.24 ms
alphabetic: 112.28 ms
uuid_short: 113.00 ms
numeric: 113.88 ms
padded_numeric: 127.31 ms

BEST COMPRESSION:

base64: 32.49% (23.44 KB saved)
alphabetic: 31.71% (22.87 KB saved)
numeric: 28.58% (20.62 KB saved)
padded_numeric: 21.17% (15.27 KB saved)
uuid_short: 11.53% (8.31 KB saved)

MOST SPACE SAVED:

base64: 23.44 KB
alphabetic: 22.87 KB
numeric: 20.62 KB
padded_numeric: 15.27 KB
uuid_short: 8.31 KB

EFFICIENCY SCORE (Compression/Time):

base64: 0.2921 (32.49% in 111.24 ms)
alphabetic: 0.2824 (31.71% in 112.28 ms)
numeric: 0.2510 (28.58% in 113.88 ms)
padded_numeric: 0.1663 (21.17% in 127.31 ms)
uuid_short: 0.1020 (11.53% in 113.00 ms)

Benchmark Results

| Model | Raw Size | Raw→Tok | Tok+Gzip | MsgPack | Tok+Msg | Tok+Msg+Gzip | Tok Enc/Dec | Msg Enc/Dec | Tok+Msg Enc/Dec | |-------|----------|---------|----------|---------|---------|--------------|-------------|-------------|------------------| | model1.json | 83.8 KB | 64.6% | 55.5% | 60.3% | 76.3% | 55.5% | 86.5/74.6 ms | 1.3/0.8 ms | 87.1/72.3 ms | | model2.json | 134.4 KB | 65.7% | 51.8% | 61.3% | 77.2% | 55.7% | 103.9/105.9 ms | 0.3/0.4 ms | 104.2/106.5 ms | | model3.json | 148.7 KB | 66.9% | 56.5% | 62.7% | 78.0% | 57.4% | 113.4/116.5 ms | 0.3/0.3 ms | 113.7/115.0 ms | | model4.json | 33.1 KB | 69.9% | 45.6% | 64.1% | 82.2% | 46.2% | 28.1/28.6 ms | 0.2/0.1 ms | 28.1/27.5 ms | | Average | - | 66.8% | 52.4% | 62.1% | 78.4% | 53.7% | 82.9/81.4 ms | 0.53/0.40 ms | 83.28/80.3 ms |

Key:

Raw→Tok: Tokenization compression ratio
Tok+Gzip: Tokenized with Gzip compression
MsgPack: MessagePack compression ratio
Tok+Msg: Combined tokenization + MessagePack
Tok+Msg+Gzip: Best compression (tokenization + MessagePack + Gzip)
Enc/Dec: Encoding/Decoding performance in milliseconds

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@docamz/json-tokenizer

Features

Installation

Quick Start

React Hook API

Installation (Hook)

Basic Usage

Manual Control

With Different Tokenization Methods

Custom Tokenization

Using Pre-generated Dictionary

React Hook API Reference

useJsonTokenizer(input, options)

SSR Considerations

TypeScript Support

Tokenization Methods

1. Alphabetic (Default)

2. Numeric

3. Padded Numeric

4. Base64 Style

5. UUID Short

6. Custom Generator

7. Prefixed Tokens

Advanced Usage

Complex Nested Objects

Arrays of Objects

Dictionary Serialization

🔒 Security Features

API Reference

Core Functions

Tokenization Methods reference

TokenizationOptions

Sequence Generators

Benchmarks

Compression Ratios

Benchmark Results

License

`useJsonTokenizer(input, options)`