universal-encoding-toolkit

v1.0.1

Published

4 months ago

A comprehensive encoding detection and conversion toolkit — supports 100+ character encodings including CJK, Cyrillic, Arabic, Hebrew, Thai, and more.

Universal Encoding Toolkit

A comprehensive encoding detection and conversion toolkit for Node.js — supports 100+ character encodings including CJK, Cyrillic, Arabic, Hebrew, Thai, and more.

Combines the encoding conversion power of iconv-lite with a custom-built multi-stage auto-detection engine.

Features

🔍 Auto-detect 100+ encodings from raw buffers — CJK, Cyrillic, Arabic, Hebrew, Thai, Latin, etc.
🔄 Encode / Decode / Transcode between any supported encodings
🧠 Smart decode — detect + decode in one step
📦 Single dependency — only iconv-lite
💡 TypeScript type declarations included
🌊 Stream support — encode/decode streams for piping

Supported Encodings (10 Groups)

| Group | Count | Examples | |-------|-------|---------| | Node.js Built-in | 8 | utf8, ucs2, ascii, base64, hex | | Unicode Extended | 7 | utf16be, utf32, utf7, utf7-imap | | Windows Code Pages | 10 | windows-874, windows-1250 ~ 1258 | | ISO-8859 Series | 15 | iso-8859-1 ~ iso-8859-16 | | IBM/DOS Code Pages | 28 | cp437, cp850, cp866, cp1125... | | Macintosh Encodings | 11 | macintosh, macgreek, macukraine... | | KOI8 Series | 4 | koi8-r, koi8-u, koi8-ru, koi8-t | | Other Single-byte | 12 | armscii8, viscii, tis620, mik... | | CJK Multi-byte (DBCS) | 11 | GBK, GB18030, Big5, Shift_JIS, EUC-JP, EUC-KR | | Common Aliases | 12 | latin1, chinese, korean, sjis... |

Installation

npm install universal-encoding-toolkit

Quick Start

const toolkit = require('universal-encoding-toolkit');

// 1. Encode & Decode
const buf = toolkit.encode('你好世界', 'gbk');
const str = toolkit.decode(buf, 'gbk');
console.log(str); // 你好世界

// 2. Auto-detect encoding
const detected = toolkit.detect(buf);
console.log(detected);
// { encoding: 'gbk', confidence: 0.88, source: 'gbk-analysis' }

// 3. Smart decode (detect + decode in one step)
const result = toolkit.smartDecode(buf);
console.log(result.text);       // 你好世界
console.log(result.encoding);   // gbk
console.log(result.confidence); // 0.88

// 4. Transcode (encoding → encoding)
const big5Buf = toolkit.transcode(buf, 'gbk', 'big5');

// 5. Check encoding support
toolkit.encodingExists('gbk');      // true
toolkit.encodingExists('chinese');  // true (alias)

// 6. Normalize encoding names
toolkit.normalize('sjis');    // 'shiftjis'
toolkit.normalize('latin1');  // 'iso-8859-1'
toolkit.normalize('chinese'); // 'gbk'

API Reference

Encoding Conversion

| Method | Description | |--------|-------------| | encode(str, encoding) | String → Buffer | | decode(buffer, encoding) | Buffer → String | | transcode(buffer, from, to) | Re-encode buffer from one encoding to another | | encodeStream(encoding) | Create a writable encode stream | | decodeStream(encoding) | Create a writable decode stream |

Encoding Detection

| Method | Description | |--------|-------------| | detect(buffer) | Auto-detect, returns { encoding, confidence, source } | | detectAll(buffer) | Returns all candidates sorted by confidence | | smartDecode(buffer) | Auto-detect + decode, returns { text, encoding, confidence } |

Utilities

| Method | Description | |--------|-------------| | encodingExists(name) | Check if an encoding is supported | | normalize(name) | Normalize encoding name (e.g. 'sjis' → 'shiftjis') | | getSupportedEncodings() | Get flat list of all supported encoding names | | getEncodingGroups() | Get encoding groups object |

Real-World Examples

Read a file with unknown encoding

const fs = require('fs');
const toolkit = require('universal-encoding-toolkit');

const buf = fs.readFileSync('unknown-file.txt');
const { text, encoding, confidence } = toolkit.smartDecode(buf);

console.log(`Detected: ${encoding} (confidence: ${(confidence * 100).toFixed(1)}%)`);
console.log(text);

HTTP response decoding

const http = require('http');
const toolkit = require('universal-encoding-toolkit');

http.get('http://example.com/data', (res) => {
  const chunks = [];
  res.on('data', chunk => chunks.push(chunk));
  res.on('end', () => {
    const buf = Buffer.concat(chunks);
    const { text, encoding } = toolkit.smartDecode(buf);
    console.log(`Response encoding: ${encoding}`);
    console.log(text);
  });
});

Batch convert files to UTF-8

const fs = require('fs');
const path = require('path');
const toolkit = require('universal-encoding-toolkit');

function convertToUTF8(filePath) {
  const buf = fs.readFileSync(filePath);
  const detected = toolkit.detect(buf);

  if (detected.encoding !== 'utf-8' && detected.confidence > 0.7) {
    const text = toolkit.decode(buf, detected.encoding);
    fs.writeFileSync(filePath, Buffer.from('\uFEFF' + text, 'utf-8'));
    console.log(`Converted ${filePath}: ${detected.encoding} → utf-8`);
  }
}

Stream piping

const fs = require('fs');
const toolkit = require('universal-encoding-toolkit');

// Decode a GBK file to UTF-8 via streams
fs.createReadStream('input-gbk.txt')
  .pipe(toolkit.decodeStream('gbk'))
  .pipe(fs.createWriteStream('output-utf8.txt'));

Using with ES Modules / TypeScript

import toolkit from 'universal-encoding-toolkit';
// or import specific exports:
import { UniversalEncodingToolkit, EncodingDetector, normalizeEncoding } from 'universal-encoding-toolkit';

const buf = toolkit.encode('Hello, 世界!', 'utf-8');
const result = toolkit.smartDecode(buf);
// Full IntelliSense support with included type declarations

Detection Engine

The auto-detection engine uses a 9-stage pipeline:

Input Buffer
    │
    ├─ Stage 1: BOM signature detection          → confidence 1.0
    ├─ Stage 2: Pure ASCII fast path              → confidence 1.0
    ├─ Stage 3: UTF-8 validation                  → confidence 0.85~0.99
    ├─ Stage 4: UTF-16 null-byte heuristics       → confidence 0.80
    ├─ Stage 5: High-byte pattern analysis
    ├─ Stage 6: CJK multi-byte evaluation
    │           (GBK/GB18030/Big5/Shift_JIS/EUC-JP/EUC-KR)
    ├─ Stage 7: Single-byte statistical scoring
    │           (windows-125x, ISO-8859-x, KOI8-x, etc.)
    ├─ Stage 8: Arbitration (with short-text protection)
    └─ Stage 9: Post-process disambiguation (12 known patterns)

Advanced Usage

Custom instance

const { UniversalEncodingToolkit } = require('universal-encoding-toolkit');

const myToolkit = new UniversalEncodingToolkit();
const result = myToolkit.detect(someBuffer);

Access sub-modules

const {
  EncodingDetector,
  normalizeEncoding,
  ENCODING_GROUPS,
  ENCODING_ALIASES
} = require('universal-encoding-toolkit');

// Use detector directly
const detector = new EncodingDetector();
const result = detector.detectAll(buffer);

// Normalize encoding names
normalizeEncoding('sjis');   // 'shiftjis'
normalizeEncoding('cp936');  // 'gbk'