jschardet-ultra

v2.1.0

Published

3 months ago

Universal character encoding detection for JavaScript - supports 100+ encodings including CJK, Unicode, Windows code pages, ISO-8859, IBM/DOS, Macintosh, KOI8 and more

0High
0Medium
0Low

sgf227

encoding charset detection unicode cjk gbk big5 shift_jis euc-jp euc-kr iso-8859 windows-1252

jschardet-ultra

Universal character encoding detection for JavaScript. Supports 100+ encodings including CJK, Unicode, Windows code pages, ISO-8859, IBM/DOS, Macintosh, KOI8 and more.

Built on top of jschardet-eastasia's Mozilla Universal Charset Detector engine, enhanced with comprehensive single-byte encoding support via iconv-lite.

Compared to jschardet-eastasia

| Feature | jschardet-eastasia | jschardet-ultra | |---------|-------------------|-----------------| | Encodings supported | ~16 | 100+ | | CJK detection accuracy | High | Same (reuses original engine) | | Single-byte encodings | ❌ Not active | ✅ Full support | | Windows code pages | ❌ | ✅ 10 encodings | | ISO-8859 series | ❌ | ✅ 15 encodings | | IBM/DOS code pages | ❌ | ✅ 28 encodings | | Macintosh encodings | ❌ | ✅ 11 encodings | | KOI8 series | ❌ | ✅ 4 encodings | | Module system | CommonJS (IIFE) | CommonJS (class-based) | | Test framework | QUnit (browser) | Jest (Node.js) | | Dependencies | None | iconv-lite |

Installation

npm install jschardet-ultra

Usage

const jschardet = require('jschardet-ultra');

// Detect from Buffer
const buf = fs.readFileSync('some-file.txt');
jschardet.detect(buf);
// { encoding: 'utf-8', confidence: 0.99 }

// Detect from binary string
jschardet.detect('\xEF\xBB\xBFHello');
// { encoding: 'utf-8', confidence: 1.0 }

// Check encoding support
jschardet.encodingExists('windows-1251'); // true

// Normalize encoding name
jschardet.normalizeEncoding('sjis'); // 'shift_jis'

Detection Architecture

Input Data
  ├─ Layer 1: BOM Detection → UTF-8/16/32 (confidence=1.0)
  ├─ Layer 2: ESC Sequence → ISO-2022-*, HZ-GB-2312
  ├─ Layer 3: Multi-byte Statistical → CJK encodings (Mozilla prober engine)
  └─ Layer 4: Single-byte Smart Detection
       ├─ Profile matching (byte signature + invalid byte exclusion)
       ├─ iconv-lite decode + Unicode range language verification
       ├─ DBCS roundtrip validation (fallback for short multi-byte text)
       └─ Brute-force roundtrip (last resort)

When MBCS prober confidence is below 0.80, single-byte detection also runs and the best result wins. This prevents false positives on short text where multi-byte byte patterns overlap with single-byte encodings.

Supported Encodings

Unicode

UTF-8 (with or without BOM)
UTF-16 LE/BE (with BOM)
UTF-32 LE/BE (with BOM)
ASCII

CJK Multi-byte (DBCS)

Chinese: GB2312, GBK, GB18030, Big5, CP950, CP936, HZ-GB-2312, ISO-2022-CN
Japanese: Shift_JIS, CP932, EUC-JP, ISO-2022-JP
Korean: EUC-KR, CP949, ISO-2022-KR

Windows Code Pages

windows-874 (Thai), windows-1250 (Central European), windows-1251 (Cyrillic)
windows-1252 (Western), windows-1253 (Greek), windows-1254 (Turkish)
windows-1255 (Hebrew), windows-1256 (Arabic), windows-1257 (Baltic)
windows-1258 (Vietnamese)

ISO-8859 Series

ISO-8859-1 through ISO-8859-16 (except 12)

IBM/DOS Code Pages

CP437, CP737, CP775, CP808, CP850, CP852, CP855, CP856, CP857, CP858
CP860–866, CP869, CP922, CP720, CP1046, CP1124–1163

Macintosh

MacRoman, MacCyrillic, MacGreek, MacTurkish, MacIceland
MacCentEuro, MacCroatian, MacRomania, MacUkraine, MacThai

KOI8 Series

KOI8-R, KOI8-U, KOI8-RU, KOI8-T

Other

ARMSCII-8, RK1048, TCVN, Georgian, PT154, VISCII, TIS-620, etc.

API

`jschardet.detect(input)`

Detect the encoding of a Buffer or binary string.

input: Buffer or string
returns: { encoding: string | null, confidence: number }

`jschardet.detectAll(input)`

Detect encoding with all candidates and their confidence levels.

returns: Array<{ encoding: string, confidence: number }> sorted by confidence

`jschardet.encodingExists(name)`

Check if an encoding is supported.

`jschardet.normalizeEncoding(name)`

Normalize an encoding name to its canonical form (e.g. 'sjis' → 'shift_jis').

Test Results

| Category | Count | Pass Rate | |----------|-------|-----------| | BOM detection | 6 | 100% | | Pure ASCII | 5 | 100% | | Boundary conditions | 7 | 71% (3-byte edge cases) | | CJK long text | 7 | 100% | | CJK short text | 6 | 83% (single char edge) | | Cyrillic encodings | 5 | 100% | | Western encodings | 3 | 100% | | Greek/Hebrew/Arabic/Thai | 5 | 100% | | Mixed content | 3 | 100% | | Large data (15KB+) | 4 | 100% | | Special byte sequences | 3 | 100% | | Total | 54 | 96.3% |

66-encoding round test: 66/66 (100%)

Known Limitations

Extremely short text (< 4 bytes) may be unreliable — there simply isn't enough statistical data
Encodings within the same language family (e.g. windows-1252 vs ISO-8859-1, or CP437 vs CP850) share nearly identical byte ranges and are inherently ambiguous
Depends on iconv-lite (~300KB) unlike the zero-dependency original

Project Structure

jschardet-ultra/
├── index.js                     # Root entry
├── src/
│   ├── index.js                 # Main API
│   ├── constants.js             # Detection constants
│   ├── universal-detector.js    # Core detection engine
│   ├── coding-state-machine.js  # Byte state machine
│   ├── charset-group-prober.js  # Group prober base
│   ├── encoding-aliases.js      # Alias resolver
│   ├── probers/                 # Encoding probers
│   │   ├── charset-prober.js
│   │   ├── mb-charset-prober.js
│   │   ├── utf8-prober.js
│   │   ├── esc-prober.js
│   │   ├── jp-probers.js
│   │   ├── cjk-probers.js
│   │   └── mbcs-group-prober.js
│   └── models/                  # Statistical models
│       ├── mbcssm.js            # Multi-byte state machines
│       ├── escsm.js             # ESC state machines
│       ├── chardistribution.js  # Char distribution
│       └── *freq.js             # Frequency tables
├── test/
│   ├── detect.test.js           # Jest unit tests
│   ├── run-round-test.js        # 66-encoding round test
│   └── comprehensive-test.js    # 54-item comprehensive + boundary test
└── test-results/                # Test result JSON files

License

MIT (new code) + LGPL-2.1 (original Mozilla chardet engine)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

jschardet-ultra

Compared to jschardet-eastasia

Installation

Usage

Detection Architecture

Supported Encodings

Unicode

CJK Multi-byte (DBCS)

Windows Code Pages

ISO-8859 Series

IBM/DOS Code Pages

Macintosh

KOI8 Series

Other

API

jschardet.detect(input)

jschardet.detectAll(input)

jschardet.encodingExists(name)

jschardet.normalizeEncoding(name)

Test Results

Known Limitations

Project Structure

License

`jschardet.detect(input)`

`jschardet.detectAll(input)`

`jschardet.encodingExists(name)`

`jschardet.normalizeEncoding(name)`