exos_codepage

v1.4.1

Published

3 years ago

pure-JS library to handle codepages

0High
0Medium
0Low

roysalgado

codepage iconv convert strings

Codepages for JS

Codepages are character encodings. In many contexts, single- or double-byte character sets are used in lieu of Unicode encodings. The codepages map between characters and numbers.

unicode.org hosts lists of mappings. The build script automatically downloads and parses the mappings in order to generate the full script. The pages.csv description in codepage.md controls which codepages are used.

Setup

In node:

var cptable = require('codepage');

In the browser:

<script src="cptable.js"></script>
<script src="cputils.js"></script>

Alternatively, use the full version in the dist folder:

<script src="cptable.full.js"></script>

The complete set of codepages is large due to some Double Byte Character Set encodings. A much smaller file that just includes SBCS codepages is provided in this repo (sbcs.js), as well as a file for other projects (cpexcel.js)

If you know which codepages you need, you can include individual scripts for each codepage. The individual files are provided in the bits/ directory. For example, to include only the Mac codepages:

<script src="bits/10000.js"></script>
<script src="bits/10006.js"></script>
<script src="bits/10007.js"></script>
<script src="bits/10029.js"></script>
<script src="bits/10079.js"></script>
<script src="bits/10081.js"></script>

All of the browser scripts define and append to the cptable object. To rename the object, edit the JSVAR shell variable in make.sh and run the script.

The utilities functions are contained in cputils.js, which assumes that the appropriate codepage scripts were loaded.

Usage

The codepages are indexed by number. To get the unicode character for a given codepoint, use the dec property:

var unicode_cp10000_255 = cptable[10000].dec[255]; // ˇ

To get the codepoint for a given character, use the enc property:

var cp10000_711 = cptable[10000].enc[String.fromCharCode(711)]; // 255

There are a few utilities that deal with strings and buffers:

var 汇总 = cptable.utils.decode(936, [0xbb,0xe3,0xd7,0xdc]);
var buf =  cptable.utils.encode(936,  汇总);
var sushi= cptable.utils.decode(65001, [0xf0,0x9f,0x8d,0xa3]); // 🍣
var sbuf = cptable.utils.encode(65001, sushi);

cptable.utils.encode(CP, data, ofmt) accepts a String or Array of characters and returns a representation controlled by ofmt:

Default output is a Buffer (or Array) of bytes (integers between 0 and 255).
If ofmt == 'str', return a String where o.charCodeAt(i) is the ith byte
If ofmt == 'arr', return an Array of bytes

Known Excel Codepages

A much smaller script, including only the codepages known to be used in Excel, is available under the name cpexcel. It exposes the same variable cptable and is suitable as a drop-in replacement when the full codepage tables are not needed.

In node:

var cptable = require('codepage/dist/cpexcel.full');

Rolling your own script

The make.sh script in the repo can take a manifest and generate JS source.

Usage:

bash make.sh path_to_manifest output_file_name JSVAR

where

JSVAR is the name of the exported variable (generally cptable)
output_file_name is the output file (e.g. cpexcel.js, cptable.js)
path_to_manifest is the path to the manifest file.

The manifest file is expected to be a CSV with 3 columns:

<codepage number>,<source>,<size>

If a source is specified, it will try to download the specified file and parse. The file format is expected to follow the format from the unicode.org site. The size should be 1 for a single-byte codepage and 2 for a double-byte codepage. For mixed codepages (which use some single- and some double-byte codes), the script assumes the mapping is a prefix code and generates efficient JS code.

Generated scripts only include the mapping. cat a mapping with cputils.js to produce a complete script like cpexcel.full.js.

Building the complete script

This script uses voc. The script to build the codepage tables and the JS source is codepage.md, so building is as simple as voc codepage.md.

Generated Codepages

The complete list of hardcoded codepages can be found in the file pages.csv.

Some codepages are easier to implement algorithmically. Since these are hardcoded in utils, there is no corresponding entry (they are "magic")

Note that MakeEncoding.cs deviates from unicode.org for some codepages. In the case of direct conflicts, unicode.org takes precedence. In cases where the unicode.org listing does not prescribe a value, MakeEncoding.cs value is used.

NLS refers to the National Language Support files supplied in various versions of Windows. In older versions of Windows (e.g. Windows 98) these files followed the pattern CP_#.NLS, but newer versions use the pattern C_#.NLS.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme