unicode-escaper

v1.0.1

Published

3 days ago

A robust Unicode escape/unescape library supporting multiple formats with streaming support

0High
0Medium
0Low

jeongmincho

unicode escape unescape encoding decoding utf8 utf16 html-entities codepoint surrogate-pairs stream

unicode-escaper

A robust, zero-dependency Unicode escape/unescape library for JavaScript and TypeScript. Supports multiple escape formats, bidirectional conversion, and streaming for large files.

Features

Multiple escape formats: \uXXXX, \u{XXXXX}, \xNN, &#xNNNN;, &#NNNN;, U+XXXX
Bidirectional: Both escape and unescape in one package
Streaming support: Process large files efficiently with Node.js and Web Streams
Full Unicode support: Handles BMP, supplementary planes, surrogate pairs, and emoji
Zero dependencies: Lightweight and fast
TypeScript-first: Written in TypeScript with strict types
Dual ESM/CJS: Works with both module systems
Customizable filters: Control exactly which characters to escape

Installation

npm install unicode-escaper
# or
pnpm add unicode-escaper
# or
yarn add unicode-escaper

Quick Start

import { escape, unescape } from "unicode-escaper";

// Escape non-ASCII characters
escape("Hello 世界");
// => 'Hello \u4E16\u754C'

// Unescape back to original
unescape("Hello \\u4E16\\u754C");
// => 'Hello 世界'

Escape Formats

| Format | Example | Description | | -------------- | ---------- | -------------------------------------------- | | unicode | \u4E16 | Standard JavaScript Unicode escape (default) | | unicode-es6 | \u{4E16} | ES6 Unicode escape (supports full range) | | hex | \xE9 | Hex escape (0x00-0xFF only) | | html-hex | 世 | HTML hexadecimal entity | | html-decimal | 世 | HTML decimal entity | | codepoint | U+4E16 | Unicode code point notation |

API Reference

Core Functions

`escape(input, options?)`

Escapes Unicode characters in a string.

import { escape } from "unicode-escaper";

// Default: preserve ASCII, escape everything else
escape("Café 世界 😀");
// => 'Caf\u00E9 \u4E16\u754C \uD83D\uDE00'

// Use ES6 format for emoji (cleaner output)
escape("Hello 😀", { format: "unicode-es6" });
// => 'Hello \u{1F600}'

// HTML entities
escape("Café", { format: "html-hex" });
// => 'Caf&#xE9;'

escape("Café", { format: "html-decimal" });
// => 'Caf&#233;'

// Escape everything (including ASCII)
escape("Hi", { preserveAscii: false });
// => '\u0048\u0069'

// Preserve Latin-1 characters
escape("Café 世界", { preserveLatin1: true });
// => 'Café \u4E16\u754C'

// Lowercase hex digits
escape("世", { uppercase: false });
// => '\u4e16'

`unescape(input, options?)`

Unescapes Unicode sequences back to characters.

import { unescape } from "unicode-escaper";

// Automatically detects and unescapes all formats
unescape("\\u4E16"); // => '世'
unescape("\\u{1F600}"); // => '😀'
unescape("\\xE9"); // => 'é'
unescape("&#x4E16;"); // => '世'
unescape("&#19990;"); // => '世'
unescape("U+4E16"); // => '世'

// Handle surrogate pairs
unescape("\\uD83D\\uDE00"); // => '😀'

// Only unescape specific formats
unescape("\\u4E16 &#x4E16;", { formats: ["unicode"] });
// => '世 &#x4E16;'

// Strict mode (throws on invalid sequences)
unescape("\\uZZZZ", { lenient: false });
// => throws Error

Convenience Functions

import {
  escapeToUnicode, // \uXXXX format
  escapeToUnicodeES6, // \u{XXXXX} format
  escapeToHex, // \xNN format
  escapeToHtmlHex, // &#xNNNN; format
  escapeToHtmlDecimal, // &#NNNN; format
  escapeToCodePoint, // U+XXXX format
  escapeAll, // Escape all characters
  escapeNonPrintable, // Escape control chars and non-ASCII
} from "unicode-escaper";

escapeToUnicodeES6("😀"); // => '\u{1F600}'
escapeToHtmlHex("世"); // => '&#x4E16;'
escapeAll("Hi"); // => '\u0048\u0069'

import {
  unescapeUnicode, // Only \uXXXX
  unescapeUnicodeES6, // Only \u{XXXXX}
  unescapeHex, // Only \xNN
  unescapeHtmlHex, // Only &#xNNNN;
  unescapeHtmlDecimal, // Only &#NNNN;
  unescapeCodePoint, // Only U+XXXX
  unescapeHtml, // Both HTML formats
  unescapeJs, // All JavaScript formats
} from "unicode-escaper";

Custom Filters

Control which characters to escape using filter functions:

import { escape, isNotAscii, isNotBmp, and, or, oneOf } from "unicode-escaper";

// Escape only non-ASCII (default behavior)
escape("Hello 世界", { filter: isNotAscii });

// Escape only emoji (non-BMP characters)
escape("Hello 世界 😀", { filter: isNotBmp });
// => 'Hello 世界 \uD83D\uDE00'

// Escape vowels
escape("Hello", { filter: oneOf("aeiouAEIOU") });
// => 'H\u0065ll\u006F'

// Combine filters
escape("Test", { filter: and(isNotAscii, isNotBmp) });

Available filters:

isAscii / isNotAscii - ASCII range (0x00-0x7F)
isLatin1 / isNotLatin1 - Latin-1 range (0x00-0xFF)
isBmp / isNotBmp - Basic Multilingual Plane (0x0000-0xFFFF)
isPrintableAscii / isNotPrintableAscii - Printable ASCII (0x20-0x7E)
isControl - Control characters
isWhitespace - Whitespace characters
isSurrogate / isHighSurrogate / isLowSurrogate - Surrogate code points
inRange(start, end) / notInRange(start, end) - Custom range
oneOf(chars) / noneOf(chars) - Character set
and(...filters) / or(...filters) / not(filter) - Combinators
all / none - Always true/false

Utility Functions

import {
  getCodePoint, // Get code point of a character
  fromCodePoint, // Create character from code point
  getCharInfo, // Get detailed character information
  toCodePoints, // Convert string to code point array
  fromCodePoints, // Convert code point array to string
  codePointLength, // Get length in code points (not UTF-16)
  toHex, // Convert code point to hex string
  parseHex, // Parse hex string to code point
  isValidUnicode, // Check for unpaired surrogates
  normalizeNFC, // Normalize to NFC
  normalizeNFD, // Normalize to NFD
  unicodeEquals, // Compare Unicode equivalence
} from "unicode-escaper";

// Get code point
getCodePoint("😀"); // => 128512 (0x1F600)

// Character info
getCharInfo("😀");
// => {
//   char: '😀',
//   codePoint: 128512,
//   hex: '1F600',
//   isAscii: false,
//   isBmp: false,
//   isLatin1: false,
//   isHighSurrogate: false,
//   isLowSurrogate: false,
//   utf16Length: 2
// }

// Code point length (differs from string.length for emoji)
"😀".length; // => 2 (UTF-16 code units)
codePointLength("😀"); // => 1 (actual characters)

// Parse various formats
parseHex("U+1F600"); // => 128512
parseHex("0x4E16"); // => 19990
parseHex("\\u{4E16}"); // => 19990

Streaming Support

Process large files efficiently without loading everything into memory:

Node.js Streams

import { createReadStream, createWriteStream } from "fs";
import { pipeline } from "stream/promises";
import { EscapeStream, UnescapeStream } from "unicode-escaper";

// Escape a file
await pipeline(
  createReadStream("input.txt", "utf8"),
  new EscapeStream({ escapeOptions: { format: "unicode-es6" } }),
  createWriteStream("escaped.txt")
);

// Unescape a file
await pipeline(
  createReadStream("escaped.txt", "utf8"),
  new UnescapeStream(),
  createWriteStream("output.txt")
);

Web Streams API

import {
  createWebEscapeStream,
  createWebUnescapeStream,
} from "unicode-escaper";

// Works in browsers and modern Node.js
const response = await fetch("data.txt");
const escaped = response.body
  .pipeThrough(new TextDecoderStream())
  .pipeThrough(createWebEscapeStream({ format: "html-hex" }))
  .pipeThrough(new TextEncoderStream());

Detection Utilities

import { hasEscapeSequences, countEscapeSequences } from "unicode-escaper";

hasEscapeSequences("\\u4E16"); // => true
hasEscapeSequences("Hello"); // => false

countEscapeSequences("\\u4E16\\u754C"); // => 2

// Filter by format
hasEscapeSequences("\\u4E16", ["unicode"]); // => true
hasEscapeSequences("\\u4E16", ["html-hex"]); // => false

TypeScript Support

Full TypeScript support with strict types:

import type {
  EscapeFormat,
  EscapeOptions,
  UnescapeOptions,
  FilterFunction,
  CharacterInfo,
  EscapeResult,
} from "unicode-escaper";

// Type-safe options
const options: EscapeOptions = {
  format: "unicode-es6",
  preserveAscii: true,
  uppercase: true,
};

// Custom filter with proper typing
const myFilter: FilterFunction = (char, codePoint) => {
  return codePoint > 0x7f;
};

Comparison with escape-unicode

| Feature | escape-unicode | unicode-escaper | | --------------- | ---------------- | --------------- | | Escape formats | \uXXXX only | 6 formats | | Unescape | Separate package | Built-in | | Streaming | No | Yes | | Web Streams | No | Yes | | ESM + CJS | CJS only | Both | | Browser support | Node only | Both | | TypeScript | Yes | Yes (strict) | | Zero deps | Yes | Yes |

International Language Support

Fully tested with diverse Unicode scripts:

| Language | Script | Example | Escaped | | ---------- | ----------------------- | -------------- | -------------------------------------- | | Korean | Hangul | 안녕하세요 | \uC548\uB155\uD558\uC138\uC694 | | Japanese | Hiragana/Katakana/Kanji | こんにちは | \u3053\u3093\u306B\u3061\u306F | | Arabic | Arabic | مرحبا | \u0645\u0631\u062D\u0628\u0627 | | Thai | Thai | สวัสดี | \u0E2A\u0E27\u0E31\u0E2A\u0E14\u0E35 | | Russian | Cyrillic | Привет | \u041F\u0440\u0438\u0432\u0435\u0442 | | Hindi | Devanagari | नमस्ते | \u0928\u092E\u0938\u094D\u0924\u0947 | | Chinese | Han | 你好 | \u4F60\u597D | | Vietnamese | Latin Extended | Xin chào | Xin ch\u00E0o | | French | Latin Extended | Café | Caf\u00E9 | | Turkish | Latin Extended | Türkçe | T\u00FCrk\u00E7e | | Spanish | Latin Extended | ¡Hola! | \u00A1Hola! | | Portuguese | Latin Extended | São Paulo | S\u00E3o Paulo |

import { escape, unescape } from "unicode-escaper";

// Korean
escape("안녕하세요"); // => '\uC548\uB155\uD558\uC138\uC694'

// Japanese (mixed scripts)
escape("東京 とうきょう トウキョウ");

// Arabic (RTL)
escape("مرحبا"); // => '\u0645\u0631\u062D\u0628\u0627'

// Thai (with tone marks)
escape("สวัสดี");

// Russian
escape("Привет"); // => '\u041F\u0440\u0438\u0432\u0435\u0442'

// Hindi (with combining marks)
escape("नमस्ते"); // => '\u0928\u092E\u0938\u094D\u0924\u0947'

// Chinese
escape("你好世界"); // => '\u4F60\u597D\u4E16\u754C'

// Vietnamese (with diacritics)
escape("Xin chào"); // => 'Xin ch\u00E0o'

// Turkish (special i variants)
escape("İstanbul"); // => '\u0130stanbul'

// Spanish (inverted punctuation)
escape("¡Hola!"); // => '\u00A1Hola!'

// Portuguese (tildes and cedilla)
escape("São Paulo"); // => 'S\u00E3o Paulo'

// Mixed multi-language content
const mixed = "Hello 안녕 こんにちは 你好 مرحبا สวัสดี Привет नमस्ते";
unescape(escape(mixed)) === mixed; // => true

Supported Features

Combining characters: Thai tone marks, Arabic diacritics, Hindi matras/virama, Vietnamese diacritics
Bidirectional text: RTL markers, mixed LTR/RTL content
Native numerals: Thai ๒๐๒๔, Arabic ٢٠٢٤, Devanagari २०२४
Conjunct consonants: Hindi samyuktakshar (क्ष, त्र, ज्ञ)
Supplementary planes: Emoji, ancient scripts, mathematical symbols
Normalization: Handles NFC/NFD forms correctly
Extended Latin: French accents, Turkish special i (ı İ), Spanish ñ, Portuguese ã/õ

Browser Support

Works in all modern browsers that support ES2022. For older browsers, you may need polyfills for:

String.prototype.codePointAt
String.fromCodePoint
Web Streams API (if using streaming)

License

Apache-2.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

unicode-escaper

Features

Installation

Quick Start

Escape Formats

API Reference

Core Functions

escape(input, options?)

unescape(input, options?)

Convenience Functions

Custom Filters

Utility Functions

Streaming Support

Node.js Streams

Web Streams API

Detection Utilities

TypeScript Support

Comparison with escape-unicode

International Language Support

Supported Features

Browser Support

License

`escape(input, options?)`

`unescape(input, options?)`