deghost

v0.0.1

Published

2 months ago

Strip invisible Unicode characters and normalize whitespace. Chainable, typesafe, zero dependencies.

0High
0Medium
0Low

kellymears

unicode invisible zero-width nbsp text normalization sanitize strip clean whitespace control-characters

deghost

Strip invisible Unicode characters and normalize whitespace. Chainable, typesafe, zero dependencies.

npm install deghost

Why

Text from binary formats, APIs, and user input is full of invisible Unicode characters — non-breaking spaces, zero-width joiners, directional marks, BOM, control characters. They break string comparison, corrupt search indexes, and produce garbled output.

Existing tools either strip everything indiscriminately or miss entire character categories. deghost gives you category-level control with a chainable API that distinguishes between stripping (remove entirely) and normalizing (replace with a visible substitute).

Quick start

import { deghost } from 'deghost'

// Sensible defaults — handles the common cases
;`${deghost('Plant\u00a064\u00a0-\u00a0Woodbridge')}`
// → 'Plant 64 - Woodbridge'

`${deghost('hello\u200Bworld')}`
// → 'helloworld'

// Also works as a tagged template literal
`${deghost`Plant\u00a064\u00a0-\u00a0Woodbridge`}`
// → 'Plant 64 - Woodbridge'

Chainable API

Fine-grained control over what gets stripped vs. normalized:

import { deghost } from 'deghost'

deghost('text\u200B\u00a0here')
  .strip('format') // zero-width joiners, directional marks, soft hyphens
  .strip('control') // C0/C1 control characters
  .normalize('spaces') // NBSP, en/em space → regular space
  .trim()
  .toString()
// → 'text here'

The chain is immutable — each method returns a new instance, so you can branch without side effects.

Chain methods

| Method | Returns | Description | | ------------------------------------ | -------------- | -------------------------------------------------------------------- | | .strip(category) | DeghostChain | Remove all characters in a category | | .normalize(category, replacement?) | DeghostChain | Replace characters with a substitute (default: ' ') | | .replace(category, mapper) | DeghostChain | Replace characters using a function that receives detection metadata | | .highlight(category?, formatter?) | DeghostChain | Replace ghosts with visible markers like [U+200B] | | .collapse() | DeghostChain | Collapse runs of whitespace into a single space | | .trim() | DeghostChain | Trim leading/trailing whitespace | | .clean() | DeghostChain | Apply the default preset | | .detect(categories?) | Detection[] | Return detections for the current value | | .hasGhosts(categories?) | boolean | Check if invisible characters remain | | .isClean(categories?) | boolean | Inverse of .hasGhosts() | | .count(categories?) | Record | Count ghosts by category | | .summary(categories?) | string | Human-readable report of ghosts found | | .toString() | string | Extract the string |

Categories:

| Category | What it matches | Default behavior | | --------- | ------------------------------------------------------------ | ------------------ | | format | Zero-width joiners, directional marks, soft hyphens (\p{Cf}) | Strip | | control | C0/C1 control characters (\p{Cc}) | Strip | | spaces | NBSP, en/em space, thin space, ideographic space (\p{Zs}) | Normalize to ' ' | | bom | Byte order mark (U+FEFF) | Strip | | tag | Unicode tag characters (U+E0001–U+E007F) | — | | fillers | Hangul, Khmer, Mongolian, Ogham fillers | — | | math | Invisible math operators (U+2061–U+2064) | — |

Reusable cleaners

Build a cleaning pipeline once, apply it to many strings with no per-call chain allocation:

import { cleaner } from 'deghost'

const clean = cleaner().strip('format').strip('control').normalize('spaces').trim().build()

clean('dirty\u00a0string') // 'dirty string'
clean('another\u200Bone') // 'anotherone'

Cleaners also support .replace() and .highlight() for dynamic transformations:

const annotate = cleaner().highlight('format').normalize('spaces').build()

annotate('a\u200Bb\u00a0c') // 'a[U+200B]b c'

Detection

Find out what's hiding in your strings:

import { detect, hasGhosts, isClean, count, first, scan } from 'deghost'

detect('sneaky\u200Btext')
// [{
//   char: '\u200B',
//   codepoint: 'U+200B',
//   name: 'ZERO WIDTH SPACE',
//   category: 'format',
//   offset: 6
// }]

hasGhosts('hello\u200Bworld') // true
isClean('hello world') // true

count('a\u00a0b\u200Bc\u200Bd')
// { spaces: 1, format: 2 }

// Get just the first detection (stops early)
first('a\u200Bb\u00a0c')
// { char: '\u200B', codepoint: 'U+200B', ... }

// Lazy iterator for large strings
for (const d of scan(largeString)) {
  if (d.category === 'format') break
}

All detection functions accept an optional categories array to filter:

detect('a\u200Bb\u00a0c', ['spaces'])
// Only returns the NBSP detection

Highlighting

Make invisible characters visible for debugging:

import { highlight } from 'deghost'

highlight('hello\u200Bworld')
// 'hello[U+200B]world'

// Custom formatter
highlight('a\u200Bb', (d) => `{${d.name}}`)
// 'a{ZERO WIDTH SPACE}b'

// Filter by category
highlight('a\u00a0b\u200Bc', { categories: ['format'] })
// 'a\u00a0b[U+200B]c'

Summary

Get a human-readable report of all invisible characters:

import { summary } from 'deghost'

summary('hello\u200Bworld\u00a0here')
// 2 invisible characters found.
//
// By category:
//   format: 1
//   spaces: 1
//
// Details:
//   U+200B  ZERO WIDTH SPACE  (format, offset 5)
//   U+00A0  NO-BREAK SPACE  (spaces, offset 11)

Character lookup

Identify a single character or codepoint:

import { identify } from 'deghost'

identify('\u200B')
// { codepoint: 'U+200B', name: 'ZERO WIDTH SPACE', category: 'format' }

identify(0x00a0)
// { codepoint: 'U+00A0', name: 'NO-BREAK SPACE', category: 'spaces' }

identify('a') // undefined — not a ghost

Presets

import { presets } from 'deghost'

// Default: strip format + control + BOM, normalize spaces
presets.clean('text\u00a0with\u200Bghosts')
// → 'text with ghosts'

// Aggressive: strip everything invisible
presets.aggressive('text\u2061with\u200Bghosts')
// → 'textwithghosts'

// Spaces only: just normalize whitespace
presets.spaces('text\u00a0here')
// → 'text here'

How it works

deghost uses ES2018 Unicode property escapes (\p{Cf}, \p{Cc}, \p{Zs}) for broad category matching, plus curated codepoint sets for categories not covered by a single Unicode general category (tag characters, script-specific fillers, invisible math operators).

The key design choice: strip vs. normalize. A non-breaking space (U+00A0) should become a regular space, not disappear — otherwise "Plant\u00a064" becomes "Plant64". deghost handles this by default; out-of-character does not.

Comparison

| Feature | deghost | out-of-character | | ------------------------------- | ------- | ---------------- | | Strip invisible chars | yes | yes | | Normalize spaces (NBSP → space) | yes | no (strips) | | Chainable API | yes | no | | Reusable cleaners | yes | no | | Detection with metadata | yes | yes | | Category-level control | yes | no | | Highlighting / debugging | yes | no | | Tagged template literal | yes | no | | TypeScript-native | yes | no | | Presets | yes | no | | CLI | not yet | yes | | Zero dependencies | yes | yes |

Requirements

Node.js >= 18. Uses ES2018 Unicode property escapes (supported in all modern runtimes).

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

deghost

Why

Quick start

Chainable API

Chain methods

Reusable cleaners

Detection

Highlighting

Summary

Character lookup

Presets

How it works

Comparison

Requirements

License