deghost
v0.0.1
Published
Strip invisible Unicode characters and normalize whitespace. Chainable, typesafe, zero dependencies.
Downloads
94
Maintainers
Readme
deghost
Strip invisible Unicode characters and normalize whitespace. Chainable, typesafe, zero dependencies.
npm install deghostWhy
Text from binary formats, APIs, and user input is full of invisible Unicode characters — non-breaking spaces, zero-width joiners, directional marks, BOM, control characters. They break string comparison, corrupt search indexes, and produce garbled output.
Existing tools either strip everything indiscriminately or miss entire character categories. deghost gives you category-level control with a chainable API that distinguishes between stripping (remove entirely) and normalizing (replace with a visible substitute).
Quick start
import { deghost } from 'deghost'
// Sensible defaults — handles the common cases
;`${deghost('Plant\u00a064\u00a0-\u00a0Woodbridge')}`
// → 'Plant 64 - Woodbridge'
`${deghost('hello\u200Bworld')}`
// → 'helloworld'
// Also works as a tagged template literal
`${deghost`Plant\u00a064\u00a0-\u00a0Woodbridge`}`
// → 'Plant 64 - Woodbridge'Chainable API
Fine-grained control over what gets stripped vs. normalized:
import { deghost } from 'deghost'
deghost('text\u200B\u00a0here')
.strip('format') // zero-width joiners, directional marks, soft hyphens
.strip('control') // C0/C1 control characters
.normalize('spaces') // NBSP, en/em space → regular space
.trim()
.toString()
// → 'text here'The chain is immutable — each method returns a new instance, so you can branch without side effects.
Chain methods
| Method | Returns | Description |
| ------------------------------------ | -------------- | -------------------------------------------------------------------- |
| .strip(category) | DeghostChain | Remove all characters in a category |
| .normalize(category, replacement?) | DeghostChain | Replace characters with a substitute (default: ' ') |
| .replace(category, mapper) | DeghostChain | Replace characters using a function that receives detection metadata |
| .highlight(category?, formatter?) | DeghostChain | Replace ghosts with visible markers like [U+200B] |
| .collapse() | DeghostChain | Collapse runs of whitespace into a single space |
| .trim() | DeghostChain | Trim leading/trailing whitespace |
| .clean() | DeghostChain | Apply the default preset |
| .detect(categories?) | Detection[] | Return detections for the current value |
| .hasGhosts(categories?) | boolean | Check if invisible characters remain |
| .isClean(categories?) | boolean | Inverse of .hasGhosts() |
| .count(categories?) | Record | Count ghosts by category |
| .summary(categories?) | string | Human-readable report of ghosts found |
| .toString() | string | Extract the string |
Categories:
| Category | What it matches | Default behavior |
| --------- | ------------------------------------------------------------ | ------------------ |
| format | Zero-width joiners, directional marks, soft hyphens (\p{Cf}) | Strip |
| control | C0/C1 control characters (\p{Cc}) | Strip |
| spaces | NBSP, en/em space, thin space, ideographic space (\p{Zs}) | Normalize to ' ' |
| bom | Byte order mark (U+FEFF) | Strip |
| tag | Unicode tag characters (U+E0001–U+E007F) | — |
| fillers | Hangul, Khmer, Mongolian, Ogham fillers | — |
| math | Invisible math operators (U+2061–U+2064) | — |
Reusable cleaners
Build a cleaning pipeline once, apply it to many strings with no per-call chain allocation:
import { cleaner } from 'deghost'
const clean = cleaner().strip('format').strip('control').normalize('spaces').trim().build()
clean('dirty\u00a0string') // 'dirty string'
clean('another\u200Bone') // 'anotherone'Cleaners also support .replace() and .highlight() for dynamic transformations:
const annotate = cleaner().highlight('format').normalize('spaces').build()
annotate('a\u200Bb\u00a0c') // 'a[U+200B]b c'Detection
Find out what's hiding in your strings:
import { detect, hasGhosts, isClean, count, first, scan } from 'deghost'
detect('sneaky\u200Btext')
// [{
// char: '\u200B',
// codepoint: 'U+200B',
// name: 'ZERO WIDTH SPACE',
// category: 'format',
// offset: 6
// }]
hasGhosts('hello\u200Bworld') // true
isClean('hello world') // true
count('a\u00a0b\u200Bc\u200Bd')
// { spaces: 1, format: 2 }
// Get just the first detection (stops early)
first('a\u200Bb\u00a0c')
// { char: '\u200B', codepoint: 'U+200B', ... }
// Lazy iterator for large strings
for (const d of scan(largeString)) {
if (d.category === 'format') break
}All detection functions accept an optional categories array to filter:
detect('a\u200Bb\u00a0c', ['spaces'])
// Only returns the NBSP detectionHighlighting
Make invisible characters visible for debugging:
import { highlight } from 'deghost'
highlight('hello\u200Bworld')
// 'hello[U+200B]world'
// Custom formatter
highlight('a\u200Bb', (d) => `{${d.name}}`)
// 'a{ZERO WIDTH SPACE}b'
// Filter by category
highlight('a\u00a0b\u200Bc', { categories: ['format'] })
// 'a\u00a0b[U+200B]c'Summary
Get a human-readable report of all invisible characters:
import { summary } from 'deghost'
summary('hello\u200Bworld\u00a0here')
// 2 invisible characters found.
//
// By category:
// format: 1
// spaces: 1
//
// Details:
// U+200B ZERO WIDTH SPACE (format, offset 5)
// U+00A0 NO-BREAK SPACE (spaces, offset 11)Character lookup
Identify a single character or codepoint:
import { identify } from 'deghost'
identify('\u200B')
// { codepoint: 'U+200B', name: 'ZERO WIDTH SPACE', category: 'format' }
identify(0x00a0)
// { codepoint: 'U+00A0', name: 'NO-BREAK SPACE', category: 'spaces' }
identify('a') // undefined — not a ghostPresets
import { presets } from 'deghost'
// Default: strip format + control + BOM, normalize spaces
presets.clean('text\u00a0with\u200Bghosts')
// → 'text with ghosts'
// Aggressive: strip everything invisible
presets.aggressive('text\u2061with\u200Bghosts')
// → 'textwithghosts'
// Spaces only: just normalize whitespace
presets.spaces('text\u00a0here')
// → 'text here'How it works
deghost uses ES2018 Unicode property escapes (\p{Cf}, \p{Cc}, \p{Zs}) for broad category matching, plus curated codepoint sets for categories not covered by a single Unicode general category (tag characters, script-specific fillers, invisible math operators).
The key design choice: strip vs. normalize. A non-breaking space (U+00A0) should become a regular space, not disappear — otherwise "Plant\u00a064" becomes "Plant64". deghost handles this by default; out-of-character does not.
Comparison
| Feature | deghost | out-of-character | | ------------------------------- | ------- | ---------------- | | Strip invisible chars | yes | yes | | Normalize spaces (NBSP → space) | yes | no (strips) | | Chainable API | yes | no | | Reusable cleaners | yes | no | | Detection with metadata | yes | yes | | Category-level control | yes | no | | Highlighting / debugging | yes | no | | Tagged template literal | yes | no | | TypeScript-native | yes | no | | Presets | yes | no | | CLI | not yet | yes | | Zero dependencies | yes | yes |
Requirements
Node.js >= 18. Uses ES2018 Unicode property escapes (supported in all modern runtimes).
License
MIT
