unllm
v2.1.0
Published
Clean LLM output to keyboard-printable text
Downloads
513
Maintainers
Readme
unllm
Convert LLM output to clean, human-like text by removing AI artifacts and normalizing typography.
import { clean } from 'unllm';
const llmOutput = "Hey there! 👋 This\u00A0message uses\u2014fancy chars\u2026 🚀";
const result = clean(llmOutput);
// → "Hey there! 👋 This message uses-fancy chars... 🚀"Why?
LLMs (ChatGPT, Claude, etc.) often generate text with problematic Unicode characters that make output look artificial:
- Control characters: NULL (
\u0000), invisible formatting marks - Typographic Unicode: Em dashes (
\u2014), fancy spaces (\u00A0), ellipsis - Invisible chars: Zero-width spaces, byte order marks (BOM), direction marks
This library normalizes LLM output to look natural while preserving emojis, quotes, and international text (Arabic, Chinese, Cyrillic, etc.).
What it does
| Input | Output | Type |
|-------|--------|------|
| "Hello\u0000World" | "HelloWorld" | Removes NULL |
| "Hello\u00A0World" | "Hello World" | NBSP → space |
| "foo\u2014bar" | "foo-bar" | Em dash → hyphen (opt-in) |
| "Wait\u2026" | "Wait..." | Ellipsis → dots (opt-in) |
| "He said "Hi"" | "He said 'Hi'" | Smart quotes → ' (opt-in) |
| "Hi 👋 مرحبا" | "Hi 👋 مرحبا" | Preserves emojis & international text |
Installation
npm install unllm
# or
pnpm add unllm
# or
bun add unllmAPI
clean(text: string, options?: CleanOptions): string
Removes LLM artifacts and normalizes typography to clean, human-like text.
Options:
interface CleanOptions {
invisible?: boolean; // Remove control/invisible chars (default: true)
spaces?: boolean; // Normalize Unicode spaces (default: true)
dashes?: boolean; // Normalize em/en dashes (default: false)
ellipsis?: boolean; // Normalize ellipsis (default: false)
quotes?: boolean | string; // Normalize smart quotes (default: false)
// true = normalize to ', string = normalize to that char
}What it preserves:
- Emojis (including multi-part with ZWJ: 👨👩👧👦)
- International text (Arabic, Chinese, Cyrillic, etc.)
- Line breaks and tabs
- Regular punctuation and symbols
Examples:
import { clean } from 'unllm';
// Basic usage (invisible + spaces only)
clean("Hello\u00A0World");
// → "Hello World"
// Enable all normalizations
clean("Text\u0000\u00A0\u2014test\u2026", {
invisible: true,
spaces: true,
dashes: true,
ellipsis: true,
quotes: true
});
// → "Text -test..."
// Normalize smart quotes to single quote
clean("He said \u201CHello\u201D", { quotes: true });
// → "He said 'Hello'"
// Normalize smart quotes to double quote
clean("He said \u201CHello\u201D", { quotes: '"' });
// → 'He said "Hello"'
// Disable everything (pass-through)
clean("Keep\u00A0all\u2014chars", {
invisible: false,
spaces: false
});
// → "Keep\u00A0all\u2014chars"
// Preserves international text
clean("C'est génial\u00A0!");
// → "C'est génial !"inspect(text: string, options?: CleanOptions): Issue[]
Analyzes text and returns array of issues found. Uses the same options as clean().
Returns:
interface Issue {
char: string; // The problematic character
code: number; // Unicode code point
hex: string; // Hex representation (e.g., "U+00A0")
position: number; // Position in string
type: 'control' | 'invisible' | 'typography';
name: string; // Human-readable name
}Usage:
import { inspect } from 'unllm';
const issues = inspect("Hello\u00A0World");
console.log(issues);
// [
// {
// char: '\u00A0',
// code: 160,
// hex: 'U+00A0',
// position: 5,
// type: 'typography',
// name: 'NO-BREAK SPACE'
// }
// ]
// Detect smart quotes (disabled by default)
const quoteIssues = inspect("He said \u201CHello\u201D", { quotes: true });
// → 2 issues: LEFT/RIGHT DOUBLE QUOTATION MARK
// Quick check
if (issues.length > 0) {
const cleaned = clean("Hello\u00A0World");
}Use Cases
- LLM output normalization: Clean ChatGPT/Claude responses for consistent formatting
- Translation quality: Normalize AI-translated text to remove artifacts
- Database storage: Ensure clean text before storing LLM output
- API responses: Remove problematic characters that break JSON/XML
- Content moderation: Detect and fix LLM-generated formatting issues
- Text comparison: Normalize before diffing or deduplication
Character Categories
Control Characters (removed)
- NULL (
\u0000) - Other C0/C1 control characters
- Backspace, vertical tab, form feed, etc.
Invisible Characters (removed)
- Zero-width space (
\u200B) - Zero-width non-joiner (
\u200C) - Left-to-right/right-to-left marks
- Word joiner, invisible operators
- Byte order mark (BOM) (
\uFEFF)
Typography (normalized)
- Unicode spaces: NBSP (
\u00A0), em space, en space, etc. → regular space - Dashes: em dash (
\u2014), en dash (\u2013), minus (\u2212) →- - Ellipsis:
\u2026→... - Soft hyphen:
\u00AD→ removed - Smart quotes (opt-in):
""''«»etc. →'or custom char
Design Principles
- Simple API: Just two functions (
cleanandinspect) - Zero configuration: Works out of the box with sensible defaults
- International-friendly: Preserves all legitimate text (Arabic, Chinese, etc.)
- Emoji-aware: Intelligently handles complex emoji sequences
- Zero dependencies: Lightweight and secure
- Type-safe: Full TypeScript support
License
MIT © Teimur Gasanov
