unllm

v2.1.0

Published

2 months ago

Clean LLM output to keyboard-printable text

Downloads

513

0High
0Medium
0Low

teimurjan

llm clean text keyboard printable

unllm

Convert LLM output to clean, human-like text by removing AI artifacts and normalizing typography.

import { clean } from 'unllm';

const llmOutput = "Hey there! 👋 This\u00A0message uses\u2014fancy chars\u2026 🚀";
const result = clean(llmOutput);
// → "Hey there! 👋 This message uses-fancy chars... 🚀"

Why?

LLMs (ChatGPT, Claude, etc.) often generate text with problematic Unicode characters that make output look artificial:

Control characters: NULL (\u0000), invisible formatting marks
Typographic Unicode: Em dashes (\u2014), fancy spaces (\u00A0), ellipsis
Invisible chars: Zero-width spaces, byte order marks (BOM), direction marks

This library normalizes LLM output to look natural while preserving emojis, quotes, and international text (Arabic, Chinese, Cyrillic, etc.).

What it does

| Input | Output | Type | |-------|--------|------| | "Hello\u0000World" | "HelloWorld" | Removes NULL | | "Hello\u00A0World" | "Hello World" | NBSP → space | | "foo\u2014bar" | "foo-bar" | Em dash → hyphen (opt-in) | | "Wait\u2026" | "Wait..." | Ellipsis → dots (opt-in) | | "He said "Hi"" | "He said 'Hi'" | Smart quotes → ' (opt-in) | | "Hi 👋 مرحبا" | "Hi 👋 مرحبا" | Preserves emojis & international text |

Installation

npm install unllm
# or
pnpm add unllm
# or
bun add unllm

API

`clean(text: string, options?: CleanOptions): string`

Removes LLM artifacts and normalizes typography to clean, human-like text.

Options:

interface CleanOptions {
  invisible?: boolean;        // Remove control/invisible chars (default: true)
  spaces?: boolean;           // Normalize Unicode spaces (default: true)
  dashes?: boolean;           // Normalize em/en dashes (default: false)
  ellipsis?: boolean;         // Normalize ellipsis (default: false)
  quotes?: boolean | string;  // Normalize smart quotes (default: false)
                              // true = normalize to ', string = normalize to that char
}

What it preserves:

Emojis (including multi-part with ZWJ: 👨‍👩‍👧‍👦)
International text (Arabic, Chinese, Cyrillic, etc.)
Line breaks and tabs
Regular punctuation and symbols

Examples:

import { clean } from 'unllm';

// Basic usage (invisible + spaces only)
clean("Hello\u00A0World");
// → "Hello World"

// Enable all normalizations
clean("Text\u0000\u00A0\u2014test\u2026", {
  invisible: true,
  spaces: true,
  dashes: true,
  ellipsis: true,
  quotes: true
});
// → "Text -test..."

// Normalize smart quotes to single quote
clean("He said \u201CHello\u201D", { quotes: true });
// → "He said 'Hello'"

// Normalize smart quotes to double quote
clean("He said \u201CHello\u201D", { quotes: '"' });
// → 'He said "Hello"'

// Disable everything (pass-through)
clean("Keep\u00A0all\u2014chars", {
  invisible: false,
  spaces: false
});
// → "Keep\u00A0all\u2014chars"

// Preserves international text
clean("C'est génial\u00A0!");
// → "C'est génial !"

`inspect(text: string, options?: CleanOptions): Issue[]`

Analyzes text and returns array of issues found. Uses the same options as clean().

Returns:

interface Issue {
  char: string;        // The problematic character
  code: number;        // Unicode code point
  hex: string;         // Hex representation (e.g., "U+00A0")
  position: number;    // Position in string
  type: 'control' | 'invisible' | 'typography';
  name: string;        // Human-readable name
}

Usage:

import { inspect } from 'unllm';

const issues = inspect("Hello\u00A0World");

console.log(issues);
// [
//   {
//     char: '\u00A0',
//     code: 160,
//     hex: 'U+00A0',
//     position: 5,
//     type: 'typography',
//     name: 'NO-BREAK SPACE'
//   }
// ]

// Detect smart quotes (disabled by default)
const quoteIssues = inspect("He said \u201CHello\u201D", { quotes: true });
// → 2 issues: LEFT/RIGHT DOUBLE QUOTATION MARK

// Quick check
if (issues.length > 0) {
  const cleaned = clean("Hello\u00A0World");
}

Use Cases

LLM output normalization: Clean ChatGPT/Claude responses for consistent formatting
Translation quality: Normalize AI-translated text to remove artifacts
Database storage: Ensure clean text before storing LLM output
API responses: Remove problematic characters that break JSON/XML
Content moderation: Detect and fix LLM-generated formatting issues
Text comparison: Normalize before diffing or deduplication

Character Categories

Control Characters (removed)

NULL (\u0000)
Other C0/C1 control characters
Backspace, vertical tab, form feed, etc.

Invisible Characters (removed)

Zero-width space (\u200B)
Zero-width non-joiner (\u200C)
Left-to-right/right-to-left marks
Word joiner, invisible operators
Byte order mark (BOM) (\uFEFF)

Typography (normalized)

Unicode spaces: NBSP (\u00A0), em space, en space, etc. → regular space
Dashes: em dash (\u2014), en dash (\u2013), minus (\u2212) → -
Ellipsis: \u2026 → ...
Soft hyphen: \u00AD → removed
Smart quotes (opt-in): " " ' ' « » etc. → ' or custom char

Design Principles

Simple API: Just two functions (clean and inspect)
Zero configuration: Works out of the box with sensible defaults
International-friendly: Preserves all legitimate text (Arabic, Chinese, etc.)
Emoji-aware: Intelligently handles complex emoji sequences
Zero dependencies: Lightweight and secure
Type-safe: Full TypeScript support

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

unllm

Why?

What it does

Installation

API

clean(text: string, options?: CleanOptions): string

inspect(text: string, options?: CleanOptions): Issue[]

Use Cases

Character Categories

Control Characters (removed)

Invisible Characters (removed)

Typography (normalized)

Design Principles

License

`clean(text: string, options?: CleanOptions): string`

`inspect(text: string, options?: CleanOptions): Issue[]`