typescript-html-clean

v1.0.0

Published

6 months ago

TypeScript port of lxml_html_clean - HTML sanitization library for security-sensitive environments

0High
0Medium
0Low

florinszilagyi

html sanitization security xss clean filter

TypeScript HTML Clean

A TypeScript/JavaScript port of the Python lxml_html_clean library for defensive HTML sanitization in security-sensitive environments.

🔒 Security Notice

Important: This HTML cleaner is designed as a defensive security tool and should be used with caution. It uses a blocklist-based approach which may not catch all potential security vulnerabilities. For maximum security, consider using allowlist-based sanitizers like DOMPurify.

✨ Features

Comprehensive HTML Sanitization: Remove scripts, dangerous attributes, and malicious content
JavaScript Protection: Detect and remove JavaScript in attributes, URLs, and CSS
CSS Security: Filter malicious CSS expressions, imports, and embedded content
Configurable: 20+ options to customize sanitization behavior
TypeScript Support: Full type definitions included
Node.js & Browser: Works in both server-side and client-side environments
Auto-linking: Convert plain URLs to clickable links
Word Breaking: Prevent layout-breaking long words

📦 Installation

pnpm add typescript-html-clean
# or
npm install typescript-html-clean

🚀 Quick Start

import { cleanHtml, Cleaner } from 'typescript-html-clean';

// Simple usage with defaults
const maliciousHtml = '<script>alert("xss")</script><p onclick="hack()">Content</p>';
const cleaned = cleanHtml(maliciousHtml);
console.log(cleaned); // '<p>Content</p>'

// Custom configuration
const cleaner = new Cleaner({
  scripts: true,        // Remove <script> tags
  javascript: true,     // Remove JavaScript from attributes and URLs
  comments: true,       // Remove HTML comments
  style: false,         // Keep <style> tags
  embedded: true,       // Remove <iframe>, <embed>, etc.
  safeAttrsOnly: true,  // Only allow safe attributes
  hostWhitelist: ['trusted-domain.com'] // Allow embedded content from specific hosts
});

const result = cleaner.cleanHtml(dangerousHtml);

🛡️ Security Features

JavaScript Removal

Event handlers: onclick, onload, onmouseover, etc.
JavaScript URLs: javascript:, jscript:, vbscript:, etc.
Obfuscated schemes: j a v a s c r i p t :, URL-encoded variants
CSS expressions: IE-specific expression() calls

CSS Protection

Malicious imports: @import statements that could load external threats
Embedded scripts: CSS comments containing HTML/JavaScript
Expression attacks: CSS expressions that execute JavaScript

Content Filtering

Dangerous tags: Scripts, forms, embedded objects, frames
Unsafe attributes: Remove non-whitelisted attributes
Host validation: Whitelist external content sources
Control characters: Strip characters that could bypass filters

📖 API Reference

Main Functions

// Clean HTML with default settings
function cleanHtml<T extends string | Element>(html: T): T

// Create custom cleaner instance
class Cleaner {
  constructor(options?: CleanerOptions)
  cleanHtml<T extends string | Element>(html: T): T
}

// Auto-link URLs in text
function autolinkHtml<T extends string | Element>(html: T, options?: AutolinkOptions): T

// Break long words to prevent layout issues
function wordBreakHtml<T extends string | Element>(html: T, options?: WordBreakOptions): T

// Remove ASCII control characters
function removeControlCharacters(html: string): string

Configuration Options

interface CleanerOptions {
  scripts?: boolean;              // Remove <script> tags (default: true)
  javascript?: boolean;           // Remove JavaScript from attributes (default: true)
  comments?: boolean;             // Remove HTML comments (default: true)
  style?: boolean;                // Remove <style> tags (default: false)
  inlineStyle?: boolean;          // Remove style attributes (default: follows style)
  links?: boolean;                // Remove <link> tags (default: true)
  meta?: boolean;                 // Remove <meta> tags (default: true)
  pageStructure?: boolean;        // Remove <head>, <html>, <title> (default: true)
  embedded?: boolean;             // Remove <iframe>, <embed>, etc. (default: true)
  frames?: boolean;               // Remove frame-related tags (default: true)
  forms?: boolean;                // Remove form elements (default: true)
  annoying?: boolean;             // Remove <blink>, <marquee> (default: true)
  removeTags?: string[];          // Tags to remove (keep content)
  killTags?: string[];            // Tags to kill (remove with content)
  allowTags?: string[];           // Allowed tags (allowlist mode)
  removeUnknownTags?: boolean;    // Remove non-standard tags (default: true)
  safeAttrsOnly?: boolean;        // Only allow safe attributes (default: true)
  safeAttrs?: Set<string>;        // Custom safe attributes set
  addNofollow?: boolean;          // Add rel="nofollow" to links (default: false)
  hostWhitelist?: string[];       // Allowed hosts for embedded content
  whitelistTags?: Set<string>;    // Tags that can use host whitelist
}

🧪 Examples

Basic Sanitization

import { cleanHtml } from 'typescript-html-clean';

const malicious = `
  <div onclick="steal_data()">
    <script>alert('xss')</script>
    <p style="background: url(javascript:hack())">Content</p>
    <iframe src="https://evil.com/malware"></iframe>
  </div>
`;

const safe = cleanHtml(malicious);
// Result: '<div><p>Content</p></div>'

Whitelist Trusted Domains

const cleaner = new Cleaner({
  embedded: true,  // Remove embeds by default
  hostWhitelist: ['youtube.com', 'vimeo.com']  // But allow these
});

const html = `
  <iframe src="https://youtube.com/embed/VIDEO_ID"></iframe>
  <iframe src="https://evil.com/malware"></iframe>
`;

const result = cleaner.cleanHtml(html);
// Result: Only YouTube iframe remains

Custom Safe Attributes

const cleaner = new Cleaner({
  safeAttrsOnly: true,
  safeAttrs: new Set(['id', 'class', 'href', 'src', 'alt', 'title'])
});

const html = '<img src="photo.jpg" alt="Photo" onclick="hack()" data-track="analytics">';
const result = cleaner.cleanHtml(html);
// Result: '<img src="photo.jpg" alt="Photo">' (onclick and data-track removed)

Auto-linking URLs

import { autolinkHtml } from 'typescript-html-clean';

const text = '<p>Visit https://github.com and mailto:[email protected]</p>';
const linked = autolinkHtml(text);
// Result: URLs become clickable <a> tags

Word Breaking

import { wordBreakHtml } from 'typescript-html-clean';

const html = '<p>supercalifragilisticexpialidocious</p>';
const broken = wordBreakHtml(html, { maxWidth: 10 });
// Result: Long word is broken with zero-width spaces

🎯 Use Cases

CMS Content: Sanitize user-generated content in blogs, forums, wikis
Email Processing: Clean HTML emails before display
Data Import: Sanitize HTML from external sources
API Responses: Clean HTML before sending to frontend
Markdown Processing: Sanitize HTML output from Markdown renderers

⚠️ Limitations

Blocklist Approach: May not catch novel attack vectors
Complex Parsing: Some edge cases in HTML parsing may exist
Performance: Large documents may require processing time
Dependencies: Requires JSDOM for server-side DOM manipulation

🔄 Migration from Python

If you're migrating from the Python lxml_html_clean library:

# Python
from lxml_html_clean import clean_html, Cleaner

cleaner = Cleaner(scripts=True, javascript=True)
result = cleaner.clean_html(html)

// TypeScript
import { cleanHtml, Cleaner } from 'typescript-html-clean';

const cleaner = new Cleaner({ scripts: true, javascript: true });
const result = cleaner.cleanHtml(html);

🧪 Testing

pnpm test               # Run all tests
pnpm test:watch         # Watch mode
pnpm test:ui            # Interactive UI
pnpm test:coverage      # Coverage report

📄 License

BSD-3-Clause (same as original lxml_html_clean)

🤝 Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Add tests for your changes
Ensure all tests pass (pnpm test)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🔗 Related Projects

lxml_html_clean: Original Python library
DOMPurify: Allowlist-based HTML sanitizer
js-xss: Another JavaScript XSS filter

⚡ Performance Tips

Use specific options to disable unneeded processing
For large documents, consider streaming/chunking
Cache cleaner instances for repeated use
Profile memory usage with very large inputs

Security Reminder: Always test thoroughly with your specific use case and consider multiple layers of security defense.