typescript-html-clean
v1.0.0
Published
TypeScript port of lxml_html_clean - HTML sanitization library for security-sensitive environments
Maintainers
Readme
TypeScript HTML Clean
A TypeScript/JavaScript port of the Python lxml_html_clean library for defensive HTML sanitization in security-sensitive environments.
🔒 Security Notice
Important: This HTML cleaner is designed as a defensive security tool and should be used with caution. It uses a blocklist-based approach which may not catch all potential security vulnerabilities. For maximum security, consider using allowlist-based sanitizers like DOMPurify.
✨ Features
- Comprehensive HTML Sanitization: Remove scripts, dangerous attributes, and malicious content
- JavaScript Protection: Detect and remove JavaScript in attributes, URLs, and CSS
- CSS Security: Filter malicious CSS expressions, imports, and embedded content
- Configurable: 20+ options to customize sanitization behavior
- TypeScript Support: Full type definitions included
- Node.js & Browser: Works in both server-side and client-side environments
- Auto-linking: Convert plain URLs to clickable links
- Word Breaking: Prevent layout-breaking long words
📦 Installation
pnpm add typescript-html-clean
# or
npm install typescript-html-clean🚀 Quick Start
import { cleanHtml, Cleaner } from 'typescript-html-clean';
// Simple usage with defaults
const maliciousHtml = '<script>alert("xss")</script><p onclick="hack()">Content</p>';
const cleaned = cleanHtml(maliciousHtml);
console.log(cleaned); // '<p>Content</p>'
// Custom configuration
const cleaner = new Cleaner({
scripts: true, // Remove <script> tags
javascript: true, // Remove JavaScript from attributes and URLs
comments: true, // Remove HTML comments
style: false, // Keep <style> tags
embedded: true, // Remove <iframe>, <embed>, etc.
safeAttrsOnly: true, // Only allow safe attributes
hostWhitelist: ['trusted-domain.com'] // Allow embedded content from specific hosts
});
const result = cleaner.cleanHtml(dangerousHtml);🛡️ Security Features
JavaScript Removal
- Event handlers:
onclick,onload,onmouseover, etc. - JavaScript URLs:
javascript:,jscript:,vbscript:, etc. - Obfuscated schemes:
j a v a s c r i p t :, URL-encoded variants - CSS expressions: IE-specific
expression()calls
CSS Protection
- Malicious imports:
@importstatements that could load external threats - Embedded scripts: CSS comments containing HTML/JavaScript
- Expression attacks: CSS expressions that execute JavaScript
Content Filtering
- Dangerous tags: Scripts, forms, embedded objects, frames
- Unsafe attributes: Remove non-whitelisted attributes
- Host validation: Whitelist external content sources
- Control characters: Strip characters that could bypass filters
📖 API Reference
Main Functions
// Clean HTML with default settings
function cleanHtml<T extends string | Element>(html: T): T
// Create custom cleaner instance
class Cleaner {
constructor(options?: CleanerOptions)
cleanHtml<T extends string | Element>(html: T): T
}
// Auto-link URLs in text
function autolinkHtml<T extends string | Element>(html: T, options?: AutolinkOptions): T
// Break long words to prevent layout issues
function wordBreakHtml<T extends string | Element>(html: T, options?: WordBreakOptions): T
// Remove ASCII control characters
function removeControlCharacters(html: string): stringConfiguration Options
interface CleanerOptions {
scripts?: boolean; // Remove <script> tags (default: true)
javascript?: boolean; // Remove JavaScript from attributes (default: true)
comments?: boolean; // Remove HTML comments (default: true)
style?: boolean; // Remove <style> tags (default: false)
inlineStyle?: boolean; // Remove style attributes (default: follows style)
links?: boolean; // Remove <link> tags (default: true)
meta?: boolean; // Remove <meta> tags (default: true)
pageStructure?: boolean; // Remove <head>, <html>, <title> (default: true)
embedded?: boolean; // Remove <iframe>, <embed>, etc. (default: true)
frames?: boolean; // Remove frame-related tags (default: true)
forms?: boolean; // Remove form elements (default: true)
annoying?: boolean; // Remove <blink>, <marquee> (default: true)
removeTags?: string[]; // Tags to remove (keep content)
killTags?: string[]; // Tags to kill (remove with content)
allowTags?: string[]; // Allowed tags (allowlist mode)
removeUnknownTags?: boolean; // Remove non-standard tags (default: true)
safeAttrsOnly?: boolean; // Only allow safe attributes (default: true)
safeAttrs?: Set<string>; // Custom safe attributes set
addNofollow?: boolean; // Add rel="nofollow" to links (default: false)
hostWhitelist?: string[]; // Allowed hosts for embedded content
whitelistTags?: Set<string>; // Tags that can use host whitelist
}🧪 Examples
Basic Sanitization
import { cleanHtml } from 'typescript-html-clean';
const malicious = `
<div onclick="steal_data()">
<script>alert('xss')</script>
<p style="background: url(javascript:hack())">Content</p>
<iframe src="https://evil.com/malware"></iframe>
</div>
`;
const safe = cleanHtml(malicious);
// Result: '<div><p>Content</p></div>'Whitelist Trusted Domains
const cleaner = new Cleaner({
embedded: true, // Remove embeds by default
hostWhitelist: ['youtube.com', 'vimeo.com'] // But allow these
});
const html = `
<iframe src="https://youtube.com/embed/VIDEO_ID"></iframe>
<iframe src="https://evil.com/malware"></iframe>
`;
const result = cleaner.cleanHtml(html);
// Result: Only YouTube iframe remainsCustom Safe Attributes
const cleaner = new Cleaner({
safeAttrsOnly: true,
safeAttrs: new Set(['id', 'class', 'href', 'src', 'alt', 'title'])
});
const html = '<img src="photo.jpg" alt="Photo" onclick="hack()" data-track="analytics">';
const result = cleaner.cleanHtml(html);
// Result: '<img src="photo.jpg" alt="Photo">' (onclick and data-track removed)Auto-linking URLs
import { autolinkHtml } from 'typescript-html-clean';
const text = '<p>Visit https://github.com and mailto:[email protected]</p>';
const linked = autolinkHtml(text);
// Result: URLs become clickable <a> tagsWord Breaking
import { wordBreakHtml } from 'typescript-html-clean';
const html = '<p>supercalifragilisticexpialidocious</p>';
const broken = wordBreakHtml(html, { maxWidth: 10 });
// Result: Long word is broken with zero-width spaces🎯 Use Cases
- CMS Content: Sanitize user-generated content in blogs, forums, wikis
- Email Processing: Clean HTML emails before display
- Data Import: Sanitize HTML from external sources
- API Responses: Clean HTML before sending to frontend
- Markdown Processing: Sanitize HTML output from Markdown renderers
⚠️ Limitations
- Blocklist Approach: May not catch novel attack vectors
- Complex Parsing: Some edge cases in HTML parsing may exist
- Performance: Large documents may require processing time
- Dependencies: Requires JSDOM for server-side DOM manipulation
🔄 Migration from Python
If you're migrating from the Python lxml_html_clean library:
# Python
from lxml_html_clean import clean_html, Cleaner
cleaner = Cleaner(scripts=True, javascript=True)
result = cleaner.clean_html(html)// TypeScript
import { cleanHtml, Cleaner } from 'typescript-html-clean';
const cleaner = new Cleaner({ scripts: true, javascript: true });
const result = cleaner.cleanHtml(html);🧪 Testing
pnpm test # Run all tests
pnpm test:watch # Watch mode
pnpm test:ui # Interactive UI
pnpm test:coverage # Coverage report📄 License
BSD-3-Clause (same as original lxml_html_clean)
🤝 Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Add tests for your changes
- Ensure all tests pass (
pnpm test) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
🔗 Related Projects
- lxml_html_clean: Original Python library
- DOMPurify: Allowlist-based HTML sanitizer
- js-xss: Another JavaScript XSS filter
⚡ Performance Tips
- Use specific options to disable unneeded processing
- For large documents, consider streaming/chunking
- Cache cleaner instances for repeated use
- Profile memory usage with very large inputs
Security Reminder: Always test thoroughly with your specific use case and consider multiple layers of security defense.
