email-scrubber-core
v1.1.0
Published
A privacy-focused email sanitizer that removes trackers and cleans URLs, powered by the ClearURLs ruleset.
Maintainers
Readme
email-scrubber-core
A privacy-focused email sanitizer that removes trackers from URLs and HTML content.
Features
🛡️ Universal Privacy Protection
- Cleans All URLs: Removes common tracking parameters from any URL, not just those from specific providers.
- Removes Tracking Pixels: Strips 1x1 pixels and other tracking images from email content.
- Powered by ClearURLs: Uses the excellent ClearURLs ruleset for provider-specific cleaning.
🚀 High Performance & Modern
- Streaming Architecture: Built for modern edge environments like Cloudflare Workers, Vercel Edge Functions, and Deno.
- Buffered API: Provides a simple, memory-based API for traditional Node.js environments.
- Efficient: Uses
linkedomfor fast, server-side DOM parsing in buffered mode.
🔧 Customizable & Flexible
- Configure or extend the tracking domains and parameters.
- Choose between streaming handlers or a simple buffered function.
- Preserve the full HTML structure or extract only the
<body>content.
How It Works
The library takes HTML content and sanitizes it in two main ways:
1. URL Cleaning
A two-pass system ensures comprehensive cleaning:
- Default Cleaning: A built-in list of common tracking parameters (e.g.,
utm_source,fbclid,gclid) is removed from every URL. - Provider-Specific Cleaning: For domains found in the ClearURLs ruleset (like Google, Facebook, Amazon), a second, more specific set of rules is applied.
2. Tracking Pixel Detection
The library removes <img> tags that are likely to be tracking pixels by checking for:
- 1x1 or 2x2 dimensions.
- Known tracking domains (e.g.,
google-analytics.com). - Hidden styles (
display: none,visibility: hidden). - Lack of descriptive
alttext.
Installation
npm install email-scrubber-coreQuick Start
For Node.js (Buffered API)
This is the simplest way to use the library in a standard Node.js application.
import { sanitizeEmailBuffered, createMinimalRules } from 'email-scrubber-core';
// Use the built-in minimal ruleset, which includes default cleaning.
const rules = createMinimalRules();
const dirtyEmail = `
<div>
<p>Check out our <a href="https://a-random-site.com/product?utm_source=email">latest product</a>!</p>
<img src="https://google-analytics.com/collect?tid=UA-12345" width="1" height="1">
</div>
`;
// sanitizeEmailBuffered processes the entire HTML string in memory.
const result = sanitizeEmailBuffered(dirtyEmail, rules);
// Resulting HTML:
// <div>
// <p>Check out our <a href="https://a-random-site.com/product">latest product</a>!</p>
// </div>
console.log(result.html);
console.log(
`Cleaned ${result.urlsCleaned} URLs and removed ${result.trackingPixelsRemoved} tracking pixels.`
);For Cloudflare Workers (Streaming API)
This is the recommended, high-performance approach for edge environments. It transforms the HTML as it streams without buffering the entire body in memory.
// In your Cloudflare Worker's main file:
import { getStreamingHandlers, createMinimalRules } from 'email-scrubber-core';
// Get the sanitization rules and handlers.
const rules = createMinimalRules();
const handlers = getStreamingHandlers(rules);
// Helper function to detect if email content is HTML
function isHtmlContent(content) {
// Check for HTML tags (case-insensitive)
const htmlRegex = /<\s*[a-zA-Z][^>]*>/i;
return htmlRegex.test(content);
}
export default {
async email(message, env, ctx) {
// Create an HTMLRewriter and attach the handlers.
const rewriter = new HTMLRewriter()
.on('a[href]', handlers.linkHandler)
.on('img', handlers.pixelHandler);
// Create a Response object from the raw email
const emailResponse = new Response(message.raw, {
headers: {
'Content-Type': 'text/html; charset=utf-8'
}
});
// Transform the email content
const sanitizedResponse = rewriter.transform(emailResponse);
const sanitizedContent = await sanitizedResponse.text();
// Forward the email
// I was stupid, I thought cloudflare provided something like this, which it doesn't. Yet.
await message.forward("inbox@corp", content=sanitizedContent);
},
};API Reference
Core Functions
sanitizeEmailBuffered(html, rules, options?)
Sanitizes a complete HTML string in memory. Ideal for Node.js.
html(string): The HTML content to sanitize.rules(ClearUrlRules): The ruleset to apply.options(SanitizeEmailOptions): Optional configuration.- Returns:
SanitizeEmailResult
getStreamingHandlers(rules)
Returns handler objects for use with HTMLRewriter in a streaming environment.
rules(ClearUrlRules): The ruleset to apply.- Returns: An object with
linkHandlerandpixelHandler.
createMinimalRules()
Returns a built-in, lightweight ruleset that includes the default * rule for cleaning all URLs, plus specific rules for major providers like Google and Facebook. This is often all you need.
For detailed API documentation on options and return types, please refer to the TypeScript definitions included in the package.
License
This project is licensed under the LGPL-3.0 License due to its use of the ClearURLs ruleset. See the LICENSE file for details. This means you can use it freely in your projects, but if you modify and distribute this library itself, you must share your changes under the same license.
