email-scrubber-core

v1.1.0

Published

10 months ago

A privacy-focused email sanitizer that removes trackers and cleans URLs, powered by the ClearURLs ruleset.

Downloads

0High
0Medium
0Low

email sanitizer privacy tracker cleaner clearurls url-cleaning tracking-removal email-security html-cleaning typescript cloudflare worker

email-scrubber-core

A privacy-focused email sanitizer that removes trackers from URLs and HTML content.

Features

🛡️ Universal Privacy Protection

Cleans All URLs: Removes common tracking parameters from any URL, not just those from specific providers.
Removes Tracking Pixels: Strips 1x1 pixels and other tracking images from email content.
Powered by ClearURLs: Uses the excellent ClearURLs ruleset for provider-specific cleaning.

🚀 High Performance & Modern

Streaming Architecture: Built for modern edge environments like Cloudflare Workers, Vercel Edge Functions, and Deno.
Buffered API: Provides a simple, memory-based API for traditional Node.js environments.
Efficient: Uses linkedom for fast, server-side DOM parsing in buffered mode.

🔧 Customizable & Flexible

Configure or extend the tracking domains and parameters.
Choose between streaming handlers or a simple buffered function.
Preserve the full HTML structure or extract only the <body> content.

How It Works

The library takes HTML content and sanitizes it in two main ways:

1. URL Cleaning

A two-pass system ensures comprehensive cleaning:

Default Cleaning: A built-in list of common tracking parameters (e.g., utm_source, fbclid, gclid) is removed from every URL.
Provider-Specific Cleaning: For domains found in the ClearURLs ruleset (like Google, Facebook, Amazon), a second, more specific set of rules is applied.

2. Tracking Pixel Detection

The library removes <img> tags that are likely to be tracking pixels by checking for:

1x1 or 2x2 dimensions.
Known tracking domains (e.g., google-analytics.com).
Hidden styles (display: none, visibility: hidden).
Lack of descriptive alt text.

Installation

npm install email-scrubber-core

Quick Start

For Node.js (Buffered API)

This is the simplest way to use the library in a standard Node.js application.

import { sanitizeEmailBuffered, createMinimalRules } from 'email-scrubber-core';

// Use the built-in minimal ruleset, which includes default cleaning.
const rules = createMinimalRules();

const dirtyEmail = `
  <div>
    <p>Check out our <a href="https://a-random-site.com/product?utm_source=email">latest product</a>!</p>
    <img src="https://google-analytics.com/collect?tid=UA-12345" width="1" height="1">
  </div>
`;

// sanitizeEmailBuffered processes the entire HTML string in memory.
const result = sanitizeEmailBuffered(dirtyEmail, rules);

// Resulting HTML:
// <div>
//   <p>Check out our <a href="https://a-random-site.com/product">latest product</a>!</p>
// </div>
console.log(result.html);

console.log(
  `Cleaned ${result.urlsCleaned} URLs and removed ${result.trackingPixelsRemoved} tracking pixels.`
);

For Cloudflare Workers (Streaming API)

This is the recommended, high-performance approach for edge environments. It transforms the HTML as it streams without buffering the entire body in memory.

// In your Cloudflare Worker's main file:
import { getStreamingHandlers, createMinimalRules } from 'email-scrubber-core';

// Get the sanitization rules and handlers.
const rules = createMinimalRules();
const handlers = getStreamingHandlers(rules);

// Helper function to detect if email content is HTML
function isHtmlContent(content) {
  // Check for HTML tags (case-insensitive)
  const htmlRegex = /<\s*[a-zA-Z][^>]*>/i;
  return htmlRegex.test(content);
}

export default {
  async email(message, env, ctx) {
    // Create an HTMLRewriter and attach the handlers.
    const rewriter = new HTMLRewriter()
      .on('a[href]', handlers.linkHandler)
      .on('img', handlers.pixelHandler);

    // Create a Response object from the raw email
    const emailResponse = new Response(message.raw, {
      headers: {
        'Content-Type': 'text/html; charset=utf-8'
      }
    });

    // Transform the email content
    const sanitizedResponse = rewriter.transform(emailResponse);
    const sanitizedContent = await sanitizedResponse.text();

    // Forward the email
    // I was stupid, I thought cloudflare provided something like this, which it doesn't. Yet.
    await message.forward("inbox@corp", content=sanitizedContent);
  },
};

API Reference

Core Functions

`sanitizeEmailBuffered(html, rules, options?)`

Sanitizes a complete HTML string in memory. Ideal for Node.js.

html (string): The HTML content to sanitize.
rules (ClearUrlRules): The ruleset to apply.
options (SanitizeEmailOptions): Optional configuration.
Returns: SanitizeEmailResult

`getStreamingHandlers(rules)`

Returns handler objects for use with HTMLRewriter in a streaming environment.

rules (ClearUrlRules): The ruleset to apply.
Returns: An object with linkHandler and pixelHandler.

`createMinimalRules()`

Returns a built-in, lightweight ruleset that includes the default * rule for cleaning all URLs, plus specific rules for major providers like Google and Facebook. This is often all you need.

For detailed API documentation on options and return types, please refer to the TypeScript definitions included in the package.

License

This project is licensed under the LGPL-3.0 License due to its use of the ClearURLs ruleset. See the LICENSE file for details. This means you can use it freely in your projects, but if you modify and distribute this library itself, you must share your changes under the same license.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

email-scrubber-core

Features

How It Works

1. URL Cleaning

2. Tracking Pixel Detection

Installation

Quick Start

For Node.js (Buffered API)

For Cloudflare Workers (Streaming API)

API Reference

Core Functions

sanitizeEmailBuffered(html, rules, options?)

getStreamingHandlers(rules)

createMinimalRules()

License

`sanitizeEmailBuffered(html, rules, options?)`

`getStreamingHandlers(rules)`

`createMinimalRules()`