octocode-data-masker

v1.0.0

Published

8 months ago

A TypeScript library for masking sensitive data in strings, including PII, tokens, API keys, and more

0High
0Medium
0Low

sensitive-data masking privacy data-protection pii gdpr security typescript regex data-sanitization personal-data anonymization redaction

sensitive-data-masker

A high-performance TypeScript library for detecting and masking sensitive data in strings. Protect PII, API keys, tokens, credentials, and other confidential information with intelligent masking algorithms and configurable accuracy levels.

Features

🛡️ 200+ Detection Patterns: Comprehensive coverage for modern security needs
⚡ High Performance: Optimized regex engine with pattern caching
🎯 Accuracy Control: Configure detection sensitivity (high/medium/low)
🔧 Flexible Masking: Smart partial masking that preserves readability
📦 Zero Dependencies: Lightweight and secure
🌍 International Support: Handles US, UK, Canadian, and international formats
🔍 Pattern Filtering: Include or exclude specific pattern types
📊 Detailed Results: Get match counts, positions, and masked values

Installation

npm install sensitive-data-masker

yarn add sensitive-data-masker

Quick Start

import { mask, hasSensitiveContent, getPatternMatches } from 'sensitive-data-masker';

// Basic usage - intelligent partial masking
const text = 'My email is [email protected] and my SSN is 123-45-6789';
const result = mask(text);
console.log(result.output);
// "My email is **[email protected]** and my SSN is **3-45-67**"

console.log(result.found);
// { email: 1, ssn: 1 }

// Check if content contains sensitive data
const isSensitive = hasSensitiveContent(text);
console.log(isSensitive); // true

// Get detailed pattern matches with positions
const matches = getPatternMatches(text);
console.log(matches);
// [
//   {
//     pattern: 'email',
//     matches: [{ match: '[email protected]', startIndex: 12, endIndex: 27 }]
//   },
//   {
//     pattern: 'ssn',
//     matches: [{ match: '123-45-6789', startIndex: 44, endIndex: 54 }]
//   }
// ]

API Reference

`mask(input: string, options?: MaskingOptions): MaskResult`

Masks sensitive content in a string using intelligent partial masking.

Options

interface MaskingOptions {
  maskChar?: string;                    // Character used for masking (default: '*')
  preserveLength?: boolean;             // Preserve original length (default: false)
  excludePatterns?: string[];           // Patterns to exclude from masking
  onlyPatterns?: string[];              // Only mask these patterns
  matchAccuracy?: 'high' | 'medium' | 'low'; // Detection sensitivity
}

Returns

interface MaskResult {
  output: string;                       // Masked string
  found: { [name: string]: number };    // Count of each pattern found
  matches: string[];                    // Original matched values
  masked: string[];                     // Masked versions of matches
}

`hasSensitiveContent(input: string, options?): boolean`

Quickly check if a string contains sensitive data without performing masking.

import { hasSensitiveContent } from 'sensitive-data-masker';

hasSensitiveContent('[email protected]'); // true
hasSensitiveContent('hello world');      // false

// With options
hasSensitiveContent('sk-1234567890abcdef', { 
  matchAccuracy: 'high',
  excludePatterns: ['genericId']
}); // true

`getPatternMatches(input: string, options?): PatternMatch[]`

Get detailed information about all pattern matches including their positions.

import { getPatternMatches } from 'sensitive-data-masker';

const matches = getPatternMatches('Contact: [email protected] and key: sk-123abc');
console.log(matches);
// [
//   {
//     pattern: 'email',
//     matches: [{ match: '[email protected]', startIndex: 9, endIndex: 22 }]
//   },
//   {
//     pattern: 'openaiApiKey',
//     matches: [{ match: 'sk-123abc', startIndex: 33, endIndex: 41 }]
//   }
// ]

Advanced Usage

Custom Masking Options

import { mask } from 'sensitive-data-masker';

// Custom masking character
const result = mask('API key: sk-1234567890abcdef', { maskChar: '#' });
console.log(result.output);
// "API key: ##-1234567890ab##"

// Preserve original length
const result2 = mask('secret123', { preserveLength: true });
console.log(result2.output);
// "*********" (full length masked)

// Use high accuracy mode (fewer false positives)
const result3 = mask('sk-1234567890abcdef', { matchAccuracy: 'high' });
console.log(result3.output);
// "##-1234567890ab##"

Pattern Filtering

// Only mask specific patterns
const result = mask('Email: [email protected], API: sk-123', { 
  onlyPatterns: ['email', 'openaiApiKey'] 
});

// Exclude certain patterns
const result2 = mask('Email: [email protected], UUID: 123e4567-e89b-12d3-a456-426614174000', { 
  excludePatterns: ['uuid', 'genericId']
});

// Combine with accuracy control
const result3 = mask(sensitiveText, {
  matchAccuracy: 'high',
  excludePatterns: ['uuid']
});

Supported Pattern Categories

The library detects sensitive data across 25 categories with 200+ patterns:

🆔 Personal Identifiable Information (PII)

Email addresses (multiple formats)
Phone numbers (US, International, E.164)
Social Security Numbers (US with various formats)
Driver's license numbers, Medical record numbers
Tax IDs (TIN/EIN), Canadian SIN, UK National Insurance Numbers

☁️ Cloud Provider Credentials

AWS: Access keys, secret keys, session tokens, account IDs
AWS Resources: EC2, S3, RDS, Lambda ARNs, VPC IDs
Azure: Subscription IDs, client secrets, resource IDs
Google Cloud: API keys, service account keys, project IDs

💳 Financial & Payment Services

Credit card numbers (Visa, MasterCard, Amex, Discover)
Stripe: Secret keys, publishable keys, webhook secrets
PayPal: Access tokens, client IDs
Square: Access tokens, application IDs
Bank account numbers (US routing numbers, IBAN)

🤖 AI Provider Credentials

OpenAI: API keys, organization IDs
Anthropic/Claude: API keys
Google AI: Gemini API keys, Vertex AI tokens
Hugging Face: Access tokens, API keys
Other AI: Groq, Perplexity, Replicate, Together AI

🔐 Authentication & Security

JWT tokens, Bearer tokens
OAuth access tokens, refresh tokens
API keys in headers (X-API-Key, Authorization)
Session IDs, CSRF tokens
Generic secret patterns in environment variables

🔧 Developer Tools & Services

GitHub: Personal access tokens, app tokens
Slack: Bot tokens, webhook URLs, app secrets
Discord: Bot tokens, webhook URLs
Analytics: Google Analytics, Mixpanel, Amplitude
Monitoring: Datadog, New Relic, Sentry keys

🗄️ Database & Storage

Database connection strings (PostgreSQL, MySQL, MongoDB)
File Storage: S3 bucket URLs, Azure Blob Storage
CDN: CloudFront URLs, Azure CDN
Redis connection strings, Elasticsearch URLs

🔑 Cryptographic Materials

RSA private keys, SSH private keys
EC private keys, DSA private keys
X.509 certificates, PGP private key blocks
JSON Web Keys (JWK), PKCS#8 keys

🌐 Network & Location

IPv4/IPv6 addresses, MAC addresses
Geographic coordinates (latitude/longitude)
Private network ranges, subnet masks
URL patterns with embedded secrets

📱 Communication Services

Messaging: Twilio, SendGrid, Mailgun keys
Social Media: Twitter, Facebook, Instagram tokens
Email Services: Mailchimp, Postmark, SparkPost
SMS/Voice: Nexmo, Plivo, MessageBird

🛠️ Infrastructure & DevOps

Container Registries: Docker Hub, ECR, GCR tokens
CI/CD: Jenkins, GitLab CI, CircleCI tokens
Deployment: Vercel, Netlify, Heroku tokens
Monitoring: PagerDuty, Datadog, New Relic

🏢 Enterprise & Business

CRM: Salesforce, HubSpot tokens
E-commerce: Shopify, WooCommerce keys
Business Tools: Slack, Microsoft Teams tokens
Analytics: Google Analytics, Adobe Analytics

🎯 Generic Patterns

UUID v4, Generic IDs
Base64 encoded secrets
Hex-encoded keys (32, 64, 128 bit)
Custom secret patterns in configuration files

🔍 URL & Reference Patterns

URLs with embedded tokens
Database connection URIs
API endpoints with keys
Webhook URLs with secrets

💾 Version Control & Code

Git repository URLs with tokens
Package manager tokens (npm, PyPI)
Container registry credentials
Code hosting platform tokens

Pattern Accuracy Levels

Control detection sensitivity to balance between security and false positives:

High Accuracy

Most specific patterns with minimal false positives
Examples: AWS access keys with AKIA prefix, specific API key formats
Best for production environments

Medium Accuracy (Default)

Balanced detection with reasonable false positive rates
Examples: Generic API keys, common secret patterns
Good for most use cases

Low Accuracy

Broadest detection, may have higher false positive rates
Examples: Generic IDs, loose pattern matching
Useful for comprehensive scanning

// Use high accuracy for production
const prodResult = mask(text, { matchAccuracy: 'high' });

// Use medium accuracy for development  
const devResult = mask(text, { matchAccuracy: 'medium' });

// Use low accuracy for comprehensive scanning
const scanResult = mask(text, { matchAccuracy: 'low' });

TypeScript Support

Full TypeScript support with complete type definitions:

import { mask, hasSensitiveContent, getPatternMatches } from 'sensitive-data-masker';
import type { MaskResult, MaskingOptions } from 'sensitive-data-masker';

// Type-safe masking options
const options: MaskingOptions = {
  maskChar: '#',
  matchAccuracy: 'high',
  excludePatterns: ['uuid']
};

const result: MaskResult = mask(text, options);

Real-World Examples

Log File Sanitization

import { mask } from 'sensitive-data-masker';

const logEntry = `
[2024-01-15 10:30:45] INFO User [email protected] logged in
[2024-01-15 10:31:12] DEBUG API call with key sk-1234567890abcdef
[2024-01-15 10:31:15] ERROR Payment failed for card 4111-1111-1111-1111
[2024-01-15 10:31:20] WARN SSN in request: 123-45-6789
`;

const sanitized = mask(logEntry);
console.log(sanitized.output);
// [2024-01-15 10:30:45] INFO User **[email protected]** logged in
// [2024-01-15 10:31:12] DEBUG API call with key **-1234567890ab**
// [2024-01-15 10:31:15] ERROR Payment failed for card **11-1111-1111-11**
// [2024-01-15 10:31:20] WARN SSN in request: **3-45-67**

console.log(sanitized.found);
// { email: 1, openaiApiKey: 1, creditCard: 1, ssn: 1 }

Configuration File Security

const config = `
DATABASE_URL=postgresql://user:password123@localhost:5432/db
OPENAI_API_KEY=sk-1234567890abcdef1234567890abcdef
STRIPE_SECRET_KEY=sk_live_abcdef123456
[email protected]
JWT_SECRET=super-secret-key-123
`;

const result = mask(config);
console.log(result.output);
// DATABASE_URL=postgresql://user:**ssword1** @localhost:5432/db
// OPENAI_API_KEY=**-1234567890abcdef1234567890ab**
// STRIPE_SECRET_KEY=**_live_abcdef12**
// ADMIN_EMAIL=**[email protected]**
// JWT_SECRET=**per-secret-key-1**

Multi-Environment Setup

import { mask } from 'sensitive-data-masker';

// Production: Mask everything with high accuracy
const prodResult = mask(sensitiveData, { matchAccuracy: 'high' });

// Development: Allow test emails but mask real API keys
const devResult = mask(sensitiveData, { 
  matchAccuracy: 'medium',
  excludePatterns: ['email'] 
});

// Testing: Only mask financial data
const testResult = mask(sensitiveData, { 
  onlyPatterns: ['creditCard', 'bankAccount', 'ssn'],
  matchAccuracy: 'high'
});

Data Pipeline Processing

import { hasSensitiveContent, mask } from 'sensitive-data-masker';

// Check if data needs processing
function processBatch(records: string[]) {
  const results = records.map(record => {
    if (hasSensitiveContent(record)) {
      const masked = mask(record, { matchAccuracy: 'high' });
      return {
        data: masked.output,
        hadSensitiveData: true,
        patternsFound: Object.keys(masked.found)
      };
    }
    return { data: record, hadSensitiveData: false };
  });
  
  return results;
}

Performance Considerations

Optimized Regex Engine: Patterns are compiled and cached on first use
Single-Pass Processing: Efficient string traversal with minimal overhead
Memory Efficient: No unnecessary string copies or allocations
Pattern Filtering: Use onlyPatterns when you know which types to look for
Accuracy Optimization: Higher accuracy modes are faster due to more specific patterns

// Optimize for specific use cases
const emailsOnly = mask(text, { onlyPatterns: ['email'] }); // Faster
const highAccuracy = mask(text, { matchAccuracy: 'high' }); // Faster, fewer false positives
const comprehensive = mask(text, { matchAccuracy: 'low' }); // Slower, more thorough

Security Best Practices

Always mask before logging: Ensure sensitive data is masked before writing to logs
Use appropriate accuracy: Higher accuracy for production, lower for development/testing
Store results securely: The matches array contains original sensitive values
Regular updates: Keep the library updated for new pattern definitions
Test your patterns: Verify masking works correctly with your specific data formats
Environment-specific config: Use different settings for dev/staging/production

Development

Prerequisites

Node.js >= 18.12.0
Yarn or npm

Setup

git clone https://github.com/bgauryy/sensitive-data-mask.git
cd sensitive-data-mask
yarn install

Commands

yarn build          # Build the library
yarn dev           # Build in watch mode
yarn lint          # Run ESLint
yarn test          # Run tests
yarn typecheck     # Run TypeScript compiler checks

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Adding New Patterns

Choose the appropriate category file in src/regexes/
Add your pattern following the existing structure:

{
  name: 'myPattern',
  regex: /your-regex-here/gi,
  description: 'Description of what this detects',
  matchAccuracy: 'medium' // optional: 'high', 'medium', or 'low'
}

Run tests to ensure no regressions
Submit a PR with a clear description

License

MIT © guybary

Security

If you discover a security vulnerability, please email [email protected] instead of using the issue tracker.

Made with ❤️ for developers who care about data security