ml-cache

v1.2.0

Published

18 days ago

SDK to collect and store business/product events for future ML training. Store now, train later when AI becomes affordable.

0High
0Medium
0Low

nicolasmondain

ml machine-learning data-collection analytics events s3 glacier aws cold-storage future-proof ai-training data-lake

ml-cache

Store your business data today. Train your AI models tomorrow.

The Problem

Machine learning is transforming every industry, but there's a catch: you need massive amounts of quality data to train effective models. Companies that start collecting data today will have a significant competitive advantage when:

ML training costs continue to drop exponentially
Your business grows and you need personalized AI features
You want to build recommendation engines, fraud detection, or predictive analytics
Custom models become essential for differentiation

The data you're generating right now is invaluable for future AI/ML applications. Don't let it slip away.

The Solution

ml-cache is a lightweight TypeScript SDK that captures your business events and stores them in Amazon S3 Glacier — the most cost-effective cold storage solution available. It's designed with a simple philosophy:

Collect everything now. Pay almost nothing. Train models when ready.

Why Cold Storage?

| Storage Type | Cost per TB/month | Retrieval | | ------------------------ | ----------------- | ---------------- | | S3 Standard | ~$23 | Instant | | S3 Glacier | ~$4 | Minutes to hours | | S3 Glacier Deep Archive | ~$1 | 12-48 hours |

For ML training data that you'll access months or years from now, cold storage is 20x cheaper than standard storage.

Features

Simple API — One method to cache all your data: cache()
Automatic Batching — Efficiently groups events to minimize API calls
Smart Retry Logic — Exponential backoff ensures no data loss
Type-Safe — Full TypeScript support with comprehensive type definitions
Flexible Storage — S3 Standard, Glacier, or Glacier Deep Archive
Rich Context — Capture user, device, page, and campaign data
Zero Dependencies on Analytics — Direct AWS integration, no middlemen
Production Ready — Battle-tested error handling and graceful shutdown
Backend Only — Designed for Node.js server-side applications

Platform Support

Important: This SDK is designed for backend/server-side use only (Node.js 18+).

It is not compatible with:

Browser environments
Edge runtimes (Cloudflare Workers, Vercel Edge)
React Native or mobile apps

The SDK requires Node.js APIs (crypto, Buffer) and direct AWS SDK access, which are not available in browser or edge environments.

Installation

npm install ml-cache

yarn add ml-cache

pnpm add ml-cache

Quick Start

import { MLCacheClient } from 'ml-cache';

// Initialize the client
const mlCache = new MLCacheClient({
  credentials: {
    accessKeyId: process.env.AWS_ACCESS_KEY_ID!,
    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,
  },
  s3: {
    bucket: 'my-ml-data-lake',
    region: 'us-east-1',
    storageClass: 'GLACIER', // Cost-effective cold storage
  },
  storageMode: 'S3',
  sourceApp: 'my-webapp',
  environment: 'production',
});

// Cache business data
await mlCache.cache({
  data: {
    productId: 'SKU-12345',
    productName: 'Premium Widget',
    price: 99.99,
    currency: 'USD',
    quantity: 2,
  },
  context: {
    user: {
      userId: 'user-789',
      traits: {
        plan: 'premium',
        signupDate: '2024-01-15',
      },
    },
  },
});

// Graceful shutdown (flushes remaining events)
await mlCache.shutdown();

Configuration

Full Configuration Options

import { MLCacheClient, type MLCacheConfig } from 'ml-cache';

const config: MLCacheConfig = {
  // Required: AWS Credentials
  credentials: {
    accessKeyId: 'AKIA...',
    secretAccessKey: '...',
    sessionToken: '...', // Optional: for temporary credentials
  },

  // S3 Configuration (required for S3 or S3_TO_GLACIER mode)
  s3: {
    bucket: 'my-ml-data-bucket',
    region: 'us-east-1',
    prefix: 'events/', // Optional: folder prefix for objects
    storageClass: 'GLACIER', // STANDARD, GLACIER, DEEP_ARCHIVE, etc.
  },

  // Glacier Configuration (required for GLACIER mode)
  glacier: {
    vaultName: 'my-ml-vault',
    region: 'us-east-1',
    accountId: '-', // Optional: defaults to current account
  },

  // Storage mode
  storageMode: 'S3', // 'S3' | 'GLACIER' | 'S3_TO_GLACIER'

  // Batching configuration
  batch: {
    enabled: true, // Enable event batching
    maxSize: 100, // Max events per batch
    maxWaitMs: 30000, // Flush every 30 seconds
  },

  // Retry configuration
  retry: {
    maxRetries: 3,
    initialDelayMs: 1000,
    maxDelayMs: 30000,
    exponentialBackoff: true,
  },

  // Logging configuration
  log: {
    level: 'info', // 'debug' | 'info' | 'warn' | 'error' | 'silent'
    enabled: true,
    customLogger: (level, message, data) => {
      // Your custom logging logic
    },
  },

  // Metadata
  sourceApp: 'my-application',
  environment: 'production',
  debug: false,
};

const client = new MLCacheClient(config);

Storage Classes

Choose the right storage class for your needs:

| Storage Class | Use Case | Retrieval Time | | -------------- | ------------------------------ | -------------- | | STANDARD | Frequent access, testing | Instant | | STANDARD_IA | Infrequent access | Instant | | GLACIER | Recommended for ML data | 1-5 minutes | | DEEP_ARCHIVE | Rarely accessed, lowest cost | 12-48 hours |

Caching Data

Basic Usage

Cache any business data with rich context:

await mlCache.cache({
  data: {
    orderId: 'ORD-123456',
    total: 299.99,
    items: [
      { sku: 'WIDGET-A', quantity: 2, price: 49.99 },
      { sku: 'WIDGET-B', quantity: 1, price: 199.99 },
    ],
    paymentMethod: 'credit_card',
    shippingMethod: 'express',
  },
  context: {
    user: { userId: 'user-123' },
    campaign: {
      source: 'google',
      medium: 'cpc',
      name: 'summer_sale',
    },
  },
});

Event Context

Enrich events with contextual data:

await mlCache.cache({
  data: {
    action: 'feature_used',
    feature: 'dark_mode',
  },
  context: {
    // User context
    user: {
      userId: 'user-123',
      anonymousId: 'anon-456',
      traits: {
        plan: 'pro',
        role: 'admin',
      },
    },

    // Device context
    device: {
      userAgent: 'Mozilla/5.0...',
      deviceType: 'desktop',
      os: 'macOS',
      browser: 'Chrome',
      screenResolution: '1920x1080',
      locale: 'en-US',
      timezone: 'America/New_York',
    },

    // Page context
    page: {
      url: 'https://example.com/settings',
      path: '/settings',
      title: 'Settings',
      referrer: 'https://example.com/home',
    },

    // Campaign/UTM context
    campaign: {
      source: 'newsletter',
      medium: 'email',
      name: 'weekly_digest',
      content: 'cta_button',
    },

    // App context
    app: {
      name: 'MyApp',
      version: '2.1.0',
      build: '456',
    },

    // Custom context
    custom: {
      experimentId: 'exp-123',
      variant: 'B',
    },
  },
});

Callbacks & Monitoring

// Monitor all cached events
mlCache.onEvent((event) => {
  console.log('Event cached:', event.eventId);
});

// Handle errors
mlCache.onError((error, event) => {
  console.error('Failed to store event:', error.message);
  // Optionally: send to error tracking service
});

// Monitor flushes
mlCache.onFlush((result) => {
  console.log(`Flushed ${result.eventCount} events`);
  if (result.failedEventIds.length > 0) {
    console.warn('Failed events:', result.failedEventIds);
  }
});

// Health check
const health = await mlCache.getHealth();
console.log('SDK Health:', health);
// {
//   healthy: true,
//   s3Connected: true,
//   glacierConnected: false,
//   queueSize: 5,
//   lastFlush: '2024-01-15T10:30:00.000Z',
// }

Data Format

Events are stored in NDJSON (Newline Delimited JSON) format, perfect for:

Apache Spark — Native NDJSON support
AWS Athena — Query directly with SQL
Pandas — pd.read_json(file, lines=True)
Any ML pipeline — Simple line-by-line parsing

S3 Object Structure

s3://my-bucket/ml-cache-events/
├── 2024/
│   ├── 01/
│   │   ├── 15/
│   │   │   ├── 10/
│   │   │   │   ├── batch_1705312200_a1b2c3d4.ndjson
│   │   │   │   └── batch_1705312500_e5f6g7h8.ndjson

Event Schema

{
  "eventId": "550e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2024-01-15T10:30:00.000Z",
  "data": {
    "productId": "SKU-123",
    "amount": 99.99
  },
  "context": {
    "user": { "userId": "user-456" }
  },
  "metadata": {
    "sdkVersion": "1.0.0",
    "sourceApp": "my-app",
    "environment": "production",
    "batchId": "batch_1705312200_a1b2c3d4"
  }
}

AWS Setup

IAM Policy

Create an IAM policy with minimal required permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetBucketLocation"],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

For Glacier mode, add:

{
  "Effect": "Allow",
  "Action": ["glacier:UploadArchive", "glacier:DescribeVault"],
  "Resource": "arn:aws:glacier:*:*:vaults/your-vault-name"
}

S3 Lifecycle Policy (Optional)

Automatically transition data to deeper cold storage:

{
  "Rules": [
    {
      "ID": "MLDataLifecycle",
      "Status": "Enabled",
      "Prefix": "ml-cache-events/",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ]
    }
  ]
}

Best Practices

1. Capture Rich Context

The more context you capture now, the better your models will be:

// Good: Rich context for future ML
await mlCache.cache({
  data: {
    action: 'product_viewed',
    productId: 'SKU-123',
    category: 'electronics',
    price: 299.99,
    inStock: true,
    viewDuration: 45,
    scrollDepth: 0.8,
  },
  context: {
    user: { userId: 'user-456', traits: { segment: 'high-value' } },
    page: { referrer: 'google.com' },
    device: { deviceType: 'mobile', os: 'iOS' },
    custom: { searchQuery: 'best headphones' },
  },
});

2. Graceful Shutdown

Always flush events before application exit:

process.on('SIGTERM', async () => {
  await mlCache.shutdown();
  process.exit(0);
});

3. Monitor Queue Size

Prevent memory issues in high-traffic scenarios:

setInterval(() => {
  const queueSize = mlCache.getQueueSize();
  if (queueSize > 5000) {
    console.warn(`Queue size high: ${queueSize}`);
  }
}, 60000);

Future ML Use Cases

The data you collect today can power tomorrow's AI features:

| Data Type | Future ML Application | | --------------------- | ---------------------------------------------- | | Purchase data | Recommendation engine, demand forecasting | | Page views | Content personalization, A/B test analysis | | Search queries | Search ranking, query understanding | | Support interactions | Automated responses, sentiment analysis | | User behavior | Churn prediction, engagement scoring | | Product interactions | Dynamic pricing, inventory optimization |

API Reference

MLCacheClient

| Method | Description | | ------------------- | ---------------------------------- | | cache(event) | Cache data for ML training | | flush() | Manually flush the event queue | | getHealth() | Get SDK health status | | getQueueSize() | Get current queue size | | getVersion() | Get SDK version | | shutdown() | Gracefully shutdown the client | | onEvent(callback) | Register event callback | | onError(callback) | Register error callback | | onFlush(callback) | Register flush callback |

Event Structure

interface MLCacheEvent {
  // Auto-generated if not provided
  eventId?: string;
  timestamp?: string;

  // Your business data
  data?: Record<string, unknown>;

  // Rich context
  context?: {
    user?: { userId?: string; anonymousId?: string; traits?: Record<string, unknown> };
    device?: { userAgent?: string; deviceType?: string; os?: string; /* ... */ };
    page?: { url?: string; path?: string; title?: string; referrer?: string };
    campaign?: { source?: string; medium?: string; name?: string; /* ... */ };
    app?: { name?: string; version?: string; build?: string };
    custom?: Record<string, unknown>;
  };
}

License

MIT

Contributing

Contributions are welcome! Please read our contributing guidelines and submit pull requests to the GitHub repository.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ml-cache

The Problem

The Solution

Why Cold Storage?

Features

Platform Support

Installation

Quick Start

Configuration

Full Configuration Options

Storage Classes

Caching Data

Basic Usage

Event Context

Callbacks & Monitoring

Data Format

S3 Object Structure

Event Schema

AWS Setup

IAM Policy

S3 Lifecycle Policy (Optional)

Best Practices

1. Capture Rich Context

2. Graceful Shutdown

3. Monitor Queue Size

Future ML Use Cases

API Reference

MLCacheClient

Event Structure

License

Contributing