ml-cache
v1.2.0
Published
SDK to collect and store business/product events for future ML training. Store now, train later when AI becomes affordable.
Maintainers
Readme
ml-cache
Store your business data today. Train your AI models tomorrow.
The Problem
Machine learning is transforming every industry, but there's a catch: you need massive amounts of quality data to train effective models. Companies that start collecting data today will have a significant competitive advantage when:
- ML training costs continue to drop exponentially
- Your business grows and you need personalized AI features
- You want to build recommendation engines, fraud detection, or predictive analytics
- Custom models become essential for differentiation
The data you're generating right now is invaluable for future AI/ML applications. Don't let it slip away.
The Solution
ml-cache is a lightweight TypeScript SDK that captures your business events and stores them in Amazon S3 Glacier — the most cost-effective cold storage solution available. It's designed with a simple philosophy:
Collect everything now. Pay almost nothing. Train models when ready.
Why Cold Storage?
| Storage Type | Cost per TB/month | Retrieval | | ------------------------ | ----------------- | ---------------- | | S3 Standard | ~$23 | Instant | | S3 Glacier | ~$4 | Minutes to hours | | S3 Glacier Deep Archive | ~$1 | 12-48 hours |
For ML training data that you'll access months or years from now, cold storage is 20x cheaper than standard storage.
Features
- Simple API — One method to cache all your data:
cache() - Automatic Batching — Efficiently groups events to minimize API calls
- Smart Retry Logic — Exponential backoff ensures no data loss
- Type-Safe — Full TypeScript support with comprehensive type definitions
- Flexible Storage — S3 Standard, Glacier, or Glacier Deep Archive
- Rich Context — Capture user, device, page, and campaign data
- Zero Dependencies on Analytics — Direct AWS integration, no middlemen
- Production Ready — Battle-tested error handling and graceful shutdown
- Backend Only — Designed for Node.js server-side applications
Platform Support
Important: This SDK is designed for backend/server-side use only (Node.js 18+).
It is not compatible with:
- Browser environments
- Edge runtimes (Cloudflare Workers, Vercel Edge)
- React Native or mobile apps
The SDK requires Node.js APIs (crypto, Buffer) and direct AWS SDK access, which are not available in browser or edge environments.
Installation
npm install ml-cacheyarn add ml-cachepnpm add ml-cacheQuick Start
import { MLCacheClient } from 'ml-cache';
// Initialize the client
const mlCache = new MLCacheClient({
credentials: {
accessKeyId: process.env.AWS_ACCESS_KEY_ID!,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,
},
s3: {
bucket: 'my-ml-data-lake',
region: 'us-east-1',
storageClass: 'GLACIER', // Cost-effective cold storage
},
storageMode: 'S3',
sourceApp: 'my-webapp',
environment: 'production',
});
// Cache business data
await mlCache.cache({
data: {
productId: 'SKU-12345',
productName: 'Premium Widget',
price: 99.99,
currency: 'USD',
quantity: 2,
},
context: {
user: {
userId: 'user-789',
traits: {
plan: 'premium',
signupDate: '2024-01-15',
},
},
},
});
// Graceful shutdown (flushes remaining events)
await mlCache.shutdown();Configuration
Full Configuration Options
import { MLCacheClient, type MLCacheConfig } from 'ml-cache';
const config: MLCacheConfig = {
// Required: AWS Credentials
credentials: {
accessKeyId: 'AKIA...',
secretAccessKey: '...',
sessionToken: '...', // Optional: for temporary credentials
},
// S3 Configuration (required for S3 or S3_TO_GLACIER mode)
s3: {
bucket: 'my-ml-data-bucket',
region: 'us-east-1',
prefix: 'events/', // Optional: folder prefix for objects
storageClass: 'GLACIER', // STANDARD, GLACIER, DEEP_ARCHIVE, etc.
},
// Glacier Configuration (required for GLACIER mode)
glacier: {
vaultName: 'my-ml-vault',
region: 'us-east-1',
accountId: '-', // Optional: defaults to current account
},
// Storage mode
storageMode: 'S3', // 'S3' | 'GLACIER' | 'S3_TO_GLACIER'
// Batching configuration
batch: {
enabled: true, // Enable event batching
maxSize: 100, // Max events per batch
maxWaitMs: 30000, // Flush every 30 seconds
},
// Retry configuration
retry: {
maxRetries: 3,
initialDelayMs: 1000,
maxDelayMs: 30000,
exponentialBackoff: true,
},
// Logging configuration
log: {
level: 'info', // 'debug' | 'info' | 'warn' | 'error' | 'silent'
enabled: true,
customLogger: (level, message, data) => {
// Your custom logging logic
},
},
// Metadata
sourceApp: 'my-application',
environment: 'production',
debug: false,
};
const client = new MLCacheClient(config);Storage Classes
Choose the right storage class for your needs:
| Storage Class | Use Case | Retrieval Time |
| -------------- | ------------------------------ | -------------- |
| STANDARD | Frequent access, testing | Instant |
| STANDARD_IA | Infrequent access | Instant |
| GLACIER | Recommended for ML data | 1-5 minutes |
| DEEP_ARCHIVE | Rarely accessed, lowest cost | 12-48 hours |
Caching Data
Basic Usage
Cache any business data with rich context:
await mlCache.cache({
data: {
orderId: 'ORD-123456',
total: 299.99,
items: [
{ sku: 'WIDGET-A', quantity: 2, price: 49.99 },
{ sku: 'WIDGET-B', quantity: 1, price: 199.99 },
],
paymentMethod: 'credit_card',
shippingMethod: 'express',
},
context: {
user: { userId: 'user-123' },
campaign: {
source: 'google',
medium: 'cpc',
name: 'summer_sale',
},
},
});Event Context
Enrich events with contextual data:
await mlCache.cache({
data: {
action: 'feature_used',
feature: 'dark_mode',
},
context: {
// User context
user: {
userId: 'user-123',
anonymousId: 'anon-456',
traits: {
plan: 'pro',
role: 'admin',
},
},
// Device context
device: {
userAgent: 'Mozilla/5.0...',
deviceType: 'desktop',
os: 'macOS',
browser: 'Chrome',
screenResolution: '1920x1080',
locale: 'en-US',
timezone: 'America/New_York',
},
// Page context
page: {
url: 'https://example.com/settings',
path: '/settings',
title: 'Settings',
referrer: 'https://example.com/home',
},
// Campaign/UTM context
campaign: {
source: 'newsletter',
medium: 'email',
name: 'weekly_digest',
content: 'cta_button',
},
// App context
app: {
name: 'MyApp',
version: '2.1.0',
build: '456',
},
// Custom context
custom: {
experimentId: 'exp-123',
variant: 'B',
},
},
});Callbacks & Monitoring
// Monitor all cached events
mlCache.onEvent((event) => {
console.log('Event cached:', event.eventId);
});
// Handle errors
mlCache.onError((error, event) => {
console.error('Failed to store event:', error.message);
// Optionally: send to error tracking service
});
// Monitor flushes
mlCache.onFlush((result) => {
console.log(`Flushed ${result.eventCount} events`);
if (result.failedEventIds.length > 0) {
console.warn('Failed events:', result.failedEventIds);
}
});
// Health check
const health = await mlCache.getHealth();
console.log('SDK Health:', health);
// {
// healthy: true,
// s3Connected: true,
// glacierConnected: false,
// queueSize: 5,
// lastFlush: '2024-01-15T10:30:00.000Z',
// }Data Format
Events are stored in NDJSON (Newline Delimited JSON) format, perfect for:
- Apache Spark — Native NDJSON support
- AWS Athena — Query directly with SQL
- Pandas —
pd.read_json(file, lines=True) - Any ML pipeline — Simple line-by-line parsing
S3 Object Structure
s3://my-bucket/ml-cache-events/
├── 2024/
│ ├── 01/
│ │ ├── 15/
│ │ │ ├── 10/
│ │ │ │ ├── batch_1705312200_a1b2c3d4.ndjson
│ │ │ │ └── batch_1705312500_e5f6g7h8.ndjsonEvent Schema
{
"eventId": "550e8400-e29b-41d4-a716-446655440000",
"timestamp": "2024-01-15T10:30:00.000Z",
"data": {
"productId": "SKU-123",
"amount": 99.99
},
"context": {
"user": { "userId": "user-456" }
},
"metadata": {
"sdkVersion": "1.0.0",
"sourceApp": "my-app",
"environment": "production",
"batchId": "batch_1705312200_a1b2c3d4"
}
}AWS Setup
IAM Policy
Create an IAM policy with minimal required permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetBucketLocation"],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}For Glacier mode, add:
{
"Effect": "Allow",
"Action": ["glacier:UploadArchive", "glacier:DescribeVault"],
"Resource": "arn:aws:glacier:*:*:vaults/your-vault-name"
}S3 Lifecycle Policy (Optional)
Automatically transition data to deeper cold storage:
{
"Rules": [
{
"ID": "MLDataLifecycle",
"Status": "Enabled",
"Prefix": "ml-cache-events/",
"Transitions": [
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
]
}
]
}Best Practices
1. Capture Rich Context
The more context you capture now, the better your models will be:
// Good: Rich context for future ML
await mlCache.cache({
data: {
action: 'product_viewed',
productId: 'SKU-123',
category: 'electronics',
price: 299.99,
inStock: true,
viewDuration: 45,
scrollDepth: 0.8,
},
context: {
user: { userId: 'user-456', traits: { segment: 'high-value' } },
page: { referrer: 'google.com' },
device: { deviceType: 'mobile', os: 'iOS' },
custom: { searchQuery: 'best headphones' },
},
});2. Graceful Shutdown
Always flush events before application exit:
process.on('SIGTERM', async () => {
await mlCache.shutdown();
process.exit(0);
});3. Monitor Queue Size
Prevent memory issues in high-traffic scenarios:
setInterval(() => {
const queueSize = mlCache.getQueueSize();
if (queueSize > 5000) {
console.warn(`Queue size high: ${queueSize}`);
}
}, 60000);Future ML Use Cases
The data you collect today can power tomorrow's AI features:
| Data Type | Future ML Application | | --------------------- | ---------------------------------------------- | | Purchase data | Recommendation engine, demand forecasting | | Page views | Content personalization, A/B test analysis | | Search queries | Search ranking, query understanding | | Support interactions | Automated responses, sentiment analysis | | User behavior | Churn prediction, engagement scoring | | Product interactions | Dynamic pricing, inventory optimization |
API Reference
MLCacheClient
| Method | Description |
| ------------------- | ---------------------------------- |
| cache(event) | Cache data for ML training |
| flush() | Manually flush the event queue |
| getHealth() | Get SDK health status |
| getQueueSize() | Get current queue size |
| getVersion() | Get SDK version |
| shutdown() | Gracefully shutdown the client |
| onEvent(callback) | Register event callback |
| onError(callback) | Register error callback |
| onFlush(callback) | Register flush callback |
Event Structure
interface MLCacheEvent {
// Auto-generated if not provided
eventId?: string;
timestamp?: string;
// Your business data
data?: Record<string, unknown>;
// Rich context
context?: {
user?: { userId?: string; anonymousId?: string; traits?: Record<string, unknown> };
device?: { userAgent?: string; deviceType?: string; os?: string; /* ... */ };
page?: { url?: string; path?: string; title?: string; referrer?: string };
campaign?: { source?: string; medium?: string; name?: string; /* ... */ };
app?: { name?: string; version?: string; build?: string };
custom?: Record<string, unknown>;
};
}License
MIT
Contributing
Contributions are welcome! Please read our contributing guidelines and submit pull requests to the GitHub repository.
