@andrejs1979/monitoring

v1.0.0

Published

6 months ago

Monitoring, metrics, and observability for NoSQL database platform

Downloads

0High
0Medium
0Low

andrejs1979

monitoring metrics observability performance analytics

NoSQL Monitoring and Observability System

A comprehensive monitoring and observability system for NoSQL that provides metrics collection, distributed tracing, structured logging, health checks, alerting, and performance analytics.

Features

🎯 Core Monitoring Capabilities

Metrics Collection: Request latency, throughput, database operations, vector search performance, memory/storage usage, error rates
Distributed Tracing: OpenTelemetry-compliant tracing across services, database operations, and vector searches
Structured Logging: Context-aware logging with error tracking, audit logs, and performance logging
Health Checks: Deep health monitoring for all services with automatic failover triggers
Real-time Alerting: Configurable alert rules with multiple notification channels and escalation policies
Dashboard Integration: Real-time metrics streaming and historical data queries
Performance Analytics: Query analysis, index optimization recommendations, and cost analysis

🌐 Edge-Native Design

Cloudflare Workers Integration: Native support for Cloudflare's edge compute platform
Edge Metrics Collection: Per-location performance monitoring and global aggregation
Low Latency: Sub-millisecond metric collection overhead
Global Distribution: Monitoring data replicated across edge locations

🤖 AI-Powered Insights

Automated Performance Analysis: Machine learning-based bottleneck detection
Intelligent Alerting: Context-aware alerts with reduced false positives
Query Optimization: AI-driven query performance recommendations
Predictive Scaling: Proactive resource scaling based on usage patterns

Quick Start

Basic Setup

import { 
  MetricsCollector,
  TraceManager,
  Logger,
  HealthChecker,
  AlertManager,
  MonitoringMiddleware
} from '@nosqldb/monitoring';

// Initialize monitoring components
const metricsCollector = new MetricsCollector({
  flushInterval: 30000,
  defaultLabels: { service: 'nosqldb', version: '1.0.0' }
});

const traceManager = new TraceManager({
  serviceName: 'nosqldb',
  version: '1.0.0',
  environment: 'production',
  samplingRate: 0.1
});

const logger = new Logger({
  component: 'NoSQLDB',
  level: 'info',
  structured: true
});

const healthChecker = new HealthChecker();
const alertManager = new AlertManager({
  evaluationInterval: 60000,
  defaultNotificationChannel: 'default'
});

// Create monitoring middleware
const monitoring = new MonitoringMiddleware(
  metricsCollector,
  traceManager,
  logger,
  healthChecker,
  {
    enableMetrics: true,
    enableTracing: true,
    enableLogging: true,
    enableHealthChecks: true,
    excludePaths: ['/health', '/metrics'],
    slowRequestThreshold: 1000,
    errorSamplingRate: 1.0,
    sensitiveHeaders: ['authorization', 'cookie']
  }
);

Cloudflare Workers Integration

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
    return await monitoring.handleRequest(request, env, ctx);
  }
};

Express.js Integration

import express from 'express';

const app = express();
app.use(monitoring.expressMiddleware());

Components

📊 Metrics Collection

Automatically collects and aggregates metrics:

// Record custom metrics
metricsCollector.counter('api_calls_total', 1, { endpoint: '/api/search' });
metricsCollector.histogram('query_duration_seconds', 0.150, { operation: 'vector_search' });
metricsCollector.gauge('active_connections', 42);

Built-in metrics include:

HTTP request/response metrics
Database operation performance
Vector search latency and accuracy
Memory and storage utilization
Authentication and security events
Edge location performance

🔍 Distributed Tracing

OpenTelemetry-compliant distributed tracing:

// Create custom spans
const span = traceManager.startSpan({
  operationName: 'vector.search',
  tags: { collection: 'documents', dimensions: 1536 }
});

try {
  const results = await performVectorSearch();
  traceManager.setSpanStatus(span, SpanStatus.OK);
} catch (error) {
  traceManager.setSpanStatus(span, SpanStatus.ERROR, error.message);
  throw error;
} finally {
  traceManager.finishSpan(span);
}

Automatic instrumentation for:

HTTP requests and responses
Database queries and transactions
Vector search operations
Authentication flows
Cross-service communication

📝 Structured Logging

Context-aware structured logging:

// Log with context
logger.info('User authenticated', {
  userId: 'user123',
  requestId: 'req456',
  operation: 'authentication'
}, {
  method: 'oauth',
  provider: 'google',
  duration: 250
});

// Automatic performance timing
const result = await logger.logWithTiming('database_query', async () => {
  return await db.query('SELECT * FROM users WHERE id = ?', [userId]);
});

Log types include:

HTTP request/response logs
Database operation logs
Authentication and security events
Error and exception tracking
Performance timing logs
Audit trail logs

🏥 Health Checks

Comprehensive health monitoring:

// Register custom health check
healthChecker.register('custom_service', async () => {
  const isHealthy = await checkServiceHealth();
  return {
    status: isHealthy ? HealthStatus.HEALTHY : HealthStatus.UNHEALTHY,
    message: isHealthy ? 'Service is running' : 'Service is down',
    duration: 50
  };
});

// Get health report
const report = await healthChecker.getHealthReport();

Built-in health checks:

Database connectivity
Vector index availability
Time series ingestion pipeline
Authentication service
Cache service availability
External API dependencies

🚨 Real-time Alerting

Configurable alerting with multiple channels:

// Create alert rule
alertManager.addRule({
  id: 'high_error_rate',
  name: 'High Error Rate',
  description: 'Error rate exceeds 5%',
  query: 'rate(errors_total[5m]) > 0.05',
  threshold: 0.05,
  comparison: AlertComparison.GREATER_THAN,
  duration: 300,
  severity: AlertSeverity.HIGH,
  enabled: true
});

// Add notification channel
alertManager.addNotificationChannel({
  id: 'slack_alerts',
  name: 'Slack Alerts',
  type: NotificationType.SLACK,
  config: {
    webhook_url: 'https://hooks.slack.com/...',
    channel: '#alerts'
  },
  enabled: true
});

Supported notification channels:

Email notifications
Slack integration
Webhook callbacks
SMS alerts (Twilio)
Discord notifications

📈 Dashboard Integration

Real-time metrics streaming and dashboard APIs:

// Query metrics
const response = await dashboardAPI.queryMetrics({
  metric: 'http_request_duration_seconds',
  timeRange: { start: Date.now() - 3600000, end: Date.now() },
  step: 60,
  aggregation: AggregationFunction.PERCENTILE_95
});

// Create dashboard
const dashboard = dashboardAPI.createDashboard({
  name: 'System Overview',
  description: 'High-level system metrics',
  widgets: [
    {
      type: WidgetType.METRIC_CHART,
      title: 'Request Rate',
      config: { metric: 'http_requests_total', type: 'line' }
    }
  ]
});

// Real-time streaming
const streamer = new MetricsStreamer(config, metricsCollector);
const subscriptionId = streamer.subscribeToMetric(
  clientId, 
  'http_requests_total',
  { interval: 5000 }
);

🔬 Performance Analytics

Advanced performance analysis and optimization:

// Analyze query performance
const analysis = await queryAnalyzer.analyzeQuery(queryPerformance);
console.log(`Query score: ${analysis.score}/100`);
console.log(`Recommendations: ${analysis.recommendations.length}`);

// Generate performance report
const report = await performanceAnalyzer.generatePerformanceReport();
console.log(`Overall grade: ${report.overall.grade}`);
console.log(`Bottlenecks found: ${report.bottlenecks.length}`);

Analytics features:

Query performance analysis
Index usage statistics
Storage optimization recommendations
Bottleneck identification
Performance trend analysis
Cost optimization insights

Configuration

Environment Variables

# Cloudflare Analytics
CF_ANALYTICS_TOKEN=your_token_here
CF_ANALYTICS_DATASET_ID=your_dataset_id
CF_ACCOUNT_ID=your_account_id

# OpenTelemetry
OTEL_SERVICE_NAME=nosqldb
OTEL_SERVICE_VERSION=1.0.0
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
OTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=your_api_key

# Monitoring
MONITORING_ENABLED=true
METRICS_RETENTION_DAYS=30
TRACING_SAMPLE_RATE=0.1
LOG_LEVEL=info

# Alerting
ALERT_EVALUATION_INTERVAL=60000
SLACK_WEBHOOK_URL=https://hooks.slack.com/...
EMAIL_SMTP_ENDPOINT=https://api.sendgrid.v3/mail/send
EMAIL_API_KEY=your_sendgrid_api_key

Advanced Configuration

const config: MonitoringConfig = {
  metrics: {
    enabled: true,
    retentionDays: 30,
    aggregationIntervals: [60, 300, 3600, 86400],
    exportInterval: 30000
  },
  tracing: {
    enabled: true,
    samplingRate: 0.1,
    maxSpansPerTrace: 1000,
    retentionDays: 7
  },
  logging: {
    enabled: true,
    level: LogLevel.INFO,
    retentionDays: 14,
    structured: true
  },
  healthChecks: {
    enabled: true,
    interval: 30000,
    timeout: 5000
  },
  alerting: {
    enabled: true,
    evaluationInterval: 60000,
    defaultNotificationChannel: 'default'
  },
  cloudflareAnalytics: {
    enabled: true,
    datasetId: process.env.CF_ANALYTICS_DATASET_ID,
    apiToken: process.env.CF_ANALYTICS_TOKEN
  }
};

API Reference

Metrics API

MetricsCollector.counter(name, value, labels) - Record counter metric
MetricsCollector.gauge(name, value, labels) - Record gauge metric
MetricsCollector.histogram(name, value, labels) - Record histogram metric
MetricsCollector.getMetrics() - Get all collected metrics

Tracing API

TraceManager.startSpan(options) - Start new span
TraceManager.finishSpan(span) - Finish span
TraceManager.setSpanStatus(span, status) - Set span status
TraceManager.createTraceContext(span) - Create trace context

Logging API

Logger.info(message, context, metadata) - Log info message
Logger.error(message, error, context, metadata) - Log error
Logger.logWithTiming(name, operation) - Log with automatic timing
Logger.child(context) - Create child logger with context

Health Check API

HealthChecker.register(name, check, config) - Register health check
HealthChecker.runCheck(name) - Run specific health check
HealthChecker.getHealthReport() - Get overall health report
HealthChecker.isHealthy() - Check if system is healthy

Alert API

AlertManager.addRule(rule) - Add alert rule
AlertManager.addNotificationChannel(channel) - Add notification channel
AlertManager.triggerManualAlert(name, description, severity) - Trigger manual alert
AlertManager.getActiveAlerts(filters) - Get active alerts

Best Practices

1. Metric Naming

Use consistent naming conventions:

// Good
metricsCollector.counter('http_requests_total', 1, { method: 'GET', status: '200' });
metricsCollector.histogram('database_query_duration_seconds', 0.1, { operation: 'select' });

// Bad  
metricsCollector.counter('requests', 1);
metricsCollector.histogram('db_time', 100);

2. Structured Logging

Always include relevant context:

// Good
logger.info('User login successful', {
  userId: user.id,
  requestId: req.id,
  operation: 'authentication'
}, {
  loginMethod: 'oauth',
  provider: 'google',
  duration: authTime
});

// Bad
logger.info('User logged in');

3. Span Organization

Create meaningful span hierarchies:

const requestSpan = traceManager.createHttpSpan('GET', '/api/search');
const authSpan = traceManager.startSpan({ 
  operationName: 'auth.validate', 
  parentSpan: requestSpan 
});
const searchSpan = traceManager.createVectorSearchSpan(
  'similarity_search', 
  'documents', 
  requestSpan
);

4. Alert Configuration

Use appropriate thresholds and durations:

// Good - prevents false positives
{
  threshold: 0.05,  // 5% error rate
  duration: 300,    // sustained for 5 minutes
  severity: AlertSeverity.HIGH
}

// Bad - too sensitive
{
  threshold: 0.01,
  duration: 30,
  severity: AlertSeverity.CRITICAL
}

Troubleshooting

Common Issues

High Memory Usage
- Reduce metric retention period
- Increase flush intervals
- Enable metric sampling
Missing Traces
- Check sampling rate configuration
- Verify trace context propagation
- Ensure spans are properly finished
Alert Noise
- Adjust thresholds and durations
- Use alert suppression rules
- Implement alert correlation
Performance Impact
- Enable async processing
- Use batching for exports
- Optimize metric cardinality

Debug Mode

Enable debug logging for troubleshooting:

const logger = new Logger({
  component: 'Monitoring',
  level: LogLevel.DEBUG
});

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

MIT License - see LICENSE file for details.