@andrejs1979/monitoring
v1.0.0
Published
Monitoring, metrics, and observability for NoSQL database platform
Downloads
9
Maintainers
Readme
NoSQL Monitoring and Observability System
A comprehensive monitoring and observability system for NoSQL that provides metrics collection, distributed tracing, structured logging, health checks, alerting, and performance analytics.
Features
🎯 Core Monitoring Capabilities
- Metrics Collection: Request latency, throughput, database operations, vector search performance, memory/storage usage, error rates
- Distributed Tracing: OpenTelemetry-compliant tracing across services, database operations, and vector searches
- Structured Logging: Context-aware logging with error tracking, audit logs, and performance logging
- Health Checks: Deep health monitoring for all services with automatic failover triggers
- Real-time Alerting: Configurable alert rules with multiple notification channels and escalation policies
- Dashboard Integration: Real-time metrics streaming and historical data queries
- Performance Analytics: Query analysis, index optimization recommendations, and cost analysis
🌐 Edge-Native Design
- Cloudflare Workers Integration: Native support for Cloudflare's edge compute platform
- Edge Metrics Collection: Per-location performance monitoring and global aggregation
- Low Latency: Sub-millisecond metric collection overhead
- Global Distribution: Monitoring data replicated across edge locations
🤖 AI-Powered Insights
- Automated Performance Analysis: Machine learning-based bottleneck detection
- Intelligent Alerting: Context-aware alerts with reduced false positives
- Query Optimization: AI-driven query performance recommendations
- Predictive Scaling: Proactive resource scaling based on usage patterns
Quick Start
Basic Setup
import {
MetricsCollector,
TraceManager,
Logger,
HealthChecker,
AlertManager,
MonitoringMiddleware
} from '@nosqldb/monitoring';
// Initialize monitoring components
const metricsCollector = new MetricsCollector({
flushInterval: 30000,
defaultLabels: { service: 'nosqldb', version: '1.0.0' }
});
const traceManager = new TraceManager({
serviceName: 'nosqldb',
version: '1.0.0',
environment: 'production',
samplingRate: 0.1
});
const logger = new Logger({
component: 'NoSQLDB',
level: 'info',
structured: true
});
const healthChecker = new HealthChecker();
const alertManager = new AlertManager({
evaluationInterval: 60000,
defaultNotificationChannel: 'default'
});
// Create monitoring middleware
const monitoring = new MonitoringMiddleware(
metricsCollector,
traceManager,
logger,
healthChecker,
{
enableMetrics: true,
enableTracing: true,
enableLogging: true,
enableHealthChecks: true,
excludePaths: ['/health', '/metrics'],
slowRequestThreshold: 1000,
errorSamplingRate: 1.0,
sensitiveHeaders: ['authorization', 'cookie']
}
);Cloudflare Workers Integration
export default {
async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
return await monitoring.handleRequest(request, env, ctx);
}
};Express.js Integration
import express from 'express';
const app = express();
app.use(monitoring.expressMiddleware());Components
📊 Metrics Collection
Automatically collects and aggregates metrics:
// Record custom metrics
metricsCollector.counter('api_calls_total', 1, { endpoint: '/api/search' });
metricsCollector.histogram('query_duration_seconds', 0.150, { operation: 'vector_search' });
metricsCollector.gauge('active_connections', 42);Built-in metrics include:
- HTTP request/response metrics
- Database operation performance
- Vector search latency and accuracy
- Memory and storage utilization
- Authentication and security events
- Edge location performance
🔍 Distributed Tracing
OpenTelemetry-compliant distributed tracing:
// Create custom spans
const span = traceManager.startSpan({
operationName: 'vector.search',
tags: { collection: 'documents', dimensions: 1536 }
});
try {
const results = await performVectorSearch();
traceManager.setSpanStatus(span, SpanStatus.OK);
} catch (error) {
traceManager.setSpanStatus(span, SpanStatus.ERROR, error.message);
throw error;
} finally {
traceManager.finishSpan(span);
}Automatic instrumentation for:
- HTTP requests and responses
- Database queries and transactions
- Vector search operations
- Authentication flows
- Cross-service communication
📝 Structured Logging
Context-aware structured logging:
// Log with context
logger.info('User authenticated', {
userId: 'user123',
requestId: 'req456',
operation: 'authentication'
}, {
method: 'oauth',
provider: 'google',
duration: 250
});
// Automatic performance timing
const result = await logger.logWithTiming('database_query', async () => {
return await db.query('SELECT * FROM users WHERE id = ?', [userId]);
});Log types include:
- HTTP request/response logs
- Database operation logs
- Authentication and security events
- Error and exception tracking
- Performance timing logs
- Audit trail logs
🏥 Health Checks
Comprehensive health monitoring:
// Register custom health check
healthChecker.register('custom_service', async () => {
const isHealthy = await checkServiceHealth();
return {
status: isHealthy ? HealthStatus.HEALTHY : HealthStatus.UNHEALTHY,
message: isHealthy ? 'Service is running' : 'Service is down',
duration: 50
};
});
// Get health report
const report = await healthChecker.getHealthReport();Built-in health checks:
- Database connectivity
- Vector index availability
- Time series ingestion pipeline
- Authentication service
- Cache service availability
- External API dependencies
🚨 Real-time Alerting
Configurable alerting with multiple channels:
// Create alert rule
alertManager.addRule({
id: 'high_error_rate',
name: 'High Error Rate',
description: 'Error rate exceeds 5%',
query: 'rate(errors_total[5m]) > 0.05',
threshold: 0.05,
comparison: AlertComparison.GREATER_THAN,
duration: 300,
severity: AlertSeverity.HIGH,
enabled: true
});
// Add notification channel
alertManager.addNotificationChannel({
id: 'slack_alerts',
name: 'Slack Alerts',
type: NotificationType.SLACK,
config: {
webhook_url: 'https://hooks.slack.com/...',
channel: '#alerts'
},
enabled: true
});Supported notification channels:
- Email notifications
- Slack integration
- Webhook callbacks
- SMS alerts (Twilio)
- Discord notifications
📈 Dashboard Integration
Real-time metrics streaming and dashboard APIs:
// Query metrics
const response = await dashboardAPI.queryMetrics({
metric: 'http_request_duration_seconds',
timeRange: { start: Date.now() - 3600000, end: Date.now() },
step: 60,
aggregation: AggregationFunction.PERCENTILE_95
});
// Create dashboard
const dashboard = dashboardAPI.createDashboard({
name: 'System Overview',
description: 'High-level system metrics',
widgets: [
{
type: WidgetType.METRIC_CHART,
title: 'Request Rate',
config: { metric: 'http_requests_total', type: 'line' }
}
]
});
// Real-time streaming
const streamer = new MetricsStreamer(config, metricsCollector);
const subscriptionId = streamer.subscribeToMetric(
clientId,
'http_requests_total',
{ interval: 5000 }
);🔬 Performance Analytics
Advanced performance analysis and optimization:
// Analyze query performance
const analysis = await queryAnalyzer.analyzeQuery(queryPerformance);
console.log(`Query score: ${analysis.score}/100`);
console.log(`Recommendations: ${analysis.recommendations.length}`);
// Generate performance report
const report = await performanceAnalyzer.generatePerformanceReport();
console.log(`Overall grade: ${report.overall.grade}`);
console.log(`Bottlenecks found: ${report.bottlenecks.length}`);Analytics features:
- Query performance analysis
- Index usage statistics
- Storage optimization recommendations
- Bottleneck identification
- Performance trend analysis
- Cost optimization insights
Configuration
Environment Variables
# Cloudflare Analytics
CF_ANALYTICS_TOKEN=your_token_here
CF_ANALYTICS_DATASET_ID=your_dataset_id
CF_ACCOUNT_ID=your_account_id
# OpenTelemetry
OTEL_SERVICE_NAME=nosqldb
OTEL_SERVICE_VERSION=1.0.0
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
OTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=your_api_key
# Monitoring
MONITORING_ENABLED=true
METRICS_RETENTION_DAYS=30
TRACING_SAMPLE_RATE=0.1
LOG_LEVEL=info
# Alerting
ALERT_EVALUATION_INTERVAL=60000
SLACK_WEBHOOK_URL=https://hooks.slack.com/...
EMAIL_SMTP_ENDPOINT=https://api.sendgrid.v3/mail/send
EMAIL_API_KEY=your_sendgrid_api_keyAdvanced Configuration
const config: MonitoringConfig = {
metrics: {
enabled: true,
retentionDays: 30,
aggregationIntervals: [60, 300, 3600, 86400],
exportInterval: 30000
},
tracing: {
enabled: true,
samplingRate: 0.1,
maxSpansPerTrace: 1000,
retentionDays: 7
},
logging: {
enabled: true,
level: LogLevel.INFO,
retentionDays: 14,
structured: true
},
healthChecks: {
enabled: true,
interval: 30000,
timeout: 5000
},
alerting: {
enabled: true,
evaluationInterval: 60000,
defaultNotificationChannel: 'default'
},
cloudflareAnalytics: {
enabled: true,
datasetId: process.env.CF_ANALYTICS_DATASET_ID,
apiToken: process.env.CF_ANALYTICS_TOKEN
}
};API Reference
Metrics API
MetricsCollector.counter(name, value, labels)- Record counter metricMetricsCollector.gauge(name, value, labels)- Record gauge metricMetricsCollector.histogram(name, value, labels)- Record histogram metricMetricsCollector.getMetrics()- Get all collected metrics
Tracing API
TraceManager.startSpan(options)- Start new spanTraceManager.finishSpan(span)- Finish spanTraceManager.setSpanStatus(span, status)- Set span statusTraceManager.createTraceContext(span)- Create trace context
Logging API
Logger.info(message, context, metadata)- Log info messageLogger.error(message, error, context, metadata)- Log errorLogger.logWithTiming(name, operation)- Log with automatic timingLogger.child(context)- Create child logger with context
Health Check API
HealthChecker.register(name, check, config)- Register health checkHealthChecker.runCheck(name)- Run specific health checkHealthChecker.getHealthReport()- Get overall health reportHealthChecker.isHealthy()- Check if system is healthy
Alert API
AlertManager.addRule(rule)- Add alert ruleAlertManager.addNotificationChannel(channel)- Add notification channelAlertManager.triggerManualAlert(name, description, severity)- Trigger manual alertAlertManager.getActiveAlerts(filters)- Get active alerts
Best Practices
1. Metric Naming
Use consistent naming conventions:
// Good
metricsCollector.counter('http_requests_total', 1, { method: 'GET', status: '200' });
metricsCollector.histogram('database_query_duration_seconds', 0.1, { operation: 'select' });
// Bad
metricsCollector.counter('requests', 1);
metricsCollector.histogram('db_time', 100);2. Structured Logging
Always include relevant context:
// Good
logger.info('User login successful', {
userId: user.id,
requestId: req.id,
operation: 'authentication'
}, {
loginMethod: 'oauth',
provider: 'google',
duration: authTime
});
// Bad
logger.info('User logged in');3. Span Organization
Create meaningful span hierarchies:
const requestSpan = traceManager.createHttpSpan('GET', '/api/search');
const authSpan = traceManager.startSpan({
operationName: 'auth.validate',
parentSpan: requestSpan
});
const searchSpan = traceManager.createVectorSearchSpan(
'similarity_search',
'documents',
requestSpan
);4. Alert Configuration
Use appropriate thresholds and durations:
// Good - prevents false positives
{
threshold: 0.05, // 5% error rate
duration: 300, // sustained for 5 minutes
severity: AlertSeverity.HIGH
}
// Bad - too sensitive
{
threshold: 0.01,
duration: 30,
severity: AlertSeverity.CRITICAL
}Troubleshooting
Common Issues
High Memory Usage
- Reduce metric retention period
- Increase flush intervals
- Enable metric sampling
Missing Traces
- Check sampling rate configuration
- Verify trace context propagation
- Ensure spans are properly finished
Alert Noise
- Adjust thresholds and durations
- Use alert suppression rules
- Implement alert correlation
Performance Impact
- Enable async processing
- Use batching for exports
- Optimize metric cardinality
Debug Mode
Enable debug logging for troubleshooting:
const logger = new Logger({
component: 'Monitoring',
level: LogLevel.DEBUG
});Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
License
MIT License - see LICENSE file for details.
