@unrdf/observability
v26.4.8
Published
Innovative Prometheus/Grafana observability dashboard for UNRDF distributed workflows
Downloads
572
Maintainers
Readme
@unrdf/observability
Innovative Prometheus/Grafana observability dashboard for UNRDF distributed workflows
Real-time monitoring, alerting, and visualization for workflow execution, resource utilization, and business metrics.
Features
Metrics Collection
- Workflow Execution Metrics: Total executions, duration, active workflows
- Task Performance: Execution time, success rate, queue depth
- Resource Utilization: CPU, memory, disk monitoring
- Event Sourcing: Events appended, store size tracking
- Business Metrics: Policy evaluations, cryptographic receipts
- Latency Percentiles: p50, p90, p95, p99 tracking
Alerting System
- Threshold-Based Alerts: Configurable rules with hysteresis
- Anomaly Detection: Statistical z-score analysis
- Webhook Notifications: HTTP callbacks for alert events
- Alert Deduplication: Smart grouping and correlation
- Severity Levels: INFO, WARNING, CRITICAL
Grafana Dashboards
- Pre-built Dashboards: Ready-to-use visualizations
- Real-time Updates: 5-second refresh intervals
- Custom Variables: Filter by workflow ID, pattern
- Alert Annotations: Visual markers for events
- Exportable JSON: Import directly into Grafana
Installation
pnpm add @unrdf/observabilityQuick Start
Basic Usage
import { createWorkflowMetrics } from '@unrdf/observability';
const metrics = createWorkflowMetrics({
enableDefaultMetrics: true,
prefix: 'unrdf_workflow_',
labels: { environment: 'production' },
});
// Record workflow execution
metrics.recordWorkflowStart('wf-123', 'SEQUENCE');
// ... execute workflow ...
metrics.recordWorkflowComplete('wf-123', 'completed', 2.5, 'SEQUENCE');
// Export metrics
const prometheusMetrics = await metrics.getMetrics();Complete Observability Stack
import { createObservabilityStack } from '@unrdf/observability';
const { metrics, grafana, alerts } = await createObservabilityStack({
metrics: {
enableDefaultMetrics: true,
prefix: 'unrdf_workflow_',
},
alerts: {
rules: [
{
id: 'high-latency',
name: 'High Workflow Latency',
metric: 'workflow_duration',
threshold: 10,
operator: 'gt',
severity: 'warning',
},
],
enableAnomalyDetection: true,
},
grafana: {
title: 'Production Workflows',
datasource: 'Prometheus',
},
});
// Listen for alerts
alerts.on('alert', alert => {
console.log(`Alert: ${alert.name} - ${alert.severity}`);
});Live Demo
Run the included demo to see metrics in action:
cd packages/observability
pnpm install
pnpm demoThen visit:
- Prometheus Metrics: http://localhost:9090/metrics
- Metrics JSON: http://localhost:9090/metrics/json
- Grafana Dashboard: http://localhost:9090/dashboard
- Active Alerts: http://localhost:9090/alerts
- Statistics: http://localhost:9090/stats
API Reference
WorkflowMetrics
Constructor
new WorkflowMetrics(config);Config Options:
enableDefaultMetrics(boolean): Enable Node.js default metrics (default: true)prefix(string): Metric name prefix (default: 'unrdfworkflow')labels(object): Global labels for all metricscollectInterval(number): Collection interval in ms (default: 10000)
Methods
recordWorkflowStart(workflowId, pattern)
Record workflow execution start.
recordWorkflowComplete(workflowId, status, durationSeconds, pattern)
Record workflow completion with duration.
recordTaskExecution(workflowId, taskId, taskType, status, durationSeconds)
Record task execution metrics.
updateTaskQueueDepth(workflowId, queueName, depth)
Update task queue depth gauge.
recordResourceUtilization(resourceType, resourceId, utilizationPercent)
Record resource utilization (0-100%).
recordEventAppended(eventType, workflowId)
Record event appended to event store.
recordPolicyEvaluation(policyName, result)
Record policy evaluation result.
recordCryptoReceipt(workflowId, algorithm)
Record cryptographic receipt generation.
recordError(errorType, workflowId, severity)
Record error occurrence.
getMetrics()
Get metrics in Prometheus text format.
getMetricsJSON()
Get metrics in JSON format.
AlertManager
Constructor
new AlertManager(config);Config Options:
rules(array): Initial alert ruleswebhooks(array): Webhook endpointscheckInterval(number): Rule check interval (default: 10000)enableAnomalyDetection(boolean): Enable anomaly detection (default: true)
Methods
addRule(rule)
Add alert rule.
Rule Schema:
{
id: 'unique-rule-id',
name: 'Human Readable Name',
metric: 'metric_name',
threshold: 100,
operator: 'gt', // gt, lt, gte, lte, eq
severity: 'warning', // info, warning, critical
duration: 60000, // ms
enabled: true
}evaluateMetric(metricName, value, labels)
Evaluate metric against all rules.
getActiveAlerts()
Get currently active alerts.
getAlertHistory(filters)
Get alert history with optional filters.
getStatistics()
Get alert statistics.
Events
alert: Fired when alert triggersalert:resolved: Fired when alert resolveswebhook:error: Fired on webhook delivery failure
GrafanaExporter
Constructor
new GrafanaExporter(config);Config Options:
title(string): Dashboard titledatasource(string): Prometheus datasource name (default: 'Prometheus')refreshInterval(string): Dashboard refresh interval (default: '5s')tags(array): Dashboard tags
Methods
generateDashboard()
Generate complete Grafana dashboard configuration.
exportJSON(pretty)
Export dashboard as JSON string.
generateAlertDashboard()
Generate alert-focused dashboard.
Metrics Collected
Workflow Metrics
unrdf_workflow_executions_total(Counter): Total workflow executionsunrdf_workflow_execution_duration_seconds(Histogram): Workflow durationunrdf_workflow_active_workflows(Gauge): Active workflows
Task Metrics
unrdf_workflow_task_executions_total(Counter): Total task executionsunrdf_workflow_task_duration_seconds(Histogram): Task durationunrdf_workflow_task_queue_depth(Gauge): Queue depth
Resource Metrics
unrdf_workflow_resource_utilization(Gauge): Resource utilization %unrdf_workflow_resource_allocations_total(Counter): Resource allocations
Event Sourcing Metrics
unrdf_workflow_events_appended_total(Counter): Events appendedunrdf_workflow_event_store_size_bytes(Gauge): Event store size
Business Metrics
unrdf_workflow_policy_evaluations_total(Counter): Policy evaluationsunrdf_workflow_crypto_receipts_total(Counter): Crypto receiptsunrdf_workflow_errors_total(Counter): Errors
Performance Metrics
unrdf_workflow_latency_percentiles(Summary): Latency percentiles
Integration with Prometheus
Prometheus Configuration
scrape_configs:
- job_name: 'unrdf-workflows'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']Alert Rules (prometheus.yml)
groups:
- name: unrdf_workflow_alerts
interval: 30s
rules:
- alert: HighWorkflowErrorRate
expr: rate(unrdf_workflow_errors_total[5m]) > 1
for: 5m
labels:
severity: critical
annotations:
summary: 'High workflow error rate detected'
description: 'Error rate is {{ $value }} errors/sec'
- alert: HighResourceUtilization
expr: unrdf_workflow_resource_utilization > 90
for: 5m
labels:
severity: warning
annotations:
summary: 'Resource utilization above 90%'
description: '{{ $labels.resource_type }} on {{ $labels.resource_id }}'Grafana Dashboard Import
- Export dashboard JSON:
curl http://localhost:9090/dashboard/export > dashboard.json- Import in Grafana:
- Navigate to Dashboards → Import
- Upload
dashboard.json - Select Prometheus datasource
- Click Import
Architecture
┌─────────────────────────────────────────────┐
│ Workflow Application │
│ (Records metrics via WorkflowMetrics) │
└─────────────────┬───────────────────────────┘
│
↓
┌─────────────────────────────────────────────┐
│ Prometheus Metrics Endpoint │
│ (Express Server) │
│ http://localhost:9090/metrics │
└─────────────────┬───────────────────────────┘
│
↓
┌─────────────────────────────────────────────┐
│ Prometheus Server (Scraper) │
│ - Scrapes metrics every 5s │
│ - Stores time-series data │
│ - Evaluates alert rules │
└─────────────────┬───────────────────────────┘
│
↓
┌─────────────────────────────────────────────┐
│ Grafana Dashboard │
│ - Visualizes metrics │
│ - Real-time graphs │
│ - Alert annotations │
└─────────────────────────────────────────────┘Alert Flow
Metric Value
↓
AlertManager.evaluateMetric()
↓
Check Threshold Rules
↓
Check Anomaly Detection (z-score)
↓
Alert Triggered?
↓
Emit 'alert' Event
↓
Send Webhook Notifications
↓
Store in Alert HistoryPerformance
- Overhead: <1ms per metric recording
- Memory: ~50MB for 1000 workflows
- Throughput: 10,000+ metrics/sec
- Alert Latency: <100ms detection to notification
Best Practices
1. Metric Cardinality
Keep label cardinality low to avoid memory issues:
// ✅ Good - bounded cardinality
metrics.recordWorkflowComplete(workflowId, 'completed', duration, 'SEQUENCE');
// ❌ Bad - unbounded cardinality
// metrics.recordWorkflow(userId, timestamp, randomValue);2. Alert Thresholds
Set thresholds based on baseline + 2σ:
const baseline = 5; // seconds
const stdDev = 1.5;
const threshold = baseline + 2 * stdDev; // 8 seconds
alerts.addRule({
id: 'high-latency',
metric: 'workflow_duration',
threshold,
operator: 'gt',
severity: 'warning',
});3. Dashboard Organization
- Group related panels
- Use template variables for filtering
- Set appropriate time ranges
- Enable auto-refresh (5-10s)
4. Webhook Reliability
- Use exponential backoff
- Implement idempotency
- Monitor webhook failures
- Set reasonable timeouts
Troubleshooting
Metrics Not Appearing
- Check metrics endpoint:
curl http://localhost:9090/metrics - Verify Prometheus scraping: Prometheus UI → Targets
- Check for metric name typos
Alerts Not Firing
- Verify rule configuration:
alerts.rules - Check metric values:
alerts.metricHistory - Enable debug logging:
alerts.on('alert', console.log)
High Memory Usage
- Reduce metric history: limit to 100-500 samples
- Lower cardinality: fewer unique label combinations
- Increase scrape interval: 10-30s instead of 5s
License
MIT
Contributing
See main UNRDF repository for contribution guidelines.
