@unrdf/yawl-observability
v26.4.4
Published
Workflow observability framework with Prometheus metrics and OpenTelemetry tracing for YAWL
Maintainers
Readme
@unrdf/yawl-observability
Comprehensive workflow observability framework for YAWL (Yet Another Workflow Language) engine with Prometheus metrics, OpenTelemetry distributed tracing, and custom Service Level Indicators (SLIs).
Features
- Prometheus Metrics: Complete workflow execution metrics in Prometheus format
- OpenTelemetry Tracing: Distributed tracing with receipt-based proof correlation
- Custom SLIs: Service Level Indicators for workflow performance monitoring
- Grafana Integration: Pre-built dashboard for visualization
- SLO Compliance: Automated Service Level Objective monitoring
Installation
pnpm add @unrdf/yawl-observabilityQuick Start
import { createWorkflowEngine } from '@unrdf/yawl';
import {
YAWLMetricsCollector,
YAWLTracer,
YAWLSLICalculator
} from '@unrdf/yawl-observability';
// Create YAWL engine
const engine = createWorkflowEngine();
// Setup observability
const metrics = new YAWLMetricsCollector(engine, {
defaultLabels: { environment: 'production', region: 'us-east-1' }
});
const tracer = new YAWLTracer(engine, {
includeReceiptHashes: true
});
const sli = new YAWLSLICalculator(engine, {
targetCompletionRate: 0.95,
targetTaskSuccessRate: 0.99,
targetP95Latency: 5.0
});
// Expose metrics endpoint (Express example)
app.get('/metrics', async (req, res) => {
res.set('Content-Type', metrics.contentType);
res.end(await metrics.getMetrics());
});
// Get SLI snapshot
const snapshot = sli.getSnapshot();
console.log(`SLO Compliance: ${snapshot.sloCompliance.score * 100}%`);Metrics Catalog
Case Lifecycle Metrics
| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| yawl_workflow_cases_total | Counter | Total cases by status | workflow_id, status |
| yawl_case_completion_time_seconds | Histogram | End-to-end case duration | workflow_id |
Status values: created, started, completed
Example queries:
# Completion rate by workflow
rate(yawl_workflow_cases_total{status="completed"}[5m])
/ rate(yawl_workflow_cases_total{status="created"}[5m])
# Average case completion time
histogram_quantile(0.50, rate(yawl_case_completion_time_seconds_bucket[5m]))Task Execution Metrics
| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| yawl_workflow_tasks_total | Counter | Total tasks by status | workflow_id, task_id, status |
| yawl_task_duration_seconds | Histogram | Task execution time | workflow_id, task_id |
| yawl_task_wait_time_seconds | Histogram | Time from enabled to started | workflow_id, task_id |
| yawl_workflow_active_tasks | Gauge | Currently running tasks | workflow_id |
| yawl_workflow_enabled_tasks | Gauge | Currently enabled tasks | workflow_id |
| yawl_task_errors_total | Counter | Task failures and timeouts | workflow_id, task_id, error_type |
Status values: enabled, started, completed, cancelled
Error types: cancelled, timeout, failed
Example queries:
# Task p95 execution time
histogram_quantile(0.95, rate(yawl_task_duration_seconds_bucket[5m]))
# Task error rate by workflow
rate(yawl_task_errors_total[5m]) / rate(yawl_workflow_tasks_total{status="started"}[5m])
# Current workload
sum(yawl_workflow_active_tasks) by (workflow_id)Pattern Usage Metrics
| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| yawl_pattern_usage_count | Counter | YAWL pattern usage | workflow_id, task_id, pattern_type, operation |
Pattern types: and, xor, or, sequence
Operations: split, join
Example queries:
# Pattern usage distribution
sum by (pattern_type, operation) (yawl_pattern_usage_count)
# XOR split usage rate
rate(yawl_pattern_usage_count{pattern_type="xor", operation="split"}[5m])Resource Metrics
| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| yawl_resource_allocations_total | Counter | Resource allocation events | workflow_id, role, resource_id |
| yawl_resource_utilization | Gauge | Resource utilization (0-1) | resource_id, role |
Example queries:
# Average resource utilization
avg(yawl_resource_utilization)
# Resource allocation rate by role
rate(yawl_resource_allocations_total[5m]) by (role)Service Level Indicators (SLIs)
| Metric | Type | Description |
|--------|------|-------------|
| yawl_sli_completion_rate | Gauge | Workflow completion rate (0-1) |
| yawl_sli_task_success_rate | Gauge | Task success rate (0-1) |
| yawl_sli_task_error_rate | Gauge | Task error rate (0-1) |
| yawl_sli_p95_latency_seconds | Gauge | 95th percentile task latency |
| yawl_sli_resource_utilization | Gauge | Average resource utilization |
| yawl_slo_compliance | Gauge | SLO compliance score (0-1) |
Example queries:
# SLO compliance over time
yawl_slo_compliance
# Task success rate trend
rate(yawl_sli_task_success_rate[5m])OpenTelemetry Tracing
Span Structure
workflow.case (caseId)
├── task.taskId1 (workItemId1)
├── task.taskId2 (workItemId2)
└── task.taskId3 (workItemId3)Span Attributes
Case Spans:
workflow.id: Workflow definition IDworkflow.case.id: Unique case IDworkflow.case.status: Case status
Task Spans:
workflow.id: Workflow definition IDworkflow.case.id: Case IDworkflow.task.id: Task definition IDworkflow.task.work_item_id: Work item instance IDworkflow.task.status: Task statusworkflow.task.resource_id: Assigned resourceworkflow.task.actor: Actor who executed task
Receipt Correlation (when enabled):
workflow.receipt.hash: BLAKE3 receipt hashworkflow.receipt.previous_hash: Previous receipt hashworkflow.receipt.event_type: Event type
Receipt-Correlated Tracing
const tracer = new YAWLTracer(engine, {
includeReceiptHashes: true
});
// Execute operation with cryptographic proof
await tracer.spanWithReceipt('external.approval', receipt, async () => {
return await externalService.approve(data);
});Custom Spans
// Create custom span within case context
await tracer.spanInCase(caseId, 'custom.validation', async () => {
// Your custom logic
return validateData(data);
});Service Level Indicators (SLIs)
Configuration
const sli = new YAWLSLICalculator(engine, {
windowMs: 300000, // 5 minute window
targetCompletionRate: 0.95, // 95% completion target
targetTaskSuccessRate: 0.99, // 99% success target
targetP95Latency: 5.0, // 5 second p95 target
targetResourceUtilization: 0.80 // 80% utilization target
});Getting SLI Snapshots
const snapshot = sli.getSnapshot();
console.log({
completionRate: snapshot.completionRate,
taskSuccessRate: snapshot.taskSuccessRate,
p95Latency: snapshot.p95Latency,
sloCompliance: snapshot.sloCompliance.score
});SLO Compliance Report
const report = sli.getSLOReport();
console.log(report);
// {
// timestamp: '2024-01-15T10:30:00.000Z',
// overall: {
// compliant: true,
// score: 1.0,
// meetsCount: 4,
// totalCount: 4
// },
// metrics: [
// {
// name: 'Completion Rate',
// current: 0.98,
// target: 0.95,
// compliant: true,
// status: 'PASS'
// },
// ...
// ]
// }Grafana Dashboard
Import the pre-built dashboard for complete visualization:
# Dashboard location
packages/yawl-observability/src/examples/grafana-dashboard.jsonDashboard Panels
- SLO Compliance Score - Overall compliance gauge
- Workflow Completion Rate - Success rate over time
- Task Success Rate - Task-level reliability
- P95 Task Latency - Performance monitoring
- Task Error Rate - Error trending
- Resource Utilization - Resource efficiency
- Cases by Status - Case throughput
- Task Duration Distribution - Latency percentiles (p50, p95, p99)
- Active vs Enabled Tasks - Workflow state
- Pattern Usage - YAWL pattern distribution
- Task Errors by Type - Error breakdown
- Case Completion Time - End-to-end duration heatmap
- Task Wait Time - Queueing metrics
- Resource Allocations - Resource activity
- Per-Resource Utilization - Individual resource metrics
Configuration Steps
- Install Grafana and Prometheus
- Configure Prometheus to scrape your metrics endpoint
- Import
grafana-dashboard.jsoninto Grafana - Select Prometheus datasource
- Set refresh interval (recommended: 10s)
Advanced Usage
Custom Metrics Labels
const metrics = new YAWLMetricsCollector(engine, {
prefix: 'myapp_yawl',
defaultLabels: {
environment: 'production',
region: 'us-east-1',
cluster: 'workflow-cluster-1',
version: '2.1.0'
}
});Multiple Exporters
import { NodeSDK } from '@opentelemetry/sdk-node';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
// Setup Prometheus + Jaeger
const prometheusExporter = new PrometheusExporter({ port: 9464 });
const jaegerExporter = new JaegerExporter({ endpoint: 'http://jaeger:14268/api/traces' });
const sdk = new NodeSDK({
metricReader: prometheusExporter,
traceExporter: jaegerExporter
});
sdk.start();
// Now create YAWL observability
const tracer = new YAWLTracer(engine);Alerting Rules
Example Prometheus alerting rules:
groups:
- name: yawl_slo_alerts
rules:
- alert: WorkflowCompletionRateLow
expr: yawl_sli_completion_rate < 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "Workflow completion rate below target"
description: "Completion rate {{ $value }} is below 95% target"
- alert: TaskErrorRateHigh
expr: yawl_sli_task_error_rate > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Task error rate above threshold"
description: "Error rate {{ $value }} exceeds 5% threshold"
- alert: P95LatencyHigh
expr: yawl_sli_p95_latency_seconds > 10
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency exceeds target"
description: "P95 latency {{ $value }}s exceeds 10s target"API Reference
YAWLMetricsCollector
Constructor
new YAWLMetricsCollector(engine, config)Config Options:
prefix: Metric name prefix (default:'yawl')collectDefaultMetrics: Enable Node.js metrics (default:true)defaultMetricsInterval: Collection interval in ms (default:10000)defaultLabels: Custom labels for all metrics (default:{})durationBuckets: Histogram buckets in seconds (default:[0.001, 0.01, 0.1, 0.5, 1, 2.5, 5, 10, 30, 60])
Methods
async getMetrics(): Get metrics in Prometheus formatgetMetric(name): Get specific metric by nameresetMetrics(): Reset all metric valuesdestroy(): Remove event listeners and cleanup
YAWLTracer
Constructor
new YAWLTracer(engine, config)Config Options:
tracerName: Tracer name (default:'@unrdf/yawl')tracerVersion: Tracer version (optional)enableContextPropagation: Auto context propagation (default:true)includeReceiptHashes: Add receipt hashes to spans (default:true)includeTaskData: Include task I/O in spans (default:false)maxTaskDataSize: Max task data size in chars (default:1000)
Methods
getCaseSpan(caseId): Get active case spangetTaskSpan(workItemId): Get active task spanasync spanWithReceipt(name, receipt, fn): Execute with receipt correlationasync spanInCase(caseId, name, fn): Execute custom span in case contextdestroy(): Remove event listeners and cleanup
YAWLSLICalculator
Constructor
new YAWLSLICalculator(engine, config)Config Options:
windowMs: Time window in ms (default:300000= 5 min)targetCompletionRate: Target completion rate (default:0.95)targetTaskSuccessRate: Target task success rate (default:0.99)targetP95Latency: Target p95 latency in seconds (default:5.0)targetResourceUtilization: Target resource utilization (default:0.80)
Methods
calculateCompletionRate(): Get workflow completion ratecalculateTaskSuccessRate(): Get task success ratecalculateTaskErrorRate(): Get task error ratecalculateP95Latency(): Get p95 task latencycalculateP99Latency(): Get p99 task latencycalculateMedianLatency(): Get median task latencycalculateResourceUtilization(): Get resource utilizationcalculateSLOCompliance(): Get SLO compliance statusgetSnapshot(): Get complete SLI snapshotgetSLOReport(): Get detailed SLO reporttoPrometheus(): Export SLIs as Prometheus metricsreset(): Clear collected datadestroy(): Remove event listeners and cleanup
Examples
See src/examples/basic-usage.mjs for a complete working example.
License
MIT
Contributing
Contributions welcome! Please see the main UNRDF repository for contribution guidelines.
