@vicistack/asterisk-otel-observability

v1.0.0

Published

3 months ago

Asterisk Observability with OpenTelemetry and Grafana — ViciStack call center engineering guide

0High
0Medium
0Low

vicistack

vicidial asterisk call-center voip predictive-dialer asterisk otel observability

Asterisk Observability with OpenTelemetry and Grafana

How to actually see what's happening inside your Asterisk servers. OpenTelemetry as the collection layer, Prometheus for storage, Grafana for dashboards, and distributed tracing to follow a call from SIP INVITE to agent headset. Built from production VICIdial clusters pushing 200K+ daily calls. --- I've been running Asterisk in production since the 1.4 days. For most of that time, "monitoring" meant SSH into the box, run asterisk -rx "core show channels", squint at the output, and hope that the number of active channels looked about right. Maybe check /var/log/asterisk/full when something broke. Maybe not. That stopped being acceptable around the time we crossed 50,000 daily calls across a 4-server cluster. When a SIP trunk goes down at 2 PM on a Tuesday and 300 agents go idle, you need to know in seconds, not whenever someone notices the real-time report looks weird and pings you on Slack. This guide covers the full observability stack for Asterisk: metrics collection with OpenTelemetry, storage in Prometheus, visualization in Grafana, and distributed tracing for individual call flows. If you're running VICIdial, everything here applies — VICIdial's call processing is just Asterisk dialplan execution under the hood, and all the telemetry surfaces the same way. --- ## Why OpenTelemetry Instead of Just Prometheus You could skip OpenTelemetry entirely. Install prometheus-node-exporter on your Asterisk box, write a script that scrapes asterisk -rx output into Prometheus metrics, and call it done. I've done exactly that. It works. It's also fragile, custom, and doesn't scale. OpenTelemetry (OTel) gives you three things that roll-your-own monitoring doesn't: Vendor-neutral collection. The OTel Collector speaks StatsD, Prometheus, OTLP, syslog, and dozens of other formats. Asterisk's built-in res_statsd module pushes metrics via StatsD. AMI events can be forwarded as structured logs. You don't have to write custom parsers — you configure receivers. Processing pipelines. OTel lets you filter, transform, aggregate, and route telemetry data before it hits your backend. Want to drop debug-level events but keep warnings? Want to add a cluster_name attribute to every metric? Want to sample 10% of traces for non-error calls? All configurable in the collector. Multi-backend export. Send metrics to Prometheus, traces to Jaeger or Tempo, and logs to Loki — from one collector instance. If you ever want to switch from Prometheus to Mimir or from Jaeger to Tempo, you change one exporter config. Nothing on the Asterisk side changes. That said, if you have a single Asterisk box running 5,000 calls a day, a Prometheus scraper script is probably fine. OTel shines when you have multiple servers, multiple signal types (metrics + traces + logs), or when you're tired of maintaining custom scripts. --- ## Architecture Overview Here's what we're building: ┌─────────────────────────────────────────────────────┐ │ Asterisk Server │ │ │ │ res_statsd ──→ OTel Collector ──→ Prometheus │ │ (sidecar) │ │ AMI Events ──→ ami-otel-bridge ──→ OTel Collector │ │ │ │ │ CDR/CEL ──→ MySQL ──→ mysqld_exporter ──→ Prometheus│ │ │ └───────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌────────────────┐ ┌──────────┐ │Prometheus│ │ Jaeger / Tempo │ │ Loki │ └────┬─────┘ └───────┬────────┘ └────┬─────┘ │ │ │ └──────────┬───────┘──────────────────┘ │ ┌────▼─────┐ │ Grafana │ └──────────┘ Components: - res_statsd — Asterisk's built-in StatsD module. Emits metrics on channel counts, endpoint status, bridge operations, and more. - OTel Collector — Runs as a sidecar process on the Asterisk server. Receives StatsD from res_statsd, processes it, exports to Prometheus. - ami-otel-bridge — A small script that reads Asterisk Manager Interface (AMI) events and converts them to OTel spans/logs. - mysqld_exporter — Exports MySQL metrics for CDR/CEL table monitoring. - Prometheus — Time-series database. Stores all metrics. - Grafana — Dashboards and alerting. - Jaeger/Tempo — Distributed tracing backend for call flow traces. Let's build each layer. --- ## Step 1: Install the OpenTelemetry Collector The OTel Collector runs on each Asterisk server as a systemd service. bash # Download the latest stable release (check https://github.com/open-telemetry/opentelemetry-collector-releases) OTEL_VERSION="0.96.0" curl -L "https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v${OTEL_VERSION}/otelcol-contrib_${OTEL_VERSION}_linux_amd64.tar.gz" \ -o /tmp/otelcol.tar.gz tar xzf /tmp/otelcol.tar.gz -C /usr/local/bin/ otelcol-contrib chmod +x /usr/local/bin/otelcol-contrib # Verify otelcol-contrib --version Create the systemd unit: ini # /etc/systemd/system/otelcol.service [Unit] Description=OpenTelemetry Collector After=network.target [Service] Type=simple User=otelcol Group=otelcol ExecStart=/usr/local/bin/otelcol-contrib --config=/etc/otelcol/config.yaml Restart=always RestartSec=5 LimitNOFILE=65536 [Install] WantedBy=multi-user.target Create the user and directories: bash useradd --system --no-create-home --shell /usr/sbin/nologin otelcol mkdir -p /etc/otelcol chown otelcol:otelcol /etc/otelcol --- ## Step 2: Configure Asterisk's StatsD Module Asterisk has had built-in StatsD support since version 13 through res_statsd. It's compiled in by default on most distributions but not loaded by default. Enable it: ini # /etc/asterisk/statsd.conf [general] enabled = yes server = 127.0.0.1:8125 ; OTel Collector's StatsD receiver prefix = asterisk ; All metrics will be prefixed with "asterisk." add_newline = no Load the module: bash asterisk -rx "module load res_statsd.so" # Verify it's loaded asterisk -rx "module show like statsd" Output should show: Module Description Use Count Status res_statsd.so StatsD client support 0 Running ### What Metrics Does res_statsd Emit? Once loaded, Asterisk pushes the following metrics as StatsD gauges and counters: | Metric | Type | Description | |--------|------|-------------| | asterisk.channels.count | gauge | Current active channel count | | asterisk.channels.by_type.SIP | gauge | Active SIP channels | | asterisk.channels.by_type.PJSIP | gauge | Active PJSIP channels | | asterisk.channels.by_type.Local | gauge | Active Local channels | | asterisk.endpoints.count | gauge | Registered endpoints | | asterisk.endpoints.state.online | gauge | Endpoints in online state | | asterisk.endpoints.state.offline | gauge | Endpoints in offline state | | asterisk.bridges.count | gauge | Active bridges | | asterisk.bridges.channels | gauge | Channels in bridges | These are updated every 10 seconds by default. For a busy system, that's fine. If you need sub-second resolution (you probably don't), you can adjust the interval in statsd.conf. --- ## Step 3: OTel Collector Configuration Here's the collector config that receives StatsD from Asterisk and exports to Prometheus: yaml # /etc/otelcol/config.yaml receivers: # Receive StatsD metrics from res_statsd statsd: endpoint: "0.0.0.0:8125" aggregation_interval: 10s timer_histogram_mapping: - statsd_type: "timer" observer_type: "histogram" histogram: explicit: - 10 - 25 - 50 - 100 - 250 - 500 - 1000 - 5000 - 10000 # Scrape host metrics (CPU, memory, disk, network) hostmetrics: collection_interval: 15s scrapers: cpu: metrics: system.cpu.utilization: enabled: true memory: metrics: system.memory.utilization: enabled: true disk: {} network: {} load: {} # Receive OTLP from custom instrumentation (ami-otel-bridge) otlp: protocols: grpc: endpoint: "0.0.0.0:4317" http: endpoint: "0.0.0.0:4318" processors: # Add resource attributes to every metric resource: attributes: - key: service.name value: "asterisk" action: upsert - key: host.name from_attribute: "" action: upsert - key: cluster.name value: "vicidial-prod" action: upsert - key: server.role value: "dialer" action: upsert # Batch metrics to reduce export overhead batch: timeout: 10s send_batch_size: 1000 # Memory limiter to prevent OOM memory_limiter: check_interval: 5s limit_mib: 256 spike_limit_mib: 64 exporters: # Export metrics to Prometheus prometheus: endpoint: "0.0.0.0:8889" namespace: "asterisk" resource_to_telemetry_conversion: enabled: true # Export traces to Jaeger (or Tempo) otlp/jaeger: endpoint: "jaeger.monitoring.local:4317" tls: insecure: true # Export logs to Loki loki: endpoint: "http://loki.monitoring.local:3100/loki/api/v1/push" labels: attributes: service.name: "service_name" host.name: "hostname" # Debug output (disable in production) # debug: # verbosity: detailed service: pipelines: metrics: receivers: [statsd, hostmetrics] processors: [memory_limiter, resource, batch] exporters: [prometheus] traces: receivers: [otlp] processors: [memory_limiter, resource, batch] exporters: [otlp/jaeger] logs: receivers: [otlp] processors: [memory_limiter, resource, batch] exporters: [loki] telemetry: logs: level: "warn" metrics: address: ":8888" Start the collector: bash systemctl daemon-reload systemctl enable otelcol systemctl start otelcol # Verify it's running and receiving StatsD curl -s http://localhost:8889/metrics | grep asterisk_channels You should see Prometheus-formatted metrics: # HELP asterisk_channels_count Current active channel count # TYPE asterisk_channels_count gauge asterisk_channels_count{cluster_name="vicidial-prod",host_name="dialer01",server_role="dialer"} 47 --- ## Step 4: Custom Metrics via AMI StatsD gives you the basics — channel counts, endpoint status, bridge counts. But for VICIdial-specific observability, you need more. The Asterisk Manager Interface (AMI) emits events for everything: new channels, hangups, DTMF, queue joins, agent status changes, you name it. Here's a Python script that connects to AMI, listens for events, and pushes them to the OTel Collector as metrics and traces: python #!/usr/bin/env python3 """ ami_otel_bridge.py — Bridge AMI events to OpenTelemetry Runs as a daemon alongside Asterisk. """ import socket import time import re import os from opentelemetry import metrics, trace from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter # OTel setup metric_exporter = OTLPMetricExporter(endpoint="localhost:4317", insecure=True) metric_reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=10000) metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader])) meter = metrics.get_meter("ami-bridge") trace_exporter = OTLPSpanExporter(endpoint="localhost:4317", insecure=True) trace.set_tracer_provider(TracerProvider()) trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(trace_exporter)) tracer = trace.get_tracer("ami-bridge") # Metrics calls_total = meter.create_counter("asterisk.calls.total", description="Total calls") calls_active = meter.create_up_down_counter("asterisk.calls.active", description="Active calls") calls_by_disposition = meter.create_counter("asterisk.calls.by_disposition", description="Calls by disposition") sip_registrations = meter.create_up_down_counter("asterisk.sip.registrations", description="SIP registration events") queue_callers = meter.create_up_down_counter("asterisk.queue.callers", description="Callers waiting in queue") call_duration = meter.create_histogram("asterisk.call.duration_ms", description="Call duration in milliseconds") # Track active call spans for distributed tracing active_spans = {} AMI_HOST = os.environ.get("AMI_HOST", "127.0.0.1") AMI_PORT = int(os.environ.get("AMI_PORT", "5038")) AMI_USER = os.environ.get("AMI_USER", "admin") AMI_SECRET = os.environ.get("AMI_SECRET", "amp111") def connect_ami(): """Connect to AMI and authenticate.""" sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.settimeout(30) sock.connect((AMI_HOST, AMI_PORT)) # Read banner sock.recv(1024) # Login login = ( f"Action: Login\r\n" f"Username: {AMI_USER}\r\n" f"Secret: {AMI_SECRET}\r\n" f"Events: call,agent,cdr\r\n" f"\r\n" ) sock.sendall(login.encode()) response = sock.recv(4096).decode() if "Success" not in response: raise ConnectionError(f"AMI login failed: {response}") print(f"[ami-otel] Connected to AMI at {AMI_HOST}:{AMI_PORT}") return sock def parse_event(raw): """Parse an AMI event into a dict.""" event = {} for line in raw.strip().split("\r\n"): if ": " in line: key, value = line.split(": ", 1) event[key.strip()] = value.strip() return event def handle_event(event): """Process an AMI event and emit OTel signals.""" event_type = event.get("Event", "") if event_type == "Newchannel": channel = event.get("Channel", "unknown") calls_total.add(1, {"channel_type": channel.split("/")[0]}) calls_active.add(1) # Start a trace span for this call uniqueid = event.get("Uniqueid", "") if uniqueid: span = tracer.start_span( "asterisk.call", attributes={ "asterisk.channel": channel, "asterisk.uniqueid": uniqueid, "asterisk.caller_id": event.get("CallerIDNum", ""), "asterisk.context": event.get("Context", ""), "asterisk.exten": event.get("Exten", ""), } ) active_spans[uniqueid] = { "span": span, "start_time": time.time(), } elif event_type == "Hangup": calls_active.add(-1) uniqueid = event.get("Uniqueid", "") cause = event.get("Cause-txt", "Unknown") # End the trace span if uniqueid in active_spans: span_data = active_spans.pop(uniqueid) duration_ms = (time.time() - span_data["start_time"]) * 1000 span_data["span"].set_attribute("asterisk.hangup_cause", cause) span_data["span"].set_attribute("asterisk.duration_ms", duration_ms) span_data["span"].end() call_duration.record(duration_ms, {"cause": cause}) elif event_type == "AgentComplete": dispo = event.get("Reason", "unknown") calls_by_disposition.add(1, {"disposition": dispo}) elif event_type == "PeerStatus": peer = event.get("Peer", "") status = event.get("PeerStatus", "") if status == "Registered": sip_registrations.add(1, {"peer": peer}) elif status == "Unregistered": sip_registrations.add(-1, {"peer": peer}) elif event_type == "Join": queue_callers.add(1, {"queue": event.get("Queue", "unknown")}) elif event_type == "Leave": queue_callers.add(-1, {"queue": event.get("Queue", "unknown")}) def main(): while True: try: sock = connect_ami() buffer = "" while True: data = sock.recv(4096).decode("utf-8", errors="replace") if not data: raise ConnectionError("AMI connection lost") buffer += data # AMI events are separated by \r\n\r\n while "\r\n\r\n" in buffer: raw_event, buffer = buffer.split("\r\n\r\n", 1) if raw_event.strip(): event = parse_event(raw_event) if "Event" in event: handle_event(event) except Exception as e: print(f"[ami-otel] Error: {e}, reconnecting in 5s...") time.sleep(5) if __name__ == "__main__": main() Install dependencies and run as a service: bash pip3 install opentelemetry-api opentelemetry-sdk \ opentelemetry-exporter-otlp-proto-grpc # Create systemd unit cat > /etc/systemd/system/ami-otel-bridge.service << 'EOF' [Unit] Description=AMI to OpenTelemetry Bridge After=asterisk.service otelcol.service [Service] Type=simple User=asterisk Environment=AMI_HOST=127.0.0.1 Environment=AMI_PORT=5038 Environment=AMI_USER=admin Environment=AMI_SECRET=your_ami_password_here ExecStart=/usr/bin/python3 /usr/local/bin/ami_otel_bridge.py Restart=always RestartSec=5 [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable ami-otel-bridge systemctl start ami-otel-bridge Now you have two metric sources feeding the OTel Collector: res_statsd for Asterisk internals, and the AMI bridge for call-level events and distributed traces. --- ## Step 5: Prometheus Configuration Prometheus needs to scrape the OTel Collector's Prometheus exporter endpoint: yaml # /etc/prometheus/prometheus.yml (add to scrape_configs) scrape_configs: - job_name: 'asterisk-otel' scrape_interval: 10s static_configs: - targets: - 'dialer01.internal:8889' - 'dialer02.internal:8889' - 'dialer03.internal:8889' labels: environment: 'production' # Also scrape the OTel Collector's own health metrics - job_name: 'otel-collector' scrape_interval: 30s static_configs: - targets: - 'dialer01.internal:8888' - 'dialer02.internal:8888' - 'dialer03.internal:8888' ### Recording Rules for Call Center KPIs Raw metrics are useful, but derived metrics are where the value lives. Set up recording rules in Prometheus: ```yaml # /etc/prometheus/rules/asterisk.yml groups: - name: asterisk_kpis interval: 30s rules: # Calls per minute (cluster-wide) - record: asterisk:calls_per_minute expr: sum(rate(asterisk_calls_total[5m])) * 60 # Average call duration (5-minute window) - record: asterisk:avg_call_duration_sec expr: | histogram_quantile(0.5, rate(asterisk_call_duration_ms_bucket[5m]) ) / 1000 # 95th percentile call duration - record: asterisk:p95_call_duration_sec expr: | histogram_quantile(0.95, rate(asterisk_call_duration_ms_bucket[5m]) ) /

Read the full article

About

Built by ViciStack — enterprise VoIP and call center infrastructure.

License

MIT

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

@vicistack/asterisk-otel-observability

v1.0.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Asterisk Observability with OpenTelemetry and Grafana

About

License