ecip-observability-stack
v1.0.0
Published
ECIP M08 — Observability Stack: OTel Collector, Prometheus, Tempo, Grafana, Elasticsearch
Readme
ecip-observability-stack (M08 — Observability Stack)
Team: Platform/Infra · Phase: 5 (Weeks 23–28) · Priority: P5
Platform-wide observability infrastructure. Provides distributed tracing, metrics collection, log aggregation, Grafana dashboards, and alerting for all ECIP modules via OpenTelemetry auto-instrumentation.
Responsibilities
- Deploy and maintain OpenTelemetry Collector
- Operate Grafana + Prometheus (metrics)
- Operate Jaeger or Tempo (distributed tracing)
- Provide dashboards for: analysis latency, query p95, cache hit rates, MCP call graphs
- Configure alerting rules and on-call routing
- Run chaos tests and load tests for production validation
Technology Stack
| Component | Technology | |-----------|-----------| | Instrumentation | OpenTelemetry SDK (auto) | | Metrics | Prometheus + Grafana | | Tracing | Jaeger or Grafana Tempo | | Logs | Loki or ELK | | Alerting | Grafana Alerting + PagerDuty |
Getting Started
git clone [email protected]:ecip/ecip-observability-stack.git
cd ecip-observability-stack
# Deploy the full observability stack to Kubernetes
helm upgrade --install ecip-obs ./helm --namespace monitoring
# Access Grafana locally
kubectl port-forward svc/grafana 3000:3000 -n monitoring
# Open http://localhost:3000 (admin/admin)Required Dashboards
| Dashboard | Key Metrics | |-----------|------------| | Analysis Pipeline | Events consumed/s, analysis duration p50/p95, embedding API latency, error rate | | Query Service | Query duration p50/p95/p99, LLM API latency, cache hit rate, MCP fan-out depth | | MCP Servers | Tool call latency per tool, per-repo RPS, auth failure rate | | Knowledge Store | Redis hit rate, pgvector query duration, write throughput | | Event Bus | Kafka consumer lag, DLQ depth, webhook processing latency | | Platform SLAs | End-to-end query p95 < 1.5s, analysis p95 < 30s, uptime |
SLA Targets
| SLA | Target | |-----|--------| | Query p95 latency | < 1.5s | | Analysis p95 latency (trunk) | < 30s per file | | Platform uptime | > 99.5% | | Cache hit rate (M03) | > 80% |
Module Dependencies
Depends on: OpenTelemetry SDKs in all other modules (auto-instrumented — no per-module code changes required)
Called by: Nothing — pull-based metrics collection
