@dici1435/observability-mcp

v1.0.12

Published

4 days ago

MCP server for querying logs and traces from Loki/Tempo observability stack

0High
0Medium
0Low

dici1433

mcp observability loki tempo logs traces cursor debugging

@dici1435/observability-mcp

MCP (Model Context Protocol) server that exposes observability tools to Cursor for AI-assisted debugging. Query logs, traces, error codes, service health, and error rates directly in your IDE.

Installation

One-Click Install (Recommended)

Click the button below to install directly to Cursor:

Manual Installation

Add to your ~/.cursor/mcp.json:

{
  "mcpServers": {
    "fp-observability": {
      "command": "npx",
      "args": ["-y", "@dici1435/observability-mcp"],
      "env": {
        "LOKI_URL": "http://localhost:3100",
        "TEMPO_URL": "http://localhost:3200",
        "API_GATEWAY_URL": "http://localhost:3000"
      }
    }
  }
}

Then restart Cursor.

Tools

`get_trace`

Retrieve a distributed trace by traceId from Tempo. Now includes error registry metadata inline.

"Get trace abc123def456"
"Show me what happened in trace 5f8d3a..."

Features:

Supports partial trace IDs (minimum 8 characters)
Automatically resolves short IDs from console logs
Shows error code metadata from the live registry (category, severity, retryable)

`get_logs`

Query logs from Loki with flexible filtering. Enhanced error/warn display with promoted error attributes.

"Show error logs from api-gateway"
"Get logs for trace abc123"
"Find logs mentioning 'timeout' in the last 30 minutes"

Parameters: service, level, traceId, spanId, flowId, correlationId, search, since, limit

`analyze_request`

Comprehensive analysis combining trace and log data. Now includes codeRef extraction and smart registry-based recommendations.

"Analyze what happened to request with trace abc123"
"Debug the failed request xyz789"

Smart recommendations based on error registry:

Retryable errors prompt retry verification
External errors point to third-party health
Critical errors flag expected pages
Runbook links included when available

`run_tests`

Execute tests and capture trace IDs from the output.

"Run unit tests for packages/testing"
"Run e2e:api tests"

Test Types: unit (default), e2e:api, e2e:browser, traced

`search_traces` (NEW)

Search for traces matching criteria. Returns lightweight summaries without N+1 full trace fetching.

"Find error traces from api-gateway in the last 30 minutes"
"Search for slow traces over 2 seconds"

Parameters: service, operation, minDuration, maxDuration, status (error/ok), tags, since, limit

`check_services` (NEW)

Check health of all services: api-gateway + downstream gRPC microservices + infrastructure (Loki, Tempo, Prometheus).

"Are all services healthy?"
"Check service health"

Uses the api-gateway deep health endpoint for gRPC fan-out to identity, core-apps-routing, lenders, finance, and edge-ops.

`get_error_info` (NEW)

Look up FormPiper error code metadata from the live error registry.

"What does FP.LENDERS.SUBMISSION_FAILED mean?"
"Show all system-category errors"
"List all critical severity errors"

Parameters: code, codeRef, category (user/system/external), severity (critical/high/medium/low)

`get_error_rate` (NEW)

Query error rates per service using Loki log counts.

"What's the error rate across services?"
"Show error rates for the last 5 minutes"

Parameters: service, since (default: "5m"), threshold (default: 1%)

`compare_traces` (NEW)

Compare two traces side-by-side with configurable span matching.

"Compare trace abc123 (passing) with trace def456 (failing)"

Parameters: traceIdA, traceIdB, matchStrategy (default/strict/loose)

Matching strategies:

default: operation + service + parent operation with positional tiebreaker
strict: adds spanKind + depth for highly uniform traces
loose: operation + service only for simple request/response traces

Configuration

Environment Variables

| Variable | Default | Description | | -------------------------- | ----------------------- | ------------------------------------------------------------------ | | LOKI_URL | http://localhost:3100 | Loki server URL | | TEMPO_URL | http://localhost:3200 | Tempo server URL | | API_GATEWAY_URL | http://localhost:3000 | fp-mono api-gateway URL (error registry + deep health) | | OBSERVABILITY_API_KEY | (none) | API key for ObservabilityGuard (optional in dev, required in prod) | | PROMETHEUS_URL | http://localhost:9090 | Prometheus URL (for check_services health) | | SPAN_MATCH_STRATEGY | default | Default span matching for compare_traces | | DICI_WORKSPACE_ROOT | Current directory | Workspace root for running tests | | LOKI_FLOW_ID_ATTR | flowId | LogQL field name for flow ID | | LOKI_CORRELATION_ID_ATTR | correlationId | LogQL field name for correlation ID |

How It Works

Cursor IDE
    │
    │ MCP protocol (stdio)
    ▼
fp-observability MCP server (9 tools)
    │
    │ HTTP calls
    ▼
┌──────────────────────────────────────┐
│  Observability Stack                 │
│  • Loki  (logs)                      │
│  • Tempo (traces)                    │
│  • Prometheus (health checks)        │
│                                      │
│  fp-mono api-gateway                 │
│  • /api/v1/error-registry (metadata) │
│  • /api/v1/health/deep (gRPC fan-out)│
│    → identity (:50051)               │
│    → core-apps-routing (:50052)      │
│    → lenders (:50053)                │
│    → finance (:50054)                │
│    → edge-ops (:50055)               │
└──────────────────────────────────────┘

Development

Using Local Build

{
  "mcpServers": {
    "fp-observability": {
      "command": "node",
      "args": ["/path/to/dici-new/packages/mcp-observability/dist/index.js"],
      "env": {
        "LOKI_URL": "http://localhost:3100",
        "TEMPO_URL": "http://localhost:3200",
        "API_GATEWAY_URL": "http://localhost:3000"
      }
    }
  }
}

Workflow

Make changes to source files in src/
Rebuild: pnpm build
Reload MCP in Cursor: Cmd+Shift+P → "Developer: Reload Window"

Troubleshooting

"Cannot connect to Loki/Tempo"

Verify your observability stack is running
Check the configured URLs are correct
Use check_services tool to diagnose all services at once

"Error registry unavailable"

Ensure fp-mono api-gateway is running
Check API_GATEWAY_URL is correct
If in production, set OBSERVABILITY_API_KEY env var

Partial trace ID not resolving

Use the full 32-character trace ID
Or increase the search window with since: "7d"

Known Limitations

Error rates are approximated, not precise. get_error_rate counts log lines in Loki as a proxy for error rates. This is not real metrics -- it's an approximation. Until the OTel Collector is configured to export metrics to Prometheus, there is no request-level error rate data available.
Dev-environment only. The entire stack depends on Docker Compose being up. This is IDE-integrated debugging for local development, not production observability.
Only as good as the telemetry. If a service has poor span coverage or doesn't propagate trace context correctly, the trace data will have gaps. The MCP tools can't fix bad instrumentation -- they surface what the services emit.
compare_traces matching is inherently fuzzy. Span matching across two different traces relies on heuristics (operation name, service, parent). Structural differences from conditional code paths, retries, or fan-out variations can make comparisons noisy. The three matching strategies (default, strict, loose) help, but aren't perfect.
No real-time streaming. All tools are request/response. There is no live tail of logs or traces -- each query is a point-in-time snapshot.

Roadmap

This MCP server currently runs locally against local Loki/Tempo/Prometheus and fp-mono api-gateway. The goal is to deploy it as a production MCP server for live agent-assisted debugging.

Production Deployment

[ ] Add authentication layer (API key or OAuth) for production Loki/Tempo/Prometheus access
[ ] Set OBSERVABILITY_API_KEY in production for error registry + deep health access
[ ] Add TLS support for all client connections
[ ] Deploy as a standalone service (Docker container or serverless function)
[ ] Add rate limiting to prevent runaway agent queries against production observability stack
[ ] Add read-only query guards (prevent agents from running expensive unbounded queries)
[ ] Support remote MCP transport (SSE or HTTP) instead of stdio for production use
[ ] Add multi-environment support (staging vs production) via environment selector

Tool Enhancements

[ ] Configure OTel Collector to export metrics to Prometheus for real request-level error rates (replacing Loki log-count approximation)
[ ] Add TraceQL support to search_traces for advanced trace querying
[ ] Add Grafana dashboard links in get_trace and get_error_rate output
[ ] Correlate Temporal workflow executions with their traces and logs in a single view
[ ] Proactive anomaly surfacing -- detect elevated error rates or degraded services on session start instead of waiting for the user to ask
[ ] Implement gRPC Health Checking Protocol (grpc.health.v1) on all microservices for cleaner deep health checks

Integrations

[ ] Add alerting integration (query PagerDuty/OpsGenie for active incidents alongside trace data)
[ ] Link to Temporal UI for workflow-level debugging when traces span workflow activities

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@dici1435/observability-mcp

Installation

One-Click Install (Recommended)

Manual Installation

Tools

get_trace

get_logs

analyze_request

run_tests

search_traces (NEW)

check_services (NEW)

get_error_info (NEW)

get_error_rate (NEW)

compare_traces (NEW)

Configuration

Environment Variables

How It Works

Development

Using Local Build

Workflow

Troubleshooting

"Cannot connect to Loki/Tempo"

"Error registry unavailable"

Partial trace ID not resolving

Known Limitations

Roadmap

Production Deployment

Tool Enhancements

Integrations

License

`get_trace`

`get_logs`

`analyze_request`

`run_tests`

`search_traces` (NEW)

`check_services` (NEW)

`get_error_info` (NEW)

`get_error_rate` (NEW)

`compare_traces` (NEW)