@dici1435/observability-mcp
v1.0.12
Published
MCP server for querying logs and traces from Loki/Tempo observability stack
Maintainers
Readme
@dici1435/observability-mcp
MCP (Model Context Protocol) server that exposes observability tools to Cursor for AI-assisted debugging. Query logs, traces, error codes, service health, and error rates directly in your IDE.
Installation
One-Click Install (Recommended)
Click the button below to install directly to Cursor:
Manual Installation
Add to your ~/.cursor/mcp.json:
{
"mcpServers": {
"fp-observability": {
"command": "npx",
"args": ["-y", "@dici1435/observability-mcp"],
"env": {
"LOKI_URL": "http://localhost:3100",
"TEMPO_URL": "http://localhost:3200",
"API_GATEWAY_URL": "http://localhost:3000"
}
}
}
}Then restart Cursor.
Tools
get_trace
Retrieve a distributed trace by traceId from Tempo. Now includes error registry metadata inline.
"Get trace abc123def456"
"Show me what happened in trace 5f8d3a..."Features:
- Supports partial trace IDs (minimum 8 characters)
- Automatically resolves short IDs from console logs
- Shows error code metadata from the live registry (category, severity, retryable)
get_logs
Query logs from Loki with flexible filtering. Enhanced error/warn display with promoted error attributes.
"Show error logs from api-gateway"
"Get logs for trace abc123"
"Find logs mentioning 'timeout' in the last 30 minutes"Parameters: service, level, traceId, spanId, flowId, correlationId, search, since, limit
analyze_request
Comprehensive analysis combining trace and log data. Now includes codeRef extraction and smart registry-based recommendations.
"Analyze what happened to request with trace abc123"
"Debug the failed request xyz789"Smart recommendations based on error registry:
- Retryable errors prompt retry verification
- External errors point to third-party health
- Critical errors flag expected pages
- Runbook links included when available
run_tests
Execute tests and capture trace IDs from the output.
"Run unit tests for packages/testing"
"Run e2e:api tests"Test Types: unit (default), e2e:api, e2e:browser, traced
search_traces (NEW)
Search for traces matching criteria. Returns lightweight summaries without N+1 full trace fetching.
"Find error traces from api-gateway in the last 30 minutes"
"Search for slow traces over 2 seconds"Parameters: service, operation, minDuration, maxDuration, status (error/ok), tags, since, limit
check_services (NEW)
Check health of all services: api-gateway + downstream gRPC microservices + infrastructure (Loki, Tempo, Prometheus).
"Are all services healthy?"
"Check service health"Uses the api-gateway deep health endpoint for gRPC fan-out to identity, core-apps-routing, lenders, finance, and edge-ops.
get_error_info (NEW)
Look up FormPiper error code metadata from the live error registry.
"What does FP.LENDERS.SUBMISSION_FAILED mean?"
"Show all system-category errors"
"List all critical severity errors"Parameters: code, codeRef, category (user/system/external), severity (critical/high/medium/low)
get_error_rate (NEW)
Query error rates per service using Loki log counts.
"What's the error rate across services?"
"Show error rates for the last 5 minutes"Parameters: service, since (default: "5m"), threshold (default: 1%)
compare_traces (NEW)
Compare two traces side-by-side with configurable span matching.
"Compare trace abc123 (passing) with trace def456 (failing)"Parameters: traceIdA, traceIdB, matchStrategy (default/strict/loose)
Matching strategies:
default: operation + service + parent operation with positional tiebreakerstrict: adds spanKind + depth for highly uniform tracesloose: operation + service only for simple request/response traces
Configuration
Environment Variables
| Variable | Default | Description |
| -------------------------- | ----------------------- | ------------------------------------------------------------------ |
| LOKI_URL | http://localhost:3100 | Loki server URL |
| TEMPO_URL | http://localhost:3200 | Tempo server URL |
| API_GATEWAY_URL | http://localhost:3000 | fp-mono api-gateway URL (error registry + deep health) |
| OBSERVABILITY_API_KEY | (none) | API key for ObservabilityGuard (optional in dev, required in prod) |
| PROMETHEUS_URL | http://localhost:9090 | Prometheus URL (for check_services health) |
| SPAN_MATCH_STRATEGY | default | Default span matching for compare_traces |
| DICI_WORKSPACE_ROOT | Current directory | Workspace root for running tests |
| LOKI_FLOW_ID_ATTR | flowId | LogQL field name for flow ID |
| LOKI_CORRELATION_ID_ATTR | correlationId | LogQL field name for correlation ID |
How It Works
Cursor IDE
│
│ MCP protocol (stdio)
▼
fp-observability MCP server (9 tools)
│
│ HTTP calls
▼
┌──────────────────────────────────────┐
│ Observability Stack │
│ • Loki (logs) │
│ • Tempo (traces) │
│ • Prometheus (health checks) │
│ │
│ fp-mono api-gateway │
│ • /api/v1/error-registry (metadata) │
│ • /api/v1/health/deep (gRPC fan-out)│
│ → identity (:50051) │
│ → core-apps-routing (:50052) │
│ → lenders (:50053) │
│ → finance (:50054) │
│ → edge-ops (:50055) │
└──────────────────────────────────────┘Development
Using Local Build
{
"mcpServers": {
"fp-observability": {
"command": "node",
"args": ["/path/to/dici-new/packages/mcp-observability/dist/index.js"],
"env": {
"LOKI_URL": "http://localhost:3100",
"TEMPO_URL": "http://localhost:3200",
"API_GATEWAY_URL": "http://localhost:3000"
}
}
}
}Workflow
- Make changes to source files in
src/ - Rebuild:
pnpm build - Reload MCP in Cursor:
Cmd+Shift+P→ "Developer: Reload Window"
Troubleshooting
"Cannot connect to Loki/Tempo"
- Verify your observability stack is running
- Check the configured URLs are correct
- Use
check_servicestool to diagnose all services at once
"Error registry unavailable"
- Ensure fp-mono api-gateway is running
- Check
API_GATEWAY_URLis correct - If in production, set
OBSERVABILITY_API_KEYenv var
Partial trace ID not resolving
- Use the full 32-character trace ID
- Or increase the search window with
since: "7d"
Known Limitations
- Error rates are approximated, not precise.
get_error_ratecounts log lines in Loki as a proxy for error rates. This is not real metrics -- it's an approximation. Until the OTel Collector is configured to export metrics to Prometheus, there is no request-level error rate data available. - Dev-environment only. The entire stack depends on Docker Compose being up. This is IDE-integrated debugging for local development, not production observability.
- Only as good as the telemetry. If a service has poor span coverage or doesn't propagate trace context correctly, the trace data will have gaps. The MCP tools can't fix bad instrumentation -- they surface what the services emit.
compare_tracesmatching is inherently fuzzy. Span matching across two different traces relies on heuristics (operation name, service, parent). Structural differences from conditional code paths, retries, or fan-out variations can make comparisons noisy. The three matching strategies (default,strict,loose) help, but aren't perfect.- No real-time streaming. All tools are request/response. There is no live tail of logs or traces -- each query is a point-in-time snapshot.
Roadmap
This MCP server currently runs locally against local Loki/Tempo/Prometheus and fp-mono api-gateway. The goal is to deploy it as a production MCP server for live agent-assisted debugging.
Production Deployment
- [ ] Add authentication layer (API key or OAuth) for production Loki/Tempo/Prometheus access
- [ ] Set OBSERVABILITY_API_KEY in production for error registry + deep health access
- [ ] Add TLS support for all client connections
- [ ] Deploy as a standalone service (Docker container or serverless function)
- [ ] Add rate limiting to prevent runaway agent queries against production observability stack
- [ ] Add read-only query guards (prevent agents from running expensive unbounded queries)
- [ ] Support remote MCP transport (SSE or HTTP) instead of stdio for production use
- [ ] Add multi-environment support (staging vs production) via environment selector
Tool Enhancements
- [ ] Configure OTel Collector to export metrics to Prometheus for real request-level error rates (replacing Loki log-count approximation)
- [ ] Add TraceQL support to
search_tracesfor advanced trace querying - [ ] Add Grafana dashboard links in
get_traceandget_error_rateoutput - [ ] Correlate Temporal workflow executions with their traces and logs in a single view
- [ ] Proactive anomaly surfacing -- detect elevated error rates or degraded services on session start instead of waiting for the user to ask
- [ ] Implement gRPC Health Checking Protocol (grpc.health.v1) on all microservices for cleaner deep health checks
Integrations
- [ ] Add alerting integration (query PagerDuty/OpsGenie for active incidents alongside trace data)
- [ ] Link to Temporal UI for workflow-level debugging when traces span workflow activities
License
MIT
