observability-toolkit
v2.1.1
Published
MCP server for observability tooling - query traces, metrics, logs from local JSONL or SigNoz
Maintainers
Readme
observability-toolkit
MCP server for observability tooling — query traces, metrics, logs, and LLM events from agentic coding tools. Works with any agent emitting OTel GenAI semantic conventions. Ingest via OTLP to Cloudflare R2 or read from local JSONL.
Version: v2.1.0 | Published: 2026-04-07 | License: MIT
Installation
# Claude Code
claude mcp add observability-toolkit -- npx -y observability-toolkit
# Cursor / Windsurf / Continue.dev / Cline — add to MCP config:
{ "mcpServers": { "observability-toolkit": { "command": "npx", "args": ["-y", "observability-toolkit"] } } }
# Local dev
node ~/.claude/mcp-servers/observability-toolkit/dist/server.jsTools
| Tool | Description |
|------|-------------|
| obs_query_traces | Query spans with filtering, regex, numeric operators, agent/tool attributes |
| obs_query_metrics | Query metrics with aggregations (sum, avg, p50, p95, p99, rate), time buckets |
| obs_query_logs | Query logs with boolean search, field extraction, negation |
| obs_query_llm_events | Query LLM events with token usage, duration, provider/model filters |
| obs_query_evaluations | Query evaluation events with aggregations and groupBy |
| obs_query_verifications | Query human verification events for EU AI Act compliance |
| obs_query_regressions | Detect quality metric regressions via EWMA drift and consecutive breach tracking |
| obs_query_metric_histograms | Query OTLP histogram bucket distributions by metric name |
| obs_health_check | Telemetry system health with cache statistics |
| obs_context_stats | Context window utilization stats |
| obs_token_budget | Context utilization, cache hit rate, headroom per model/session with alert levels |
| obs_hallucination_detection | Hallucination risk from evaluation telemetry — rates, scores, model/method breakdowns |
| obs_multi_agent_coordination | Delegation depth, fan-out ratio, handoff latency, agent token usage |
| obs_routing_telemetry | Model distribution, cost savings, fallback rate, routing latency |
| obs_estimate_cost | Token cost estimation across models |
| obs_audit_trail | Query audit trail events (SHA-256 hash chain) |
| obs_manage_datasets | Create, list, get, delete evaluation datasets (trace promotion) |
| obs_inject_evaluations | Inject evaluation events into local telemetry |
| obs_ingest_spans | Ingest spans to cloud backend via OTLP protobuf |
| obs_ingest_traces | Push complete OTel traces (resourceSpans) with service metadata |
| obs_export_langfuse | Export evaluations to Langfuse via OTLP HTTP |
| obs_export_phoenix | Export evaluations to Arize Phoenix via OTLP HTTP |
| obs_export_datadog | Export evaluations to Datadog LLM Observability |
| obs_export_confident | Export evaluations to Confident AI |
| obs_get_trace_url | Get trace viewer URL |
| obs_setup_claudeignore | Add entries to .claudeignore |
| obs_export_jaeger | Export spans to a local Jaeger instance via OTLP HTTP |
| obs_detect_trace_anomalies | Detect anomalous spans — duration, error status, token usage, unknown names, instrumentation loops |
Configuration
export OBTOOL_API_KEY="obtk_YOUR_KEY_HERE"
export OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.integritystudio.ai"
# Or: doppler run --project integrity-studio --config dev -- npm start| Variable | Default | Notes |
|----------|---------|-------|
| OBTOOL_API_KEY | — | Required for cloud backend tools |
| OTEL_EXPORTER_OTLP_ENDPOINT | — | Required for cloud telemetry export |
| BACKEND_TYPE | cloud | Query/ingest backend: local or cloud |
| TELEMETRY_DIR | ./.otel | Local telemetry directory (cwd-relative) |
| CACHE_TTL_MS | 60000 | Query cache TTL |
| RETENTION_DAYS | 7 | Telemetry file retention |
| OBTOOL_API_URL | — | Cloud backend query URL |
| OBTOOL_INGEST_URL | — | Cloud ingest URL |
See docs/ENVIRONMENT_SETUP.md for full setup.
Data Sources
Local JSONL (default)
Scans ./.otel/ (cwd-relative; override with TELEMETRY_DIR) plus project-local .claude/telemetry/, telemetry/, .telemetry/. All directories are only included if they exist on disk at query time. Supports gzip. Compatible with Claude Code natively and any agent using the OTel file exporter.
File patterns: {traces,logs,metrics,llm-events,evaluations,verifications}-YYYY-MM-DD.jsonl[.gz]
Cloud Backend (optional)
All query tools accept backend: 'local' | 'cloud' (default: cloud). Cloud queries obtool-api (D1/R2) via OBTOOL_API_URL + OBTOOL_API_KEY. Circuit breaker protects against cascading failures.
Data Pipelines
Local backend (backend: 'local'):
Claude Code hooks / OTel SDK
│ write
▼
./.otel/<path>.jsonl ← FileSpanExporter (SDK self-telemetry; path is caller-configured)
./.otel/*.jsonl ← hook-written JSONL (TELEMETRY_DIR primary)
.claude/telemetry/ ← legacy project-local hook spans (if present)
telemetry/ .telemetry/ ← other project-local dirs (if present)
│ read (getTelemetryDirectories union)
▼
obs_query_* tools
│ enrich
▼
derive → judge → sync-to-kv.ts → Cloudflare KVCloud backend (backend: 'cloud'):
hooks → OTLP HTTP → ingest.integritystudio.ai → R2 → D1 → api.integritystudio.aiServices
Ingest Worker (obtool-ingest)
Cloudflare Worker (Hono v4) receiving OTLP protobuf telemetry. Deployed at ingest.integritystudio.ai.
POST /v1/{traces,metrics,logs}— OTLP protobuf ingest (gzip supported)- R2 storage:
telemetry/{signal}/{YYYY-MM-DD}/{HH}/batch-{ts}-{uuid8}.jsonl - SHA-256 bearer token auth, KV idempotency (5min TTL)
cd services/obtool-ingest && npm run dev | npm test | npm run deployAPI Worker (obtool-api)
Cloudflare Worker (Hono v4) querying D1/R2. Deployed at api.integritystudio.ai.
Routes: /v1/{traces,metrics,logs,sessions,cost,datasets} + histogram and raw span drill-down. Cursor-based pagination, per-key rate limiting.
cd services/obtool-api && npm run dev | npm test | npm run deployAPI Provisioning
Two-worker system for Flutter client API key lifecycle.
- Sender Worker (public): signup / signin / provision; signs payloads with HMAC-SHA256 before forwarding to receiver
- Receiver Worker (
services/api-provisioning-receiver/): verifies HMAC signature + replay window; validates email format, registrable domain, and MX records via Cloudflare DoH; authenticates JWT via Auth0/userinfo; upserts team organization (type='team',current_plan=tier); adds user to org membership; callsapi-keys-createSupabase Edge Function api-keys-create(Supabase/Deno): generatesobtk_token, insertsapi_keyswithtierfrom request body, syncs to Cloudflare KV
Tier values (api_key_tier DB enum): starter | growth | enterprise — defaults to starter.
Receiver env bindings: AUTH0_DOMAIN, SUPABASE_URL, SUPABASE_ANON_KEY, SUPABASE_SERVICE_ROLE_KEY, SHARED_SECRET.
See docs/api-provisioning-flutter-contract.md | docs/auth/api-key-provisioning.md | docs/api-provisioning-security.md.
cd services/e2e && npm test # 7 files, 30 E2E scenarios (api-key-auth, dashboard-auth, receiver-security, sender-receiver, …)Evaluation Libraries
LLM-as-Judge
G-Eval (chain-of-thought + logprob normalization), QAG faithfulness, position bias mitigation (mitigatedPairwiseEval), panel evaluation, circuit breaker + retry. Zod schemas co-located as single source of truth.
See docs/quality/llm-as-judge.md.
Agent-as-Judge
ProceduralJudge (fixed pipeline, early termination), ReactiveJudge (adaptive routing, LRU state), tool verification (selection 40% / args 30% / result 30%), trajectory efficiency analysis, multi-agent handoff scoring.
See docs/quality/agent-as-judge.md.
Quality Pipeline (Hooks)
- T1 rule-based:
tool_correctness,evaluation_latency,task_completion— every invocation, zero cost - T2 LLM judge:
relevance,coherence,faithfulness,hallucination— sampled, budget-controlled - Divergence detection: entropy-based bimodal alerts for
relevance,coherence,task_completion - Regression detection: post-T2 inline EWMA drift check, emits
quality.degradation_confirmedOTel event - Meta-evaluation: explanation quality scoring via
evaluateExplanationQuality()(R6.2)
Dashboard
React 19 + Vite 8 in dashboard/ (git submodule). Hono API on :3001, Auth0 Universal Login, role-based access via Supabase.
Routes: / (overview), /metrics/:name, /role/:roleName, /correlations, /coverage, /traces/:traceId, /sessions/:sessionId, /agents/:sessionId, /compliance
Deployed as Cloudflare Pages (frontend) + Worker (quality-metrics-api). Data synced from local pipeline via sync-to-kv.ts.
cd dashboard && npm run dev # :5173 + API :3001
cd dashboard && npm run populate -- --seedIntegrations
| Platform | Method | Status | |----------|--------|--------| | Claude Code | Native MCP | Full | | Cursor / Windsurf / Continue.dev / Cline | MCP config | Full | | Any OTel agent | OTLP → local JSONL | Full | | obtool-ingest | OTLP → Cloudflare R2 | Full | | Langfuse / Phoenix / Datadog / Confident AI | OTLP / HTTP export | Export only |
OTel GenAI semconv v1.40.0 compliance. 15 LLM providers supported (anthropic, openai, gcp.gemini, gcp.vertex_ai, aws.bedrock, azure.ai.openai, mistral_ai, cohere, groq, ollama, together_ai, fireworks_ai, huggingface, replicate, perplexity).
Development
npm install && npm run build && npm testSee docs/repomix/token-tree.txt for the full file tree with token counts.
Documentation
| Doc | Description | |-----|-------------| | docs/CHANGELOG.md | Version index v2.0.1 → v3.0.14 | | docs/roadmap/README.md | Roadmap, research directions, architecture docs index | | docs/otel-v2/otel-genai-attribute-reference.md | OTel GenAI attribute reference (v1.40.0) | | docs/otel-v2/agent-span-hierarchies.md | Agent span model and hierarchy patterns | | docs/otel-v2/schema-migration.md | JSONL → OTLP migration | | docs/otel-v3/llm-evaluation-frameworks.md | Langfuse, Phoenix, DeepEval, Datadog comparison | | docs/quality/llm-as-judge.md | LLM-as-Judge architecture | | docs/quality/agent-as-judge.md | Agent-as-Judge architecture | | docs/api-provisioning-flutter-contract.md | Flutter client integration | | docs/auth/api-key-provisioning.md | End-to-end provisioning flow diagram | | docs/api-provisioning-security.md | Security properties and threat model | | docs/reliability/security.md | Security controls | | docs/hooks-integration.md | Hooks system integration — producer-consumer architecture, type chain, path coupling | | docs/test-anti-patterns.md | Test anti-patterns — patterns to avoid and shared helpers to use instead |
