observability-toolkit

v2.0.0

Published

2 days ago

MCP server for observability tooling - query traces, metrics, logs from local JSONL or SigNoz

0High
0Medium
0Low

alyshialedlie

mcp observability opentelemetry traces metrics logs signoz claude-code

observability-toolkit

MCP server for observability tooling - query traces, metrics, and logs from local JSONL files for Claude Code sessions. Optionally integrates with SigNoz Cloud for enhanced observability.

Installation

claude mcp add observability-toolkit -- npx -y observability-toolkit

Or for local development:

claude mcp add observability-toolkit -- node ~/.claude/mcp-servers/observability-toolkit/dist/server.js

Tools

| Tool | Description | |------|-------------| | obs_query_traces | Query traces with filtering, regex, numeric operators | | obs_query_metrics | Query metrics with aggregations (sum, avg, p50, p95, p99, rate) | | obs_query_logs | Query logs with boolean search, field extraction | | obs_query_llm_events | Query LLM events with token usage and duration metrics | | obs_query_evaluations | Query evaluation events with aggregations and groupBy | | obs_query_verifications | Query human verification events for EU AI Act compliance | | obs_health_check | Check telemetry system health with cache statistics | | obs_context_stats | Get context window utilization stats | | obs_get_trace_url | Get SigNoz trace viewer URL (requires SigNoz) | | obs_setup_claudeignore | Add entries to .claudeignore | | obs_export_langfuse | Export evaluations to Langfuse via OTLP HTTP |

Configuration

| Variable | Description | Default | |----------|-------------|---------| | TELEMETRY_DIR | Local telemetry directory | ~/.claude/telemetry | | SIGNOZ_URL | SigNoz instance URL | - | | SIGNOZ_API_KEY | SigNoz API key | - | | CACHE_TTL_MS | Query cache TTL in milliseconds | 60000 | | RETENTION_DAYS | Days to retain telemetry files | 7 | | LANGFUSE_ENDPOINT | Langfuse OTLP endpoint URL | - | | LANGFUSE_PUBLIC_KEY | Langfuse public key | - | | LANGFUSE_SECRET_KEY | Langfuse secret key | - |

Usage Examples

Query Traces

// Basic query
obs_query_traces({ limit: 10 })

// Filter by trace ID
obs_query_traces({ traceId: "abc123..." })

// Filter by service and duration
obs_query_traces({ serviceName: "claude-code", minDurationMs: 100 })

// Regex pattern matching
obs_query_traces({ spanNameRegex: "^http\\..*" })

// Numeric attribute filtering
obs_query_traces({
  numericFilter: [
    { attribute: "http.status_code", operator: "gte", value: 400 }
  ]
})

// Existence checks
obs_query_traces({
  attributeExists: ["error.message"],
  attributeNotExists: ["http.response.body"]
})

// OTel GenAI agent/tool filters
obs_query_traces({ agentName: "Explore", toolName: "Read" })
obs_query_traces({ operationName: "execute_tool", toolCallId: "toolu_123" })

Query Logs

// Basic severity filter
obs_query_logs({ severity: "ERROR", limit: 20 })

// Boolean search (AND)
obs_query_logs({
  searchTerms: ["timeout", "connection"],
  searchOperator: "AND"
})

// Boolean search (OR)
obs_query_logs({
  searchTerms: ["error", "warning", "critical"],
  searchOperator: "OR"
})

// Field extraction from JSON logs
obs_query_logs({
  extractFields: ["user.id", "request.method", "response.status"]
})

// Exclude patterns
obs_query_logs({
  search: "error",
  excludeSearch: "health-check"
})

Query Metrics

// Basic query
obs_query_metrics({ metricName: "session.context.size" })

// Aggregations
obs_query_metrics({ metricName: "http.duration", aggregation: "avg" })
obs_query_metrics({ metricName: "http.duration", aggregation: "p95" })
obs_query_metrics({ metricName: "requests.count", aggregation: "rate" })

// Time bucket grouping
obs_query_metrics({
  metricName: "token.usage",
  aggregation: "sum",
  timeBucket: "1h",
  groupBy: ["model"]
})

// Percentiles
obs_query_metrics({ metricName: "latency", aggregation: "p99" })

Query LLM Events

// Basic query
obs_query_llm_events({ limit: 20 })

// Filter by model and provider
obs_query_llm_events({ model: "claude-3-opus", provider: "anthropic" })

// OTel GenAI operation types
obs_query_llm_events({ operationName: "chat" })
obs_query_llm_events({ operationName: "invoke_agent" })

// Filter by conversation
obs_query_llm_events({ conversationId: "conv-abc123" })

// Combine filters
obs_query_llm_events({
  operationName: "chat",
  provider: "anthropic",
  conversationId: "conv-abc123"
})

Multi-Provider LLM Support

Query events from any LLM provider using OTel GenAI standard identifiers:

// Anthropic Claude
obs_query_llm_events({ provider: "anthropic", model: "claude-3-opus" })

// OpenAI
obs_query_llm_events({ provider: "openai", model: "gpt-4o" })

// Google Gemini
obs_query_llm_events({ provider: "gcp.gemini", model: "gemini-1.5-pro" })

// Mistral AI
obs_query_llm_events({ provider: "mistral_ai", model: "mistral-large" })

// Cohere
obs_query_llm_events({ provider: "cohere", model: "command-r-plus" })

// AWS Bedrock (multi-model)
obs_query_llm_events({ provider: "aws.bedrock" })

// Azure OpenAI
obs_query_llm_events({ provider: "azure.ai.openai" })

// Local models (Ollama)
obs_query_llm_events({ provider: "ollama", model: "llama3:8b" })

// Groq
obs_query_llm_events({ provider: "groq", model: "llama-3.3-70b" })

Provider Fallback Chain: The toolkit uses OTel GenAI v1.39 compliant attribute lookup:

gen_ai.provider.name (primary)
gen_ai.system (legacy OTel)
provider (custom/fallback)

Health Check with Cache Stats

obs_health_check({ verbose: true })

// Returns:
{
  "status": "ok",
  "backends": { ... },
  "cache": {
    "traces": { "hits": 10, "misses": 5, "hitRate": 0.67, "size": 15, "evictions": 0 },
    "logs": { "hits": 8, "misses": 12, "hitRate": 0.4, "size": 20, "evictions": 2 },
    "metrics": { "hits": 0, "misses": 0, "hitRate": 0, "size": 0, "evictions": 0 },
    "llmEvents": { "hits": 0, "misses": 0, "hitRate": 0, "size": 0, "evictions": 0 }
  }
}

Features

Query Capabilities

| Feature | Description | |---------|-------------| | Percentile Aggregations | p50, p95, p99 for metrics | | Time Bucket Grouping | 1m, 5m, 1h, 1d buckets for trend analysis | | Rate Calculations | Per-second rate of change | | Numeric Operators | gt, gte, lt, lte, eq for attribute filtering | | Regex Patterns | Advanced span name filtering | | Boolean Search | AND/OR operators for log queries | | Field Extraction | Extract JSON paths from structured logs | | Negation Filters | Exclude matching spans/logs | | Existence Checks | Filter by attribute presence |

OTel Compliance

| Feature | Description | |---------|-------------| | severityNumber | Standard OTel severity levels | | statusCode | UNSET, OK, ERROR for spans | | Histogram Buckets | Full histogram distribution support | | InstrumentationScope | Library/module metadata | | Span Links | Cross-trace relationships | | Exemplars | Metric-to-trace correlation | | Aggregation Temporality | DELTA, CUMULATIVE support |

OTel GenAI Semantic Conventions (10/10 compliance)

| Feature | Description | |---------|-------------| | gen_ai.operation.name | Filter by chat, embeddings, invoke_agent, execute_tool | | gen_ai.provider.name | Provider fallback: gen_ai.provider.name → gen_ai.system → provider | | gen_ai.conversation.id | Filter LLM events by conversation ID | | gen_ai.agent.id/name | Filter traces by agent attributes | | gen_ai.tool.name/call.id | Filter traces by tool attributes | | gen_ai.response.model | Actual model that responded | | gen_ai.response.finish_reasons | Why generation stopped | | gen_ai.request.temperature | Sampling temperature | | gen_ai.request.max_tokens | Maximum output tokens | | Percentiles | p50, p95, p99, rate aggregations |

Supported LLM Providers

| Provider ID | Description | Example Models | |-------------|-------------|----------------| | anthropic | Anthropic Claude | claude-3-opus, claude-3-sonnet, claude-3-haiku | | openai | OpenAI GPT | gpt-4o, gpt-4-turbo, gpt-3.5-turbo, o1, o1-mini | | gcp.gemini | Google AI Studio | gemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash | | gcp.vertex_ai | Google Vertex AI | gemini-pro, claude-3-opus (via Vertex) | | aws.bedrock | AWS Bedrock | claude-3-sonnet, titan-text, llama-3 | | azure.ai.openai | Azure OpenAI | gpt-4-deployment, gpt-35-turbo | | mistral_ai | Mistral AI | mistral-large, mistral-small, codestral | | cohere | Cohere | command-r-plus, command-r, embed-english | | groq | Groq | llama-3.3-70b, mixtral-8x7b | | ollama | Ollama (local) | llama3, mistral, codellama | | together_ai | Together AI | llama-3-70b, mixtral-8x7b | | fireworks_ai | Fireworks AI | llama-v3-70b, mixtral-8x7b | | huggingface | HuggingFace | Various open models | | replicate | Replicate | Various hosted models | | perplexity | Perplexity | sonar-pro, sonar |

Note: Custom provider identifiers are also supported for internal or unlisted LLM services.

Performance

| Feature | Description | |---------|-------------| | Query Caching | LRU cache with configurable TTL | | File Indexing | .idx sidecars for fast lookups | | Gzip Support | Transparent decompression of .jsonl.gz files | | BatchWriter | Buffered writes to reduce I/O | | Streaming | Early termination for large files | | Parallel Queries | Concurrent multi-directory queries | | Cursor Pagination | Efficient large result set handling |

Internal Observability

| Feature | Description | |---------|-------------| | Cache Metrics | Hit/miss/eviction tracking | | Query Timing | Slow query warnings (>500ms) | | Circuit Breaker Logging | State transition visibility | | Health Check Stats | Cache statistics in health output |

Security

| Feature | Description | |---------|-------------| | Query Escaping | ClickHouse-specific escaping, 22-pattern blocklist | | Memory Limits | MAX_RESULTS_IN_MEMORY=10000, streaming aggregation | | Input Validation | limit≤1000, date range≤365 days, regex limits | | Type Safety | NaN/Infinity rejection, explicit type assertions |

See docs/security.md for details.

Data Sources

Local JSONL (Default)

Scans multiple telemetry directories:

Global: ~/.claude/telemetry/ (always checked)
Project-local: .claude/telemetry/, telemetry/, .telemetry/

File patterns (supports gzip compression):

traces-YYYY-MM-DD.jsonl / .jsonl.gz
logs-YYYY-MM-DD.jsonl / .jsonl.gz
metrics-YYYY-MM-DD.jsonl / .jsonl.gz
llm-events-YYYY-MM-DD.jsonl / .jsonl.gz

SigNoz Cloud (Optional)

When configured, queries SigNoz Cloud API with:

Circuit breaker protection
Cursor-based pagination
Response time tracking

OTLP Export

Export data in OpenTelemetry format:

// Export traces
const otlpTraces = await backend.exportTracesOTLP({ startDate: "2026-01-28" });

// Export logs
const otlpLogs = await backend.exportLogsOTLP({ severity: "ERROR" });

// Export metrics
const otlpMetrics = await backend.exportMetricsOTLP({ metricName: "http.duration" });

Langfuse Export (v1.8.6+)

Export evaluations to Langfuse for unified tracing and evaluation analysis:

// Export all evaluations from last 7 days
obs_export_langfuse({})

// Export with filters
obs_export_langfuse({
  evaluationName: "quality",
  scoreMin: 0.8,
  limit: 500,
  batchSize: 100
})

// Dry run to preview export
obs_export_langfuse({
  startDate: "2026-01-28",
  dryRun: true
})

// Override credentials (for testing)
obs_export_langfuse({
  endpoint: "https://cloud.langfuse.com",
  publicKey: "pk-lf-...",
  secretKey: "sk-lf-..."
})

Features:

Batched OTLP HTTP export with retry logic
Memory protection (400MB warn, 600MB abort)
Progress logging for large exports
Credential sanitization in error messages
DNS rebinding protection

Evaluation Libraries

LLM-as-Judge (`src/lib/llm-as-judge.ts`)

Single-pass LLM evaluation for output quality:

import { gEval, qagEvaluate, JudgeCircuitBreaker } from './lib/llm-as-judge.js';

// G-Eval pattern with chain-of-thought
const result = await gEval(testCase, criteria, llmFn);

// QAG faithfulness evaluation
const faithfulness = await qagEvaluate(testCase, llmFn);

// Production circuit breaker
const breaker = new JudgeCircuitBreaker(5, 60000);
const result = await breaker.evaluate(() => gEval(...));

Agent-as-Judge (`src/lib/agent-as-judge.ts`)

Multi-step agent evaluation with trajectory analysis:

import {
  verifyToolCalls,
  aggregateStepScores,
  analyzeTrajectory,
  collectiveConsensus,
  ProceduralJudge,
  ReactiveJudge,
} from './lib/agent-as-judge.js';

// Verify tool call correctness
const verifications = verifyToolCalls(actions, expectedTools);

// Analyze agent trajectory efficiency
const metrics = analyzeTrajectory({ actions, expectedSteps: 5 });

// Multi-agent consensus evaluation
const consensus = await collectiveConsensus(judges, { id: 'eval-1' }, {
  rounds: 3,
  convergenceThreshold: 0.05,
});

// Procedural multi-stage evaluation
const proceduralJudge = new ProceduralJudge([
  { name: 'syntax', evaluate: syntaxChecker },
  { name: 'semantic', evaluate: semanticAnalyzer },
]);
const result = await proceduralJudge.evaluate(evaluand);

// Reactive specialist-based evaluation
const reactiveJudge = new ReactiveJudge(router, specialists, deepDiveSpecialists);
const result = await reactiveJudge.evaluate(evaluand);

Query Agent Evaluations

// Filter by agent ID/name
obs_query_evaluations({
  agentId: 'agent-123',
  agentName: 'TaskRunner',
  evaluationName: 'tool_correctness',
})

// Response includes agent-specific fields
{
  stepScores: [{ step: 0, score: 0.9, explanation: '...' }],
  toolVerifications: [{ toolName: 'search', toolCorrect: true, score: 1.0 }],
  trajectoryLength: 5,
}

Development

cd ~/.claude/mcp-servers/observability-toolkit
npm install
npm run build
npm test     # 3254 tests
npm run start

Documentation

docs/changelog/ - Version history and changelogs
docs/reliability/security.md - Security controls and hardening
docs/quality/llm-as-judge.md - LLM-as-Judge architecture
docs/quality/agent-as-judge.md - Agent-as-Judge architecture
docs/backlog/ - Feature backlog and roadmap
docs/changelog/SESSION_HISTORY.md - Development session logs
docs/Summary.md - Full documentation index

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

observability-toolkit

Installation

Tools

Configuration

Usage Examples

Query Traces

Query Logs

Query Metrics

Query LLM Events

Multi-Provider LLM Support

Health Check with Cache Stats

Features

Query Capabilities

OTel Compliance

OTel GenAI Semantic Conventions (10/10 compliance)

Supported LLM Providers

Performance

Internal Observability

Security

Data Sources

Local JSONL (Default)

SigNoz Cloud (Optional)

OTLP Export

Langfuse Export (v1.8.6+)

Evaluation Libraries

LLM-as-Judge (src/lib/llm-as-judge.ts)

Agent-as-Judge (src/lib/agent-as-judge.ts)

Query Agent Evaluations

Development

Documentation

LLM-as-Judge (`src/lib/llm-as-judge.ts`)

Agent-as-Judge (`src/lib/agent-as-judge.ts`)