observability-toolkit
v2.0.0
Published
MCP server for observability tooling - query traces, metrics, logs from local JSONL or SigNoz
Maintainers
Readme
observability-toolkit
MCP server for observability tooling - query traces, metrics, and logs from local JSONL files for Claude Code sessions. Optionally integrates with SigNoz Cloud for enhanced observability.
Installation
claude mcp add observability-toolkit -- npx -y observability-toolkitOr for local development:
claude mcp add observability-toolkit -- node ~/.claude/mcp-servers/observability-toolkit/dist/server.jsTools
| Tool | Description |
|------|-------------|
| obs_query_traces | Query traces with filtering, regex, numeric operators |
| obs_query_metrics | Query metrics with aggregations (sum, avg, p50, p95, p99, rate) |
| obs_query_logs | Query logs with boolean search, field extraction |
| obs_query_llm_events | Query LLM events with token usage and duration metrics |
| obs_query_evaluations | Query evaluation events with aggregations and groupBy |
| obs_query_verifications | Query human verification events for EU AI Act compliance |
| obs_health_check | Check telemetry system health with cache statistics |
| obs_context_stats | Get context window utilization stats |
| obs_get_trace_url | Get SigNoz trace viewer URL (requires SigNoz) |
| obs_setup_claudeignore | Add entries to .claudeignore |
| obs_export_langfuse | Export evaluations to Langfuse via OTLP HTTP |
Configuration
| Variable | Description | Default |
|----------|-------------|---------|
| TELEMETRY_DIR | Local telemetry directory | ~/.claude/telemetry |
| SIGNOZ_URL | SigNoz instance URL | - |
| SIGNOZ_API_KEY | SigNoz API key | - |
| CACHE_TTL_MS | Query cache TTL in milliseconds | 60000 |
| RETENTION_DAYS | Days to retain telemetry files | 7 |
| LANGFUSE_ENDPOINT | Langfuse OTLP endpoint URL | - |
| LANGFUSE_PUBLIC_KEY | Langfuse public key | - |
| LANGFUSE_SECRET_KEY | Langfuse secret key | - |
Usage Examples
Query Traces
// Basic query
obs_query_traces({ limit: 10 })
// Filter by trace ID
obs_query_traces({ traceId: "abc123..." })
// Filter by service and duration
obs_query_traces({ serviceName: "claude-code", minDurationMs: 100 })
// Regex pattern matching
obs_query_traces({ spanNameRegex: "^http\\..*" })
// Numeric attribute filtering
obs_query_traces({
numericFilter: [
{ attribute: "http.status_code", operator: "gte", value: 400 }
]
})
// Existence checks
obs_query_traces({
attributeExists: ["error.message"],
attributeNotExists: ["http.response.body"]
})
// OTel GenAI agent/tool filters
obs_query_traces({ agentName: "Explore", toolName: "Read" })
obs_query_traces({ operationName: "execute_tool", toolCallId: "toolu_123" })Query Logs
// Basic severity filter
obs_query_logs({ severity: "ERROR", limit: 20 })
// Boolean search (AND)
obs_query_logs({
searchTerms: ["timeout", "connection"],
searchOperator: "AND"
})
// Boolean search (OR)
obs_query_logs({
searchTerms: ["error", "warning", "critical"],
searchOperator: "OR"
})
// Field extraction from JSON logs
obs_query_logs({
extractFields: ["user.id", "request.method", "response.status"]
})
// Exclude patterns
obs_query_logs({
search: "error",
excludeSearch: "health-check"
})Query Metrics
// Basic query
obs_query_metrics({ metricName: "session.context.size" })
// Aggregations
obs_query_metrics({ metricName: "http.duration", aggregation: "avg" })
obs_query_metrics({ metricName: "http.duration", aggregation: "p95" })
obs_query_metrics({ metricName: "requests.count", aggregation: "rate" })
// Time bucket grouping
obs_query_metrics({
metricName: "token.usage",
aggregation: "sum",
timeBucket: "1h",
groupBy: ["model"]
})
// Percentiles
obs_query_metrics({ metricName: "latency", aggregation: "p99" })Query LLM Events
// Basic query
obs_query_llm_events({ limit: 20 })
// Filter by model and provider
obs_query_llm_events({ model: "claude-3-opus", provider: "anthropic" })
// OTel GenAI operation types
obs_query_llm_events({ operationName: "chat" })
obs_query_llm_events({ operationName: "invoke_agent" })
// Filter by conversation
obs_query_llm_events({ conversationId: "conv-abc123" })
// Combine filters
obs_query_llm_events({
operationName: "chat",
provider: "anthropic",
conversationId: "conv-abc123"
})Multi-Provider LLM Support
Query events from any LLM provider using OTel GenAI standard identifiers:
// Anthropic Claude
obs_query_llm_events({ provider: "anthropic", model: "claude-3-opus" })
// OpenAI
obs_query_llm_events({ provider: "openai", model: "gpt-4o" })
// Google Gemini
obs_query_llm_events({ provider: "gcp.gemini", model: "gemini-1.5-pro" })
// Mistral AI
obs_query_llm_events({ provider: "mistral_ai", model: "mistral-large" })
// Cohere
obs_query_llm_events({ provider: "cohere", model: "command-r-plus" })
// AWS Bedrock (multi-model)
obs_query_llm_events({ provider: "aws.bedrock" })
// Azure OpenAI
obs_query_llm_events({ provider: "azure.ai.openai" })
// Local models (Ollama)
obs_query_llm_events({ provider: "ollama", model: "llama3:8b" })
// Groq
obs_query_llm_events({ provider: "groq", model: "llama-3.3-70b" })Provider Fallback Chain: The toolkit uses OTel GenAI v1.39 compliant attribute lookup:
gen_ai.provider.name(primary)gen_ai.system(legacy OTel)provider(custom/fallback)
Health Check with Cache Stats
obs_health_check({ verbose: true })
// Returns:
{
"status": "ok",
"backends": { ... },
"cache": {
"traces": { "hits": 10, "misses": 5, "hitRate": 0.67, "size": 15, "evictions": 0 },
"logs": { "hits": 8, "misses": 12, "hitRate": 0.4, "size": 20, "evictions": 2 },
"metrics": { "hits": 0, "misses": 0, "hitRate": 0, "size": 0, "evictions": 0 },
"llmEvents": { "hits": 0, "misses": 0, "hitRate": 0, "size": 0, "evictions": 0 }
}
}Features
Query Capabilities
| Feature | Description | |---------|-------------| | Percentile Aggregations | p50, p95, p99 for metrics | | Time Bucket Grouping | 1m, 5m, 1h, 1d buckets for trend analysis | | Rate Calculations | Per-second rate of change | | Numeric Operators | gt, gte, lt, lte, eq for attribute filtering | | Regex Patterns | Advanced span name filtering | | Boolean Search | AND/OR operators for log queries | | Field Extraction | Extract JSON paths from structured logs | | Negation Filters | Exclude matching spans/logs | | Existence Checks | Filter by attribute presence |
OTel Compliance
| Feature | Description | |---------|-------------| | severityNumber | Standard OTel severity levels | | statusCode | UNSET, OK, ERROR for spans | | Histogram Buckets | Full histogram distribution support | | InstrumentationScope | Library/module metadata | | Span Links | Cross-trace relationships | | Exemplars | Metric-to-trace correlation | | Aggregation Temporality | DELTA, CUMULATIVE support |
OTel GenAI Semantic Conventions (10/10 compliance)
| Feature | Description |
|---------|-------------|
| gen_ai.operation.name | Filter by chat, embeddings, invoke_agent, execute_tool |
| gen_ai.provider.name | Provider fallback: gen_ai.provider.name → gen_ai.system → provider |
| gen_ai.conversation.id | Filter LLM events by conversation ID |
| gen_ai.agent.id/name | Filter traces by agent attributes |
| gen_ai.tool.name/call.id | Filter traces by tool attributes |
| gen_ai.response.model | Actual model that responded |
| gen_ai.response.finish_reasons | Why generation stopped |
| gen_ai.request.temperature | Sampling temperature |
| gen_ai.request.max_tokens | Maximum output tokens |
| Percentiles | p50, p95, p99, rate aggregations |
Supported LLM Providers
| Provider ID | Description | Example Models |
|-------------|-------------|----------------|
| anthropic | Anthropic Claude | claude-3-opus, claude-3-sonnet, claude-3-haiku |
| openai | OpenAI GPT | gpt-4o, gpt-4-turbo, gpt-3.5-turbo, o1, o1-mini |
| gcp.gemini | Google AI Studio | gemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash |
| gcp.vertex_ai | Google Vertex AI | gemini-pro, claude-3-opus (via Vertex) |
| aws.bedrock | AWS Bedrock | claude-3-sonnet, titan-text, llama-3 |
| azure.ai.openai | Azure OpenAI | gpt-4-deployment, gpt-35-turbo |
| mistral_ai | Mistral AI | mistral-large, mistral-small, codestral |
| cohere | Cohere | command-r-plus, command-r, embed-english |
| groq | Groq | llama-3.3-70b, mixtral-8x7b |
| ollama | Ollama (local) | llama3, mistral, codellama |
| together_ai | Together AI | llama-3-70b, mixtral-8x7b |
| fireworks_ai | Fireworks AI | llama-v3-70b, mixtral-8x7b |
| huggingface | HuggingFace | Various open models |
| replicate | Replicate | Various hosted models |
| perplexity | Perplexity | sonar-pro, sonar |
Note: Custom provider identifiers are also supported for internal or unlisted LLM services.
Performance
| Feature | Description | |---------|-------------| | Query Caching | LRU cache with configurable TTL | | File Indexing | .idx sidecars for fast lookups | | Gzip Support | Transparent decompression of .jsonl.gz files | | BatchWriter | Buffered writes to reduce I/O | | Streaming | Early termination for large files | | Parallel Queries | Concurrent multi-directory queries | | Cursor Pagination | Efficient large result set handling |
Internal Observability
| Feature | Description | |---------|-------------| | Cache Metrics | Hit/miss/eviction tracking | | Query Timing | Slow query warnings (>500ms) | | Circuit Breaker Logging | State transition visibility | | Health Check Stats | Cache statistics in health output |
Security
| Feature | Description | |---------|-------------| | Query Escaping | ClickHouse-specific escaping, 22-pattern blocklist | | Memory Limits | MAX_RESULTS_IN_MEMORY=10000, streaming aggregation | | Input Validation | limit≤1000, date range≤365 days, regex limits | | Type Safety | NaN/Infinity rejection, explicit type assertions |
See docs/security.md for details.
Data Sources
Local JSONL (Default)
Scans multiple telemetry directories:
- Global:
~/.claude/telemetry/(always checked) - Project-local:
.claude/telemetry/,telemetry/,.telemetry/
File patterns (supports gzip compression):
traces-YYYY-MM-DD.jsonl/.jsonl.gzlogs-YYYY-MM-DD.jsonl/.jsonl.gzmetrics-YYYY-MM-DD.jsonl/.jsonl.gzllm-events-YYYY-MM-DD.jsonl/.jsonl.gz
SigNoz Cloud (Optional)
When configured, queries SigNoz Cloud API with:
- Circuit breaker protection
- Cursor-based pagination
- Response time tracking
OTLP Export
Export data in OpenTelemetry format:
// Export traces
const otlpTraces = await backend.exportTracesOTLP({ startDate: "2026-01-28" });
// Export logs
const otlpLogs = await backend.exportLogsOTLP({ severity: "ERROR" });
// Export metrics
const otlpMetrics = await backend.exportMetricsOTLP({ metricName: "http.duration" });Langfuse Export (v1.8.6+)
Export evaluations to Langfuse for unified tracing and evaluation analysis:
// Export all evaluations from last 7 days
obs_export_langfuse({})
// Export with filters
obs_export_langfuse({
evaluationName: "quality",
scoreMin: 0.8,
limit: 500,
batchSize: 100
})
// Dry run to preview export
obs_export_langfuse({
startDate: "2026-01-28",
dryRun: true
})
// Override credentials (for testing)
obs_export_langfuse({
endpoint: "https://cloud.langfuse.com",
publicKey: "pk-lf-...",
secretKey: "sk-lf-..."
})Features:
- Batched OTLP HTTP export with retry logic
- Memory protection (400MB warn, 600MB abort)
- Progress logging for large exports
- Credential sanitization in error messages
- DNS rebinding protection
Evaluation Libraries
LLM-as-Judge (src/lib/llm-as-judge.ts)
Single-pass LLM evaluation for output quality:
import { gEval, qagEvaluate, JudgeCircuitBreaker } from './lib/llm-as-judge.js';
// G-Eval pattern with chain-of-thought
const result = await gEval(testCase, criteria, llmFn);
// QAG faithfulness evaluation
const faithfulness = await qagEvaluate(testCase, llmFn);
// Production circuit breaker
const breaker = new JudgeCircuitBreaker(5, 60000);
const result = await breaker.evaluate(() => gEval(...));Agent-as-Judge (src/lib/agent-as-judge.ts)
Multi-step agent evaluation with trajectory analysis:
import {
verifyToolCalls,
aggregateStepScores,
analyzeTrajectory,
collectiveConsensus,
ProceduralJudge,
ReactiveJudge,
} from './lib/agent-as-judge.js';
// Verify tool call correctness
const verifications = verifyToolCalls(actions, expectedTools);
// Analyze agent trajectory efficiency
const metrics = analyzeTrajectory({ actions, expectedSteps: 5 });
// Multi-agent consensus evaluation
const consensus = await collectiveConsensus(judges, { id: 'eval-1' }, {
rounds: 3,
convergenceThreshold: 0.05,
});
// Procedural multi-stage evaluation
const proceduralJudge = new ProceduralJudge([
{ name: 'syntax', evaluate: syntaxChecker },
{ name: 'semantic', evaluate: semanticAnalyzer },
]);
const result = await proceduralJudge.evaluate(evaluand);
// Reactive specialist-based evaluation
const reactiveJudge = new ReactiveJudge(router, specialists, deepDiveSpecialists);
const result = await reactiveJudge.evaluate(evaluand);Query Agent Evaluations
// Filter by agent ID/name
obs_query_evaluations({
agentId: 'agent-123',
agentName: 'TaskRunner',
evaluationName: 'tool_correctness',
})
// Response includes agent-specific fields
{
stepScores: [{ step: 0, score: 0.9, explanation: '...' }],
toolVerifications: [{ toolName: 'search', toolCorrect: true, score: 1.0 }],
trajectoryLength: 5,
}Development
cd ~/.claude/mcp-servers/observability-toolkit
npm install
npm run build
npm test # 3254 tests
npm run startDocumentation
- docs/changelog/ - Version history and changelogs
- docs/reliability/security.md - Security controls and hardening
- docs/quality/llm-as-judge.md - LLM-as-Judge architecture
- docs/quality/agent-as-judge.md - Agent-as-Judge architecture
- docs/backlog/ - Feature backlog and roadmap
- docs/changelog/SESSION_HISTORY.md - Development session logs
- docs/Summary.md - Full documentation index
