@vmallela/spectral

v0.1.0

Published

4 months ago

Causal Observability for AI Agents

0High
0Medium
0Low

Spectral

Causal observability for AI agents. Drop one line into your app and get a full trace of every LLM call, tool use, cost, latency, and behavioral invariant — with a CLI to explore, replay, and evaluate everything.

npm install spectral-obs
spectral traces

Getting Started

1. Install

npm install spectral-obs

2. Wrap your Anthropic client

import Anthropic from '@anthropic-ai/sdk';
import { spectral } from 'spectral-obs';

const client = spectral.wrap(new Anthropic(), {
  taskType: 'code-review',   // label for grouping traces + evals
  captureInputs: true,       // store prompts (disable for sensitive data)
});

// Use client exactly as before — nothing else changes
const response = await client.messages.create({
  model: 'claude-opus-4-6',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Review this PR...' }],
});

That's it. Every call is now traced, hashed, and stored in ~/.spectral/spectral.db.

CLI

spectral traces                          # recent runs
spectral inspect  <trace-id>            # tree view of every span
spectral waterfall <trace-id>           # latency waterfall chart
spectral cost    --last 7d              # cost breakdown by model
spectral replay  <trace-id> \
  --swap-step 2 --with-input "..."     # re-run one step, see the diff
spectral scan                           # silent failure detection
spectral eval learn  <task-type>        # mine behavioral invariants
spectral eval run    <trace-id> <type>  # run invariants against a trace
spectral eval show   <task-type>        # list learned invariants
spectral eval pin    <invariant-id>     # lock an invariant across updates
spectral eval export <task-type>        # dump suite as JSON

Core features

Trace explorer

spectral traces lists your most recent runs with cost, latency, and status. spectral inspect <id> renders the full trace DAG as a tree with token counts and timing for every span.

Waterfall

spectral waterfall <trace-id> renders a terminal bar chart of every span's contribution to total latency — useful for finding which tool or LLM call is the bottleneck.

Cost tracking

spectral cost --last 7d breaks down spend by model across the last N days. Pricing is built in for all current Claude models.

Replay engine

spectral replay abc123 --swap-step 2 --with-input "Be more concise"

Loads the cached trace, replaces step 2's input with your new prompt, calls the API live, and shows:

A line-by-line diff of the old vs new output
Cost delta and latency delta

No need to re-run your whole agent to test a single prompt change.

Silent failure detection

spectral scan

Runs z-score anomaly detection on the output hash distribution for each task type. Flags runs where the output unexpectedly changed while the input didn't — a common sign of silent regressions after a model upgrade or prompt edit.

Behavioral evals

Spectral can learn what "normal" looks like from your production traces and then check new traces against those expectations automatically.

Learn invariants from traces

spectral eval learn code-review --limit 50

Analyzes your last 50 code-review runs and extracts invariants across three dimensions:

| Dimension | What it mines | Cost | |-----------|--------------|------| | Structural | Tool ordering, call counts, step count, repetition loops, never-final tools | Free | | Content | Output line-count bounds, LLM-extracted presence/absence/format patterns | Free + optional Haiku | | Causal | Which tool outputs flow into downstream inputs (Jaccard similarity) | Free |

Each invariant gets a score:

score = 0.4·consistency + 0.25·specificity + 0.25·actionability − 0.1·cost

Only invariants above the threshold (default 0.5) are saved.

Run evals on a new trace

spectral eval run <trace-id> code-review

Checks the trace against all learned invariants in priority order:

Structural — pure graph analysis, instant
Deterministic content — regex / line-count, instant
Heuristic causal — Jaccard similarity, instant
LLM judge — Claude Haiku, skipped if a critical violation is already found

✓ 11/12 checks passed (91%)

Violations:
  ✗ [critical] search_files output flows into write_file input
    write_file shows 2% overlap with search_files output (min 8%)
    Fix: write_file may be ignoring output from search_files — blind operation detected

Pin invariants

spectral eval pin inv_01abc123

Pinned invariants survive future eval learn refreshes — useful for invariants you've manually reviewed and want to treat as ground truth.

How it works

Zero-overhead hot path

messages.create() called
       │
       ▼
  generateId()          ← ~0.001 ms
  Date.now() ×2         ← ~0.001 ms
  pipeline.push(ref)    ← ring buffer write, ~0.001 ms
       │
       ▼                 (background, off the call stack)
  drain()               ← serialize + hash
  batch flush           ← single SQLite transaction

The intercepted call adds ~0.003 ms to TTFT. The rest happens asynchronously.

Storage

All data lives in ~/.spectral/spectral.db — a single WAL-mode SQLite file. No server, no account, no data leaves your machine.

Performance internals

| Component | Technique | Benefit | |-----------|-----------|---------| | RingBuffer<T> | Pre-allocated power-of-2 array, bitwise modulo | O(1) push/drain, no GC pressure | | fastHash | Murmur3 × 2 seeds | ~35× faster than SHA-256 | | BatchWriter | Prepared statement + db.transaction() | One fsync per batch, not per trace | | TracePipeline | Three-lane: hot → drain → flush | Hot path never touches SQLite |

SDK reference

import { spectral } from 'spectral-obs';

// Wrap a client
const client = spectral.wrap(anthropicClient, {
  taskType?: string,        // groups runs for evals + cost tracking
  captureInputs?: boolean,  // default true
  dbPath?: string,          // default ~/.spectral/spectral.db
});

// Access the underlying stores directly if needed
const store    = spectral.getStore();
const pipeline = spectral.getPipeline();

// Clean shutdown (flushes pending traces)
spectral.closeAll();

Development

npm test          # 249 tests, all green
npm run build     # compile to dist/
npm run dev       # watch mode

Tests use Vitest with pool: 'forks' for native module compatibility. All tests are self-contained and create temporary SQLite databases.

License

MIT