buzo-sdk

v0.4.0

Published

9 days ago

Buzo AI retrieval observability SDK — mirror your AI agent's vector retrievals and LLM generations to Buzo for usage-weighted findings, served-flagged alerts, cited-flagged attribution, and dead-vector detection.

Downloads

710

What is Buzo AI?

Buzo AI is a vector store quality and integrity platform for production AI agents. It connects to your vector database (Pinecone, ChromaDB, pgvector, Qdrant), runs deterministic and AI-powered analysis to detect quality issues (stale, duplicate, contradictory, suspicious content, PII), and provides a dashboard for human-gated remediation with a tamper-evident audit chain.

Learn more at buzoai.com.

What does this SDK do?

Buzo already sees what data exists in your vector store. This SDK lets it see what your agent actually consumes.

After a one-line wrap of your retriever, every query → results pair is mirrored to Buzo asynchronously. That data unlocks:

Usage-weighted findings — "this stale vector was served 1,247 times to your support agent this week" instead of a flat list of issues nobody triages.
SERVED_FLAGGED alerts — if your agent retrieves a vector Buzo previously flagged for review, your team gets an email within minutes (with hourly dedup, no inbox flood).
CITED_FLAGGED alerts (v0.2+) — a strictly stronger signal than SERVED_FLAGGED: the flagged vector's content actually appears in the agent's final response, not just the retrieval context. Enabled by opting into LLM output capture.
Dead-vector detection — vectors no agent has retrieved in 30+ days become prune candidates with estimated storage savings. With v0.2 generation capture, "retrieved but never cited in a response" is a stronger dead-vector signal still.
Continuous retrieval health metric — the % of queries that return at least one flagged vector, trended over time. A leading indicator of agent quality regressions.
Reinforced right-to-be-forgotten attestation — for GDPR/CCPA, prove a vector was deleted and that no agent retrieved it in the days before deletion.

Operational guarantees

Never in your agent's request path. Out-of-band, fire-and-forget POST after the original retrieval already returned.
Never throws to your code. All errors are caught and logged via your configured logger.
Never blocks. Background scheduling via setImmediate (Node) or waitUntil (Edge runtimes).
Bounded memory. Ring buffer drops oldest events under load — no unbounded growth.
Same pattern as the observability industry standard — Datadog, Sentry, Langfuse, Helicone, OpenTelemetry collectors. Battle-tested approach.

Install

npm install buzo-sdk
# Optional, only if you use the LangChain callback integration:
npm install @langchain/core

Get an API key at buzoai.com → Settings → SDK Keys.

Quick start

// lib/buzo.ts
import { Buzo } from 'buzo-sdk';

export const buzo = new Buzo({
  apiKey: process.env.BUZO_API_KEY!,
});

Integration options

1. LangChain callback handler (recommended)

The callback handler is the primary integration. It survives chain composition (LCEL), ensemble retrievers, MultiVectorRetriever, ParentDocumentRetriever, MMR rerankers, and cached retrievers — all places where the simpler wrap() either misses traces or generates noise.

import { createBuzoCallbackHandler } from 'buzo-sdk/langchain';
import { buzo } from './lib/buzo';

const handler = await createBuzoCallbackHandler(buzo, {
  collectionId: 'prod-support-kb',
  agentId: 'support-v3',
});

const docs = await retriever.invoke('how do I reset my password?', {
  callbacks: [handler],
});

2. `wrap()` — drop-in fallback

For isolated retrievers, vanilla SDK use, or codebases without callback infrastructure.

import { wrap } from 'buzo-sdk';
import { buzo } from './lib/buzo';

const tracedRetriever = wrap(buzo, vectorStore.asRetriever({ k: 5 }), {
  collectionId: 'prod-support-kb',
});

const docs = await tracedRetriever.invoke('how do I reset my password?');

3. Vanilla Pinecone

import { Pinecone } from '@pinecone-database/pinecone';
import { wrapPineconeIndex } from 'buzo-sdk';
import { buzo } from './lib/buzo';

const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = wrapPineconeIndex(buzo, pc.index('support-kb'), {
  collectionId: 'prod-support-kb',
});

const result = await index.query({ vector, topK: 5, includeMetadata: true });

index.namespace(ns) is also wrapped — chained calls keep tracing.

4. Direct API

If none of the above fit your stack, call recordRetrieval yourself from any retrieval code path.

buzo.recordRetrieval({
  collectionId: 'prod-support-kb',
  query: { text: 'how do I reset my password?' },
  results: [
    { id: 'vec_abc', score: 0.87 },
    { id: 'vec_def', score: 0.81 },
  ],
  latencyMs: 47,
});

Capturing LLM outputs (v0.2+)

Retrieval traces tell Buzo what your agent saw. Generation traces tell Buzo what your agent said — unlocking CITED_FLAGGED alerts: a flagged vector was not just served into the context, its content actually appears in the answer that reached your user.

Output capture is opt-in (default outputCapture: 'off'). LLM generations frequently echo back PII from the conversation — enable only after configuring redaction or with a DPA in place.

The LangChain callback handler covers both retrievals and generations when enabled — no extra wiring:

const buzo = new Buzo({
  apiKey: process.env.BUZO_API_KEY!,
  outputCapture: 'redacted',               // 'off' | 'redacted' | 'plaintext'
  outputRedactPatterns: [
    { pattern: /[\w.-]+@[\w.-]+\.\w+/g, replacement: '<EMAIL>' },
    { pattern: /\b\d{3}-\d{2}-\d{4}\b/g, replacement: '<SSN>' },
  ],
});

const handler = await createBuzoCallbackHandler(buzo, {
  collectionId: 'prod-support-kb',
  agentId: 'support-v3',
});

// Same handler, same one line — now fires on both retriever.invoke()
// and chatModel.invoke() in the same LCEL chain.
const answer = await chain.invoke(question, { callbacks: [handler] });

For non-LangChain stacks, call recordGeneration directly after the LLM call. Pass whatever correlation id you use for the chain as runId — the server joins against the matching recordRetrieval call.

buzo.recordGeneration({
  runId: chainRunId,
  output: { text: finalAnswer },
  model: 'gpt-4o',
  promptTokens: tokens.prompt,
  completionTokens: tokens.completion,
  latencyMs: elapsed,
});

Capturing retrieved content (v0.4+)

By default Buzo only receives { id, score } for each retrieved vector. Citation detection (CITED_FLAGGED) then depends on Buzo having a server-side content_snapshot for that vector — which only exists for vectors that have been fully scanned. In practice that's a small subset of the corpus, so most retrievals never produce a citation.

Opting in to resultsCapture lets Buzo match citations on every retrieved vector, scanned or not, by shipping the raw pageContent alongside id/score. One-line change:

const buzo = new Buzo({
  apiKey: process.env.BUZO_API_KEY!,
  resultsCapture: 'plaintext',             // 'ids-only' (default) | 'plaintext' | 'redacted'
});

'redacted' mode applies your resultsRedactPatterns in-place before the trace leaves the SDK — use this when retrievals may echo user PII:

const buzo = new Buzo({
  apiKey: process.env.BUZO_API_KEY!,
  resultsCapture: 'redacted',
  resultsRedactPatterns: [
    { pattern: /[\w.-]+@[\w.-]+\.\w+/g, replacement: '<EMAIL>' },
    { pattern: /\b\d{16}\b/g, replacement: '<CC>' },
  ],
});

All three integration paths (LangChain callback, wrap(), LlamaIndex) propagate content automatically when the retriever returns it — there is no change to call sites. If you build retrieval traces by hand via recordRetrieval, include content on each result item (it gets stripped or redacted based on the configured mode):

buzo.recordRetrieval({
  collectionId: 'prod-support-kb',
  query: { text: question },
  results: docs.map((d) => ({ id: d.id, score: d.score, content: d.pageContent })),
  latencyMs: elapsed,
});

Payload size vs. observability trade-off: a typical RAG chunk is 500–2000 characters. At k=5 and 1 rps per agent, enabling plaintext adds ~25–100 KB/sec of upstream traffic per agent. Values are hard-capped at 16 KB per vector — oversized chunks are rejected by ingest rather than truncated, so truncate retriever-side if your chunks exceed that.

Configuration

new Buzo({
  apiKey: process.env.BUZO_API_KEY!,

  // Endpoint override (self-hosted Buzo)
  endpoint: 'https://api.buzoai.com',

  // Query capture mode &mdash; default 'plaintext'
  // 'plaintext' (default): full query text shipped &mdash; best signal, governed by your DPA with Buzo.
  // 'hash':                only SHA-256 of query text &mdash; compliance-safe option.
  // 'redact':              regex-based in-place redaction before egress.
  queryCapture: 'plaintext',

  // Used only when queryCapture === 'redact'
  redactPatterns: [
    { pattern: /[\w.-]+@[\w.-]+\.\w+/g, replacement: '<EMAIL>' },
    { pattern: /\b\d{3}-\d{2}-\d{4}\b/g, replacement: '<SSN>' },
  ],

  // LLM output capture (v0.2+) &mdash; default 'off'. Opt-in because outputs
  // frequently contain user PII echoed back from the conversation.
  // 'off' (default): no LLM events emitted; handleLLMStart/End are no-ops.
  // 'redacted':      outputRedactPatterns applied before the trace leaves the SDK.
  // 'plaintext':     raw generation text shipped; unlocks CITED_FLAGGED detection.
  outputCapture: 'off',

  // Used only when outputCapture === 'redacted'
  outputRedactPatterns: [
    { pattern: /[\w.-]+@[\w.-]+\.\w+/g, replacement: '<EMAIL>' },
  ],

  // Retrieved-content capture (v0.4+) &mdash; default 'ids-only'. Opt-in because
  // the payload grows substantially when full chunk text is shipped.
  // 'ids-only' (default): only { id, score } — pre-0.4.0 behaviour.
  // 'plaintext':          full content shipped; enables citation matching on every retrieval.
  // 'redacted':           resultsRedactPatterns applied before egress.
  resultsCapture: 'ids-only',

  // Used only when resultsCapture === 'redacted'
  resultsRedactPatterns: [
    { pattern: /[\w.-]+@[\w.-]+\.\w+/g, replacement: '<EMAIL>' },
  ],

  // Sampling &mdash; 1.0 captures everything, 0.1 captures 10% (uniform).
  sampleRate: 1.0,

  // Batching
  batchSize: 25,
  flushIntervalMs: 5_000,
  maxBufferSize: 1_000,

  // Hook for SDK-internal telemetry (default: no-op).
  logger: (level, msg, ctx) => console.warn(`[buzo:${level}] ${msg}`, ctx),

  // Disable all network activity (tests / local dev).
  disabled: process.env.NODE_ENV === 'test',
});

Edge runtime (Cloudflare Workers, Vercel Edge, Next.js Edge)

The SDK auto-detects the runtime. Edge runtimes do not have a persistent event loop between requests, so background flushing has no place to run after the response is sent. For guaranteed delivery, call buzo.flush() from inside ctx.waitUntil(...):

export const runtime = 'edge';

export async function POST(req: Request) {
  const docs = await tracedRetriever.invoke(question);
  // ... LLM call ...

  // Ensure traces ship before the worker is recycled
  // (Vercel Edge: from @vercel/functions; Cloudflare: ctx.waitUntil)
  // ctx.waitUntil(buzo.flush());

  return new Response(/* ... */);
}

If you forget to wire waitUntil, the SDK degrades to best-effort: traces accumulate and ship synchronously on the next request, but may be lost if the isolate is evicted between requests.

What gets sent

Retrieval traces are POSTed to /v1/retrieval-traces in batches:

{
  "events": [
    {
      "clientEventId": "01913e1f-3a40-7b2c-8d4f-aaaa00000001",
      "collectionId": "prod-support-kb",
      "agentId": "support-v3",
      "query": { "text": "how do I reset my password?" },
      "results": [
        { "id": "vec_abc123", "score": 0.87, "content": "To reset your password, click the link..." },
        { "id": "vec_def456", "score": 0.81, "content": "Password reset links expire 24h after they are issued." }
      ],
      "kReturned": 2,
      "latencyMs": 47,
      "runId": "chain-run-abc",
      "timestamp": "2026-04-18T14:22:17.123Z",
      "sdk": { "lang": "ts", "version": "0.4.0" }
    }
  ]
}

Generation traces (only when outputCapture !== 'off') are POSTed to /v1/generation-traces:

{
  "events": [
    {
      "clientEventId": "01913e1f-3a40-7b2c-8d4f-bbbb00000001",
      "runId": "chain-run-abc",
      "parentRunId": "chain-run-root",
      "collectionId": "prod-support-kb",
      "agentId": "support-v3",
      "output": { "text": "Your password reset link expires in 24 hours." },
      "model": "gpt-4o",
      "promptTokens": 82,
      "completionTokens": 14,
      "latencyMs": 812,
      "timestamp": "2026-04-18T14:22:18.015Z",
      "sdk": { "lang": "ts", "version": "0.2.0" }
    }
  ]
}

Notes:

query.text is replaced by query.hash when queryCapture === 'hash'.
results[].content is present only when resultsCapture !== 'ids-only' (default from v0.4+ is 'ids-only' — pre-0.4 behaviour preserved).
clientEventId is a UUIDv7-style id used for idempotent server-side dedup, so retries never double-count.
runId is the LangChain run identifier. The server uses it to correlate retrieval and generation events for CITED_FLAGGED attribution.

Failure semantics

| Scenario | Behavior | |---|---| | Buzo backend returns 5xx | Retry → after 5 consecutive failures, circuit opens for 30s. | | Buzo backend returns 4xx | Batch dropped, no retry, logged via logger. | | Network error / DNS failure | Same as 5xx. | | Customer process crashes | Buffered traces lost (intentional — this is telemetry, not a transactional log). | | Customer's retriever throws | Trace recorded with error field, original error re-thrown to caller unchanged. | | Buzo constructed inside a request handler | Works, but no batching benefit. Construct once at module load. |