stozer-ai

v0.1.6

Published

20 days ago

Deterministic grounding validation for AI agents and RAG pipelines. Detect hallucinations, data fabrication, and grounding failures without LLM-as-judge — <50ms, zero API calls.

Stozer

Grounding validation for tool-calling AI agents and RAG pipelines.

Detect when an agent's response contradicts its tool outputs or retrieved context — deterministically, without LLM-as-judge overhead.

The Problem

Most "hallucinations" in tool-calling agents and RAG pipelines are grounding failures — the agent gets accurate data from tools or retrieved context, and then ignores it, miscalculates, or fabricates from empty results. The source of truth is already in the trace.

Why Not LLM-as-a-Judge?

| | Stozer | LLM-as-a-Judge | |---|---|---| | Latency | <50ms | 2–10s per call | | Cost | $0 compute | $0.01–0.10 per eval | | Determinism | Same input → same result | Varies across runs | | Explainability | Exact failure code + evidence | "I think there might be an issue" | | Hallucination risk | Zero (no LLM) | The judge can hallucinate too | | Scalability | 10K+ evals/sec | Rate-limited by provider |

The Solution

Stozer validates the agent's response against tool outputs and reports grounding failures with standardized codes — like OBD diagnostic codes for AI.

npm install stozer-ai

Zero LLM calls. Deterministic failure detection. Runs in <50ms.

Getting Started

Create a free account at app.stozer.dev — no credit card required
Copy your API key from Settings → API Keys (starts with stozer_live_)
Set the env var or save it with the CLI:

# Option A: environment variable
export STOZER_API_KEY=stozer_live_your_key

# Option B: save once with the CLI (persists across sessions)
npx stozer auth stozer_live_your_key

The free tier includes full access to grounding evaluation, failure detection, and the dashboard — enough to integrate and test with your agent.

Quick Start — 3 Minutes

1. Evaluate a trace

import { StozerClient, TraceBuilder } from 'stozer-ai';

const client = new StozerClient();  // uses STOZER_API_KEY env var

const trace = new TraceBuilder({ traceId: 'run-001' })
  .addUserInput('How many employees are on leave today?')
  .addToolCall('getLeaveRecords', { date: '2024-03-15' })
  .addToolOutput('getLeaveRecords', [
    { employeeId: 'E01', name: 'Sarah Johnson', status: 'on_leave' },
    { employeeId: 'E02', name: 'Michael Chen', status: 'on_leave' },
  ])
  .addFinalResponse('There are 3 employees on leave today.')  // ← Bug: says 3, data shows 2
  .build();

const result = await client.evaluate(trace);

console.log(result.report.groundingScore);       // 0.5
console.log(result.report.detectedFailures[0]);  // { type: 'grounding.data_ignored', severity: 'high' }

2. Monitor in production (proxy mode)

Works with any language — PHP, Python, Go, Java, Ruby, C#.

Change your AI base URL and add your Stozer key using the composite key format (||):

# OpenAI — add Stozer key before your AI key, separated by ||
OPENAI_BASE_URL=http://localhost:3001/proxy/openai/v1
OPENAI_API_KEY=stozer_live_your_key||sk-your-openai-key

# Anthropic
ANTHROPIC_BASE_URL=http://localhost:3001/proxy/anthropic
ANTHROPIC_API_KEY=stozer_live_your_key||sk-ant-your-key

Your app works exactly the same — zero code changes. Stozer automatically splits the key, identifies your account, and forwards only the AI key to the provider.

Detection

Deterministic failure detectors across six categories: grounding, reasoning, orchestration, safety, semantic, and instrumentation.

Examples:

Fabrication from empty tool results or missing RAG context
Numeric inconsistencies with tool data
Ignored or altered data in the response
Entity misattribution
Prompt leakage and sensitive data exposure
Tool budget exhaustion and fabricated tool inputs

Features

Diagnostic Advisor

Every detected failure includes actionable diagnostics — root cause identification, evidence, and remediation guidance.

Policy Engine

Configure per-failure actions on your Stozer dashboard — block, warn, or observe for each failure type. SDK wrappers automatically enforce your configured policy:

import { wrapOpenAI, GroundingError } from 'stozer-ai';
import OpenAI from 'openai';

const openai = wrapOpenAI(new OpenAI(), {
  mode: 'block',
  threshold: 0.85,
  onEvaluate: (report, trace) => {
    console.log('Score:', report.groundingScore);
    console.log('Failures:', report.detectedFailures.length);
  },
});

Also available: wrapAnthropic for Anthropic SDK, wrapGemini for Google Gemini SDK.

import { wrapAnthropic } from 'stozer-ai';
const client = wrapAnthropic(new Anthropic(), { mode: 'observe' });

import { wrapGemini } from 'stozer-ai';
const model = wrapGemini(genAI.getGenerativeModel({ model: 'gemini-2.0-flash' }), { mode: 'observe' });

MCP Server (VS Code, Cursor, Windsurf, Zed, Claude Desktop)

Use Stozer directly from your IDE — no terminal needed.

Setup (one command):

npx stozer auth stozer_live_your_key
npx stozer mcp-setup

Auto-detects installed IDEs and configures MCP. Restart your IDE after running.

In VS Code: Ctrl+Shift+P → "MCP: Open User Configuration"
Add this to mcp.json:

{
  "servers": {
    "stozer": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "stozer-ai", "mcp"]
    }
  }
}

Restart VS Code

Usage: In Copilot Chat, say: "Call stozer verify_response with this trace: {...}"

8 tools available: verify_response, quick_check, check_trace_quality, list_rules, get_failure_info, evaluate_with_policy, get_live_traces, get_trace_report

Express Middleware

import express from 'express';
import { groundingMiddleware, FileStore } from 'stozer-ai';

const app = express();
app.post('/api/chat', groundingMiddleware({
  mode: 'warn',
  store: new FileStore('./traces/grounding.jsonl'),
  extractTrace: (req, res, body) => body.trace,
}));

CLI

npx stozer auth stozer_live_your_key      # Save API key (once)
npx stozer mcp-setup                      # Auto-configure MCP for all IDEs
npx stozer evaluate trace.json            # Evaluate one trace
npx stozer traces                         # List recent evaluations
npx stozer trace <trace-id>               # Full report for a specific trace
npx stozer rules [category]               # List detection rules (grounding, reasoning, safety, ...)
npx stozer rule <failure_type>            # Details + remediation for a rule

Auto-discovery: Set STOZER_API_KEY env var and all SDK wrappers automatically send traces — no explicit client setup needed.

How It Works

Analyze the agent's response against tool output data
Detect grounding failures using deterministic rules
Score overall grounding quality
Diagnose root causes with actionable remediation

No LLM calls. Deterministic. Supports 11 human languages.

Benchmarks

Tested on public hallucination datasets — no cherry-picking, full results. Same engine, same rules — accuracy depends on data structure.

Tool-Calling Agents (structured JSON — APIs, databases, functions)

| Benchmark | Samples | Precision | Recall | F1 | Runtime | |---|---|---|---|---|---| | HaluEval QA | 16,662 | 96.4% | 93.3% | 96.5% | 8s |

HaluEval QA (Li et al., 2023): 16,662 question–answer pairs with known hallucinations. Near-zero false positives on structured data — when Stozer flags something, it's real.

RAG Pipelines (free-text documents — PDFs, knowledge bases, policies)

| Benchmark | Samples | Precision | Recall | F1 | Runtime | |---|---|---|---|---|---| | FaithBench | 750 | 61.3% | 78.6% | 68.9% | 4s |

FaithBench (Bai et al., 2024): 750 free-text summaries. Lower precision is expected — paraphrased prose is harder than structured data. Stozer reports a coverage metric showing what fraction of claims it verified deterministically vs. semantically, so you know exactly where the certainty boundary is.

Production Traces

| Samples | Precision | Recall | F1 | |---|---|---|---| | 1,500+ verified | 98.6% | 96.2% | 97.8% |

Predominantly tool-calling agents with structured API/database outputs, across HR, finance, and operations domains.

Why the Gap?

The accuracy difference comes from data structure, not configuration:

Tool outputs are structured JSON — field names, typed values, discrete data. Deterministic checks are near-perfect.
RAG documents are free text — paraphrases, coreference, entailment. Harder to match deterministically. Stozer transparently reports what it could verify and what it couldn't.

Reproduce locally: npx ts-node _halueval_bench.ts and npx ts-node _faithbench_bench.ts

License

Proprietary. See LICENSE for details.