sentient-sdk

v0.3.0

Published

2 months ago

A developer-first SDK for automated RAG evaluation using LLM-as-a-Judge and Deterministic Guards

Downloads

0High
0Medium
0Low

dharmikgohil

rag evaluation llm ai hallucination faithfulness sdk testing

Sentient-SDK

A developer-first SDK for automated RAG evaluation using LLM-as-a-Judge and Deterministic Guards.

The Problem

RAG (Retrieval-Augmented Generation) systems have a critical flaw: hallucinations kill trust.

When your LLM generates a response, how do you know if it's:

Faithful to the retrieved context?
Relevant to the user's query?
Free of hallucinations?

Manually reviewing responses doesn't scale. Traditional NLP metrics don't capture semantic meaning. You need a systematic approach.

The Solution

Sentient-SDK automates the Evaluation (Evals) phase of your RAG lifecycle:

import { evaluate } from 'sentient-sdk';

const result = await evaluate(
  {
    context: 'Paris is the capital of France, established in 508 AD.',
    query: 'What is the capital of France?',
    response: 'The capital of France is Paris, a city founded in ancient Roman times.',
  },
  {
    judge: 'openai',
    openaiApiKey: process.env.OPENAI_API_KEY,
  }
);

console.log(result);
// {
//   verdict: 'FAIL',
//   faithfulness: { score: 0.6, rationale: 'Incorrect founding date claim...' },
//   hallucination: { detected: true, evidence: '"ancient Roman times"...' },
//   guards: { piiLeak: false, forbiddenTerms: [] },
//   latencyMs: 1234,
// }

Architecture

Application
  └── uses Sentient SDK
          ├── Judge (LLM-based)        → OpenAI / Claude
          ├── Guards (Deterministic)   → PII / Forbidden Terms
          ├── Scorer (combines signals)→ Configurable thresholds
          ├── Reporter (structured)    → JSON output
          └── Shadow Runner (async)    → Non-blocking evaluation

Clean Architecture

sentient-sdk/
├── src/
│   ├── domain/           # Core types & interfaces
│   │   ├── types.ts
│   │   ├── Judge.ts
│   │   └── Guard.ts
│   ├── application/      # Use cases
│   │   ├── EvaluateRAG.ts
│   │   ├── ScoringPolicy.ts
│   │   └── ShadowRunner.ts
│   ├── infrastructure/   # Implementations
│   │   ├── judges/
│   │   │   ├── OpenAIJudge.ts
│   │   │   └── ClaudeJudge.ts
│   │   ├── guards/
│   │   │   ├── PiiGuard.ts
│   │   │   └── ForbiddenTermsGuard.ts
│   │   └── reporters/
│   │       └── JsonReporter.ts
│   ├── cli/
│   │   └── shadow.ts
│   └── index.ts
└── tests/

Installation

npm install sentient-sdk
# or
pnpm add sentient-sdk

Quick Start

Basic Evaluation

import { evaluate } from 'sentient-sdk';

const result = await evaluate(
  {
    context: 'The Eiffel Tower is 330 meters tall.',
    query: 'How tall is the Eiffel Tower?',
    response: 'The Eiffel Tower is 330 meters tall.',
  },
  {
    judge: 'openai',
    openaiApiKey: process.env.OPENAI_API_KEY,
  }
);

if (result.verdict === 'PASS') {
  console.log('Response is reliable');
} else {
  console.log('Response failed evaluation');
  console.log('Reason:', result.faithfulness.rationale);
}

Advanced Configuration

import { EvaluateRAG, OpenAIJudge, PiiGuard, ForbiddenTermsGuard, ScoringPolicy } from 'sentient-sdk';

// Create a custom evaluator
const evaluator = new EvaluateRAG({
  judge: new OpenAIJudge({
    apiKey: process.env.OPENAI_API_KEY,
    model: 'gpt-4o',
    temperature: 0,
  }),
  guards: [
    new PiiGuard({ types: ['email', 'phone', 'ssn'] }),
    new ForbiddenTermsGuard({
      terms: ['competitor_name', 'internal_secret'],
      caseInsensitive: true,
    }),
  ],
  scoringPolicy: new ScoringPolicy({
    faithfulnessThreshold: 0.8,  // Stricter than default 0.7
    relevanceThreshold: 0.6,
    failOnGuardViolation: true,
  }),
});

const result = await evaluator.run({ context, query, response });

Shadow Testing (Production)

Run evaluations on a percentage of live traffic without affecting latency:

import { ShadowRunner, EvaluateRAG, OpenAIJudge, PiiGuard } from 'sentient-sdk';

const evaluator = new EvaluateRAG({
  judge: new OpenAIJudge({ apiKey: process.env.OPENAI_API_KEY }),
  guards: [new PiiGuard()],
});

const shadow = new ShadowRunner({
  evaluator,
  sampleRate: 0.1, // 10% of requests
  onResult: (result, input) => {
    // Log to your observability platform
    metrics.record('rag.evaluation', {
      verdict: result.verdict,
      faithfulness: result.faithfulness.score,
      latencyMs: result.latencyMs,
    });
  },
  onError: (error, input) => {
    logger.error('Evaluation failed', { error, input });
  },
});

// In your RAG handler:
app.post('/chat', async (req, res) => {
  const response = await generateRAGResponse(req.body);
  
  // Fire-and-forget - doesn't block the response
  shadow.maybeEvaluate({
    context: response.retrievedContext,
    query: req.body.query,
    response: response.text,
  });
  
  return res.json(response);
});

CLI Usage

Evaluate a Single Response

export OPENAI_API_KEY="sk-..."

sentient eval \
  --context "Paris is the capital of France." \
  --query "What is the capital of France?" \
  --response "The capital of France is Paris."

Shadow Evaluation from JSONL

# Input file: evaluations.jsonl
# {"context": "...", "query": "...", "response": "..."}
# {"context": "...", "query": "...", "response": "..."}

sentient shadow \
  --input evaluations.jsonl \
  --output results.jsonl \
  --sample-rate 1.0 \
  --judge openai \
  --verbose

Evaluation Result Schema

interface EvaluationResult {
  // LLM-based evaluations
  faithfulness: {
    score: number;           // 0-1, higher is more faithful
    rationale: string;       // Explanation
    unsupportedClaims?: string[];
  };
  relevance: {
    score: number;           // 0-1, higher is more relevant
    rationale?: string;
  };
  hallucination: {
    detected: boolean;
    confidence: number;      // 0-1
    evidence?: string;       // Quote of hallucinated content
  };
  
  // Deterministic checks
  guards: {
    piiLeak: boolean;
    forbiddenTerms: string[];
    piiDetails?: string[];
  };
  
  // Final verdict
  verdict: 'PASS' | 'FAIL';
  evaluatedAt: string;       // ISO timestamp
  latencyMs: number;
}

Testing Philosophy

This SDK follows TDD (Test-Driven Development):

Tests are written before implementation
Every feature has corresponding test cases
Mock judges make tests deterministic and fast

# Run all tests
pnpm test

# Run with coverage
pnpm test:coverage

# Run specific test files
pnpm test:run tests/domain

Current test coverage: 68 tests across 8 test files.

Supported Judges

| Judge | Model | Use Case | |-------|-------|----------| | OpenAIJudge | GPT-4o (default) | High accuracy, production use | | ClaudeJudge | Claude 3.5 Sonnet | Alternative provider | | Custom | Any Judge interface | Bring your own LLM |

Built-in Guards

| Guard | Detects | |-------|---------| | PiiGuard | Emails, phones, SSNs, credit cards, IPs | | ForbiddenTermsGuard | Custom banned words/phrases |

API Reference

`evaluate(input, options)`

Quick evaluation function for simple use cases.

`EvaluateRAG`

Main evaluation orchestrator with full configuration.

`ShadowRunner`

Async, sampled evaluation for production traffic.

`ScoringPolicy`

Configurable thresholds for pass/fail determination.

`JsonReporter`

Structured JSON output for evaluation results.

What This Enables in Production

CI/CD Integration: Fail builds if response quality drops
Observability: Track faithfulness scores over time
A/B Testing: Compare prompt versions objectively
Compliance: Detect PII leaks before they reach users
Quality Gates: Block low-quality responses from reaching users

Contributing

Contributions are welcome! Please read the contributing guidelines first.

Fork the repository
Create a feature branch
Write tests first (TDD)
Submit a PR

Built with ❤️ Dharmik For RAG Reliability!