npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

sentient-sdk

v0.3.0

Published

A developer-first SDK for automated RAG evaluation using LLM-as-a-Judge and Deterministic Guards

Downloads

17

Readme

Sentient-SDK

A developer-first SDK for automated RAG evaluation using LLM-as-a-Judge and Deterministic Guards.

CI npm version TypeScript License: MIT


The Problem

RAG (Retrieval-Augmented Generation) systems have a critical flaw: hallucinations kill trust.

When your LLM generates a response, how do you know if it's:

  • Faithful to the retrieved context?
  • Relevant to the user's query?
  • Free of hallucinations?

Manually reviewing responses doesn't scale. Traditional NLP metrics don't capture semantic meaning. You need a systematic approach.

The Solution

Sentient-SDK automates the Evaluation (Evals) phase of your RAG lifecycle:

import { evaluate } from 'sentient-sdk';

const result = await evaluate(
  {
    context: 'Paris is the capital of France, established in 508 AD.',
    query: 'What is the capital of France?',
    response: 'The capital of France is Paris, a city founded in ancient Roman times.',
  },
  {
    judge: 'openai',
    openaiApiKey: process.env.OPENAI_API_KEY,
  }
);

console.log(result);
// {
//   verdict: 'FAIL',
//   faithfulness: { score: 0.6, rationale: 'Incorrect founding date claim...' },
//   hallucination: { detected: true, evidence: '"ancient Roman times"...' },
//   guards: { piiLeak: false, forbiddenTerms: [] },
//   latencyMs: 1234,
// }

Architecture

Application
  └── uses Sentient SDK
          ├── Judge (LLM-based)        → OpenAI / Claude
          ├── Guards (Deterministic)   → PII / Forbidden Terms
          ├── Scorer (combines signals)→ Configurable thresholds
          ├── Reporter (structured)    → JSON output
          └── Shadow Runner (async)    → Non-blocking evaluation

Clean Architecture

sentient-sdk/
├── src/
│   ├── domain/           # Core types & interfaces
│   │   ├── types.ts
│   │   ├── Judge.ts
│   │   └── Guard.ts
│   ├── application/      # Use cases
│   │   ├── EvaluateRAG.ts
│   │   ├── ScoringPolicy.ts
│   │   └── ShadowRunner.ts
│   ├── infrastructure/   # Implementations
│   │   ├── judges/
│   │   │   ├── OpenAIJudge.ts
│   │   │   └── ClaudeJudge.ts
│   │   ├── guards/
│   │   │   ├── PiiGuard.ts
│   │   │   └── ForbiddenTermsGuard.ts
│   │   └── reporters/
│   │       └── JsonReporter.ts
│   ├── cli/
│   │   └── shadow.ts
│   └── index.ts
└── tests/

Installation

npm install sentient-sdk
# or
pnpm add sentient-sdk

Quick Start

Basic Evaluation

import { evaluate } from 'sentient-sdk';

const result = await evaluate(
  {
    context: 'The Eiffel Tower is 330 meters tall.',
    query: 'How tall is the Eiffel Tower?',
    response: 'The Eiffel Tower is 330 meters tall.',
  },
  {
    judge: 'openai',
    openaiApiKey: process.env.OPENAI_API_KEY,
  }
);

if (result.verdict === 'PASS') {
  console.log('Response is reliable');
} else {
  console.log('Response failed evaluation');
  console.log('Reason:', result.faithfulness.rationale);
}

Advanced Configuration

import { EvaluateRAG, OpenAIJudge, PiiGuard, ForbiddenTermsGuard, ScoringPolicy } from 'sentient-sdk';

// Create a custom evaluator
const evaluator = new EvaluateRAG({
  judge: new OpenAIJudge({
    apiKey: process.env.OPENAI_API_KEY,
    model: 'gpt-4o',
    temperature: 0,
  }),
  guards: [
    new PiiGuard({ types: ['email', 'phone', 'ssn'] }),
    new ForbiddenTermsGuard({
      terms: ['competitor_name', 'internal_secret'],
      caseInsensitive: true,
    }),
  ],
  scoringPolicy: new ScoringPolicy({
    faithfulnessThreshold: 0.8,  // Stricter than default 0.7
    relevanceThreshold: 0.6,
    failOnGuardViolation: true,
  }),
});

const result = await evaluator.run({ context, query, response });

Shadow Testing (Production)

Run evaluations on a percentage of live traffic without affecting latency:

import { ShadowRunner, EvaluateRAG, OpenAIJudge, PiiGuard } from 'sentient-sdk';

const evaluator = new EvaluateRAG({
  judge: new OpenAIJudge({ apiKey: process.env.OPENAI_API_KEY }),
  guards: [new PiiGuard()],
});

const shadow = new ShadowRunner({
  evaluator,
  sampleRate: 0.1, // 10% of requests
  onResult: (result, input) => {
    // Log to your observability platform
    metrics.record('rag.evaluation', {
      verdict: result.verdict,
      faithfulness: result.faithfulness.score,
      latencyMs: result.latencyMs,
    });
  },
  onError: (error, input) => {
    logger.error('Evaluation failed', { error, input });
  },
});

// In your RAG handler:
app.post('/chat', async (req, res) => {
  const response = await generateRAGResponse(req.body);
  
  // Fire-and-forget - doesn't block the response
  shadow.maybeEvaluate({
    context: response.retrievedContext,
    query: req.body.query,
    response: response.text,
  });
  
  return res.json(response);
});

CLI Usage

Evaluate a Single Response

export OPENAI_API_KEY="sk-..."

sentient eval \
  --context "Paris is the capital of France." \
  --query "What is the capital of France?" \
  --response "The capital of France is Paris."

Shadow Evaluation from JSONL

# Input file: evaluations.jsonl
# {"context": "...", "query": "...", "response": "..."}
# {"context": "...", "query": "...", "response": "..."}

sentient shadow \
  --input evaluations.jsonl \
  --output results.jsonl \
  --sample-rate 1.0 \
  --judge openai \
  --verbose

Evaluation Result Schema

interface EvaluationResult {
  // LLM-based evaluations
  faithfulness: {
    score: number;           // 0-1, higher is more faithful
    rationale: string;       // Explanation
    unsupportedClaims?: string[];
  };
  relevance: {
    score: number;           // 0-1, higher is more relevant
    rationale?: string;
  };
  hallucination: {
    detected: boolean;
    confidence: number;      // 0-1
    evidence?: string;       // Quote of hallucinated content
  };
  
  // Deterministic checks
  guards: {
    piiLeak: boolean;
    forbiddenTerms: string[];
    piiDetails?: string[];
  };
  
  // Final verdict
  verdict: 'PASS' | 'FAIL';
  evaluatedAt: string;       // ISO timestamp
  latencyMs: number;
}

Testing Philosophy

This SDK follows TDD (Test-Driven Development):

  1. Tests are written before implementation
  2. Every feature has corresponding test cases
  3. Mock judges make tests deterministic and fast
# Run all tests
pnpm test

# Run with coverage
pnpm test:coverage

# Run specific test files
pnpm test:run tests/domain

Current test coverage: 68 tests across 8 test files.


Supported Judges

| Judge | Model | Use Case | |-------|-------|----------| | OpenAIJudge | GPT-4o (default) | High accuracy, production use | | ClaudeJudge | Claude 3.5 Sonnet | Alternative provider | | Custom | Any Judge interface | Bring your own LLM |

Built-in Guards

| Guard | Detects | |-------|---------| | PiiGuard | Emails, phones, SSNs, credit cards, IPs | | ForbiddenTermsGuard | Custom banned words/phrases |


API Reference

evaluate(input, options)

Quick evaluation function for simple use cases.

EvaluateRAG

Main evaluation orchestrator with full configuration.

ShadowRunner

Async, sampled evaluation for production traffic.

ScoringPolicy

Configurable thresholds for pass/fail determination.

JsonReporter

Structured JSON output for evaluation results.


What This Enables in Production

  • CI/CD Integration: Fail builds if response quality drops
  • Observability: Track faithfulness scores over time
  • A/B Testing: Compare prompt versions objectively
  • Compliance: Detect PII leaks before they reach users
  • Quality Gates: Block low-quality responses from reaching users

Contributing

Contributions are welcome! Please read the contributing guidelines first.

  1. Fork the repository
  2. Create a feature branch
  3. Write tests first (TDD)
  4. Submit a PR

Built with ❤️ Dharmik For RAG Reliability!