sentient-sdk
v0.3.0
Published
A developer-first SDK for automated RAG evaluation using LLM-as-a-Judge and Deterministic Guards
Downloads
17
Maintainers
Readme
Sentient-SDK
A developer-first SDK for automated RAG evaluation using LLM-as-a-Judge and Deterministic Guards.
The Problem
RAG (Retrieval-Augmented Generation) systems have a critical flaw: hallucinations kill trust.
When your LLM generates a response, how do you know if it's:
- Faithful to the retrieved context?
- Relevant to the user's query?
- Free of hallucinations?
Manually reviewing responses doesn't scale. Traditional NLP metrics don't capture semantic meaning. You need a systematic approach.
The Solution
Sentient-SDK automates the Evaluation (Evals) phase of your RAG lifecycle:
import { evaluate } from 'sentient-sdk';
const result = await evaluate(
{
context: 'Paris is the capital of France, established in 508 AD.',
query: 'What is the capital of France?',
response: 'The capital of France is Paris, a city founded in ancient Roman times.',
},
{
judge: 'openai',
openaiApiKey: process.env.OPENAI_API_KEY,
}
);
console.log(result);
// {
// verdict: 'FAIL',
// faithfulness: { score: 0.6, rationale: 'Incorrect founding date claim...' },
// hallucination: { detected: true, evidence: '"ancient Roman times"...' },
// guards: { piiLeak: false, forbiddenTerms: [] },
// latencyMs: 1234,
// }Architecture
Application
└── uses Sentient SDK
├── Judge (LLM-based) → OpenAI / Claude
├── Guards (Deterministic) → PII / Forbidden Terms
├── Scorer (combines signals)→ Configurable thresholds
├── Reporter (structured) → JSON output
└── Shadow Runner (async) → Non-blocking evaluationClean Architecture
sentient-sdk/
├── src/
│ ├── domain/ # Core types & interfaces
│ │ ├── types.ts
│ │ ├── Judge.ts
│ │ └── Guard.ts
│ ├── application/ # Use cases
│ │ ├── EvaluateRAG.ts
│ │ ├── ScoringPolicy.ts
│ │ └── ShadowRunner.ts
│ ├── infrastructure/ # Implementations
│ │ ├── judges/
│ │ │ ├── OpenAIJudge.ts
│ │ │ └── ClaudeJudge.ts
│ │ ├── guards/
│ │ │ ├── PiiGuard.ts
│ │ │ └── ForbiddenTermsGuard.ts
│ │ └── reporters/
│ │ └── JsonReporter.ts
│ ├── cli/
│ │ └── shadow.ts
│ └── index.ts
└── tests/Installation
npm install sentient-sdk
# or
pnpm add sentient-sdkQuick Start
Basic Evaluation
import { evaluate } from 'sentient-sdk';
const result = await evaluate(
{
context: 'The Eiffel Tower is 330 meters tall.',
query: 'How tall is the Eiffel Tower?',
response: 'The Eiffel Tower is 330 meters tall.',
},
{
judge: 'openai',
openaiApiKey: process.env.OPENAI_API_KEY,
}
);
if (result.verdict === 'PASS') {
console.log('Response is reliable');
} else {
console.log('Response failed evaluation');
console.log('Reason:', result.faithfulness.rationale);
}Advanced Configuration
import { EvaluateRAG, OpenAIJudge, PiiGuard, ForbiddenTermsGuard, ScoringPolicy } from 'sentient-sdk';
// Create a custom evaluator
const evaluator = new EvaluateRAG({
judge: new OpenAIJudge({
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o',
temperature: 0,
}),
guards: [
new PiiGuard({ types: ['email', 'phone', 'ssn'] }),
new ForbiddenTermsGuard({
terms: ['competitor_name', 'internal_secret'],
caseInsensitive: true,
}),
],
scoringPolicy: new ScoringPolicy({
faithfulnessThreshold: 0.8, // Stricter than default 0.7
relevanceThreshold: 0.6,
failOnGuardViolation: true,
}),
});
const result = await evaluator.run({ context, query, response });Shadow Testing (Production)
Run evaluations on a percentage of live traffic without affecting latency:
import { ShadowRunner, EvaluateRAG, OpenAIJudge, PiiGuard } from 'sentient-sdk';
const evaluator = new EvaluateRAG({
judge: new OpenAIJudge({ apiKey: process.env.OPENAI_API_KEY }),
guards: [new PiiGuard()],
});
const shadow = new ShadowRunner({
evaluator,
sampleRate: 0.1, // 10% of requests
onResult: (result, input) => {
// Log to your observability platform
metrics.record('rag.evaluation', {
verdict: result.verdict,
faithfulness: result.faithfulness.score,
latencyMs: result.latencyMs,
});
},
onError: (error, input) => {
logger.error('Evaluation failed', { error, input });
},
});
// In your RAG handler:
app.post('/chat', async (req, res) => {
const response = await generateRAGResponse(req.body);
// Fire-and-forget - doesn't block the response
shadow.maybeEvaluate({
context: response.retrievedContext,
query: req.body.query,
response: response.text,
});
return res.json(response);
});CLI Usage
Evaluate a Single Response
export OPENAI_API_KEY="sk-..."
sentient eval \
--context "Paris is the capital of France." \
--query "What is the capital of France?" \
--response "The capital of France is Paris."Shadow Evaluation from JSONL
# Input file: evaluations.jsonl
# {"context": "...", "query": "...", "response": "..."}
# {"context": "...", "query": "...", "response": "..."}
sentient shadow \
--input evaluations.jsonl \
--output results.jsonl \
--sample-rate 1.0 \
--judge openai \
--verboseEvaluation Result Schema
interface EvaluationResult {
// LLM-based evaluations
faithfulness: {
score: number; // 0-1, higher is more faithful
rationale: string; // Explanation
unsupportedClaims?: string[];
};
relevance: {
score: number; // 0-1, higher is more relevant
rationale?: string;
};
hallucination: {
detected: boolean;
confidence: number; // 0-1
evidence?: string; // Quote of hallucinated content
};
// Deterministic checks
guards: {
piiLeak: boolean;
forbiddenTerms: string[];
piiDetails?: string[];
};
// Final verdict
verdict: 'PASS' | 'FAIL';
evaluatedAt: string; // ISO timestamp
latencyMs: number;
}Testing Philosophy
This SDK follows TDD (Test-Driven Development):
- Tests are written before implementation
- Every feature has corresponding test cases
- Mock judges make tests deterministic and fast
# Run all tests
pnpm test
# Run with coverage
pnpm test:coverage
# Run specific test files
pnpm test:run tests/domainCurrent test coverage: 68 tests across 8 test files.
Supported Judges
| Judge | Model | Use Case |
|-------|-------|----------|
| OpenAIJudge | GPT-4o (default) | High accuracy, production use |
| ClaudeJudge | Claude 3.5 Sonnet | Alternative provider |
| Custom | Any Judge interface | Bring your own LLM |
Built-in Guards
| Guard | Detects |
|-------|---------|
| PiiGuard | Emails, phones, SSNs, credit cards, IPs |
| ForbiddenTermsGuard | Custom banned words/phrases |
API Reference
evaluate(input, options)
Quick evaluation function for simple use cases.
EvaluateRAG
Main evaluation orchestrator with full configuration.
ShadowRunner
Async, sampled evaluation for production traffic.
ScoringPolicy
Configurable thresholds for pass/fail determination.
JsonReporter
Structured JSON output for evaluation results.
What This Enables in Production
- CI/CD Integration: Fail builds if response quality drops
- Observability: Track faithfulness scores over time
- A/B Testing: Compare prompt versions objectively
- Compliance: Detect PII leaks before they reach users
- Quality Gates: Block low-quality responses from reaching users
Contributing
Contributions are welcome! Please read the contributing guidelines first.
- Fork the repository
- Create a feature branch
- Write tests first (TDD)
- Submit a PR
Built with ❤️ Dharmik For RAG Reliability!
