@caleblawson/evals
v0.10.4
Published
A comprehensive evaluation framework for assessing AI model outputs across multiple dimensions.
Downloads
2
Readme
@mastra/evals
A comprehensive evaluation framework for assessing AI model outputs across multiple dimensions.
Installation
npm install @mastra/evalsOverview
@mastra/evals provides a suite of evaluation metrics for assessing AI model outputs. The package includes both LLM-based and NLP-based metrics, enabling both automated and model-assisted evaluation of AI responses.
Features
LLM-Based Metrics
Answer Relevancy
- Evaluates how well an answer addresses the input question
- Considers uncertainty weighting for more nuanced scoring
- Returns detailed reasoning for scores
Bias Detection
- Identifies potential biases in model outputs
- Analyzes opinions and statements for bias indicators
- Provides explanations for detected biases
- Configurable scoring scale
Context Precision & Relevancy
- Assesses how well responses use provided context
- Evaluates accuracy of context usage
- Measures relevance of context to the response
- Analyzes context positioning in responses
Faithfulness
- Verifies that responses are faithful to provided context
- Detects hallucinations or fabricated information
- Evaluates claims against provided context
- Provides detailed analysis of faithfulness breaches
Prompt Alignment
- Measures how well responses follow given instructions
- Evaluates adherence to multiple instruction criteria
- Provides per-instruction scoring
- Supports custom instruction sets
Toxicity
- Detects toxic or harmful content in responses
- Provides detailed reasoning for toxicity verdicts
- Configurable scoring thresholds
- Considers both input and output context
NLP-Based Metrics
Completeness
- Analyzes structural completeness of responses
- Identifies missing elements from input requirements
- Provides detailed element coverage analysis
- Tracks input-output element ratios
Content Similarity
- Measures text similarity between inputs and outputs
- Configurable for case and whitespace sensitivity
- Returns normalized similarity scores
- Uses string comparison algorithms for accuracy
Keyword Coverage
- Tracks presence of key terms from input in output
- Provides detailed keyword matching statistics
- Calculates coverage ratios
- Useful for ensuring comprehensive responses
Usage
Basic Example
import { ContentSimilarityMetric, ToxicityMetric } from '@mastra/evals';
// Initialize metrics
const similarityMetric = new ContentSimilarityMetric({
ignoreCase: true,
ignoreWhitespace: true,
});
const toxicityMetric = new ToxicityMetric({
model: openai('gpt-4'),
scale: 1, // Optional: adjust scoring scale
});
// Evaluate outputs
const input = 'What is the capital of France?';
const output = 'Paris is the capital of France.';
const similarityResult = await similarityMetric.measure(input, output);
const toxicityResult = await toxicityMetric.measure(input, output);
console.log('Similarity Score:', similarityResult.score);
console.log('Toxicity Score:', toxicityResult.score);Context-Aware Evaluation
import { FaithfulnessMetric } from '@mastra/evals';
// Initialize with context
const faithfulnessMetric = new FaithfulnessMetric({
model: openai('gpt-4'),
context: ['Paris is the capital of France', 'Paris has a population of 2.2 million'],
scale: 1,
});
// Evaluate response against context
const result = await faithfulnessMetric.measure(
'Tell me about Paris',
'Paris is the capital of France with 2.2 million residents',
);
console.log('Faithfulness Score:', result.score);
console.log('Reasoning:', result.reason);Metric Results
Each metric returns a standardized result object containing:
score: Normalized score (typically 0-1)info: Detailed information about the evaluation- Additional metric-specific data (e.g., matched keywords, missing elements)
Some metrics also provide:
reason: Detailed explanation of the scoreverdicts: Individual judgments that contributed to the final score
Telemetry and Logging
The package includes built-in telemetry and logging capabilities:
- Automatic evaluation tracking through Mastra Storage
- Integration with OpenTelemetry for performance monitoring
- Detailed evaluation traces for debugging
import { attachListeners } from '@mastra/evals';
// Enable basic evaluation tracking
await attachListeners();
// Store evals in Mastra Storage (if storage is enabled)
await attachListeners(mastra);
// Note: When using in-memory storage, evaluations are isolated to the test process.
// When using file storage, evaluations are persisted and can be queried later.Environment Variables
Required for LLM-based metrics:
OPENAI_API_KEY: For OpenAI model access- Additional provider keys as needed (Cohere, Anthropic, etc.)
Package Exports
// Main package exports
import { evaluate } from '@mastra/evals';
// NLP-specific metrics
import { ContentSimilarityMetric } from '@mastra/evals/nlp';Related Packages
@mastra/core: Core framework functionality@mastra/engine: LLM execution engine@mastra/mcp: Model Context Protocol integration
