evalz

v0.2.2

Published

a year ago

Model graded evals with typescript

Downloads

774

0High
0Medium
0Low

dimitrikennedy

llm structured output streaming evals openai zod

Overview

evalz provides structured evaluation tools for assessing LLM outputs across multiple dimensions. Built with TypeScript and integrated with OpenAI and Instructor, it enables both automated evaluation and human-in-the-loop assessment workflows.

Key Capabilities

🎯 Model-Graded Evaluation: Leverage LLMs to assess response quality
📊 Accuracy Measurement: Compare outputs using semantic and lexical similarity
🔍 Context Validation: Evaluate responses against source materials
⚖️ Composite Assessment: Combine multiple evaluation types with custom weights

Installation

Install evalz using your preferred package manager:

npm install evalz openai zod @instructor-ai/instructor

bun add evalz openai zod @instructor-ai/instructor

pnpm add evalz openai zod @instructor-ai/instructor

When to Use evalz

Model-Graded Evaluation

Provides human-like judgment for subjective criteria that can't be measured through pure text comparison

Use when you need qualitative assessment of responses:

Evaluating RAG system output quality
Assessing chatbot response appropriateness
Validating content generation
Measuring response coherence and fluency

const relevanceEval = createEvaluator({
  client: oai,
  model: "gpt-4-turbo",
  evaluationDescription: "Rate relevance and quality from 0-1"
});

Accuracy Evaluation

Gives objective measurements for cases where exact or semantic matching is important

Use for comparing outputs against known correct answers:

Question-answering system validation
Translation accuracy measurement
Fact-checking systems
Test case validation

const accuracyEval = createAccuracyEvaluator({
  weights: { 
    factual: 0.6,  // Levenshtein distance weight
    semantic: 0.4   // Embedding similarity weight
  }
});

Context Evaluation

Measures how well outputs utilize and stay faithful to provided context

Use for assessing responses against source materials:

RAG system faithfulness
Document summarization accuracy
Knowledge extraction validation
Information retrieval quality

const contextEval = createContextEvaluator({ 
  type: "precision"  // or "recall", "relevance", "entities-recall" 
});

Composite Evaluation

Provides balanced assessment across multiple dimensions of quality

Use for comprehensive system assessment:

Production LLM monitoring
A/B testing prompts and models
Quality assurance pipelines
Multi-factor response validation

const compositeEval = createWeightedEvaluator({
  evaluators: {
    relevance: relevanceEval(),
    accuracy: accuracyEval(),
    context: contextEval()
  },
  weights: {
    relevance: 0.4,
    accuracy: 0.4,
    context: 0.2
  }
});

Evaluator Types and Data Requirements

Context Evaluator Types

type ContextEvaluatorType = "entities-recall" | "precision" | "recall" | "relevance";

entities-recall: Measures how well the completion captures named entities from the context
precision: Evaluates how accurate the completion is compared to the context
recall: Measures how much relevant information from the context is included
relevance: Assesses how well the completion relates to the context

Data Requirements by Evaluator Type

Model-Graded Evaluator

type ModelGradedData = {
  prompt: string;
  completion: string;
  expectedCompletion?: string;  // Ignored for this evaluator type
}

const modelEval = createEvaluator({
  client: oai,
  model: "gpt-4-turbo",
  evaluationDescription: "Rate the response"
});


await modelEval({
  data: [{
    prompt: "What is TypeScript?",
    completion: "TypeScript is a typed superset of JavaScript"
  }]
});

Accuracy Evaluator

type AccuracyData = {
  completion: string;
  expectedCompletion: string;  // Required for accuracy comparison
}

const accuracyEval = createAccuracyEvaluator({
  weights: { factual: 0.5, semantic: 0.5 }
});

await accuracyEval({
  data: [{
    completion: "TypeScript adds types to JavaScript",
    expectedCompletion: "TypeScript is JavaScript with type support"
  }]
});

Context Evaluator

type ContextData = {
  prompt: string;
  completion: string;
  groundTruth: string;   // Required for context evaluation
  contexts: string[];    // Required for context evaluation
}

// Entities Recall - Checks named entities
const entitiesEval = createContextEvaluator({ 
  type: "entities-recall" 
});

// Precision - Checks accuracy against context
const precisionEval = createContextEvaluator({ 
  type: "precision" 
});

// Recall - Checks information coverage
const recallEval = createContextEvaluator({ 
  type: "recall" 
});

// Relevance - Checks contextual relevance
const relevanceEval = createContextEvaluator({ 
  type: "relevance" 
});

// Example usage
const data = {
  prompt: "What did the CEO say about Q3?",
  completion: "CEO Jane Smith reported 15% growth in Q3 2023",
  groundTruth: "The CEO announced strong Q3 performance",
  contexts: [
    "CEO Jane Smith presented Q3 results",
    "Company saw 15% revenue growth in Q3 2023"
  ]
};

await entitiesEval({ data: [data] });   // Focuses on "Jane Smith", "Q3", "2023"
await precisionEval({ data: [data] });  // Checks factual accuracy
await recallEval({ data: [data] });     // Checks information completeness
await relevanceEval({ data: [data] });  // Checks contextual relevance

Composite Evaluation

// Can combine different evaluator types
const compositeEval = createWeightedEvaluator({
  evaluators: {
    entities: createContextEvaluator({ type: "entities-recall" }),
    accuracy: createAccuracyEvaluator({
      weights: { 
        factual: 0.9,   // High weight on exact matches
        semantic: 0.1    // Low weight on similar terms
      }
    }),
    quality: createEvaluator({
      client: oai,
      model: "gpt-4-turbo",
      evaluationDescription: "Rate quality"
    })
  },
  weights: {
    entities: 0.3,
    accuracy: 0.4,
    quality: 0.3
  }
});

// Must provide all required fields for each evaluator type
await compositeEval({
  data: [{
    prompt: "Summarize the earnings call",
    completion: "CEO Jane Smith announced 15% growth",
    expectedCompletion: "The CEO reported strong growth",
    groundTruth: "CEO discussed Q3 performance",
    contexts: [
      "CEO Jane Smith presented Q3 results",
      "Company saw 15% growth in Q3 2023"
    ]
  }]
});

Cookbook

RAG System Evaluation

Evaluate RAG responses for relevance to source documents and factual accuracy.

const ragEvaluator = createWeightedEvaluator({
  evaluators: {
    // Check if named entities (people, places, dates) are preserved
    entities: createContextEvaluator({ 
      type: "entities-recall" 
    }),
    // Verify factual correctness using embedding similarity
    precision: createContextEvaluator({ 
      type: "precision" 
    }),
    // Check if all relevant information is included
    recall: createContextEvaluator({ 
      type: "recall" 
    }),
    // Assess overall contextual relevance
    relevance: createEvaluator({
      client: oai,
      model: "gpt-4-turbo",
      evaluationDescription: "Rate how well the response uses the context"
    })
  },
  weights: {
    entities: 0.2,   // Lower weight as it's more supplementary
    precision: 0.3,  // Higher weight for factual correctness
    recall: 0.3,     // Higher weight for information coverage
    relevance: 0.2   // Balance of overall relevance
  }
});

const result = await ragEvaluator({
  data: [{
    prompt: "What are the key financial metrics?",
    completion: "Revenue grew 25% to $10M in Q3 2023",
    groundTruth: "Q3 2023 saw 25% revenue growth to $10M",
    contexts: [
      "In Q3 2023, company revenue increased 25% to $10M",
      "Operating margins improved to 15%"
    ]
  }]
});

/* Example output:
{
  results: [{
    score: 0.85,
    scores: [
      { score: 1.0, evaluator: "entities" },    // Perfect entity preservation
      { score: 0.92, evaluator: "precision" },  // High factual accuracy
      { score: 0.75, evaluator: "recall" },     // Missing margin information
      { score: 0.78, evaluator: "relevance" }   // Good contextual relevance
    ],
    item: {
      prompt: "What were the key financial metrics?",
      completion: "Revenue grew 25% to $10M in Q3 2023",
      groundTruth: "Q3 2023 saw 25% revenue growth to $10M",
      contexts: [...]
    }
  }],
  scoreResults: {
    value: 0.85,
    individual: {
      entities: 1.0,
      precision: 0.92,
      recall: 0.75,
      relevance: 0.78
    }
  }
}
*/

Content Moderation Evaluation

Binary evaluation for content policy compliance, useful for automated content filtering.

const moderationEvaluator = createEvaluator({
  client: oai,
  model: "gpt-4-turbo",
  resultsType: "binary",  // Changes output to true/false counts
  evaluationDescription: "Score 1 if content follows all policies (safe, respectful, appropriate), 0 if any violation exists"
});

const moderationResult = await moderationEvaluator({
  data: [
    {
      prompt: "Describe our product benefits",
      completion: "Our product helps improve productivity",
      expectedCompletion: "Professional product description"
    },
    {
      prompt: "Respond to negative review",
      completion: "Your complaint is totally wrong...",
      expectedCompletion: "Professional response to feedback"
    }
  ]
});

/* Example output:
{
  results: [
    { score: 1, item: { ... } },  // Meets content guidelines
    { score: 0, item: { ... } }   // Violates professional tone policy
  ],
  binaryResults: {
    trueCount: 1,
    falseCount: 1
  }
}
*/

Student Answer Evaluation

Demonstrates weighted evaluation combining exact matching, semantic understanding, and qualitative assessment.

const gradingEvaluator = createWeightedEvaluator({
  evaluators: {
    // Check for presence of required terminology
    keyTerms: createAccuracyEvaluator({
      weights: { 
        factual: 0.9,   // High weight on exact matches
        semantic: 0.1    // Low weight on similar terms
      }
    }),
    // Assess conceptual understanding
    understanding: createAccuracyEvaluator({
      weights: { 
        factual: 0.2,   // Low weight on exact matches
        semantic: 0.8    // High weight on meaning similarity
      }
    }),
    // Evaluate answer quality like a human grader
    quality: createEvaluator({
      client: oai,
      model: "gpt-4-turbo",
      evaluationDescription: "Rate answer completeness and clarity 0-1"
    })
  },
  weights: {
    keyTerms: 0.3,      // Balance terminology requirements
    understanding: 0.4,  // Emphasize conceptual grasp
    quality: 0.3        // Consider overall presentation
  }
});

const gradingResult = await gradingEvaluator({
  data: [{
    prompt: "Explain how photosynthesis works",
    completion: "Plants convert sunlight into chemical energy through chlorophyll",
    expectedCompletion: "Photosynthesis is the process where plants use chlorophyll to convert sunlight, water, and CO2 into glucose and oxygen"
  }]
});

/* Example output:
{
  results: [{
    score: 0.78,  // Overall grade (78%)
    scores: [
      { 
        score: 0.65,           // Missing key terms (water, CO2, glucose)
        evaluator: "keyTerms",
        evaluatorType: "accuracy"
      },
      { 
        score: 0.90,           // Shows good conceptual understanding
        evaluator: "understanding",
        evaluatorType: "accuracy"
      },
      { 
        score: 0.75,           // Clear but not comprehensive
        evaluator: "quality",
        evaluatorType: "model-graded"
      }
    ],
    item: { ... }
  }],
  scoreResults: {
    value: 0.78,
    individual: {
      keyTerms: 0.65,
      understanding: 0.90,
      quality: 0.75
    }
  }
}
*/

Chatbot Quality Assessment

Monitor chatbot response quality across multiple dimensions.

const chatbotEvaluator = createWeightedEvaluator({
  evaluators: {
    // Evaluate response appropriateness
    relevance: createEvaluator({
      client: oai,
      model: "gpt-4-turbo",
      evaluationDescription: "Rate how well the response addresses the user's query"
    }),
    // Check response tone
    tone: createEvaluator({
      client: oai,
      model: "gpt-4-turbo",
      evaluationDescription: "Rate the professionalism and friendliness of the response"
    }),
    // Verify against known good responses
    accuracy: createAccuracyEvaluator({
      weights: { semantic: 0.8, factual: 0.2 }
    })
  },
  weights: {
    relevance: 0.4,
    tone: 0.3,
    accuracy: 0.3
  }
});

const result = await chatbotEvaluator({
  data: [{
    prompt: "How do I reset my password?",
    completion: "You can reset your password by clicking the 'Forgot Password' link on the login page.",
    expectedCompletion: "To reset your password, use the 'Forgot Password' option at login.",
    contexts: ["Previous support interactions"]
  }]
});

Content Generation Pipeline

Evaluate generated content for quality and accuracy.

const contentEvaluator = createWeightedEvaluator({
  evaluators: {
    // Check writing quality
    quality: createEvaluator({
      client: oai,
      model: "gpt-4-turbo",
      evaluationDescription: "Rate clarity, structure, and engagement"
    }),
    // Verify factual accuracy
    factCheck: createAccuracyEvaluator({
      weights: { factual: 1.0 }
    }),
    // Assess source usage
    citations: createContextEvaluator({ 
      type: "entities-recall" 
    })
  },
  weights: {
    quality: 0.4,
    factCheck: 0.4,
    citations: 0.2
  }
});

const result = await contentEvaluator({
  data: [{
    prompt: "Write an article about renewable energy trends",
    completion: "Solar and wind power installations increased by 30% in 2023...",
    contexts: [
      "Global renewable energy deployment grew by 30% year-over-year",
      "Solar and wind remained the fastest-growing sectors"
    ],
    groundTruth: "Renewable energy saw significant growth, led by solar and wind"
  }]
});

Document Processing System

Evaluate document extraction and summarization quality.

const documentEvaluator = createWeightedEvaluator({
  evaluators: {
    // Verify key information extraction
    extraction: createContextEvaluator({
      type: "recall"
    }),
    // Check summary quality
    summary: createEvaluator({
      client: oai,
      model: "gpt-4-turbo",
      evaluationDescription: "Rate conciseness and completeness"
    }),
    // Validate against reference summary
    accuracy: createAccuracyEvaluator({
      weights: { semantic: 0.6, factual: 0.4 }
    })
  },
  weights: {
    extraction: 0.4,
    summary: 0.3,
    accuracy: 0.3
  }
});

const result = await documentEvaluator({
  data: [{
    prompt: "Summarize the quarterly report",
    completion: "Q3 revenue grew 25% YoY, driven by new product launches...",
    contexts: [
      "Revenue increased 25% compared to Q3 2022",
      "Growth primarily attributed to successful product launches"
    ],
    groundTruth: "Q3 saw 25% YoY revenue growth due to new products"
  }]
});

API Reference

createEvaluator

Creates a basic evaluator for assessing AI-generated content based on custom criteria.

Parameters

• client: OpenAI instance. • model: OpenAI model to use (e.g., "gpt-4o"). • evaluationDescription: Description guiding the evaluation criteria. • resultsType: Type of results to return ("score" or "binary"). • messages: Additional messages to include in the OpenAI API call.

Example

import { createEvaluator } from "evalz";
import OpenAI from "openai";

const oai = new OpenAI({
  apiKey: process.env["OPENAI_API_KEY"],
  organization: process.env["OPENAI_ORG_ID"]
});

const evaluator = createEvaluator({
  client: oai,
  model: "gpt-4-turbo",
  evaluationDescription: "Rate the relevance from 0 to 1."
});

const result = await evaluator({ data: [{ prompt: "Discuss the importance of AI.", completion: "AI is important for future technology.", expectedCompletion: "AI is important for future technology." }] });
console.log(result.scoreResults);

createAccuracyEvaluator

Creates an evaluator that assesses string similarity using a hybrid approach of Levenshtein distance (factual similarity) and semantic embeddings (semantic similarity), with customizable weights.

Parameters

• model (optional): OpenAI.Embeddings.EmbeddingCreateParams["model"] - The OpenAI embedding model to use defaults to "text-embedding-3-small".

• weights (optional): An object specifying the weights for factual and semantic similarities. Defaults to { factual: 0.5, semantic: 0.5 }.

Example

import { createAccuracyEvaluator } from "evalz";

const evaluator = createAccuracyEvaluator({
  model: "text-embedding-3-small",
  weights: { factual: 0.4, semantic: 0.6 }
});


const data = [
  { completion: "Einstein was born in Germany in 1879.", expectedCompletion: "Einstein was born in 1879 in Germany." }
];

const result = await evaluator({ data });
console.log(result.scoreResults);

createWeightedEvaluator

Combines multiple evaluators with specified weights for a comprehensive assessment.

Parameters

• evaluators: An object mapping evaluator names to evaluator functions.

• weights: An object mapping evaluator names to their corresponding weights.

Example

import { createWeightedEvaluator } from "evalz";

const weightedEvaluator = createWeightedEvaluator({
  evaluators: {
    relevance: relevanceEval(),
    fluency: fluencyEval(),
    completeness: completenessEval()
  },
  weights: {
    relevance: 0.25,
    fluency: 0.25,
    completeness: 0.5
  }
});

const result = await weightedEvaluator({ data: yourResponseData });
console.log(result.scoreResults);

Create Composite Weighted Evaluation

A weighted evaluator that incorporates various evaluation types: Example

import { createEvaluator, createAccuracyEvaluator, createContextEvaluator, createWeightedEvaluator }  from "evalz"

const oai = new OpenAI({
  apiKey: process.env["OPENAI_API_KEY"],
  organization: process.env["OPENAI_ORG_ID"]
});


const relevanceEval = () => createEvaluator({
  client: oai,
  model: "gpt-4-turbo",
  evaluationDescription: "Please rate the relevance of the response from 0 (not at all relevant) to 1 (highly relevant), considering whether the AI stayed on topic and provided a reasonable answer."
});

const distanceEval = () => createAccuracyEvaluator({
  weights: { factual: 0.5, semantic: 0.5 }
});

const semanticEval = () => createAccuracyEvaluator({
  weights: { factual: 0.0, semantic: 1.0 }
});

const fluencyEval = () => createEvaluator({
  client: oai,
  model: "gpt-4-turbo",
  evaluationDescription: "Please rate the completeness of the response from 0 (not at all complete) to 1 (completely answered), considering whether the AI addressed all parts of the prompt."
});

const completenessEval = () => createEvaluator({
  client: oai,
  model: "gpt-4-turbo",
  evaluationDescription: "Please rate the completeness of the response from 0 (not at all complete) to 1 (completely answered), considering whether the AI addressed all parts of the prompt."
});

const contextEntitiesRecallEval = () => createContextEvaluator({ type: "entities-recall" });
const contextPrecisionEval = () => createContextEvaluator({ type: "precision" });
const contextRecallEval = () => createContextEvaluator({ type: "recall" });
const contextRelevanceEval = () => createContextEvaluator({ type: "relevance" });


const compositeWeightedEvaluator = createWeightedEvaluator({
  evaluators: {
    relevance: relevanceEval(),
    fluency: fluencyEval(),
    completeness: completenessEval(),
    accuracy: createAccuracyEvaluator({ weights: { factual: 0.6, semantic: 0.4 } }),
    contextPrecision: contextPrecisionEval()
  },
  weights: {
    relevance: 0.2,
    fluency: 0.2,
    completeness: 0.2,
    accuracy: 0.2,
    contextPrecision: 0.2
  }
});


const data = [
  {
    prompt: "When was the first super bowl?",
    completion: "The first super bowl was held on January 15, 1967.",
    expectedCompletion: "The first superbowl was held on January 15, 1967.",
    contexts: ["The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."],
    groundTruth: "The first superbowl was held on January 15, 1967."
  }
];


const result = await compositeWeightedEvaluator({ data });
console.log(result.scoreResults);

createContextEvaluator

Creates an evaluator that assesses context-based criteria such as relevance, precision, recall, and entities recall.

Parameters

• type: "entities-recall" | "precision" | "recall" | "relevance" - The type of context evaluation to perform.

• model (optional): OpenAI.Embeddings.EmbeddingCreateParams["model"] - The OpenAI embedding model to use. Defaults to "text-embedding-3-small".

Example

import { createContextEvaluator } from "evalz";


const entitiesRecallEvaluator = createContextEvaluator({ type: "entities-recall" });


const precisionEvaluator = createContextEvaluator({ type: "precision" });


const recallEvaluator = createContextEvaluator({ type: "recall" });


const relevanceEvaluator = createContextEvaluator({ type: "relevance" });


const data = [
  { 
    prompt: "When was the first super bowl?", 
    completion: "The first superbowl was held on January 15, 1967.", 
    groundTruth: "The first superbowl was held on January 15, 1967.", 
    contexts: [
      "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967 at the Los Angeles Memorial Coliseum in Los Angeles.",
      "This first championship game is retroactively referred to as Super Bowl I."
    ]
  }
];


const result1 = await entitiesRecallEvaluator({ data });
console.log(result1.scoreResults);


const result2 = await precisionEvaluator({ data });
console.log(result2.scoreResults);


const result3 = await recallEvaluator({ data });
console.log(result3.scoreResults);


const result4 = await relevanceEvaluator({ data });
console.log(result4.scoreResults);

Integration with Island AI

Part of the Island AI toolkit:

schema-stream: Streaming JSON parser
zod-stream: Structured streaming
stream-hooks: React streaming hooks
llm-polyglot: Universal LLM client
instructor: High-level extraction

Contributing

We welcome contributions! Check out:

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Overview

Key Capabilities

Installation

When to Use evalz

Model-Graded Evaluation

Accuracy Evaluation

Context Evaluation

Composite Evaluation

Evaluator Types and Data Requirements

Context Evaluator Types

Data Requirements by Evaluator Type

Model-Graded Evaluator

Accuracy Evaluator

Context Evaluator

Composite Evaluation

Cookbook

RAG System Evaluation

Content Moderation Evaluation

Student Answer Evaluation

Chatbot Quality Assessment

Content Generation Pipeline

Document Processing System

API Reference

createEvaluator

createAccuracyEvaluator

createWeightedEvaluator

Create Composite Weighted Evaluation

createContextEvaluator

Integration with Island AI

Contributing

License