vitest-evals
v0.5.0
Published
Evaluate LLM outputs using the familiar Vitest testing framework.
Readme
vitest-evals
Evaluate LLM outputs using the familiar Vitest testing framework.
Installation
npm install -D vitest-evalsQuick Start
import { describeEval } from "vitest-evals";
describeEval("capital cities", {
data: async () => [
{ input: "What is the capital of France?", expected: "Paris" },
{ input: "What is the capital of Japan?", expected: "Tokyo" },
],
task: async (input) => {
const response = await queryLLM(input);
return response; // Simple string return
},
scorers: [
async ({ output, expected }) => ({
score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
}),
],
threshold: 0.8,
});Tasks
Tasks process inputs and return outputs. Two formats are supported:
// Simple: just return a string
const task = async (input) => "response";
// With tool tracking: return a TaskResult
const task = async (input) => ({
result: "response",
toolCalls: [
{ name: "search", arguments: { query: "..." }, result: {...} }
]
});Scorers
Scorers evaluate outputs and return a score (0-1). Use built-in scorers or create your own:
// Built-in scorer
import { ToolCallScorer } from "vitest-evals";
// Or import individually
import { ToolCallScorer } from "vitest-evals/scorers/toolCallScorer";
describeEval("tool usage", {
data: async () => [
{ input: "Search weather", expectedTools: [{ name: "weather_api" }] },
],
task: weatherTask,
scorers: [ToolCallScorer()],
});
// Custom scorer
const LengthScorer = async ({ output }) => ({
score: output.length > 50 ? 1.0 : 0.0,
});
// TypeScript scorer with custom options
import { type ScoreFn, type BaseScorerOptions } from "vitest-evals";
interface CustomOptions extends BaseScorerOptions {
minLength: number;
}
const TypedScorer: ScoreFn<CustomOptions> = async (opts) => ({
score: opts.output.length >= opts.minLength ? 1.0 : 0.0,
});Built-in Scorers
ToolCallScorer
Evaluates if the expected tools were called with correct arguments.
// Basic usage - strict matching, any order
describeEval("search test", {
data: async () => [
{
input: "Find Italian restaurants",
expectedTools: [
{ name: "search", arguments: { type: "restaurant" } },
{ name: "filter", arguments: { cuisine: "italian" } },
],
},
],
task: myTask,
scorers: [ToolCallScorer()],
});
// Strict evaluation - exact order and parameters
scorers: [
ToolCallScorer({
ordered: true, // Tools must be in exact order
params: "strict", // Parameters must match exactly
}),
];
// Flexible evaluation
scorers: [
ToolCallScorer({
requireAll: false, // Partial matches give partial credit
allowExtras: false, // No additional tools allowed
}),
];Default behavior:
- Strict parameter matching (exact equality required)
- Any order allowed
- Extra tools allowed
- All expected tools required
AI SDK Integration
See src/ai-sdk-integration.test.ts for a complete example with the Vercel AI SDK.
Transform provider responses to our format:
const { text, steps } = await generateText({
model: openai("gpt-4o"),
prompt: input,
tools: { myTool: myToolDefinition },
});
return {
result: text,
toolCalls: steps
.flatMap((step) => step.toolCalls)
.map((call) => ({
name: call.toolName,
arguments: call.args,
})),
};Advanced Usage
Advanced Scorers
Using autoevals
For sophisticated evaluation, use autoevals scorers:
import { Factuality, ClosedQA } from "autoevals";
scorers: [
Factuality, // LLM-based factuality checking
ClosedQA.partial({
criteria: "Does the answer mention Paris?",
}),
];Custom LLM-based Factuality Scorer
Here's an example of implementing your own LLM-based factuality scorer using the Vercel AI SDK:
import { generateObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
const Factuality = (model = openai("gpt-4o")) => async ({ input, output, expected }) => {
if (!expected) {
return { score: 1.0, metadata: { rationale: "No expected answer" } };
}
const { object } = await generateObject({
model,
prompt: `
Compare the factual content of the submitted answer with the expert answer.
Question: ${input}
Expert: ${expected}
Submission: ${output}
Options:
(A) Subset of expert answer
(B) Superset of expert answer
(C) Same content as expert
(D) Contradicts expert answer
(E) Different but factually equivalent
`,
schema: z.object({
answer: z.enum(["A", "B", "C", "D", "E"]),
rationale: z.string(),
}),
});
const scores = { A: 0.4, B: 0.6, C: 1, D: 0, E: 1 };
return {
score: scores[object.answer],
metadata: { rationale: object.rationale, answer: object.answer },
};
};
// Usage
scorers: [Factuality()];Skip Tests Conditionally
describeEval("gpt-4 tests", {
skipIf: () => !process.env.OPENAI_API_KEY,
// ...
});Existing Test Suites
For integration with existing Vitest test suites, you can use the .toEval() matcher:
⚠️ Deprecated: The
.toEval()helper is deprecated. UsedescribeEval()instead for better test organization and multiple scorers support. We may consider bringing back a similar check, but its currently too limited for many scorer implementations.
import "vitest-evals";
test("capital check", () => {
const simpleFactuality = async ({ output, expected }) => ({
score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
});
expect("What is the capital of France?").toEval(
"Paris",
answerQuestion,
simpleFactuality,
0.8
);
});Recommended migration to describeEval():
import { describeEval } from "vitest-evals";
describeEval("capital check", {
data: async () => [
{ input: "What is the capital of France?", expected: "Paris" },
],
task: answerQuestion,
scorers: [
async ({ output, expected }) => ({
score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
}),
],
threshold: 0.8,
});Configuration
Separate Eval Configuration
Create vitest.evals.config.ts:
import { defineConfig } from "vitest/config";
import defaultConfig from "./vitest.config";
export default defineConfig({
...defaultConfig,
test: {
...defaultConfig.test,
include: ["src/**/*.eval.{js,ts}"],
},
});Run evals separately:
vitest --config=vitest.evals.config.tsDevelopment
npm install
npm test