vitest-evals
v0.6.0
Published
End-to-end evaluation framework for AI agents, built on Vitest.
Readme
vitest-evals
End-to-end evaluation framework for AI agents, built on Vitest.
Installation
npm install -D vitest-evalsQuick Start
import { describeEval } from "vitest-evals";
describeEval("deploy agent", {
data: async () => [
{ input: "Deploy the latest release to production", expected: "deployed" },
{ input: "Roll back the last deploy", expected: "rolled back" },
],
task: async (input) => {
const response = await myAgent.run(input);
return response;
},
scorers: [
async ({ output, expected }) => ({
score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
}),
],
threshold: 0.8,
});Tasks
Tasks process inputs and return outputs. Two formats are supported:
// Simple: just return a string
const task = async (input) => "response";
// With tool tracking: return a TaskResult
const task = async (input) => ({
result: "response",
toolCalls: [
{ name: "search", arguments: { query: "..." }, result: {...} }
]
});Test Data
Each test case requires an input field. Use name to give tests a descriptive label:
data: async () => [
{ name: "simple deploy", input: "Deploy to staging" },
{ name: "deploy with rollback", input: "Deploy to prod, roll back if errors" },
],Additional fields (like expected, expectedTools) are passed through to scorers.
Lifecycle Hooks
Use beforeEach and afterEach for setup and teardown:
describeEval("agent with database", {
beforeEach: async () => {
await db.seed();
},
afterEach: async () => {
await db.clean();
},
data: async () => [{ input: "Find recent errors" }],
task: myAgentTask,
scorers: [async ({ output }) => ({ score: output.includes("error") ? 1.0 : 0.0 })],
});Scorers
Scorers evaluate outputs and return a score (0-1). Use built-in scorers or create your own.
ToolCallScorer
Evaluates if the expected tools were called with correct arguments.
import { ToolCallScorer } from "vitest-evals";
describeEval("tool usage", {
data: async () => [
{
input: "Find Italian restaurants",
expectedTools: [
{ name: "search", arguments: { type: "restaurant" } },
{ name: "filter", arguments: { cuisine: "italian" } },
],
},
],
task: myTask,
scorers: [ToolCallScorer()],
});
// Strict order and parameters
scorers: [ToolCallScorer({ ordered: true, params: "strict" })];
// Flexible evaluation
scorers: [ToolCallScorer({ requireAll: false, allowExtras: false })];Default behavior:
- Strict parameter matching (exact equality required)
- Any order allowed
- Extra tools allowed
- All expected tools required
StructuredOutputScorer
Evaluates if the output matches expected structured data (JSON).
import { StructuredOutputScorer } from "vitest-evals";
describeEval("query generation", {
data: async () => [
{
input: "Show me errors from today",
expected: {
dataset: "errors",
query: "",
sort: "-timestamp",
timeRange: { statsPeriod: "24h" },
},
},
],
task: myTask,
scorers: [StructuredOutputScorer()],
});
// Fuzzy matching
scorers: [StructuredOutputScorer({ match: "fuzzy" })];
// Custom validation
scorers: [
StructuredOutputScorer({
match: (expected, actual, key) => {
if (key === "age") return actual >= 18 && actual <= 100;
return expected === actual;
},
}),
];Custom Scorers
// Inline scorer
const LengthScorer = async ({ output }) => ({
score: output.length > 50 ? 1.0 : 0.0,
});
// TypeScript scorer with custom options
import { type ScoreFn, type BaseScorerOptions } from "vitest-evals";
interface CustomOptions extends BaseScorerOptions {
minLength: number;
}
const TypedScorer: ScoreFn<CustomOptions> = async (opts) => ({
score: opts.output.length >= opts.minLength ? 1.0 : 0.0,
});AI SDK Integration
See src/ai-sdk-integration.test.ts for a complete example with the Vercel AI SDK.
Transform provider responses to our format:
const { text, steps } = await generateText({
model: openai("gpt-4o"),
prompt: input,
tools: { myTool: myToolDefinition },
});
return {
result: text,
toolCalls: steps
.flatMap((step) => step.toolCalls)
.map((call) => ({
name: call.toolName,
arguments: call.args,
})),
};Advanced Usage
Using autoevals
For evaluation using the autoevals library:
import { Factuality, ClosedQA } from "autoevals";
scorers: [
Factuality,
ClosedQA.partial({
criteria: "Does the answer mention Paris?",
}),
];Skip Tests Conditionally
describeEval("gpt-4 tests", {
skipIf: () => !process.env.OPENAI_API_KEY,
// ...
});Existing Test Suites
For integration with existing Vitest test suites, you can use the .toEval() matcher:
Deprecated: The
.toEval()helper is deprecated. UsedescribeEval()instead for better test organization and multiple scorers support.
import "vitest-evals";
test("capital check", () => {
const simpleFactuality = async ({ output, expected }) => ({
score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
});
expect("What is the capital of France?").toEval(
"Paris",
answerQuestion,
simpleFactuality,
0.8
);
});Configuration
Separate Eval Configuration
Create vitest.evals.config.ts:
import { defineConfig } from "vitest/config";
import defaultConfig from "./vitest.config";
export default defineConfig({
...defaultConfig,
test: {
...defaultConfig.test,
include: ["src/**/*.eval.{js,ts}"],
},
});Run evals separately:
vitest --config=vitest.evals.config.tsDevelopment
pnpm install
pnpm test