@ai-sdk-tool/eval
v1.1.0
Published
Benchmarking and evaluation tools for AI SDK
Readme
AI SDK - evaluation tool
This package provides a standardized, extensible, and reproducible way to benchmark and evaluate the performance of Language Models (LanguageModel instances) within the Vercel AI SDK ecosystem.
It allows developers to:
- Compare different models (e.g., Gemma, Llama, GPT) under the same conditions.
- Quantify the impact of model updates or configuration changes.
- Create custom benchmarks tailored to specific use cases (e.g., 'Korean proficiency', 'code generation').
- Automate the evaluation process across a matrix of models and configurations.
Core Concepts
- Benchmark (
LanguageModelV3Benchmark): A standardized interface for creating an evaluation task. It has arunmethod that takes aLanguageModeland returns aBenchmarkResult. evaluatefunction: The core function that runs a set of benchmarks against one or more models and provides a report on the results.- Reporter: Formats the evaluation results into different outputs, such as a human-readable console report or a machine-readable JSON object.
Installation
pnpm add @ai-sdk-tool/evalQuick Start
Here's how to evaluate two different models against the built-in Berkeley Function-Calling Leaderboard (BFCL) benchmark for simple function calls.
import { evaluate, bfclSimpleBenchmark } from "@ai-sdk-tool/eval";
import { openrouter } from "ai/providers/openrouter";
// 1. Define the models you want to evaluate
// 2. Run the evaluation
async function runMyEvaluation() {
console.log("Starting model evaluation...");
const results = await evaluate({
models: [/* your models here */],
benchmarks: [bfclSimpleBenchmark], // Use a built-in benchmark
reporter: "console", // 'console' or 'json'
});
console.log("Evaluation complete!");
// The console reporter will have already printed a detailed report.
}
runMyEvaluation();Run the example from this repo:
cd examples/eval-core && pnpm dlx tsx src/bfcl-simple.tsBuilt-in Benchmarks
This package includes several pre-built benchmarks.
bfclSimpleBenchmark: Evaluates simple, single function calls.bfclParallelBenchmark: Evaluates parallel (multi-tool) function calls.bfclMultipleBenchmark: Evaluates multiple calls to the same function.bfclParallelMultipleBenchmark: A combination of parallel and multiple function calls.bfclMultiTurnBaseBenchmark: Evaluates BFCL v4 multi-turn base cases.bfclMultiTurnLongContextBenchmark: Evaluates BFCL v4 multi-turn long-context cases.bfclMultiTurnMissFuncBenchmark: Evaluates BFCL v4 multi-turn missing-function cases.bfclMultiTurnMissParamBenchmark: Evaluates BFCL v4 multi-turn missing-parameter cases.jsonGenerationBenchmark: Evaluates the model's ability to generate schema-compliant JSON.
Note: Multi-turn benchmarks now execute tool calls with a native TypeScript implementation and do not require Python at runtime.
BFCL evaluation data will be downloaded automatically on first run. For manual download, visit the BFCL repository.
To try a JSON generation run locally:
cd examples/eval-core && pnpm dlx tsx src/json-generation.tsCreating a Custom Benchmark
You can easily create your own benchmark by implementing the LanguageModelV3Benchmark interface. This is useful for testing model performance on tasks specific to your application.
Example: A custom benchmark to test politeness.
import {
LanguageModelV3Benchmark,
BenchmarkResult,
EvaluateOptions,
} from "@ai-sdk-tool/eval";
import { LanguageModel, generateText } from "ai";
// Define the benchmark object
export const politenessBenchmark: LanguageModelV3Benchmark = {
name: "politeness-check",
version: "1.0.0",
description: "Checks if the model's response is polite.",
async run(model: LanguageModel): Promise<BenchmarkResult> {
const { text } = await generateText({
model,
prompt:
"A customer is angry because their order is late. Write a response.",
});
const isPolite = !text.toLowerCase().includes("sorry, but");
const score = isPolite ? 1 : 0;
return {
score,
success: isPolite,
metrics: {
length: text.length,
},
logs: [`Response: "${text}"`],
};
},
};
// You can then use it in the evaluate function:
// await evaluate({
// models: myModel,
// benchmarks: [politenessBenchmark],
// });License
Licensed under Apache License 2.0. See the repository LICENSE. Include the NOTICE file in distributions.
