@ai-sdk-tool/eval

v1.1.0

Published

4 months ago

Benchmarking and evaluation tools for AI SDK

0High
0Medium
0Low

AI SDK - evaluation tool

This package provides a standardized, extensible, and reproducible way to benchmark and evaluate the performance of Language Models (LanguageModel instances) within the Vercel AI SDK ecosystem.

It allows developers to:

Compare different models (e.g., Gemma, Llama, GPT) under the same conditions.
Quantify the impact of model updates or configuration changes.
Create custom benchmarks tailored to specific use cases (e.g., 'Korean proficiency', 'code generation').
Automate the evaluation process across a matrix of models and configurations.

Core Concepts

Benchmark (LanguageModelV3Benchmark): A standardized interface for creating an evaluation task. It has a run method that takes a LanguageModel and returns a BenchmarkResult.
evaluate function: The core function that runs a set of benchmarks against one or more models and provides a report on the results.
Reporter: Formats the evaluation results into different outputs, such as a human-readable console report or a machine-readable JSON object.

Installation

pnpm add @ai-sdk-tool/eval

Quick Start

Here's how to evaluate two different models against the built-in Berkeley Function-Calling Leaderboard (BFCL) benchmark for simple function calls.

import { evaluate, bfclSimpleBenchmark } from "@ai-sdk-tool/eval";
import { openrouter } from "ai/providers/openrouter";

// 1. Define the models you want to evaluate

// 2. Run the evaluation
async function runMyEvaluation() {
  console.log("Starting model evaluation...");

  const results = await evaluate({
    models: [/* your models here */],
    benchmarks: [bfclSimpleBenchmark], // Use a built-in benchmark
    reporter: "console", // 'console' or 'json'
  });

  console.log("Evaluation complete!");
  // The console reporter will have already printed a detailed report.
}

runMyEvaluation();

Run the example from this repo:

cd examples/eval-core && pnpm dlx tsx src/bfcl-simple.ts

Built-in Benchmarks

This package includes several pre-built benchmarks.

bfclSimpleBenchmark: Evaluates simple, single function calls.
bfclParallelBenchmark: Evaluates parallel (multi-tool) function calls.
bfclMultipleBenchmark: Evaluates multiple calls to the same function.
bfclParallelMultipleBenchmark: A combination of parallel and multiple function calls.
bfclMultiTurnBaseBenchmark: Evaluates BFCL v4 multi-turn base cases.
bfclMultiTurnLongContextBenchmark: Evaluates BFCL v4 multi-turn long-context cases.
bfclMultiTurnMissFuncBenchmark: Evaluates BFCL v4 multi-turn missing-function cases.
bfclMultiTurnMissParamBenchmark: Evaluates BFCL v4 multi-turn missing-parameter cases.
jsonGenerationBenchmark: Evaluates the model's ability to generate schema-compliant JSON.

Note: Multi-turn benchmarks now execute tool calls with a native TypeScript implementation and do not require Python at runtime.

BFCL evaluation data will be downloaded automatically on first run. For manual download, visit the BFCL repository.

To try a JSON generation run locally:

cd examples/eval-core && pnpm dlx tsx src/json-generation.ts

Creating a Custom Benchmark

You can easily create your own benchmark by implementing the LanguageModelV3Benchmark interface. This is useful for testing model performance on tasks specific to your application.

Example: A custom benchmark to test politeness.

import {
  LanguageModelV3Benchmark,
  BenchmarkResult,
  EvaluateOptions,
} from "@ai-sdk-tool/eval";
import { LanguageModel, generateText } from "ai";

// Define the benchmark object
export const politenessBenchmark: LanguageModelV3Benchmark = {
  name: "politeness-check",
  version: "1.0.0",
  description: "Checks if the model's response is polite.",

  async run(model: LanguageModel): Promise<BenchmarkResult> {
    const { text } = await generateText({
      model,
      prompt:
        "A customer is angry because their order is late. Write a response.",
    });

    const isPolite = !text.toLowerCase().includes("sorry, but");
    const score = isPolite ? 1 : 0;

    return {
      score,
      success: isPolite,
      metrics: {
        length: text.length,
      },
      logs: [`Response: "${text}"`],
    };
  },
};

// You can then use it in the evaluate function:
// await evaluate({
//   models: myModel,
//   benchmarks: [politenessBenchmark],
// });

License

Licensed under Apache License 2.0. See the repository LICENSE. Include the NOTICE file in distributions.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme