npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

vitest-evals

v0.13.1

Published

Harness-backed AI testing on top of Vitest.

Downloads

78,264

Readme

vitest-evals

Harness-backed AI testing on top of Vitest.

Use this package README for the core authoring model. For a guided setup path, runtime-specific harness examples, replay, and GitHub Actions reporting, start with the docs site: https://vitest-evals.sentry.dev/docs.

Install

npm install -D vitest-evals

Install a first-party harness package for the runtime you want to test:

npm install -D @vitest-evals/harness-pi-ai
# or
npm install -D @vitest-evals/harness-ai-sdk
# or
npm install -D @vitest-evals/harness-openai-agents

For GitHub Actions summaries and annotations, emit Vitest JSON and use the native getsentry/vitest-evals action. No extra npm package is needed in the workflow.

Core Model

  • describeEval(...) binds exactly one harness to a suite
  • the suite callback receives a fixture-backed Vitest it
  • run(input) executes the harness explicitly and returns a normalized HarnessRun
  • the returned result.output is the app-facing value you assert on directly
  • helper assertions usually read the returned result, for example toolCalls(result) or spansByKind(result, "tool")
  • result.session is the canonical JSON-serializable transcript for reporting, replay, tool assertions, and judges
  • result.traces contains JSON-serializable operation spans; the first-party harnesses attach run, model, and tool spans automatically, while createHarness(...) attaches fallback run and tool spans for custom harnesses that do not return traces themselves. Span attributes include typed OpenTelemetry GenAI semantic keys while still allowing provider-specific metadata
  • scenario-specific judge criteria should live in input or explicit matcher options, depending on whether the app or only the judge needs them
  • suite-level judges are optional and run automatically after each run(...)
  • suite-level judgeThreshold controls fail-on-score for those automatic judges
  • every judge is a named object with assess(ctx)
  • every judge receives JudgeContext with typed input, typed output, the normalized run/session, and tool calls; output is only optional when the harness output type includes undefined
  • judges own their prompt, rubric, and parsing; LLM-backed judges use ctx.runJudge(...) from a configured judgeHarness
  • explicit judge assertions use await expect(result).toSatisfyJudge(judge, options)

Explicit Run Example

import { getModel } from "@mariozechner/pi-ai";
import { expect } from "vitest";
import { piAiHarness, piAiJudgeHarness } from "@vitest-evals/harness-pi-ai";
import {
  describeEval,
  FactualityJudge,
  toolCalls,
} from "vitest-evals";
import { createRefundAgent } from "../src/refundAgent";

const judgeHarness = piAiJudgeHarness({
  model: getModel("anthropic", "claude-sonnet-4-5"),
  temperature: 0,
});

describeEval(
  "refund agent",
  {
    harness: piAiHarness({
      agent: () => createRefundAgent(),
    }),
    judgeHarness,
    judges: [
      FactualityJudge({
        expected: "The refund request is approved.",
      }),
    ],
    judgeThreshold: 0.6,
  },
  (it) => {
    it("approves a refundable invoice", async ({ run }) => {
      const result = await run("Refund invoice inv_123");

      expect(result.output).toMatchObject({ status: "approved" });
      expect(toolCalls(result).map((call) => call.name)).toEqual([
        "lookupInvoice",
        "createRefund",
      ]);
    });
  },
);

Table-Driven Vitest Style

If you want case tables, use Vitest's own it.for(...) and call run(...) inside the test body:

describeEval("refund agent", { harness }, (it) => {
  it.for([
    {
      name: "approves refundable invoice",
      input: "Refund invoice inv_123",
      expectedStatus: "approved",
    },
    {
      name: "denies non-refundable invoice",
      input: "Refund invoice inv_404",
      expectedStatus: "denied",
    },
  ])("$name", async ({ input, expectedStatus }, { run }) => {
    const result = await run(input);

    expect(result.output).toMatchObject({
      status: expectedStatus,
    });
  });
});

Terminal Reporting

The terminal reporter has two eval report levels. Normal mode prints compact test, score, usage, and tool-count summaries. Info mode adds per-tool summaries, arguments, timing/size metadata, replay status, and final output summaries. Set VITEST_EVALS_REPORT_LEVEL=info, or pass --info through the workspace eval scripts, to enable it. --verbose and -v remain aliases for compatibility.

Full transcripts and spans are preserved in the Vitest JSON report metadata.

Local Report UI

The local report UI reads the same Vitest JSON artifacts and serves a React SPA for drilling into runs, eval cases, harness output, sessions, tool calls, scores, and trace spans.

pnpm exec vitest-evals serve vitest-results.json
pnpm exec vitest-evals serve "eval-results/*.json"
pnpm exec vitest-evals serve eval-results/

GitHub Actions Reporting

Use Vitest JSON as the eval report artifact. It preserves the meta field that contains eval scores and normalized harness runs.

vitest run --config vitest.evals.config.ts \
  --reporter=vitest-evals/reporter \
  --reporter=json \
  --outputFile.json=vitest-results.json
- uses: getsentry/vitest-evals@v0
  if: always()
  with:
    results: vitest-results.json

The GitHub reporter action writes a job summary, emits short failure annotations, can publish a separate Check Run, and can reduce sharded eval JSON artifacts into one combined report.

Existing Agents

For an existing agent, the intended contract is:

  • pass the agent instance or per-test factory through the harness
  • optionally pass run when the app entrypoint is not run(input, runtime)
  • let the harness infer native tools from the existing agent by default
  • only pass an explicit tools override when the agent hides its tool surface

The harness owns normalization, diagnostics, tool capture, replay plumbing, and reporter-facing artifacts. Your app just needs one runtime seam where those wrapped pieces can be injected.

Replay opt-in belongs on the harness, via toolReplay, while replay mode and recording directory can live in Vitest environment config. Tool definitions should stay free of VCR policy.

For the Pi-specific harness, output/session/usage normalization should usually be inferred automatically. Treat low-level normalization callbacks as an escape hatch, not part of the primary authoring path.

For OpenAI Agents SDK apps, use @vitest-evals/harness-openai-agents with an existing Agent or an agent factory and a Runner or runner factory. The harness calls Runner.run(agent, input, options) by default and exposes the same normalization and replay hooks when the app needs a custom entrypoint or structured domain output mapping.

Custom App Harnesses

First-party harness packages are conveniences, not the only supported path. If you need to test a full application flow, use createHarness(...) to run your app through its normal entrypoint and return the app-facing output. Judges own their prompt/rubric text separately from the system under test. When generics are needed, use createHarness<Input, Output>(...).

import {
  createHarness,
  createJudge,
  createJudgeHarness,
  describeEval,
} from "vitest-evals";

type AppEvent = {
  type: string;
  payload: Record<string, string>;
};

type AppEvalInput = {
  events: AppEvent[];
  criteria: {
    contract: string;
    pass: string[];
    fail?: string[];
  };
};

type AppOutput = {
  replies: Array<{ text: string }>;
  sideEffects: string[];
};

const appHarness = createHarness<AppEvalInput, AppOutput>({
  name: "custom-app",
  run: async ({ input, signal }) => {
    const result = await replayAppEvents(input.events, {
      signal,
    });

    return {
      output: {
        replies: result.replies,
        sideEffects: result.sideEffects,
      },
      artifacts: {
        replyCount: result.replies.length,
      },
      usage: {},
    };
  },
});

const judgeHarness = createJudgeHarness({
  name: "app-rubric-judge-model",
  run: async ({ prompt }, { signal }) =>
    promptJudgeModel({ prompt, signal }),
});

const AppRubricJudge = createJudge<AppEvalInput, AppOutput>(
  "AppRubricJudge",
  async (ctx) => {
    if (!ctx.runJudge) {
      throw new Error("AppRubricJudge requires a configured judgeHarness.");
    }

    const verdict = await ctx.runJudge({
      prompt: formatRubricPrompt({
        output: ctx.output,
        criteria: ctx.input.criteria,
      }),
      responseFormat: { type: "json" },
    });

    return parseRubricVerdict(verdict);
  },
);

describeEval(
  "app behavior",
  {
    harness: appHarness,
    judgeHarness,
    judges: [AppRubricJudge],
    judgeThreshold: 0.75,
  },
  (it) => {
    it("handles an event flow", async ({ run }) => {
      await run({
        events: [
          {
            type: "message.created",
            payload: {
              text: "Summarize the current incident.",
            },
          },
        ],
        criteria: {
          contract: "The app posts one user-visible incident summary.",
          pass: ["The reply names the incident status."],
          fail: ["The reply exposes internal metadata."],
        },
      });
    });
  },
);

Use Harness.run(...) for the application under test. Calling ctx.harness.run(...) from inside a judge runs the application a second time, so reserve that for judges that intentionally need a second execution. Put criteria on input when they are part of the scenario itself; pass case-specific judge criteria through matcher options, or configure suite-wide criteria on the judge instance. createHarness(...) builds a default user/assistant session from input and typed output; return a full HarnessRun only when you need exact session control.

Provider setup and rubric parsing stay in your judge. The core package only requires the judge to return a JudgeResult with a score and optional metadata.

Automatic suite-level judges are a good fit when every run(...) should get the same scoring. For cases where only some runs need an LLM judge, keep the suite free of automatic judges and use an explicit matcher:

await expect(result).toSatisfyJudge(AppRubricJudge, {
  threshold: 0.75,
});

Judge Matchers

Use the matcher when a judge should behave like a normal Vitest assertion. In practice, this is usually most useful for factuality, rubric, or grounded answer checks:

import { openai } from "@ai-sdk/openai";
import { aiSdkJudgeHarness } from "@vitest-evals/harness-ai-sdk";
import { expect } from "vitest";
import { FactualityJudge } from "vitest-evals";

const judgeHarness = aiSdkJudgeHarness({
  model: openai("gpt-4.1-mini"),
  temperature: 0,
});
const factualityJudge = FactualityJudge({ judgeHarness });

await expect(result).toSatisfyJudge(factualityJudge, {
  expected: "Paris is the capital of France.",
  threshold: 0.6,
});

For lower-level cases, the matcher also accepts raw values and synthetic judge context. Pass every context field the judge needs when the value did not come from eval fixture run(...):

await expect({ status: "approved" }).toSatisfyJudge(MyJudge, {
  input: "Refund invoice inv_123",
});

Use the built-in factuality judge when you want a model-backed factuality grade over the normalized run:

import { openai } from "@ai-sdk/openai";
import { aiSdkJudgeHarness } from "@vitest-evals/harness-ai-sdk";
import { FactualityJudge } from "vitest-evals";

export const judgeHarness = aiSdkJudgeHarness({
  model: openai("gpt-4.1-mini"),
  temperature: 0,
});
export const factualityJudge = FactualityJudge({ judgeHarness });

For custom judge providers, create a dedicated judge harness with the same prompt contract:

import {
  createJudgeHarness,
  FactualityJudge,
  type JudgeHarness,
} from "vitest-evals";
import { callJudgeModel } from "./judgeModel";

export const judgeHarness: JudgeHarness = createJudgeHarness({
  name: "factuality-judge-model",
  run: async ({ system, prompt }, { signal }) =>
    callJudgeModel({ system, prompt, signal }),
});

export const factualityJudge = FactualityJudge({ judgeHarness });

Configure that judge harness once and reuse the same judge with any app harness:

import { describeEval } from "vitest-evals";
import { aiSdkRefundHarness } from "./aiSdkRefundHarness";
import { piRefundHarness } from "./piRefundHarness";
import { factualityJudge } from "./sharedJudges";

describeEval("ai sdk refund agent", {
  harness: aiSdkRefundHarness,
  judges: [factualityJudge],
});

describeEval("pi refund agent", {
  harness: piRefundHarness,
  judges: [factualityJudge],
});

Use createJudge(...) for custom judges so reporter output gets a stable label. Custom LLM-backed judges should provide their own judge prompt, rubric text, and parser, then call ctx.runJudge(...) for the provider-specific model request. Bind a reusable default with createJudge({ name, judgeHarness, assess }) or pass judgeHarness on the matcher or suite. Core curries the matcher, judge, or explicit suite judgeHarness into that function with the current run's abort signal. Matcher options win over a judge default, and a judge default wins over the suite default. Explicit matcher calls can also reuse a single unambiguous judge-level harness from the suite's automatic judges, but automatic judges do not inherit inferred harnesses from sibling judges. That inference requires those judges to share the same judge harness instance. Leave judgeHarness unset for suites that only use deterministic judges. Calling harness.run(...) from a judge executes the application again, so use that only when a second run is intentional.

For an EvalHarnessRun returned by fixture run(...), toSatisfyJudge(...) uses the run's typed output and reuses the registered input. It requires any custom judge params and rejects judges whose output type cannot assess the received value. Inside an eval test, matcher calls on registered output objects or session objects reuse that exact run context when the value can be registered by reference, so expect(result.output).toSatisfyJudge(judge) stays concise for structured outputs. Other raw values fall back to the current test's most recent run(...) context. For manually-created runs or values outside an eval context, pass any required input or harness in matcher options. Structured or programmatic result checks should usually assert on result.output directly. When a judge needs richer normalized context or the configured suite harness, type it with createJudge<Input, Output>(...) or JudgeContext<Input, Output>.

When you only need deterministic contract checks, built-ins such as StructuredOutputJudge() and ToolCallJudge() are still available.