vitest-evals
v0.13.1
Published
Harness-backed AI testing on top of Vitest.
Downloads
78,264
Readme
vitest-evals
Harness-backed AI testing on top of Vitest.
Use this package README for the core authoring model. For a guided setup path,
runtime-specific harness examples, replay, and GitHub Actions reporting, start
with the docs site: https://vitest-evals.sentry.dev/docs.
Install
npm install -D vitest-evalsInstall a first-party harness package for the runtime you want to test:
npm install -D @vitest-evals/harness-pi-ai
# or
npm install -D @vitest-evals/harness-ai-sdk
# or
npm install -D @vitest-evals/harness-openai-agentsFor GitHub Actions summaries and annotations, emit Vitest JSON and use the
native getsentry/vitest-evals action. No extra npm package is needed in the
workflow.
Core Model
describeEval(...)binds exactly one harness to a suite- the suite callback receives a fixture-backed Vitest
it run(input)executes the harness explicitly and returns a normalizedHarnessRun- the returned
result.outputis the app-facing value you assert on directly - helper assertions usually read the returned
result, for exampletoolCalls(result)orspansByKind(result, "tool") result.sessionis the canonical JSON-serializable transcript for reporting, replay, tool assertions, and judgesresult.tracescontains JSON-serializable operation spans; the first-party harnesses attach run, model, and tool spans automatically, whilecreateHarness(...)attaches fallback run and tool spans for custom harnesses that do not return traces themselves. Span attributes include typed OpenTelemetry GenAI semantic keys while still allowing provider-specific metadata- scenario-specific judge criteria should live in
inputor explicit matcher options, depending on whether the app or only the judge needs them - suite-level
judgesare optional and run automatically after eachrun(...) - suite-level
judgeThresholdcontrols fail-on-score for those automatic judges - every judge is a named object with
assess(ctx) - every judge receives
JudgeContextwith typedinput, typedoutput, the normalized run/session, and tool calls;outputis only optional when the harness output type includesundefined - judges own their prompt, rubric, and parsing; LLM-backed judges use
ctx.runJudge(...)from a configuredjudgeHarness - explicit judge assertions use
await expect(result).toSatisfyJudge(judge, options)
Explicit Run Example
import { getModel } from "@mariozechner/pi-ai";
import { expect } from "vitest";
import { piAiHarness, piAiJudgeHarness } from "@vitest-evals/harness-pi-ai";
import {
describeEval,
FactualityJudge,
toolCalls,
} from "vitest-evals";
import { createRefundAgent } from "../src/refundAgent";
const judgeHarness = piAiJudgeHarness({
model: getModel("anthropic", "claude-sonnet-4-5"),
temperature: 0,
});
describeEval(
"refund agent",
{
harness: piAiHarness({
agent: () => createRefundAgent(),
}),
judgeHarness,
judges: [
FactualityJudge({
expected: "The refund request is approved.",
}),
],
judgeThreshold: 0.6,
},
(it) => {
it("approves a refundable invoice", async ({ run }) => {
const result = await run("Refund invoice inv_123");
expect(result.output).toMatchObject({ status: "approved" });
expect(toolCalls(result).map((call) => call.name)).toEqual([
"lookupInvoice",
"createRefund",
]);
});
},
);Table-Driven Vitest Style
If you want case tables, use Vitest's own it.for(...) and call run(...)
inside the test body:
describeEval("refund agent", { harness }, (it) => {
it.for([
{
name: "approves refundable invoice",
input: "Refund invoice inv_123",
expectedStatus: "approved",
},
{
name: "denies non-refundable invoice",
input: "Refund invoice inv_404",
expectedStatus: "denied",
},
])("$name", async ({ input, expectedStatus }, { run }) => {
const result = await run(input);
expect(result.output).toMatchObject({
status: expectedStatus,
});
});
});Terminal Reporting
The terminal reporter has two eval report levels. Normal mode prints compact
test, score, usage, and tool-count summaries. Info mode adds per-tool summaries,
arguments, timing/size metadata, replay status, and final output summaries.
Set VITEST_EVALS_REPORT_LEVEL=info, or pass --info through the workspace
eval scripts, to enable it. --verbose and -v remain aliases for
compatibility.
Full transcripts and spans are preserved in the Vitest JSON report metadata.
Local Report UI
The local report UI reads the same Vitest JSON artifacts and serves a React SPA for drilling into runs, eval cases, harness output, sessions, tool calls, scores, and trace spans.
pnpm exec vitest-evals serve vitest-results.json
pnpm exec vitest-evals serve "eval-results/*.json"
pnpm exec vitest-evals serve eval-results/GitHub Actions Reporting
Use Vitest JSON as the eval report artifact. It preserves the meta field that
contains eval scores and normalized harness runs.
vitest run --config vitest.evals.config.ts \
--reporter=vitest-evals/reporter \
--reporter=json \
--outputFile.json=vitest-results.json- uses: getsentry/vitest-evals@v0
if: always()
with:
results: vitest-results.jsonThe GitHub reporter action writes a job summary, emits short failure annotations, can publish a separate Check Run, and can reduce sharded eval JSON artifacts into one combined report.
Existing Agents
For an existing agent, the intended contract is:
- pass the agent instance or per-test factory through the harness
- optionally pass
runwhen the app entrypoint is notrun(input, runtime) - let the harness infer native tools from the existing agent by default
- only pass an explicit
toolsoverride when the agent hides its tool surface
The harness owns normalization, diagnostics, tool capture, replay plumbing, and reporter-facing artifacts. Your app just needs one runtime seam where those wrapped pieces can be injected.
Replay opt-in belongs on the harness, via toolReplay, while replay mode and
recording directory can live in Vitest environment config. Tool definitions
should stay free of VCR policy.
For the Pi-specific harness, output/session/usage normalization should usually be inferred automatically. Treat low-level normalization callbacks as an escape hatch, not part of the primary authoring path.
For OpenAI Agents SDK apps, use
@vitest-evals/harness-openai-agents with an existing Agent or an agent
factory and a Runner or runner factory. The harness calls
Runner.run(agent, input, options) by default and exposes the same
normalization and replay hooks when the app needs a custom entrypoint or
structured domain output mapping.
Custom App Harnesses
First-party harness packages are conveniences, not the only supported path. If
you need to test a full application flow, use createHarness(...) to run your
app through its normal entrypoint and return the app-facing output. Judges own
their prompt/rubric text separately from the system under test.
When generics are needed, use createHarness<Input, Output>(...).
import {
createHarness,
createJudge,
createJudgeHarness,
describeEval,
} from "vitest-evals";
type AppEvent = {
type: string;
payload: Record<string, string>;
};
type AppEvalInput = {
events: AppEvent[];
criteria: {
contract: string;
pass: string[];
fail?: string[];
};
};
type AppOutput = {
replies: Array<{ text: string }>;
sideEffects: string[];
};
const appHarness = createHarness<AppEvalInput, AppOutput>({
name: "custom-app",
run: async ({ input, signal }) => {
const result = await replayAppEvents(input.events, {
signal,
});
return {
output: {
replies: result.replies,
sideEffects: result.sideEffects,
},
artifacts: {
replyCount: result.replies.length,
},
usage: {},
};
},
});
const judgeHarness = createJudgeHarness({
name: "app-rubric-judge-model",
run: async ({ prompt }, { signal }) =>
promptJudgeModel({ prompt, signal }),
});
const AppRubricJudge = createJudge<AppEvalInput, AppOutput>(
"AppRubricJudge",
async (ctx) => {
if (!ctx.runJudge) {
throw new Error("AppRubricJudge requires a configured judgeHarness.");
}
const verdict = await ctx.runJudge({
prompt: formatRubricPrompt({
output: ctx.output,
criteria: ctx.input.criteria,
}),
responseFormat: { type: "json" },
});
return parseRubricVerdict(verdict);
},
);
describeEval(
"app behavior",
{
harness: appHarness,
judgeHarness,
judges: [AppRubricJudge],
judgeThreshold: 0.75,
},
(it) => {
it("handles an event flow", async ({ run }) => {
await run({
events: [
{
type: "message.created",
payload: {
text: "Summarize the current incident.",
},
},
],
criteria: {
contract: "The app posts one user-visible incident summary.",
pass: ["The reply names the incident status."],
fail: ["The reply exposes internal metadata."],
},
});
});
},
);Use Harness.run(...) for the application under test. Calling
ctx.harness.run(...) from inside a judge runs the application a second time,
so reserve that for judges that intentionally need a second execution. Put
criteria on input when they are part of the scenario itself; pass
case-specific judge criteria through matcher options, or configure suite-wide
criteria on the judge instance. createHarness(...) builds a default
user/assistant session from input and typed output; return a full
HarnessRun only when you need exact session control.
Provider setup and rubric parsing stay in your judge. The core
package only requires the judge to return a JudgeResult with a score and
optional metadata.
Automatic suite-level judges are a good fit when every run(...) should get
the same scoring. For cases where only some runs need an LLM judge, keep the
suite free of automatic judges and use an explicit matcher:
await expect(result).toSatisfyJudge(AppRubricJudge, {
threshold: 0.75,
});Judge Matchers
Use the matcher when a judge should behave like a normal Vitest assertion. In practice, this is usually most useful for factuality, rubric, or grounded answer checks:
import { openai } from "@ai-sdk/openai";
import { aiSdkJudgeHarness } from "@vitest-evals/harness-ai-sdk";
import { expect } from "vitest";
import { FactualityJudge } from "vitest-evals";
const judgeHarness = aiSdkJudgeHarness({
model: openai("gpt-4.1-mini"),
temperature: 0,
});
const factualityJudge = FactualityJudge({ judgeHarness });
await expect(result).toSatisfyJudge(factualityJudge, {
expected: "Paris is the capital of France.",
threshold: 0.6,
});For lower-level cases, the matcher also accepts raw values and synthetic judge
context. Pass every context field the judge needs when the value did not come
from eval fixture run(...):
await expect({ status: "approved" }).toSatisfyJudge(MyJudge, {
input: "Refund invoice inv_123",
});Use the built-in factuality judge when you want a model-backed factuality grade over the normalized run:
import { openai } from "@ai-sdk/openai";
import { aiSdkJudgeHarness } from "@vitest-evals/harness-ai-sdk";
import { FactualityJudge } from "vitest-evals";
export const judgeHarness = aiSdkJudgeHarness({
model: openai("gpt-4.1-mini"),
temperature: 0,
});
export const factualityJudge = FactualityJudge({ judgeHarness });For custom judge providers, create a dedicated judge harness with the same prompt contract:
import {
createJudgeHarness,
FactualityJudge,
type JudgeHarness,
} from "vitest-evals";
import { callJudgeModel } from "./judgeModel";
export const judgeHarness: JudgeHarness = createJudgeHarness({
name: "factuality-judge-model",
run: async ({ system, prompt }, { signal }) =>
callJudgeModel({ system, prompt, signal }),
});
export const factualityJudge = FactualityJudge({ judgeHarness });Configure that judge harness once and reuse the same judge with any app harness:
import { describeEval } from "vitest-evals";
import { aiSdkRefundHarness } from "./aiSdkRefundHarness";
import { piRefundHarness } from "./piRefundHarness";
import { factualityJudge } from "./sharedJudges";
describeEval("ai sdk refund agent", {
harness: aiSdkRefundHarness,
judges: [factualityJudge],
});
describeEval("pi refund agent", {
harness: piRefundHarness,
judges: [factualityJudge],
});Use createJudge(...) for custom judges so reporter output gets a stable
label. Custom LLM-backed judges should provide their own judge prompt, rubric
text, and parser, then call ctx.runJudge(...) for the provider-specific model
request. Bind a reusable default with createJudge({ name, judgeHarness,
assess }) or pass judgeHarness on the matcher or suite. Core curries the
matcher, judge, or explicit suite judgeHarness into that function with the
current run's abort signal. Matcher options win over a judge default, and a
judge default wins over the suite default. Explicit matcher calls can also
reuse a single unambiguous judge-level harness from the suite's automatic
judges, but automatic judges do not inherit inferred harnesses from sibling
judges. That inference requires those judges to share the same judge harness
instance. Leave judgeHarness unset for suites that only use deterministic
judges. Calling harness.run(...) from a judge executes the application again,
so use that only when a second run is intentional.
For an EvalHarnessRun returned by fixture run(...),
toSatisfyJudge(...) uses the run's typed output and reuses the registered
input. It requires any custom judge params and rejects judges whose
output type cannot assess the received value. Inside an eval test,
matcher calls on registered output objects or session objects reuse that exact
run context when the value can be registered by reference, so
expect(result.output).toSatisfyJudge(judge) stays concise for structured
outputs. Other raw values fall back to the current test's most recent
run(...) context. For
manually-created runs or values outside an eval context, pass any required
input or harness in matcher options. Structured or
programmatic result checks should usually assert on result.output directly.
When a judge needs richer normalized context or the configured suite harness,
type it with createJudge<Input, Output>(...) or JudgeContext<Input, Output>.
When you only need deterministic contract checks, built-ins such as
StructuredOutputJudge() and ToolCallJudge() are still available.
