@halo-sdk/eval
v2.0.0
Published
Cache cost benchmark + behavioral-eval seam for Halo SDK — measure the prefix-cache moat, plug in promptfoo/vitest for quality
Maintainers
Readme
@halo-sdk/eval
Two things, deliberately scoped:
Cache cost benchmark — the differentiated "evidence" that proves Halo's prefix-cache moat.
benchmarkCache(agent, inputs)drives an agent through a multi-turn scenario and reports hit rate, token split, estimated spend, and an A–F grade.compareCache(scenario, a, b)runs the same scenario through two agents — e.g. to showSummarizeAppendStrategyretains hit-rate where naive truncation collapses it.Behavioral-eval seam — a thin
runEvalCases(agent, cases)harness. For real behavioral/quality evaluation (LLM-as-judge, datasets, regression gates), point promptfoo or vitest at your agent. Halo does not reimplement generic eval.
Usage
import { benchmarkCache, compareCache, runEvalCases } from "@halo-sdk/eval";
const report = await benchmarkCache(agent, [
"Summarize the cache design.",
"Now expand on breakpoints.",
"And how does it compare to OpenAI?",
]);
console.log(report.hitRate, report.grade, report.estimatedUsd);
const cmp = await compareCache(
scenario,
{ name: "truncate", agent: a },
{ name: "summarize", agent: b },
);
const evalReport = await runEvalCases(agent, [
{ name: "greets", input: "say hi", assert: (out) => out.toLowerCase().includes("hi") },
]);