@reactive-agents/eval
v0.10.6
Published
Evaluation framework for Reactive Agents — LLM-as-judge scoring, regression detection, dataset loading
Readme
@reactive-agents/eval
Evaluation framework for Reactive Agents — benchmark agent quality, track regressions, and run automated test suites against an isolated frozen judge. v0.10.3
Installation
bun add @reactive-agents/evalFeatures
- 5-dimension scoring —
accuracy,relevance,completeness,safety,cost-efficiency - LLM-as-judge — judge runs through
JudgeLLMService, a tag isolated from the system-under-test (Rule 4: judge MUST differ from SUT) - Evaluation suites —
EvalSuitedescribes cases + dimensions + suite metadata - SQLite persistence —
createEvalStorefor run history, regression diffs, comparison reports - Dataset loader —
DatasetServicefor sharing evaluation corpora across suites - CLI integration —
rax evalruns suites and writes reports
Suite Runner Contract
runSuite(suite, agentConfig, agentRunner, config?) requires three things:
suite— cases + dimensions + suite metadata (EvalSuite)agentConfig— string identifying the system under test (used in result records and the Rule-4 guard)agentRunner— caller-supplied function that invokes YOUR agent and returns its output + metrics. Pre-W6.5 this was hardcoded to a placeholder; callers now supply this themselvesconfig?— optionalEvalConfig, including the judge model selection (must differ from the SUT)
import {
EvalService,
createEvalLayer,
type SuiteAgentRunner,
} from "@reactive-agents/eval";
import { Effect } from "effect";
// Caller-supplied SUT runner. This invokes YOUR agent. It MUST NOT use the
// JudgeLLMService — that Tag is reserved for the frozen judge per Rule 4 of
// 00-RESEARCH-DISCIPLINE.md. Use LLMService or your agent builder layer here.
const myAgentRunner: SuiteAgentRunner = (input) =>
Effect.gen(function* () {
const result = yield* runMyAgent(input); // your agent invocation
return {
actualOutput: result.output,
metrics: {
latencyMs: result.elapsedMs,
tokensUsed: result.tokens,
costUsd: result.costUsd,
},
};
});
const program = Effect.gen(function* () {
const evalService = yield* EvalService;
const run = yield* evalService.runSuite(
{
id: "qa-benchmark",
name: "QA Benchmark",
dimensions: ["accuracy", "relevance"],
cases: [
{ id: "q1", input: "What is the capital of France?", expectedOutput: "Paris" },
{ id: "q2", input: "Who wrote 'The Great Gatsby'?", expectedOutput: "F. Scott Fitzgerald" },
],
},
"anthropic/claude-sonnet-4-20250514", // SUT identifier
myAgentRunner,
{
judge: {
model: "claude-haiku-4-5-20251001", // judge MUST differ from SUT
provider: "anthropic",
},
},
);
console.log(`avgScore: ${run.summary.avgScore}, passed: ${run.summary.passed}/${run.summary.totalCases}`);
});The judge LLM is wired separately via JudgeLLMService so the judge code path is fully isolated from the SUT. See createEvalLayer JSDoc for layer composition.
Dimensions
| Dimension | What it measures | Scorer |
| ----------------- | ---------------------------------------------------- | ----------------------- |
| accuracy | Factual correctness against expected output | scoreAccuracy |
| relevance | How well the response addresses the question | scoreRelevance |
| completeness | Coverage of all aspects of the expected answer | scoreCompleteness |
| safety | Absence of harmful, biased, or inappropriate content | scoreSafety |
| cost-efficiency | Token usage and cost relative to quality | scoreCostEfficiency |
Each dimension scorer is a standalone Effect that takes the LLM tag + scoring params and returns a DimensionScore.
Persistence
import { createEvalStore, makeEvalServicePersistentLive } from "@reactive-agents/eval";
const store = createEvalStore("./eval-history.db");
// `makeEvalServicePersistentLive(store)` wires automatic persistence
// — every `runSuite` call writes to SQLite for diffing and regression checks.EvalStore exposes listRuns, getRun, compareRuns, getRegressions for downstream tooling.
Key Exports
| Export | Purpose |
| --------------------------------------------------------------------- | ------------------------------------------------ |
| EvalService, EvalServiceLive, makeEvalServiceLive | Suite runner with frozen-judge isolation |
| makeEvalServicePersistentLive | Persistent variant wired to EvalStore |
| JudgeLLMService | Frozen-judge tag (Rule 4 isolation) |
| DatasetService, DatasetServiceLive | Dataset loader |
| createEvalStore | SQLite-backed history |
| createEvalLayer | Factory for the runtime layer |
| scoreAccuracy, scoreRelevance, scoreCompleteness, scoreSafety, scoreCostEfficiency | Per-dimension scorers |
| SuiteAgentRunner, EvalSuite, EvalCase, EvalRun, EvalRunSummary, JudgeConfig, EvalConfig | Schemas + types |
| EvalError, BenchmarkError, DatasetError | Tagged errors |
Documentation
- Full docs: docs.reactiveagents.dev
- Eval guide: docs.reactiveagents.dev/guides/evaluation/
License
MIT
