@cool-ai/beach-evals
v0.0.14
Published
Record, replay, and score evaluation primitives for Beach applications — built on the event log.
Readme
@cool-ai/beach-evals
Optional evaluation primitives for Beach-based agents. Test case schema, deterministic replay-based runner, reference scorers, pluggable Scorer interface.
Home: cool-ai.org · Documentation: cool-ai.org/docs
Install
npm install --save-dev @cool-ai/beach-evalsAPI
EvalRecorder
Records live turn data as a replayable EvalTestCase.
import { EvalRecorder } from '@cool-ai/beach-evals';
const recorder = new EvalRecorder({ actorConfig, provider });
// Wrap specialist tools to capture their recorded outputs
const wrappedTools = recorder.wrapRegistry(toolDefinitions);
// After a turn completes, record it
const testCase = recorder.record({
turnId,
initialMessages,
specialistExecutions,
expected: {
turnState: 'complete',
parts: [{ partType: 'domain-data' }],
},
promptText: systemPrompt,
});EvalDataset
Stores test cases on disk — one JSON file per case.
import { EvalDataset } from '@cool-ai/beach-evals';
const dataset = new EvalDataset();
dataset.add(testCase);
await dataset.save('./evals/cases');
// Load from disk
const loaded = await EvalDataset.load('./evals/cases');EvalRunner
Replays test cases and scores them.
import { EvalRunner, TurnStateScorer, PartPresentScorer, ExactMatchScorer, SchemaValidScorer } from '@cool-ai/beach-evals';
const runner = new EvalRunner({ actorConfig, provider, toolDefinitions });
const result = await runner.run(testCase, {
scorers: [
new TurnStateScorer('complete'),
new PartPresentScorer('domain-data'),
new ExactMatchScorer('response', 'Flights found'),
new SchemaValidScorer((data) => Array.isArray((data as any).flights)),
],
promptHashMismatch: 'warn', // 'warn' | 'error' | 'ignore'
});
console.log(result.passed, result.scores);Reference scorers
| Scorer | Purpose |
|--------|---------|
| TurnStateScorer | Assert the replay settled with the expected turnState |
| PartPresentScorer | Assert a part of a given type is present (optionally non-empty) |
| ExactMatchScorer | Assert a response part's text matches exactly |
| SchemaValidScorer | Assert a domain-data part passes a caller-supplied predicate |
Bring your own Scorer implementation for LLM-as-judge, rubric-based, or domain-specific checks.
Rationale
Beach's audit-first claim is undercut without evaluation — auditability without assertions is just logging. Consumers need to:
- Regression-test their agents: "given this input, does the agent produce this output?"
- Compare models: "same input against different models, compare outputs."
- Iterate prompts: "does changing the prompt change behaviour in the ways I expected?"
- Score outputs: equality, regex, schema-validation, rubric, LLM-as-judge.
@cool-ai/beach-evals ships the primitives for this without being a full evaluation platform. Consumers with more sophisticated needs (rubric-based scoring, human-judged evals, continuous CI with dataset management, cross-run dashboards) write their own scorers against Beach's Scorer interface, or integrate with an external eval platform (Mastra evals, LangSmith, Braintrust, PromptFoo, Logfire).
Concern
@cool-ai/beach-evals provides:
EvalTestCaseschema —{ id, actorId, initialMessages, specialistExecutions, expected, promptHash?, tags? }.EvalRecorder— records live turn data as anEvalTestCase;wrapRegistry()intercepts specialist tool calls transparently so executions are captured without changing actor code.EvalRunner— replays a test case via@cool-ai/beach-session'sTurnReplayer, runs scorers against the result, returns anEvalResult.EvalDataset— stores test cases on disk (one JSON file per case) withsave(dir)/static load(dir).- Reference scorers —
TurnStateScorer,PartPresentScorer,ExactMatchScorer,SchemaValidScorer. Scorerinterface — consumers plug in custom scorers (LLM-as-judge, rubric-based, domain-specific).hashPrompt(prompt)— computes a SHA-256 hash over the actor's system prompt text; stored with the recording and compared at replay time to detect prompt drift.
Not in this package
- CI integration (GitHub Actions, GitLab Pipelines, report uploaders).
- Dashboards, trend charts, cross-run comparisons.
- Human-in-the-loop eval UI.
- LLM-as-judge reference implementation (consumers bring their own using the
Scorerinterface).
For any of these, integrate an external eval platform. Beach ships the primitives; the rest is a platform-choice concern.
Scorer interface
interface Scorer {
name: string;
score(result: ReplayResult): ScoreResult;
}
interface ScoreResult {
scorer: string;
passed: boolean;
message?: string;
details?: Record<string, unknown>;
}A consumer's LLM-as-judge scorer would call an LLM with the actual output + a rubric, return a passed boolean plus the LLM's reasoning in details.
Determinism
@cool-ai/beach-evals uses @cool-ai/beach-session's replay primitives to run tests deterministically:
- LLM calls are mocked from recorded outputs during replay, or run live against a specified model for "fresh" runs.
- Tool calls are mocked from recorded results during replay; specialist-scope tools are mocked from their
specialist_executionlog records (see principle 17). - Both modes are available; the test case declares which.
Prompt-change invalidates recorded outputs
A subtle trap: recorded LLM outputs are pinned to the prompt that produced them. If the actor's system prompt changes between recording and replay, the recorded output was made against the old prompt. Replaying with the new prompt but mocking from the old output tests nothing about the new prompt's behaviour — it tests only that the old prompt's behaviour is still consistent.
Rules of use:
- Deterministic-behaviour regressions (data shape, schema conformance, turn-state correctness) are valid against recorded outputs as long as the prompt is fixed.
- Prompt-sensitive regressions (response quality, clarification thresholds, tool-choice heuristics) require fresh runs against a live model. A recorded replay won't catch prompt regressions.
- Rerecord when you change prompts. Test runners should detect prompt hash mismatches and warn.
@cool-ai/beach-evalscomputes a hash over the actor's resolved system prompt (snippet + consumer content) and stores it with the recording; mismatch at replay time surfaces a warning, not a silent pass.
Consumer eval pipelines should interleave both modes — fresh runs for prompt iteration, replay for regression checks on unchanged prompts.
Consumers
Any Beach-based agent serious enough to need regression testing. Prototype or throwaway agents can skip.
Related
- ../session/ — replay API that this package builds on.
- ../inspect/ — dev UI that can show eval run results alongside session traces.
- https://cool-ai.org/docs/design-principles — principle 3.7 (replay and evaluation are first-class).
