@cool-ai/beach-evals

v0.0.14

Published

7 hours ago

Record, replay, and score evaluation primitives for Beach applications — built on the event log.

0High
0Medium
0Low

@cool-ai/beach-evals

Optional evaluation primitives for Beach-based agents. Test case schema, deterministic replay-based runner, reference scorers, pluggable Scorer interface.

Home: cool-ai.org · Documentation: cool-ai.org/docs

Install

npm install --save-dev @cool-ai/beach-evals

API

EvalRecorder

Records live turn data as a replayable EvalTestCase.

import { EvalRecorder } from '@cool-ai/beach-evals';

const recorder = new EvalRecorder({ actorConfig, provider });

// Wrap specialist tools to capture their recorded outputs
const wrappedTools = recorder.wrapRegistry(toolDefinitions);

// After a turn completes, record it
const testCase = recorder.record({
  turnId,
  initialMessages,
  specialistExecutions,
  expected: {
    turnState: 'complete',
    parts: [{ partType: 'domain-data' }],
  },
  promptText: systemPrompt,
});

EvalDataset

Stores test cases on disk — one JSON file per case.

import { EvalDataset } from '@cool-ai/beach-evals';

const dataset = new EvalDataset();
dataset.add(testCase);
await dataset.save('./evals/cases');

// Load from disk
const loaded = await EvalDataset.load('./evals/cases');

EvalRunner

Replays test cases and scores them.

import { EvalRunner, TurnStateScorer, PartPresentScorer, ExactMatchScorer, SchemaValidScorer } from '@cool-ai/beach-evals';

const runner = new EvalRunner({ actorConfig, provider, toolDefinitions });

const result = await runner.run(testCase, {
  scorers: [
    new TurnStateScorer('complete'),
    new PartPresentScorer('domain-data'),
    new ExactMatchScorer('response', 'Flights found'),
    new SchemaValidScorer((data) => Array.isArray((data as any).flights)),
  ],
  promptHashMismatch: 'warn',  // 'warn' | 'error' | 'ignore'
});

console.log(result.passed, result.scores);

Reference scorers

| Scorer | Purpose | |--------|---------| | TurnStateScorer | Assert the replay settled with the expected turnState | | PartPresentScorer | Assert a part of a given type is present (optionally non-empty) | | ExactMatchScorer | Assert a response part's text matches exactly | | SchemaValidScorer | Assert a domain-data part passes a caller-supplied predicate |

Bring your own Scorer implementation for LLM-as-judge, rubric-based, or domain-specific checks.

Rationale

Beach's audit-first claim is undercut without evaluation — auditability without assertions is just logging. Consumers need to:

Regression-test their agents: "given this input, does the agent produce this output?"
Compare models: "same input against different models, compare outputs."
Iterate prompts: "does changing the prompt change behaviour in the ways I expected?"
Score outputs: equality, regex, schema-validation, rubric, LLM-as-judge.

@cool-ai/beach-evals ships the primitives for this without being a full evaluation platform. Consumers with more sophisticated needs (rubric-based scoring, human-judged evals, continuous CI with dataset management, cross-run dashboards) write their own scorers against Beach's Scorer interface, or integrate with an external eval platform (Mastra evals, LangSmith, Braintrust, PromptFoo, Logfire).

Concern

@cool-ai/beach-evals provides:

EvalTestCase schema — { id, actorId, initialMessages, specialistExecutions, expected, promptHash?, tags? }.
EvalRecorder — records live turn data as an EvalTestCase; wrapRegistry() intercepts specialist tool calls transparently so executions are captured without changing actor code.
EvalRunner — replays a test case via @cool-ai/beach-session's TurnReplayer, runs scorers against the result, returns an EvalResult.
EvalDataset — stores test cases on disk (one JSON file per case) with save(dir) / static load(dir).
Reference scorers — TurnStateScorer, PartPresentScorer, ExactMatchScorer, SchemaValidScorer.
Scorer interface — consumers plug in custom scorers (LLM-as-judge, rubric-based, domain-specific).
hashPrompt(prompt) — computes a SHA-256 hash over the actor's system prompt text; stored with the recording and compared at replay time to detect prompt drift.

Not in this package

CI integration (GitHub Actions, GitLab Pipelines, report uploaders).
Dashboards, trend charts, cross-run comparisons.
Human-in-the-loop eval UI.
LLM-as-judge reference implementation (consumers bring their own using the Scorer interface).

For any of these, integrate an external eval platform. Beach ships the primitives; the rest is a platform-choice concern.

Scorer interface

interface Scorer {
  name: string;
  score(result: ReplayResult): ScoreResult;
}

interface ScoreResult {
  scorer: string;
  passed: boolean;
  message?: string;
  details?: Record<string, unknown>;
}

A consumer's LLM-as-judge scorer would call an LLM with the actual output + a rubric, return a passed boolean plus the LLM's reasoning in details.

Determinism

@cool-ai/beach-evals uses @cool-ai/beach-session's replay primitives to run tests deterministically:

LLM calls are mocked from recorded outputs during replay, or run live against a specified model for "fresh" runs.
Tool calls are mocked from recorded results during replay; specialist-scope tools are mocked from their specialist_execution log records (see principle 17).
Both modes are available; the test case declares which.

Prompt-change invalidates recorded outputs

A subtle trap: recorded LLM outputs are pinned to the prompt that produced them. If the actor's system prompt changes between recording and replay, the recorded output was made against the old prompt. Replaying with the new prompt but mocking from the old output tests nothing about the new prompt's behaviour — it tests only that the old prompt's behaviour is still consistent.

Rules of use:

Deterministic-behaviour regressions (data shape, schema conformance, turn-state correctness) are valid against recorded outputs as long as the prompt is fixed.
Prompt-sensitive regressions (response quality, clarification thresholds, tool-choice heuristics) require fresh runs against a live model. A recorded replay won't catch prompt regressions.
Rerecord when you change prompts. Test runners should detect prompt hash mismatches and warn. @cool-ai/beach-evals computes a hash over the actor's resolved system prompt (snippet + consumer content) and stores it with the recording; mismatch at replay time surfaces a warning, not a silent pass.

Consumer eval pipelines should interleave both modes — fresh runs for prompt iteration, replay for regression checks on unchanged prompts.

Consumers

Any Beach-based agent serious enough to need regression testing. Prototype or throwaway agents can skip.

../session/ — replay API that this package builds on.
../inspect/ — dev UI that can show eval run results alongside session traces.
https://cool-ai.org/docs/design-principles — principle 3.7 (replay and evaluation are first-class).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@cool-ai/beach-evals

Install

API

EvalRecorder

EvalDataset

EvalRunner

Reference scorers

Rationale

Concern

Not in this package

Scorer interface

Determinism

Prompt-change invalidates recorded outputs

Consumers

Related