npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@cool-ai/beach-evals

v0.0.14

Published

Record, replay, and score evaluation primitives for Beach applications — built on the event log.

Readme

@cool-ai/beach-evals

Optional evaluation primitives for Beach-based agents. Test case schema, deterministic replay-based runner, reference scorers, pluggable Scorer interface.

Home: cool-ai.org · Documentation: cool-ai.org/docs

Install

npm install --save-dev @cool-ai/beach-evals

API

EvalRecorder

Records live turn data as a replayable EvalTestCase.

import { EvalRecorder } from '@cool-ai/beach-evals';

const recorder = new EvalRecorder({ actorConfig, provider });

// Wrap specialist tools to capture their recorded outputs
const wrappedTools = recorder.wrapRegistry(toolDefinitions);

// After a turn completes, record it
const testCase = recorder.record({
  turnId,
  initialMessages,
  specialistExecutions,
  expected: {
    turnState: 'complete',
    parts: [{ partType: 'domain-data' }],
  },
  promptText: systemPrompt,
});

EvalDataset

Stores test cases on disk — one JSON file per case.

import { EvalDataset } from '@cool-ai/beach-evals';

const dataset = new EvalDataset();
dataset.add(testCase);
await dataset.save('./evals/cases');

// Load from disk
const loaded = await EvalDataset.load('./evals/cases');

EvalRunner

Replays test cases and scores them.

import { EvalRunner, TurnStateScorer, PartPresentScorer, ExactMatchScorer, SchemaValidScorer } from '@cool-ai/beach-evals';

const runner = new EvalRunner({ actorConfig, provider, toolDefinitions });

const result = await runner.run(testCase, {
  scorers: [
    new TurnStateScorer('complete'),
    new PartPresentScorer('domain-data'),
    new ExactMatchScorer('response', 'Flights found'),
    new SchemaValidScorer((data) => Array.isArray((data as any).flights)),
  ],
  promptHashMismatch: 'warn',  // 'warn' | 'error' | 'ignore'
});

console.log(result.passed, result.scores);

Reference scorers

| Scorer | Purpose | |--------|---------| | TurnStateScorer | Assert the replay settled with the expected turnState | | PartPresentScorer | Assert a part of a given type is present (optionally non-empty) | | ExactMatchScorer | Assert a response part's text matches exactly | | SchemaValidScorer | Assert a domain-data part passes a caller-supplied predicate |

Bring your own Scorer implementation for LLM-as-judge, rubric-based, or domain-specific checks.

Rationale

Beach's audit-first claim is undercut without evaluation — auditability without assertions is just logging. Consumers need to:

  • Regression-test their agents: "given this input, does the agent produce this output?"
  • Compare models: "same input against different models, compare outputs."
  • Iterate prompts: "does changing the prompt change behaviour in the ways I expected?"
  • Score outputs: equality, regex, schema-validation, rubric, LLM-as-judge.

@cool-ai/beach-evals ships the primitives for this without being a full evaluation platform. Consumers with more sophisticated needs (rubric-based scoring, human-judged evals, continuous CI with dataset management, cross-run dashboards) write their own scorers against Beach's Scorer interface, or integrate with an external eval platform (Mastra evals, LangSmith, Braintrust, PromptFoo, Logfire).

Concern

@cool-ai/beach-evals provides:

  • EvalTestCase schema{ id, actorId, initialMessages, specialistExecutions, expected, promptHash?, tags? }.
  • EvalRecorder — records live turn data as an EvalTestCase; wrapRegistry() intercepts specialist tool calls transparently so executions are captured without changing actor code.
  • EvalRunner — replays a test case via @cool-ai/beach-session's TurnReplayer, runs scorers against the result, returns an EvalResult.
  • EvalDataset — stores test cases on disk (one JSON file per case) with save(dir) / static load(dir).
  • Reference scorersTurnStateScorer, PartPresentScorer, ExactMatchScorer, SchemaValidScorer.
  • Scorer interface — consumers plug in custom scorers (LLM-as-judge, rubric-based, domain-specific).
  • hashPrompt(prompt) — computes a SHA-256 hash over the actor's system prompt text; stored with the recording and compared at replay time to detect prompt drift.

Not in this package

  • CI integration (GitHub Actions, GitLab Pipelines, report uploaders).
  • Dashboards, trend charts, cross-run comparisons.
  • Human-in-the-loop eval UI.
  • LLM-as-judge reference implementation (consumers bring their own using the Scorer interface).

For any of these, integrate an external eval platform. Beach ships the primitives; the rest is a platform-choice concern.

Scorer interface

interface Scorer {
  name: string;
  score(result: ReplayResult): ScoreResult;
}

interface ScoreResult {
  scorer: string;
  passed: boolean;
  message?: string;
  details?: Record<string, unknown>;
}

A consumer's LLM-as-judge scorer would call an LLM with the actual output + a rubric, return a passed boolean plus the LLM's reasoning in details.

Determinism

@cool-ai/beach-evals uses @cool-ai/beach-session's replay primitives to run tests deterministically:

  • LLM calls are mocked from recorded outputs during replay, or run live against a specified model for "fresh" runs.
  • Tool calls are mocked from recorded results during replay; specialist-scope tools are mocked from their specialist_execution log records (see principle 17).
  • Both modes are available; the test case declares which.

Prompt-change invalidates recorded outputs

A subtle trap: recorded LLM outputs are pinned to the prompt that produced them. If the actor's system prompt changes between recording and replay, the recorded output was made against the old prompt. Replaying with the new prompt but mocking from the old output tests nothing about the new prompt's behaviour — it tests only that the old prompt's behaviour is still consistent.

Rules of use:

  • Deterministic-behaviour regressions (data shape, schema conformance, turn-state correctness) are valid against recorded outputs as long as the prompt is fixed.
  • Prompt-sensitive regressions (response quality, clarification thresholds, tool-choice heuristics) require fresh runs against a live model. A recorded replay won't catch prompt regressions.
  • Rerecord when you change prompts. Test runners should detect prompt hash mismatches and warn. @cool-ai/beach-evals computes a hash over the actor's resolved system prompt (snippet + consumer content) and stores it with the recording; mismatch at replay time surfaces a warning, not a silent pass.

Consumer eval pipelines should interleave both modes — fresh runs for prompt iteration, replay for regression checks on unchanged prompts.

Consumers

Any Beach-based agent serious enough to need regression testing. Prototype or throwaway agents can skip.

Related