@verydia/eval
v0.1.0
Published
Evaluation harness for testing and benchmarking Verydia agent flows
Downloads
59
Maintainers
Readme
@verydia/eval
Evaluation and regression harness for Verydia flows.
Dataset format
Eval datasets are JSON or YAML arrays of cases:
[
{
"input": { "text": "hello" },
"expectedOutput": { "text": "hello", "length": 5 },
"metadata": { "id": "case-1" }
}
]
``
Each case:
- `input`: value passed directly to `flow.run(input, deps)`.
- `expectedOutput` (optional): deep JSON equality check against the flow output.
- `expectedBehavior` (optional): behavior assertions, e.g. guard activity.
- `metadata` (optional): arbitrary tags (scenario id, notes, etc.).
YAML is supported via a simple parse-then-JSON transform.
## API
```ts
import { evaluateFlow, loadEvalDatasetFromFile } from "@verydia/eval";
import type { BuiltFlow, FlowRuntimeDeps } from "@verydia/flow-dsl";
const flow: BuiltFlow<any, any> = /* your flow */;
const deps: FlowRuntimeDeps = { /* memoryStore, llmRegistry, etc. */ };
const dataset = await loadEvalDatasetFromFile("./my-dataset.json");
const result = await evaluateFlow({ flow, dataset, deps });
console.log(result.metrics.passRate);Assertions
evaluateFlow supports basic assertions:
expectedOutput: deep/JSON equality against the actual output.expectedBehavior.guardEvaluated: expect at least onepolicy.evaluateevent.expectedBehavior.noGuardEvaluation: expect nopolicy.evaluateevents.
Metrics
For each case, the runner measures:
- Latency (ms)
- Number of
llm.invokeevents (LLM calls) - Number of
mcp.callevents (tool calls)
Aggregated metrics in EvalResult.metrics:
totalCases,passCount,failCount,passRateaverageLatencyMs,totalLatencyMstotalLlmCalls,totalMcpCalls- Optional token and cost estimates (
totalTokensIn,totalTokensOut,costEstimateTotal) if you provide an estimator.
CLI: verydia eval run
The Verydia CLI exposes a thin wrapper over @verydia/eval.
verydia eval run clinical-triage-dsl --dataset health-eval.json --json eval-report.jsonThis will:
- Load the dataset from
--dataset(JSON or YAML). - Run the specified demo flow (
clinical-triage-dslorclinical-triage-graph). - Print a summary (cases, pass rate, latency, LLM/MCP calls).
- If
--jsonis provided, write the fullEvalResultto the given file.
Use these reports to compare runs over time and catch regressions in flow behavior, guard activation, or performance.
