@reaatech/agent-eval-harness-suite
v0.1.0
Published
Orchestrated evaluation suite runner with results aggregation for agent-eval-harness
Readme
@reaatech/agent-eval-harness-suite
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Orchestrated evaluation suite runner with results aggregation and run comparison. Executes multi-metric evaluations across trajectory batches with configurable concurrency, YAML-driven configuration, and statistical comparison between runs.
Installation
npm install @reaatech/agent-eval-harness-suiteFeature Overview
- Batch evaluation — run evaluations across hundreds of trajectories with configurable parallel workers
- YAML-driven config — declare metrics, judge models, budget limits, and gate configs in a single file
- Multi-metric scoring — aggregates faithfulness, relevance, tool correctness, cost, latency, coherence, and goal completion into an overall score
- Results aggregation — exports to JSON, JUnit XML, CSV, and Markdown with per-metric breakdowns
- Run comparison — statistical comparison between baseline and candidate runs with regression detection
- Threshold checking — validate results against configurable per-metric thresholds
- Progress tracking — real-time progress callbacks for long-running suites
Quick Start
import { SuiteRunner, parseConfig, createResultsAggregator } from '@reaatech/agent-eval-harness-suite';
import { evaluate } from '@reaatech/agent-eval-harness-trajectory';
import type { Trajectory } from '@reaatech/agent-eval-harness-types';
const config = parseConfig(`
metrics:
- faithfulness
- relevance
- cost
- latency
judge_model: claude-opus
budget_limit: 10.00
parallel_workers: 4
`);
const runner = new SuiteRunner(config);
const result = await runner.run(trajectories, evaluate);
console.log(`Overall: ${result.overallMetrics.overallScore}, Pass rate: ${result.summary.passRate}`);API Reference
Suite Runner
| Name | Type | Description |
|------|------|-------------|
| SuiteRunner | class | Orchestrates batch evaluation with configurable concurrency, timeout, error handling, and progress callbacks |
| createSuiteRunner(config?) | function | Factory: returns a new SuiteRunner instance with optional partial config |
SuiteRunner constructor accepts config?: Partial<SuiteRunnerConfig> and an optional progressCallback. The run(trajectories, evaluator) method executes evaluations in concurrent batches and returns EvalRunResult.
Configuration
| Name | Type | Description |
|------|------|-------------|
| parseConfig(yamlString) | function | Parse a YAML configuration string into a SuiteConfig object |
| validateConfig(config) | function | Validate a SuiteConfig; returns { valid, errors } — checks weights sum to 1.0, threshold ranges, required fields |
| createDefaultConfig(name) | function | Create a default SuiteConfig with all five standard metrics pre-configured |
| mergeConfig(partial) | function | Merge a partial config object with sensible defaults |
| calculateOverallScore(metricScores, config) | function | Weighted composite score from per-metric scores using config weights |
| checkThresholds(metricScores, config) | function | Verify all enabled metric thresholds are met; returns { passed, failures } |
Results Aggregation
| Name | Type | Description |
|------|------|-------------|
| ResultsAggregator | class | Aggregates raw run results into structured breakdowns with export methods |
| createResultsAggregator(config) | function | Factory: returns a new ResultsAggregator for the given SuiteConfig |
ResultsAggregator methods:
| Method | Returns | Description |
|--------|---------|-------------|
| aggregate(runResult) | AggregatedResults | Compute per-metric breakdowns, trajectory results, and summary statistics |
| exportJSON(results) | string | Export aggregated results as formatted JSON |
| exportJUnit(results) | string | Export as JUnit XML for CI test reporters |
| exportCSV(results) | string | Export as CSV with one row per trajectory |
| exportMarkdown(results) | string | Export as Markdown with summary table and per-metric breakdown |
| export(results, format) | string | Export in any supported format ('json' | 'junit' | 'csv' | 'markdown') |
Run Comparison
| Name | Type | Description |
|------|------|-------------|
| RunComparator | class | Statistical comparison engine for two evaluation runs |
| createRunComparator(significanceLevel?, minEffectSize?) | function | Factory with configurable significance alpha (default 0.05) and minimum effect size (default 0.1) |
RunComparator methods:
| Method | Returns | Description |
|--------|---------|-------------|
| compare(baseline, candidate) | RunComparisonResult | Full comparison with metric diffs, statistical significance, regressions, improvements, and verdict |
| generateVisualizationData(comparison) | VisualizationData | Generate bar chart, waterfall, and heatmap data for chart rendering |
Types
| Name | Type | Description |
|------|------|-------------|
| SuiteConfig | interface | Top-level suite configuration: name, metrics, judge, goldenPath, baseline, output |
| MetricConfig | interface | Per-metric config: name, enabled, weight, threshold, config |
| JudgeConfig | interface | Judge settings: model, provider, budgetLimit, calibrationEnabled |
| OutputConfig | interface | Output settings: formats, directory, includeDetails |
| SuiteRunnerConfig | interface | Runtime config: concurrency, continueOnError, timeoutMs, metrics |
| EvalRunResult | interface | Full run result: runId, status, totalTrajectories, trajectoryResults[], overallMetrics, durationMs |
| OverallMetrics | interface | Aggregate scores: overallScore, avgFaithfulness, avgRelevance, toolCorrectnessRate, avgCostPerTask, latencyP50/P90/P99, slaViolations |
| ProgressUpdate | interface | Real-time progress: runId, status, progress, completed, total, currentTrajectory |
| AggregatedResults | interface | Full aggregation: runId, config, overallMetrics, metricBreakdown, trajectoryResults[], summary, timestamp |
| MetricBreakdown | interface | Per-metric stats: name, avgScore, minScore, maxScore, stdDev, passRate, weight |
| TrajectoryResult | interface | Per-trajectory: trajectoryId, overallScore, metricScores, passed, errors |
| SummaryStatistics | interface | Aggregate counts: totalTrajectories, passedTrajectories, failedTrajectories, passRate, overallPassed, durationMs |
| RunComparisonResult | interface | Comparison output: scoreDiff, metricDiffs[], statisticalSignificance, regressions[], improvements[], summary |
| MetricDiff | interface | Per-metric change: metric, baseline, candidate, diff, percentChange, effectSize (Cohen's d) |
| StatisticalResult | interface | Significance test: test, pValue, confidenceInterval, significant, alpha |
Related Packages
| Package | Description | |---------|-------------| | @reaatech/agent-eval-harness-types | Shared domain types and Zod schemas | | @reaatech/agent-eval-harness-trajectory | Trajectory loading, evaluation, and golden comparison | | @reaatech/agent-eval-harness-tool-use | Tool-use validation and schema compliance | | @reaatech/agent-eval-harness-cost | Cost tracking, budgets, and reporting | | @reaatech/agent-eval-harness-latency | Latency monitoring, SLA enforcement, and optimization | | @reaatech/agent-eval-harness-judge | LLM-as-judge with calibration and consensus | | @reaatech/agent-eval-harness-golden | Golden trajectory management and curation | | @reaatech/agent-eval-harness-suite | Suite runner, results aggregation, and comparison | | @reaatech/agent-eval-harness-gate | CI regression gates with JUnit and GitHub output | | @reaatech/agent-eval-harness-mcp-server | MCP server with three-layer tool architecture | | @reaatech/agent-eval-harness-cli | Command-line interface | | @reaatech/agent-eval-harness-observability | OTel tracing, metrics, structured logging, and dashboards |
