npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@reaatech/agent-eval-harness-suite

v0.1.0

Published

Orchestrated evaluation suite runner with results aggregation for agent-eval-harness

Readme

@reaatech/agent-eval-harness-suite

npm version License CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Orchestrated evaluation suite runner with results aggregation and run comparison. Executes multi-metric evaluations across trajectory batches with configurable concurrency, YAML-driven configuration, and statistical comparison between runs.

Installation

npm install @reaatech/agent-eval-harness-suite

Feature Overview

  • Batch evaluation — run evaluations across hundreds of trajectories with configurable parallel workers
  • YAML-driven config — declare metrics, judge models, budget limits, and gate configs in a single file
  • Multi-metric scoring — aggregates faithfulness, relevance, tool correctness, cost, latency, coherence, and goal completion into an overall score
  • Results aggregation — exports to JSON, JUnit XML, CSV, and Markdown with per-metric breakdowns
  • Run comparison — statistical comparison between baseline and candidate runs with regression detection
  • Threshold checking — validate results against configurable per-metric thresholds
  • Progress tracking — real-time progress callbacks for long-running suites

Quick Start

import { SuiteRunner, parseConfig, createResultsAggregator } from '@reaatech/agent-eval-harness-suite';
import { evaluate } from '@reaatech/agent-eval-harness-trajectory';
import type { Trajectory } from '@reaatech/agent-eval-harness-types';

const config = parseConfig(`
metrics:
  - faithfulness
  - relevance
  - cost
  - latency
judge_model: claude-opus
budget_limit: 10.00
parallel_workers: 4
`);

const runner = new SuiteRunner(config);
const result = await runner.run(trajectories, evaluate);
console.log(`Overall: ${result.overallMetrics.overallScore}, Pass rate: ${result.summary.passRate}`);

API Reference

Suite Runner

| Name | Type | Description | |------|------|-------------| | SuiteRunner | class | Orchestrates batch evaluation with configurable concurrency, timeout, error handling, and progress callbacks | | createSuiteRunner(config?) | function | Factory: returns a new SuiteRunner instance with optional partial config |

SuiteRunner constructor accepts config?: Partial<SuiteRunnerConfig> and an optional progressCallback. The run(trajectories, evaluator) method executes evaluations in concurrent batches and returns EvalRunResult.

Configuration

| Name | Type | Description | |------|------|-------------| | parseConfig(yamlString) | function | Parse a YAML configuration string into a SuiteConfig object | | validateConfig(config) | function | Validate a SuiteConfig; returns { valid, errors } — checks weights sum to 1.0, threshold ranges, required fields | | createDefaultConfig(name) | function | Create a default SuiteConfig with all five standard metrics pre-configured | | mergeConfig(partial) | function | Merge a partial config object with sensible defaults | | calculateOverallScore(metricScores, config) | function | Weighted composite score from per-metric scores using config weights | | checkThresholds(metricScores, config) | function | Verify all enabled metric thresholds are met; returns { passed, failures } |

Results Aggregation

| Name | Type | Description | |------|------|-------------| | ResultsAggregator | class | Aggregates raw run results into structured breakdowns with export methods | | createResultsAggregator(config) | function | Factory: returns a new ResultsAggregator for the given SuiteConfig |

ResultsAggregator methods:

| Method | Returns | Description | |--------|---------|-------------| | aggregate(runResult) | AggregatedResults | Compute per-metric breakdowns, trajectory results, and summary statistics | | exportJSON(results) | string | Export aggregated results as formatted JSON | | exportJUnit(results) | string | Export as JUnit XML for CI test reporters | | exportCSV(results) | string | Export as CSV with one row per trajectory | | exportMarkdown(results) | string | Export as Markdown with summary table and per-metric breakdown | | export(results, format) | string | Export in any supported format ('json' | 'junit' | 'csv' | 'markdown') |

Run Comparison

| Name | Type | Description | |------|------|-------------| | RunComparator | class | Statistical comparison engine for two evaluation runs | | createRunComparator(significanceLevel?, minEffectSize?) | function | Factory with configurable significance alpha (default 0.05) and minimum effect size (default 0.1) |

RunComparator methods:

| Method | Returns | Description | |--------|---------|-------------| | compare(baseline, candidate) | RunComparisonResult | Full comparison with metric diffs, statistical significance, regressions, improvements, and verdict | | generateVisualizationData(comparison) | VisualizationData | Generate bar chart, waterfall, and heatmap data for chart rendering |

Types

| Name | Type | Description | |------|------|-------------| | SuiteConfig | interface | Top-level suite configuration: name, metrics, judge, goldenPath, baseline, output | | MetricConfig | interface | Per-metric config: name, enabled, weight, threshold, config | | JudgeConfig | interface | Judge settings: model, provider, budgetLimit, calibrationEnabled | | OutputConfig | interface | Output settings: formats, directory, includeDetails | | SuiteRunnerConfig | interface | Runtime config: concurrency, continueOnError, timeoutMs, metrics | | EvalRunResult | interface | Full run result: runId, status, totalTrajectories, trajectoryResults[], overallMetrics, durationMs | | OverallMetrics | interface | Aggregate scores: overallScore, avgFaithfulness, avgRelevance, toolCorrectnessRate, avgCostPerTask, latencyP50/P90/P99, slaViolations | | ProgressUpdate | interface | Real-time progress: runId, status, progress, completed, total, currentTrajectory | | AggregatedResults | interface | Full aggregation: runId, config, overallMetrics, metricBreakdown, trajectoryResults[], summary, timestamp | | MetricBreakdown | interface | Per-metric stats: name, avgScore, minScore, maxScore, stdDev, passRate, weight | | TrajectoryResult | interface | Per-trajectory: trajectoryId, overallScore, metricScores, passed, errors | | SummaryStatistics | interface | Aggregate counts: totalTrajectories, passedTrajectories, failedTrajectories, passRate, overallPassed, durationMs | | RunComparisonResult | interface | Comparison output: scoreDiff, metricDiffs[], statisticalSignificance, regressions[], improvements[], summary | | MetricDiff | interface | Per-metric change: metric, baseline, candidate, diff, percentChange, effectSize (Cohen's d) | | StatisticalResult | interface | Significance test: test, pValue, confidenceInterval, significant, alpha |

Related Packages

| Package | Description | |---------|-------------| | @reaatech/agent-eval-harness-types | Shared domain types and Zod schemas | | @reaatech/agent-eval-harness-trajectory | Trajectory loading, evaluation, and golden comparison | | @reaatech/agent-eval-harness-tool-use | Tool-use validation and schema compliance | | @reaatech/agent-eval-harness-cost | Cost tracking, budgets, and reporting | | @reaatech/agent-eval-harness-latency | Latency monitoring, SLA enforcement, and optimization | | @reaatech/agent-eval-harness-judge | LLM-as-judge with calibration and consensus | | @reaatech/agent-eval-harness-golden | Golden trajectory management and curation | | @reaatech/agent-eval-harness-suite | Suite runner, results aggregation, and comparison | | @reaatech/agent-eval-harness-gate | CI regression gates with JUnit and GitHub output | | @reaatech/agent-eval-harness-mcp-server | MCP server with three-layer tool architecture | | @reaatech/agent-eval-harness-cli | Command-line interface | | @reaatech/agent-eval-harness-observability | OTel tracing, metrics, structured logging, and dashboards |

License

MIT