@reaatech/agent-eval-harness-gate

v0.1.0

Published

13 days ago

CI regression gates, threshold checks, and JUnit/GitHub integration for agent-eval-harness

Downloads

134

0High
0Medium
0Low

reaatech

@reaatech/agent-eval-harness-gate

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

CI/CD regression gates for AI agent evaluation. Define quality, cost, latency, and correctness thresholds that block merges when agents regress. Outputs JUnit XML for test reporters, GitHub Actions annotations for PR comments, and structured JSON for dashboards.

Installation

npm install @reaatech/agent-eval-harness-gate

Feature Overview

Threshold gates — overall quality, faithfulness, relevance, tool correctness, cost, latency, pass rate, SLA violations
Baseline comparison gates — no-regression, improvement-required, statistical significance, per-metric regression
Three presets — standard (quality >= 0.80), strict (quality >= 0.90), lenient (quality >= 0.60)
Custom gate functions — programmatic gates with access to full results and comparison data
CI integration — JUnit XML output, GitHub Actions annotations, step outputs, PR comments
Result caching — configurable TTL caching to speed repeated evaluations

Quick Start

import { createGateEngine, getStandardPreset, CIIntegration } from '@reaatech/agent-eval-harness-gate';

const engine = createGateEngine(getStandardPreset().gates);
const results = await getAggregatedResults();
const summary = engine.evaluate(results);

console.log(`Passed: ${summary.overallPassed}, ${summary.passedGates}/${summary.totalGates} gates`);
console.log(`Exit code: ${CIIntegration.getExitCode(summary)}`);

API Reference

GateEngine

| Method | Signature | Description | |--------|-----------|-------------| | evaluate | (results: AggregatedResults, comparison?: RunComparisonResult) => GateEvaluationSummary | Evaluate all gates against results | | clearCache | () => void | Clear the result cache | | getGates | () => GateDefinition[] | Get all registered gates | | addGate | (gate: GateDefinition) => void | Add a gate dynamically | | removeGate | (name: string) => void | Remove a gate by name |

Factory: createGateEngine(gates: GateDefinition[], cacheTTL?: number): GateEngine

Threshold Gate Builders

| Builder | Default | Description | |---------|---------|-------------| | createOverallQualityGate(threshold?) | 0.8 | Overall quality score >= threshold | | createFaithfulnessGate(threshold?) | 0.8 | Faithfulness score >= threshold | | createRelevanceGate(threshold?) | 0.8 | Relevance score >= threshold | | createToolCorrectnessGate(threshold?) | 0.9 | Tool correctness rate >= threshold | | createCostGate(maxCost?) | 0.05 | Cost per task <= maxCost | | createLatencyGate(maxLatencyMs?) | 5000 | P99 latency <= maxLatencyMs | | createPassRateGate(minPassRate?) | 0.95 | Pass rate >= minPassRate | | createSLAViolationsGate(maxViolations?) | 0 | SLA violations <= maxViolations | | buildThresholdGates(config) | — | Build gates from a config object |

Presets

| Preset | Function | Quality | Faithfulness | Relevance | Tool Correctness | Cost | Latency P99 | Pass Rate | SLA Violations | |--------|----------|---------|-------------|-----------|-----------------|------|-------------|-----------|----------------| | Standard | getStandardPreset() | >= 0.80 | >= 0.80 | >= 0.80 | >= 0.90 | <= $0.05 | <= 5000ms | >= 95% | — | | Strict | getStrictPreset() | >= 0.90 | >= 0.90 | >= 0.90 | >= 0.95 | <= $0.02 | <= 2000ms | >= 99% | <= 0 | | Lenient | getLenientPreset() | >= 0.60 | >= 0.60 | >= 0.60 | >= 0.70 | <= $0.10 | <= 10000ms | — | — |

Baseline Gate Builders

| Builder | Description | |---------|-------------| | createNoRegressionGate() | Fail if any regression detected vs baseline | | createImprovementGate(minImprovement?) | Require minimum overall score improvement | | createSignificanceGate(alpha?) | Require statistical significance (default α=0.05) | | createMetricRegressionGate(metric, allowDecline?) | Per-metric regression gate with tolerance | | getBaselinePreset() | Returns [noRegression, improvement(0)] | | getStrictBaselinePreset() | Returns [noRegression, improvement(0.05), significance(0.05), metricRegression × 3] |

CI Integration

| Export | Type | Description | |--------|------|-------------| | CIIntegration | Class (static methods) | Generate annotations, JUnit XML, PR comments, env vars | | writeJUnitReport(summary, filePath) | Function | Write JUnit XML to file | | outputGitHubAnnotations(summary) | Function | Print GitHub Actions workflow commands | | setGitHubOutput(key, value) | Function | Set GitHub Actions step output | | exportForCI(summary, outputDir) | Function | Export JUnit XML + JSON results + PR comment |

CIIntegration static methods:

| Method | Returns | Description | |--------|---------|-------------| | generateGitHubAnnotations(summary) | string | Workflow command string for GitHub Actions | | generateJUnitReport(summary) | string | JUnit XML for test reporters | | generatePRComment(summary) | string | Markdown table for PR comments | | generateStepSummary(summary) | string | Markdown for GitHub Actions step summary | | generateEnvVars(summary) | Record<string, string> | Environment variables for CI | | getExitCode(summary) | number | 0 if all passed, 1 otherwise | | parseGateConfig(yamlString) | GateConfig[] | Parse gate config from YAML |

Types

GateDefinition

| Field | Type | Required | Description | |-------|------|----------|-------------| | name | string | yes | Unique gate name | | type | GateType | yes | 'threshold' | 'baseline-comparison' | 'regression' | 'custom' | | metric | string | no | Metric to check (for threshold/baseline gates) | | operator | GateOperator | no | '>=' | '<=' | '>' | '<' | '==' | '!=' | | threshold | number | no | Threshold value for comparison | | baseline | string | no | Baseline run ID | | allowRegression | boolean | no | Whether regression is allowed | | customFn | (results, comparison?) => GateResult | no | Custom evaluation function | | enabled | boolean | no | Gate enabled flag (default true) | | description | string | no | Human-readable description |

GateResult

| Field | Type | Description | |-------|------|-------------| | name | string | Gate name | | passed | boolean | Whether gate passed | | reason | string | Pass/fail reason | | actualValue | number? | Actual value observed | | expectedValue | number? | Expected/threshold value | | type | GateType | Gate type |

GateEvaluationSummary

| Field | Type | Description | |-------|------|-------------| | runId | string | Evaluation run ID | | totalGates | number | Total gates evaluated | | passedGates | number | Gates that passed | | failedGates | number | Gates that failed | | overallPassed | boolean | All gates passed | | results | GateResult[] | Individual gate results | | durationMs | number | Evaluation duration | | cacheHitRate | number? | Cache hit rate (0-1) |

Advanced Patterns

Custom Programmatic Gates

Custom gates have full access to evaluation results and comparison data, enabling arbitrary logic beyond simple thresholds:

import { createGateEngine } from '@reaatech/agent-eval-harness-gate';

const customGate: GateDefinition = {
  name: 'composite-quality',
  type: 'custom',
  description: 'Composite gate combining multiple metrics',
  customFn: (results, comparison) => {
    const overall = results.overallMetrics.overallScore;
    const faithfulness = results.metricBreakdown.faithfulness?.avgScore ?? 0;
    const cost = results.metricBreakdown.cost?.avgScore ?? 0;

    const composite = overall * 0.5 + faithfulness * 0.3 + (1 - cost) * 0.2;
    const passed = composite >= 0.75;

    return {
      passed,
      reason: passed
        ? `Composite score ${composite.toFixed(2)} >= 0.75`
        : `Composite score ${composite.toFixed(2)} < 0.75`,
    };
  },
};

const engine = createGateEngine([customGate]);
const summary = engine.evaluate(results);

CI Pipeline Integration

# .github/workflows/eval-gates.yml
name: Agent Evaluation Gates

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation
        run: |
          npx agent-eval-harness eval trajectories/*.jsonl \
            --output results/

      - name: Run regression gates
        id: gates
        run: |
          npx agent-eval-harness gate results/results.json \
            --preset standard \
            --exit-code

      - name: Upload JUnit report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: gate-results
          path: results/

      - name: Comment on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const { CIIntegration } = require('@reaatech/agent-eval-harness-gate');
            const results = require('./results/results.json');
            const summary = CIIntegration.evaluateFromResults(results);

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: CIIntegration.generatePRComment(summary)
            });

Related Packages

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@reaatech/agent-eval-harness-gate

Installation

Feature Overview

Quick Start

API Reference

GateEngine

Threshold Gate Builders

Presets

Baseline Gate Builders

CI Integration

Types

Advanced Patterns

Custom Programmatic Gates

CI Pipeline Integration

Related Packages

License