@reaatech/agent-eval-harness-types

v0.1.0

Published

3 months ago

Shared domain types and Zod schemas for agent-eval-harness

0High
0Medium
0Low

reaatech

@reaatech/agent-eval-harness-types

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Canonical TypeScript domain types, Zod schemas, and interfaces for the agent-eval-harness ecosystem. This package is the foundational dependency of every other package in the monorepo.

Installation

npm install @reaatech/agent-eval-harness-types
# or
pnpm add @reaatech/agent-eval-harness-types

Feature Overview

19 domain type interfaces — Turn, Trajectory, EvalResult, JudgeScore, CostBreakdown, LatencyBudget, GoldenTrajectory, RegressionGate, and more
20 Zod schemas — runtime validation for every domain type with full type inference via z.infer
Zero runtime dependencies beyond zod
Dual ESM/CJS output — works with import and require
Golden trajectory markers — golden, expected, and quality_notes fields on every turn
CI gate types — threshold, baseline-comparison, and distribution gates with regression tracking
Suite runner types — configuration, run status, comparison, and metric regression interfaces

Quick Start

import { TurnSchema, type Trajectory, type EvalResult } from '@reaatech/agent-eval-harness-types';

const turn = TurnSchema.parse({
  turn_id: 1,
  role: 'user',
  content: 'Hello',
  timestamp: '2026-04-15T00:00:00Z',
});

const trajectory: Trajectory = { turns: [turn], metadata: { total_turns: 1 } };

API Reference

Core Types

| Name | Type | Description | |------|------|-------------| | Turn | interface | Single turn in a trajectory with role, content, timestamp, and optional tool calls, latency, and cost | | ToolCall | interface | Tool invocation with name, arguments, and optional result | | CostData | interface | Token usage and cost for a single turn | | Trajectory | interface | Complete agent execution with turns array and optional metadata | | EvalResult | interface | Evaluation result with overall score, per-metric scores, and issues | | EvalIssue | interface | Issue found during evaluation with type, severity, and description |

Judge Types

| Name | Type | Description | |------|------|-------------| | JudgeScore | interface | LLM judge scoring result with score, explanation, confidence, and calibration status |

Cost Types

| Name | Type | Description | |------|------|-------------| | CostBreakdown | interface | Full cost breakdown for a trajectory with LLM, tool, and per-turn costs | | TurnCost | interface | Cost breakdown for a single turn with token counts |

Latency Types

| Name | Type | Description | |------|------|-------------| | LatencyBudget | interface | Latency SLA budget with P50, P90, P99 thresholds and component breakdowns | | LatencyResult | interface | Latency measurement result with percentiles, violations, and SLA status | | LatencyViolation | interface | SLA violation record with turn ID, actual vs threshold values |

Golden Types

| Name | Type | Description | |------|------|-------------| | GoldenTrajectory | interface | Golden reference trajectory with versioning and quality markers |

Gate Types

| Name | Type | Description | |------|------|-------------| | RegressionGate | interface | Gate definition with threshold, baseline-comparison, or distribution types | | GateResult | interface | Single gate evaluation result with pass/fail and actual vs expected values |

Suite Types

| Name | Type | Description | |------|------|-------------| | EvalSuiteConfig | interface | Suite configuration with metrics, judge model, budgets, gates, and parallelism | | EvalRunStatus | interface | Suite run progress with status, completion counts, and timing | | RunComparison | interface | Comparison of two evaluation runs with metric diffs and significance testing | | MetricRegression | interface | Single regression with baseline and candidate values and change percentage |

Schemas

| Name | Type | Description | |------|------|-------------| | ToolCallSchema | ZodObject | Validates tool invocation structure | | CostDataSchema | ZodObject | Validates token counts and cost data | | TurnSchema | ZodObject | Validates turn structure with optional tool calls, latency, and golden markers | | TrajectoryMetadataSchema | ZodObject | Validates trajectory metadata | | TrajectorySchema | ZodObject | Validates complete trajectory (minimum one turn, optional metadata) | | EvalIssueSchema | ZodObject | Validates evaluation issue records | | EvalResultSchema | ZodObject | Validates evaluation results with metrics and issues | | JudgeScoreSchema | ZodObject | Validates judge scoring output | | CostBreakdownSchema | ZodObject | Validates cost breakdowns with per-turn cost arrays | | LatencyBudgetSchema | ZodObject | Validates latency budget configuration | | LatencyViolationSchema | ZodObject | Validates latency SLA violations | | LatencyResultSchema | ZodObject | Validates latency measurement results | | QualityMarkersSchema | ZodObject | Validates golden trajectory quality markers | | GoldenTrajectorySchema | ZodObject | Validates golden trajectories with nested trajectory and quality markers | | RegressionGateSchema | ZodObject | Validates regression gate definitions | | GateResultSchema | ZodObject | Validates gate evaluation results | | EvalSuiteConfigSchema | ZodObject | Validates suite configuration with nested latency budget and gates | | EvalRunStatusSchema | ZodObject | Validates suite run status | | MetricRegressionSchema | ZodObject | Validates metric regression records | | RunComparisonSchema | ZodObject | Validates run comparison results with statistical significance arrays |

Related Packages

| Package | Description | |---------|-------------| | @reaatech/agent-eval-harness-types | Shared domain types and Zod schemas | | @reaatech/agent-eval-harness-trajectory | Trajectory loading, evaluation, and golden comparison | | @reaatech/agent-eval-harness-tool-use | Tool-use validation and schema compliance | | @reaatech/agent-eval-harness-cost | Cost tracking, budgets, and reporting | | @reaatech/agent-eval-harness-latency | Latency monitoring, SLA enforcement, and optimization | | @reaatech/agent-eval-harness-judge | LLM-as-judge with calibration and consensus | | @reaatech/agent-eval-harness-golden | Golden trajectory management and curation | | @reaatech/agent-eval-harness-suite | Suite runner, results aggregation, and comparison | | @reaatech/agent-eval-harness-gate | CI regression gates with JUnit and GitHub output | | @reaatech/agent-eval-harness-mcp-server | MCP server with three-layer tool architecture | | @reaatech/agent-eval-harness-cli | Command-line interface | | @reaatech/agent-eval-harness-observability | OTel tracing, metrics, structured logging, and dashboards |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@reaatech/agent-eval-harness-types

Installation

Feature Overview

Quick Start

API Reference

Core Types

Judge Types

Cost Types

Latency Types

Golden Types

Gate Types

Suite Types

Schemas

Related Packages

License