@reaatech/agent-eval-harness-types
v0.1.0
Published
Shared domain types and Zod schemas for agent-eval-harness
Readme
@reaatech/agent-eval-harness-types
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Canonical TypeScript domain types, Zod schemas, and interfaces for the agent-eval-harness ecosystem. This package is the foundational dependency of every other package in the monorepo.
Installation
npm install @reaatech/agent-eval-harness-types
# or
pnpm add @reaatech/agent-eval-harness-typesFeature Overview
- 19 domain type interfaces —
Turn,Trajectory,EvalResult,JudgeScore,CostBreakdown,LatencyBudget,GoldenTrajectory,RegressionGate, and more - 20 Zod schemas — runtime validation for every domain type with full type inference via
z.infer - Zero runtime dependencies beyond
zod - Dual ESM/CJS output — works with
importandrequire - Golden trajectory markers —
golden,expected, andquality_notesfields on every turn - CI gate types — threshold, baseline-comparison, and distribution gates with regression tracking
- Suite runner types — configuration, run status, comparison, and metric regression interfaces
Quick Start
import { TurnSchema, type Trajectory, type EvalResult } from '@reaatech/agent-eval-harness-types';
const turn = TurnSchema.parse({
turn_id: 1,
role: 'user',
content: 'Hello',
timestamp: '2026-04-15T00:00:00Z',
});
const trajectory: Trajectory = { turns: [turn], metadata: { total_turns: 1 } };API Reference
Core Types
| Name | Type | Description |
|------|------|-------------|
| Turn | interface | Single turn in a trajectory with role, content, timestamp, and optional tool calls, latency, and cost |
| ToolCall | interface | Tool invocation with name, arguments, and optional result |
| CostData | interface | Token usage and cost for a single turn |
| Trajectory | interface | Complete agent execution with turns array and optional metadata |
| EvalResult | interface | Evaluation result with overall score, per-metric scores, and issues |
| EvalIssue | interface | Issue found during evaluation with type, severity, and description |
Judge Types
| Name | Type | Description |
|------|------|-------------|
| JudgeScore | interface | LLM judge scoring result with score, explanation, confidence, and calibration status |
Cost Types
| Name | Type | Description |
|------|------|-------------|
| CostBreakdown | interface | Full cost breakdown for a trajectory with LLM, tool, and per-turn costs |
| TurnCost | interface | Cost breakdown for a single turn with token counts |
Latency Types
| Name | Type | Description |
|------|------|-------------|
| LatencyBudget | interface | Latency SLA budget with P50, P90, P99 thresholds and component breakdowns |
| LatencyResult | interface | Latency measurement result with percentiles, violations, and SLA status |
| LatencyViolation | interface | SLA violation record with turn ID, actual vs threshold values |
Golden Types
| Name | Type | Description |
|------|------|-------------|
| GoldenTrajectory | interface | Golden reference trajectory with versioning and quality markers |
Gate Types
| Name | Type | Description |
|------|------|-------------|
| RegressionGate | interface | Gate definition with threshold, baseline-comparison, or distribution types |
| GateResult | interface | Single gate evaluation result with pass/fail and actual vs expected values |
Suite Types
| Name | Type | Description |
|------|------|-------------|
| EvalSuiteConfig | interface | Suite configuration with metrics, judge model, budgets, gates, and parallelism |
| EvalRunStatus | interface | Suite run progress with status, completion counts, and timing |
| RunComparison | interface | Comparison of two evaluation runs with metric diffs and significance testing |
| MetricRegression | interface | Single regression with baseline and candidate values and change percentage |
Schemas
| Name | Type | Description |
|------|------|-------------|
| ToolCallSchema | ZodObject | Validates tool invocation structure |
| CostDataSchema | ZodObject | Validates token counts and cost data |
| TurnSchema | ZodObject | Validates turn structure with optional tool calls, latency, and golden markers |
| TrajectoryMetadataSchema | ZodObject | Validates trajectory metadata |
| TrajectorySchema | ZodObject | Validates complete trajectory (minimum one turn, optional metadata) |
| EvalIssueSchema | ZodObject | Validates evaluation issue records |
| EvalResultSchema | ZodObject | Validates evaluation results with metrics and issues |
| JudgeScoreSchema | ZodObject | Validates judge scoring output |
| CostBreakdownSchema | ZodObject | Validates cost breakdowns with per-turn cost arrays |
| LatencyBudgetSchema | ZodObject | Validates latency budget configuration |
| LatencyViolationSchema | ZodObject | Validates latency SLA violations |
| LatencyResultSchema | ZodObject | Validates latency measurement results |
| QualityMarkersSchema | ZodObject | Validates golden trajectory quality markers |
| GoldenTrajectorySchema | ZodObject | Validates golden trajectories with nested trajectory and quality markers |
| RegressionGateSchema | ZodObject | Validates regression gate definitions |
| GateResultSchema | ZodObject | Validates gate evaluation results |
| EvalSuiteConfigSchema | ZodObject | Validates suite configuration with nested latency budget and gates |
| EvalRunStatusSchema | ZodObject | Validates suite run status |
| MetricRegressionSchema | ZodObject | Validates metric regression records |
| RunComparisonSchema | ZodObject | Validates run comparison results with statistical significance arrays |
Related Packages
| Package | Description | |---------|-------------| | @reaatech/agent-eval-harness-types | Shared domain types and Zod schemas | | @reaatech/agent-eval-harness-trajectory | Trajectory loading, evaluation, and golden comparison | | @reaatech/agent-eval-harness-tool-use | Tool-use validation and schema compliance | | @reaatech/agent-eval-harness-cost | Cost tracking, budgets, and reporting | | @reaatech/agent-eval-harness-latency | Latency monitoring, SLA enforcement, and optimization | | @reaatech/agent-eval-harness-judge | LLM-as-judge with calibration and consensus | | @reaatech/agent-eval-harness-golden | Golden trajectory management and curation | | @reaatech/agent-eval-harness-suite | Suite runner, results aggregation, and comparison | | @reaatech/agent-eval-harness-gate | CI regression gates with JUnit and GitHub output | | @reaatech/agent-eval-harness-mcp-server | MCP server with three-layer tool architecture | | @reaatech/agent-eval-harness-cli | Command-line interface | | @reaatech/agent-eval-harness-observability | OTel tracing, metrics, structured logging, and dashboards |
