@reaatech/agent-eval-harness-trajectory
v0.1.0
Published
Trajectory loading, evaluation, and comparison for agent-eval-harness
Readme
@reaatech/agent-eval-harness-trajectory
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Trajectory loading, multi-turn evaluation, and golden-comparison utilities for agent conversation analysis. Parses JSONL turn files, scores coherence and goal completion, and diffs candidate trajectories against golden references.
Installation
npm install @reaatech/agent-eval-harness-trajectory
# or
pnpm add @reaatech/agent-eval-harness-trajectoryFeature Overview
- JSONL loader — parse, validate, and serialize trajectory files with full Zod schema validation via
@reaatech/agent-eval-harness-types - Multi-turn evaluation — coherence analysis, goal completion scoring, and conversation flow metrics
- Golden comparison — diff candidate trajectories against reference with regression detection and improvement tracking
- Directory batch load — load and validate all
.jsonlfiles in a directory in a single call - Custom error types —
TrajectoryLoadErrorwith file path and cause tracking for precise debugging
Quick Start
import { loadFromContent, evaluate, compare } from '@reaatech/agent-eval-harness-trajectory';
import type { Trajectory } from '@reaatech/agent-eval-harness-types';
const jsonl =
'{"turn_id":1,"role":"user","content":"Reset password","timestamp":"2026-04-15T00:00:00Z"}\n' +
'{"turn_id":1,"role":"agent","content":"What\'s your email?","tool_calls":[],"timestamp":"2026-04-15T00:00:01Z"}';
const trajectory = loadFromContent(jsonl);
const result = evaluate(trajectory);
console.log(`Score: ${result.overall_score}, Coherence: ${result.metrics.coherence}`);API Reference
Loader Functions
| Name | Type | Description |
|------|------|-------------|
| parseTurn(line, lineNumber) | (string, number) => Turn | Parse a single JSONL line into a validated Turn object |
| loadFromContent(content, options?) | (string, LoadOptions?) => Trajectory | Load a trajectory from a JSONL content string |
| loadFromFile(filePath, options?) | (string, LoadOptions?) => Promise<Trajectory> | Load a trajectory from a .jsonl file on disk |
| loadFromDirectory(dirPath, options?) | (string, LoadOptions?) => Promise<Trajectory[]> | Load all .jsonl files in a directory |
| serializeToJsonl(trajectory) | (Trajectory) => string | Serialize a trajectory to JSONL string format |
| saveToFile(trajectory, filePath) | (Trajectory, string) => Promise<void> | Save a trajectory to a .jsonl file |
Loader Types
| Name | Type | Description |
|------|------|-------------|
| LoadOptions | interface | Options with validate (boolean, default true) and generateId (boolean, default true) |
| TrajectoryLoadError | class extends Error | Custom error with cause, filePath, and descriptive message |
Evaluator Functions
| Name | Type | Description |
|------|------|-------------|
| evaluate(trajectory, options?) | (Trajectory, EvaluateOptions?) => EvalResult | Full trajectory evaluation returning overall score, per-metric scores, and issues |
| analyzeCoherence(trajectory) | (Trajectory) => CoherenceResult | Multi-turn coherence analysis with per-transition scoring |
| analyzeGoalCompletion(trajectory) | (Trajectory) => GoalCompletionResult | Determine if the agent completed the user's goal with confidence scoring |
| analyzeConversationFlow(trajectory) | (Trajectory) => FlowAnalysis | Conversation flow analysis with topic changes, interruptions, and flow score |
Evaluator Types
| Name | Type | Description |
|------|------|-------------|
| EvaluateOptions | interface | Options with checkCoherence, checkGoalCompletion, analyzeFlow, and coherenceThreshold |
| CoherenceResult | interface | Coherence score, issues list, and per-transition analysis |
| TurnTransition | interface | Single turn transition with from, to, coherent, and optional reason |
| GoalCompletionResult | interface | Completion status, confidence, evidence array, and unresolved turn IDs |
| FlowAnalysis | interface | Flow metrics: avg turns per topic, topic changes, interruptions, and flow score |
Comparator
| Name | Type | Description |
|------|------|-------------|
| compare(candidate, golden, options?) | (Trajectory, GoldenTrajectory \| Trajectory, CompareOptions?) => ComparisonResult | Compare a candidate trajectory against a golden reference |
Comparator Types
| Name | Type | Description |
|------|------|-------------|
| CompareOptions | interface | Options with similarityThreshold (default 0.85), compareTools, compareLatency, compareCosts, and strict |
| ComparisonResult | interface | Overall similarity, pass/fail, diff, regressions, improvements, and per-turn comparisons |
| TrajectoryDiff | interface | Detailed diff with missing turns, extra turns, modified turns, and tool differences |
| TurnDiff | interface | Difference in a single turn field with expected and actual values |
| ToolDiff | interface | Tool call difference with expected/actual tool names and argument differences |
| ArgumentDiff | interface | Single argument difference with expected and actual values |
| Regression | interface | Regression record with type, severity, description, turn ID, and impact score |
| Improvement | interface | Improvement record with type, description, turn ID, and benefit score |
| TurnComparison | interface | Per-turn comparison with similarity, match status, and differences |
Related Packages
| Package | Description | |---------|-------------| | @reaatech/agent-eval-harness-types | Shared domain types and Zod schemas | | @reaatech/agent-eval-harness-trajectory | Trajectory loading, evaluation, and golden comparison | | @reaatech/agent-eval-harness-tool-use | Tool-use validation and schema compliance | | @reaatech/agent-eval-harness-cost | Cost tracking, budgets, and reporting | | @reaatech/agent-eval-harness-latency | Latency monitoring, SLA enforcement, and optimization | | @reaatech/agent-eval-harness-judge | LLM-as-judge with calibration and consensus | | @reaatech/agent-eval-harness-golden | Golden trajectory management and curation | | @reaatech/agent-eval-harness-suite | Suite runner, results aggregation, and comparison | | @reaatech/agent-eval-harness-gate | CI regression gates with JUnit and GitHub output | | @reaatech/agent-eval-harness-mcp-server | MCP server with three-layer tool architecture | | @reaatech/agent-eval-harness-cli | Command-line interface | | @reaatech/agent-eval-harness-observability | OTel tracing, metrics, structured logging, and dashboards |
