npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@reaatech/agent-eval-harness-cli

v0.1.0

Published

CLI interface for agent-eval-harness with eval, judge, compare, gate, golden, report, and serve commands

Readme

@reaatech/agent-eval-harness-cli

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Command-line interface for the agent-eval-harness ecosystem. Provides 7 subcommands for full evaluation runs, on-the-fly LLM judging, baseline comparison, CI gate checking, golden trajectory management, multi-format reporting, and an MCP server in stdio mode.

Installation

npm install @reaatech/agent-eval-harness-cli
# or
npm install -g @reaatech/agent-eval-harness-cli

Feature Overview

  • 7 subcommandseval, judge, compare, gate, golden, report, serve
  • Full evaluation pipeline — load trajectories from files or directories, run multi-metric evaluation, output results as JSON or CSV
  • On-the-fly judging — evaluate faithfulness, relevance, tool correctness, or overall quality with a single command
  • CI gate checking — evaluate gate presets (standard, strict, lenient) against results with exit codes for pipeline integration
  • Golden trajectory management — list, create, update, and validate golden reference trajectories
  • Multi-format reporting — JSON, HTML, Markdown, and PDF output for evaluation results
  • MCP server — stdio-mode MCP server exposing all 13 eval tools to AI coding agents

Quick Start

# Install globally
npm install -g @reaatech/agent-eval-harness-cli

# Run evaluation on a directory of JSONL trajectories
agent-eval-harness eval trajectories/ --config eval-config.yaml --output results/

# Judge a single response on faithfulness
agent-eval-harness judge faithfulness \
  --context "The user's account is associated with email [email protected]" \
  --response "I've sent the password reset to [email protected]"

# Compare two evaluation runs
agent-eval-harness compare results/baseline.json results/candidate.json --format markdown

# Check CI regression gates
agent-eval-harness gate results/results.json --preset standard --exit-code

# List golden trajectories
agent-eval-harness golden --list

# Generate HTML report
agent-eval-harness report results/results.json --format html --output report.html

# Start MCP server
agent-eval-harness serve

API Reference

Binary Entry

agent-eval-harness [global-options] <command> [command-options]

Global Options

| Flag | Type | Default | Description | |------|------|---------|-------------| | -v, --verbose | boolean | false | Enable verbose output | | -c, --config <path> | string | eval-config.yaml | Path to configuration file | | -o, --output <path> | string | results | Output directory for results |

Subcommand: eval <paths...>

Run full evaluation on trajectory files or directories.

| Flag | Type | Default | Description | |------|------|---------|-------------| | -g, --golden <path> | string | — | Path to golden trajectory for comparison | | -m, --metrics <metrics> | string | — | Comma-separated list of metrics to evaluate | | --judge-model <model> | string | claude-opus | Model to use for LLM judge | | --no-judge | boolean | false | Disable LLM judge evaluation | | --budget <budget> | string | 10.00 | Cost budget limit (USD) | | -f, --format <format> | string | json | Output format (json, junit, csv) |

Subcommand: judge <aspect>

Run LLM judge on a specific evaluation aspect.

| Flag | Type | Default | Description | |------|------|---------|-------------| | -t, --trajectory <path> | string | — | Path to trajectory file | | --context <text> | string | — | Context for faithfulness evaluation | | --response <text> | string | — | Response to evaluate | | --intent <text> | string | — | User intent for relevance evaluation | | --model <model> | string | claude-opus | Model to use for judging | | --calibrated | boolean | false | Use calibrated scores |

Valid aspects: faithfulness, relevance, tool_correctness, overall

Subcommand: compare <baseline> <candidate>

Compare two evaluation runs.

| Flag | Type | Default | Description | |------|------|---------|-------------| | --statistical | boolean | false | Run statistical significance tests | | -f, --format <format> | string | json | Output format (json, markdown, table) |

Subcommand: gate <results>

Check regression gates against evaluation results.

| Flag | Type | Default | Description | |------|------|---------|-------------| | --gates <path> | string | gates.yaml | Path to gate configuration file | | --preset <preset> | string | standard | Gate preset (standard, strict, lenient) | | --exit-code | boolean | true | Return CI-compatible exit code |

Subcommand: golden

Manage golden reference trajectories.

| Flag | Type | Default | Description | |------|------|---------|-------------| | -l, --list | boolean | false | List all golden trajectories | | -c, --create <path> | string | — | Create new golden trajectory from file | | -u, --update <id> | string | — | Update existing golden trajectory | | -d, --delete <id> | string | — | Delete golden trajectory | | --validate <path> | string | — | Validate golden trajectory quality | | --dir <path> | string | golden | Golden trajectories directory |

Subcommand: report <results>

Generate evaluation reports.

| Flag | Type | Default | Description | |------|------|---------|-------------| | -f, --format <format> | string | markdown | Output format (html, markdown, json, pdf) | | -o, --output <path> | string | — | Output file path | | --template <path> | string | — | Custom report template | | --include-raw | boolean | false | Include raw trajectory data in report |

Subcommand: serve

Start the MCP server.

| Flag | Type | Default | Description | |------|------|---------|-------------| | -p, --port <port> | string | 3000 | Server port | | --host <host> | string | localhost | Server host | | --transport <transport> | string | http | Transport type (http, stdio) |

Programmatic Use

Command functions and output helpers are available as library exports:

import {
  evalCommand,
  judgeCommand,
  compareCommand,
  gateCommand,
  goldenCommand,
  reportCommand,
  cliOut,
  cliError,
  cliWarn,
} from "@reaatech/agent-eval-harness-cli";

Type Exports

| Type | Description | |------|-------------| | EvalOptions | Options interface for evalCommand | | JudgeOptions | Options interface for judgeCommand | | CompareOptions | Options interface for compareCommand | | GateOptions | Options interface for gateCommand | | GoldenOptions | Options interface for goldenCommand | | ReportOptions | Options interface for reportCommand |

Usage Patterns

Using in Docker

# Build the image
docker build -t agent-eval-harness .

# Run evaluation with mounted volumes
docker run -v ./trajectories:/app/trajectories \
  -v ./results:/app/results \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  agent-eval-harness eval trajectories/ --output results/

# Start MCP server in stdio mode
docker run -i agent-eval-harness serve

CI Pipeline Integration

Use the gate subcommand in CI workflows to block regressions:

# .github/workflows/eval.yml
name: Agent Evaluation

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation suite
        run: |
          npx @reaatech/agent-eval-harness-cli eval trajectories/ \
            --config eval-config.yaml \
            --output results/

      - name: Run regression gates
        run: |
          npx @reaatech/agent-eval-harness-cli gate results/results.json \
            --preset standard \
            --exit-code

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/

The --exit-code flag causes the command to exit with code 1 when any gate fails, failing the CI step.

Gate presets provide ready-made thresholds:

| Preset | Overall Quality | Cost Limit | Latency P99 | Tool Correctness | Faithfulness | |--------|----------------|------------|-------------|------------------|-------------| | standard | >= 0.80 | <= $0.05 | <= 5000ms | >= 0.90 | >= 0.80 | | strict | >= 0.90 | <= $0.02 | <= 2000ms | >= 0.95 | >= 0.90 | | lenient | >= 0.60 | <= $0.10 | <= 10000ms | >= 0.70 | >= 0.60 |

Related Packages

| Package | Description | |---------|-------------| | @reaatech/agent-eval-harness-types | Shared domain types and schemas | | @reaatech/agent-eval-harness-trajectory | Trajectory evaluation | | @reaatech/agent-eval-harness-tool-use | Tool-use validation | | @reaatech/agent-eval-harness-cost | Cost tracking | | @reaatech/agent-eval-harness-latency | Latency monitoring | | @reaatech/agent-eval-harness-judge | LLM-as-judge | | @reaatech/agent-eval-harness-golden | Golden trajectories | | @reaatech/agent-eval-harness-suite | Suite runner | | @reaatech/agent-eval-harness-gate | CI gates | | @reaatech/agent-eval-harness-mcp-server | MCP server | | @reaatech/agent-eval-harness-observability | Observability |

License

MIT