@vibeatlas/ship-eval

v1.0.0

Published

8 days ago

AI agent reliability evaluation harness — like SWE-bench but for RELIABILITY

0High
0Medium
0Low

vibeatlasadmin

ai evaluation reliability benchmark ship-protocol

SHIP Eval — AI Agent Reliability Evaluation Harness

Like SWE-bench measures agent capability, SHIP Eval measures agent reliability.

SHIP Eval runs standardized coding tasks through pre-collected agent outputs, scores them using the SHIP Protocol API, and generates comparative reliability reports with statistical analysis.

Quick Start

# Install dependencies
npm install

# Build
npm run build

# Validate task definitions
npx ship-eval validate --tasks ./tasks/

# Score pre-collected agent outputs
npx ship-eval run --tasks ./tasks/ --outputs ./sample-outputs/ -o results.json

# Generate report from results
npx ship-eval report --input results.json -o ./report/

Commands

`ship-eval run`

Scores pre-collected agent outputs against task definitions using the SHIP API.

npx ship-eval run --tasks <dir> --outputs <dir> [options]

| Option | Description | Default | |--------|-------------|---------| | --tasks <dir> | Path to tasks directory | (required) | | --outputs <dir> | Path to agent outputs directory | (required) | | -o, --output <file> | Output results file | results.json | | -c, --concurrency <n> | Max concurrent API calls | 5 | | --api-url <url> | SHIP API base URL | Production API |

`ship-eval report`

Generates HTML + markdown comparison reports from results.

npx ship-eval report --input <file> [options]

| Option | Description | Default | |--------|-------------|---------| | --input <file> | Path to results.json | (required) | | -o, --output <dir> | Output directory for reports | ./report |

`ship-eval validate`

Validates task and output definitions are well-formed.

npx ship-eval validate --tasks <dir> [--outputs <dir>]

Adding Custom Tasks

Create a JSON file in tasks/<difficulty>/:

{
  "id": "my-task-01",
  "title": "My Custom Task",
  "description": "Detailed description of what to build...",
  "language": "typescript",
  "difficulty": "medium",
  "category": "security",
  "tags": ["auth", "middleware"],
  "expected_signals": ["handles edge case X", "validates input Y"]
}

Required fields:

id — Unique identifier
title — Human-readable name
description — What to build (2-3 sentences)
language — typescript, python, or javascript
difficulty — easy, medium, or hard
category — Grouping for analysis (e.g., security, API, data)
tags — Array of searchable tags
expected_signals — Array of expected quality signals

Adding Agent Outputs

Create a JSON file per task per agent in sample-outputs/<agent-name>/:

{
  "task_id": "my-task-01",
  "agent": "my-agent",
  "agent_version": "v1.0",
  "code": "function myTask() {\n  // implementation\n}",
  "commit_message": "feat: implement my task with proper error handling",
  "timestamp": "2026-03-20T10:00:00Z"
}

The output file name should match the task ID (e.g., my-task-01.json).

Interpreting Reports

Overall Ranking

| Metric | Description | |--------|-------------| | Avg Score | Mean SHIP reliability score (0-100) | | Median | Middle score value | | Std Dev | Score consistency (lower = more consistent) | | Pass Rate | Percentage of tasks scoring >= 70 | | 95% CI | Confidence interval for the true mean |

What Makes a Reliable Agent?

High avg score (>70): Consistently produces reliable code
Low std dev (<15): Predictable quality across task types
High pass rate (>80%): Rarely produces unreliable output
Narrow CI: Results are statistically robust

Statistical Significance

The report includes Welch's t-test between agent pairs. "Significant = YES" means the score difference is statistically meaningful (p < 0.05), not just random variation.

Breakdown Views

By Difficulty: Shows how agents handle increasing complexity
By Category: Shows agent strengths/weaknesses (security, API, data, etc.)
Radar Chart: Visual comparison of category strengths

Report Outputs

| File | Description | |------|-------------| | report.md | Markdown report with tables | | report.html | Interactive HTML with Chart.js visualizations | | report.json | Raw report data for programmatic access |

Sample Data

The repo includes sample outputs for two agents:

reliablebot (v2.1) — High-quality agent (~75 avg score)
quickcoder (v1.0) — Low-quality agent (~45 avg score)

These demonstrate how the harness differentiates agent quality.

Development

npm install          # Install dependencies
npm run build        # Compile TypeScript
npm test             # Run unit tests
npm run test:watch   # Watch mode

Architecture

ship-eval/
├── src/
│   ├── cli.ts        — CLI entry point (commander.js)
│   ├── runner.ts     — Loads tasks + outputs, calls SHIP API
│   ├── scorer.ts     — SHIP API client with retry logic
│   ├── reporter.ts   — Generates markdown + HTML reports
│   ├── stats.ts      — Statistical functions (mean, CI, t-test)
│   ├── validator.ts  — JSON schema validation
│   └── types.ts      — TypeScript type definitions
├── tasks/            — 30 evaluation task definitions
├── sample-outputs/   — Pre-collected agent outputs
└── tests/            — Unit and integration tests

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme