@vibeatlas/ship-eval
v1.0.0
Published
AI agent reliability evaluation harness — like SWE-bench but for RELIABILITY
Maintainers
Readme
SHIP Eval — AI Agent Reliability Evaluation Harness
Like SWE-bench measures agent capability, SHIP Eval measures agent reliability.
SHIP Eval runs standardized coding tasks through pre-collected agent outputs, scores them using the SHIP Protocol API, and generates comparative reliability reports with statistical analysis.
Quick Start
# Install dependencies
npm install
# Build
npm run build
# Validate task definitions
npx ship-eval validate --tasks ./tasks/
# Score pre-collected agent outputs
npx ship-eval run --tasks ./tasks/ --outputs ./sample-outputs/ -o results.json
# Generate report from results
npx ship-eval report --input results.json -o ./report/Commands
ship-eval run
Scores pre-collected agent outputs against task definitions using the SHIP API.
npx ship-eval run --tasks <dir> --outputs <dir> [options]| Option | Description | Default |
|--------|-------------|---------|
| --tasks <dir> | Path to tasks directory | (required) |
| --outputs <dir> | Path to agent outputs directory | (required) |
| -o, --output <file> | Output results file | results.json |
| -c, --concurrency <n> | Max concurrent API calls | 5 |
| --api-url <url> | SHIP API base URL | Production API |
ship-eval report
Generates HTML + markdown comparison reports from results.
npx ship-eval report --input <file> [options]| Option | Description | Default |
|--------|-------------|---------|
| --input <file> | Path to results.json | (required) |
| -o, --output <dir> | Output directory for reports | ./report |
ship-eval validate
Validates task and output definitions are well-formed.
npx ship-eval validate --tasks <dir> [--outputs <dir>]Adding Custom Tasks
Create a JSON file in tasks/<difficulty>/:
{
"id": "my-task-01",
"title": "My Custom Task",
"description": "Detailed description of what to build...",
"language": "typescript",
"difficulty": "medium",
"category": "security",
"tags": ["auth", "middleware"],
"expected_signals": ["handles edge case X", "validates input Y"]
}Required fields:
id— Unique identifiertitle— Human-readable namedescription— What to build (2-3 sentences)language—typescript,python, orjavascriptdifficulty—easy,medium, orhardcategory— Grouping for analysis (e.g., security, API, data)tags— Array of searchable tagsexpected_signals— Array of expected quality signals
Adding Agent Outputs
Create a JSON file per task per agent in sample-outputs/<agent-name>/:
{
"task_id": "my-task-01",
"agent": "my-agent",
"agent_version": "v1.0",
"code": "function myTask() {\n // implementation\n}",
"commit_message": "feat: implement my task with proper error handling",
"timestamp": "2026-03-20T10:00:00Z"
}The output file name should match the task ID (e.g., my-task-01.json).
Interpreting Reports
Overall Ranking
| Metric | Description | |--------|-------------| | Avg Score | Mean SHIP reliability score (0-100) | | Median | Middle score value | | Std Dev | Score consistency (lower = more consistent) | | Pass Rate | Percentage of tasks scoring >= 70 | | 95% CI | Confidence interval for the true mean |
What Makes a Reliable Agent?
- High avg score (>70): Consistently produces reliable code
- Low std dev (<15): Predictable quality across task types
- High pass rate (>80%): Rarely produces unreliable output
- Narrow CI: Results are statistically robust
Statistical Significance
The report includes Welch's t-test between agent pairs. "Significant = YES" means the score difference is statistically meaningful (p < 0.05), not just random variation.
Breakdown Views
- By Difficulty: Shows how agents handle increasing complexity
- By Category: Shows agent strengths/weaknesses (security, API, data, etc.)
- Radar Chart: Visual comparison of category strengths
Report Outputs
| File | Description |
|------|-------------|
| report.md | Markdown report with tables |
| report.html | Interactive HTML with Chart.js visualizations |
| report.json | Raw report data for programmatic access |
Sample Data
The repo includes sample outputs for two agents:
- reliablebot (v2.1) — High-quality agent (~75 avg score)
- quickcoder (v1.0) — Low-quality agent (~45 avg score)
These demonstrate how the harness differentiates agent quality.
Development
npm install # Install dependencies
npm run build # Compile TypeScript
npm test # Run unit tests
npm run test:watch # Watch modeArchitecture
ship-eval/
├── src/
│ ├── cli.ts — CLI entry point (commander.js)
│ ├── runner.ts — Loads tasks + outputs, calls SHIP API
│ ├── scorer.ts — SHIP API client with retry logic
│ ├── reporter.ts — Generates markdown + HTML reports
│ ├── stats.ts — Statistical functions (mean, CI, t-test)
│ ├── validator.ts — JSON schema validation
│ └── types.ts — TypeScript type definitions
├── tasks/ — 30 evaluation task definitions
├── sample-outputs/ — Pre-collected agent outputs
└── tests/ — Unit and integration tests