@vibeatlas/ship-reliability-spec

v1.0.0

Published

3 months ago

SHIP Reliability Specification v1.0 — an open standard for measuring AI agent output reliability through outcome-validated scoring

0High
0Medium
0Low

vibeatlasadmin

ai-reliability ai-governance calibration scoring specification open-standard ship-protocol

SHIP Reliability Specification v1.0

An open standard for measuring AI agent output reliability through outcome-validated scoring.

What This Is

The SHIP Reliability Specification defines a methodology for computing calibrated reliability scores (0-100) for any AI agent's output. A score of 70 means: historically, outputs with similar characteristics succeed approximately 70% of the time.

The specification is:

Domain-agnostic: The core methodology applies to any AI agent (code generation, customer service, finance, etc.)
Outcome-validated: Scores are calibrated against real-world outcomes, not heuristics or static rules
Open: Licensed CC-BY-4.0. Anyone can implement SHIP-compatible scoring.

Files

| File | Description | |------|-------------| | SPEC.md | The full specification (~5000 words, RFC 2119 style) | | SCORE_SCHEMA.json | JSON Schema (draft 2020-12) for score exchange format | | examples/code-domain.md | Reference Domain Adapter for code (CI pass/fail outcomes) | | examples/cs-domain.md | Reference Domain Adapter for customer service (ticket resolution) | | examples/score-example.json | Example score response conforming to the schema | | CHANGELOG.md | Version history | | LICENSE | CC-BY-4.0 |

Quick Start: Implement SHIP-Compatible Scoring

1. Choose a Domain

Pick the domain you want to score (code, customer service, or define a new one following Section 6 of the spec).

2. Collect Ground Truth

Gather at least 500 output-outcome pairs. For the code domain, this means AI-generated commits paired with CI pass/fail results. For customer service, this means AI agent responses paired with ticket resolution outcomes.

3. Extract Features

Implement feature extractors for your domain. At minimum, you need three metadata features. See examples/code-domain.md Section 3 for the code domain feature taxonomy.

4. Train a Model

Train any supervised classifier (logistic regression, XGBoost, etc.) on your labeled data. The model must achieve AUC-ROC >= 0.70 on a held-out test set.

5. Calibrate

Apply Platt scaling (recommended) or isotonic regression to your model's raw probabilities. Measure ECE and Brier Score.

p_calibrated = sigmoid(A * logit(p_raw) + B)
score = round(p_calibrated * 100)

6. Emit Conforming Scores

Return scores in the format defined by SCORE_SCHEMA.json:

{
  "score": 73,
  "grade": "C",
  "confidence": 0.82,
  "domain": "code",
  "spec_version": "1.0.0"
}

7. Validate

Validate your score responses against the JSON Schema:

# Using ajv-cli
npm install -g ajv-cli ajv-formats
ajv validate -s SCORE_SCHEMA.json -d examples/score-example.json --spec=draft2020

# Using Python jsonschema
import json
from jsonschema import validate, Draft202012Validator

with open('SCORE_SCHEMA.json') as f:
    schema = json.load(f)
with open('examples/score-example.json') as f:
    instance = json.load(f)

Draft202012Validator(schema).validate(instance)
print("Valid!")

Adding a New Domain

Follow the six-step Domain Extension Protocol in SPEC.md Section 6:

Define Outcome Signals -- What does success/failure look like?
Define Feature Extractors -- What inputs predict the outcome?
Collect Ground Truth -- Gather 500+ output-outcome pairs
Train and Validate -- AUC >= 0.70 on held-out data
Calibrate -- ECE < 0.10 before production use
Register the Domain -- Document the adapter (see examples/)

Grade Scale

| Grade | Score Range | Meaning | |-------|-------------|---------| | A+ | 95-100 | Very high reliability | | A | 90-94 | High reliability | | B | 80-89 | Above average | | C | 65-79 | Moderate; review recommended | | D | 50-64 | Below average; significant review recommended | | F | 0-49 | Low reliability; manual review required |

Key Formulas

Platt Scaling:

p_cal = sigmoid(A * logit(p_raw) + B)

Expected Calibration Error:

ECE = sum(|B_m|/N * |avg_pred(B_m) - avg_obs(B_m)|)  for m = 1..10

Confidence:

confidence = 2 * model_certainty * data_sufficiency / (model_certainty + data_sufficiency)
model_certainty = 1 - 4 * p * (1 - p)
data_sufficiency = min(1.0, n_similar / N_threshold)

Bayesian Shrinkage (cross-tool comparison):

adjusted_rate = (n * observed_rate + k * global_rate) / (n + k)    where k = 50

Reference Implementation

The SHIP Protocol API at https://ship-protocol.dhruvaapi.workers.dev is the reference implementation of this specification for the code domain. It scores AI-generated code commits against CI pass/fail outcomes using an XGBoost model with Platt scaling calibration.

Contributing

This specification is open for community input. To propose changes:

Open an issue describing the proposed change and its rationale
Reference the specific section(s) affected
Include any empirical evidence supporting the change

Changes to MUST/SHOULD/MAY requirements or grade boundaries require a major version increment.

License

This specification is licensed under CC-BY-4.0. You are free to implement, adapt, and redistribute with attribution.